MongoDB 3.2 Goodies Coming Your Way: More ways to $unwind

The aggregation framework has a slew of new operators and pipeline improvements. One notable improvement is the more robust $unwind pipeline stage.

One of the main motivations for using Mongo is its flexible document model. Any document in a collection can have arbitrarily different fields and data types in fields. This is great for many reasons. But when it came to aggregating array elements it posed a few problems. Unwinding an array is done with the $unwind pipeline stage, but you could only specify the field to unwind.

Problem one was that the variety across documents created for non-existent arrays. Consider a cake collection, containing documents about cakes.

> db.cakes.find()
{ "_id" : "pound cake", "recipe" : [ "butter", "flour", "eggs", "sugar" ] }
{ "_id" : "brownies", "makeup" : "brownie" }

Pound cake has a recipe field, containing ingredients. Brownies are defined in a different document schema, listing the main constituents of the brownie in a makeup field.

If we wanted to create a listing of possible cakes by ingredient, we’d write something like:

db.cakes.aggregate([
{$unwind:'$recipe'},
{$group:{_id: '$recipe', found_in: {$addToSet:'$_id'}}}
])

But this will skip brownies because brownies don’t have field recipe at all. We’d have to synthetically add a recipe field in order to fix that like so:

db.cakes.aggregate([
{$project:{ fixed: {$ifNull: ['$recipe', ['-']]}}},
{$unwind:'$fixed'},
{$group:{_id: '$fixed', found_in: {$addToSet:'$_id'}}}
])

Adding an array with the single element ‘-‘ (or whatever token you want) was necessary because $unwind insisted that the field being unwound would both be an array type and contain an element. No element, not an array? $unwind didn’t emit the document at all.

But with 3.2, $unwind would allow you to emit a document placeholder even if there is no field or the field contains null. You activate this option by adding a second argument to $unwind named preserveNullAndEmptyArrays with a value of true or false. If you don’t specify the extra argument, then no document is emitted for null or empty array fields. This allows us the more concise expression

db.cakes.aggregate([
{$unwind: {path: '$recipe', preserveNullAndEmptyArrays: true}},
{$group: {_id: '$recipe', found_in: {$addToSet: '$_id'}}}
])
// result of this aggregation:
{ "_id" : "sugar", "found_in" : [ "pound cake" ] }
{ "_id" : "eggs", "found_in" : [ "pound cake" ] }
{ "_id" : "flour", "found_in" : [ "pound cake" ] }
{ "_id" : null, "found_in" : [ "brownies" ] }
{ "_id" : "butter", "found_in" : [ "pound cake" ] }

Using preserveNullAndEmptyArrays will emit documents that have either null value, or an empty array. The above aggregation example will produce a result with _id null for all cakes that don’t have a recipe field.

Nice.

But variety doesn’t stop at existence or non-existence of fields. What about a field that contains an array in some documents, but is a straight out string in another?

Consider these 3 cakes:

{ "_id" : "princess", "makeup" : [ "sponge", "jam", "sponge", "custard", "sponge", "whipped-cream", "marzipan" ] }
{ "_id" : "angel cake", "makeup" : [ "sponge", "whipped-cream", "sponge", "icing" ] }
{ "_id" : "brownies", "makeup" : "brownie" }

The first two contain an array for their multi-part “cakiness” in the makeup field. But the brownie, well.. it’s made out of brownie! The value of the makeup field there is just a string.

$unwind used to error out on this condition. If it encountered any document who’s field was not an underlying BSON type of array it would halt the pipeline and throw an error. But not anymore!

Running a straightforward aggregation:

db.cakes.aggregate([
{$unwind: {path: '$makeup'}},
{$group: {_id: '$makeup', makes: {$addToSet: '$_id'}}}
])
// results in the following:
{ "_id" : "marzipan", "makes" : [ "princess" ] }
{ "_id" : "sponge", "makes" : [ "princess", "angel cake" ] }
{ "_id" : "icing", "makes" : [ "angel cake" ] }
{ "_id" : "jam", "makes" : [ "princess" ] }
{ "_id" : "custard", "makes" : [ "princess" ] }
{ "_id" : "brownie", "makes" : [ "brownies" ] }
{ "_id" : "whipped-cream", "makes" : [ "princess", "angel cake" ] }

Is perfectly acceptable in the new unwinding world. When $unwind sees a field with a non-array value, it treats it as an array with that single value, and emits a single document containing that value. No more error. Since in the past such data condition would have produced an error, your legacy code should largely work OK under upgrade. If it didn’t error then this condition wasn’t present – presumably clean data or you took the time to fashion fancy $match clause or expressions that prevented single-valued fields from entering the $unwind stage.

Nice.

This feature adds nuances to a common scenario. Initially, a document schema contains only a single value because it is version 0.9 of the software, and the use case was simple. Later on, we discover that a single value doesn’t cover the future feature and convert a field to an array. But then we may have some limbo time when some documents are saved with one schema, and others with another. Either way, we can now aggregate across that field and handle single-valued fields as an array with single a value without the need for elaborate $project of $match.

Oh, and one more thing:

We may have a need to figure out how cake composition varies across cakes. For that, I want to aggregate around the ordinal position of a component across cakes. I’d like to have an idea across the layer 1’s, layer 2’s etc what’s the amount of variation. To assist with that, $unwind has a new option to emit the array offset alongside the value.

> db.cakes.aggregate([
{$unwind: {path: '$makeup', includeArrayIndex: 'offset'}},
{$group: {_id: '$offset', options: {$addToSet: '$makeup'}}}
])
// returns these results:
{ "_id" : NumberLong(6), "options" : [ "marzipan" ] }
{ "_id" : NumberLong(5), "options" : [ "whipped-cream" ] }
{ "_id" : NumberLong(4), "options" : [ "sponge" ] }
{ "_id" : NumberLong(3), "options" : [ "custard", "icing" ] }
{ "_id" : NumberLong(2), "options" : [ "sponge" ] }
{ "_id" : NumberLong(1), "options" : [ "jam", "whipped-cream" ] }
{ "_id" : NumberLong(0), "options" : [ "sponge" ] }
{ "_id" : null, "options" : [ "brownie" ] }

Which lets us see that our cakes thus far have sponge layers and a variety of fillings in between for the first 4 components (offsets 0 - 4)

When we evaluate our customer schema and use cases, we also spend time ensuring that we can produce requisite (and sometimes intricate) queries using available Mongo server syntax. The 3.2 release includes an improved $unwind and a bevy of other operators which helps us cover more use cases in a more compact and efficient way.