+N Consulting, Inc.

AI Query Generator Added to MongoDB Compass

MongoDB’s aggregate command (and its predecessor find) take some time to get used to.

This is seen as a barrier to entry for those coming from a relational background as well as those who are early in the journey of MongoDB adoption.

With all the new AI and awesome natural language tools, can we get some help for generating queries? “Yes!” says the Compass team.

Announced at the .Local event in London, Compass is rolling out a natural language text prompt which translates your natural language input into a properly stated query using either the find syntax or an aggregation pipeline (MQL).

Compass AI Assisted Query Generator

I ran some experiments to kick the tiers on this feature.

Taking a Test Run

The following are prompts and results I ran against a collection which has documents like the one below. Documents have different field values. The fields in the sub-document contact are sparse: a document may have one, two, or all of the fields cell, email, and fax. These fields - if missing - are absent (not assigned null).

{
"_id": 6,
"name": "Elmer",
"status": "prospect",
"contact": {
"cell": "07969 773415",
"email": "[email protected]",
"fax": "01949 525019"
}
}

Prompt

How How many Elmers are there?

Generated Query

[
{
$match: {
name: "Elmer"
}
},
{
$group: {
_id: null,
count: {
$sum: 1
}
}
}
]

Success. The query generated answers the prompted text as intended. Note that the field name is nowhere in the prompt. The tool understood that we’re looking for Elmers (a proper name, plural) and figured it should look in the field name. Nice!

Prompt

What are the 3 most common names of prospects with the status “customer” who have an email in the system?

Generated Query

[
{
$match: {
status: "customer",
"contact.email": { $exists: true }
}
},
{
$group: {
_id: "$name",
count: { $sum: 1 }
}
},
{
$sort: { count: -1 }
},
{
$limit: 3
},
{
$project: {
_id: 0,
name: "$_id",
count: 1
}
}
]

Correct. The AI understood the AND filter for documents with the status set to “customer” and email existing. It used the common trope for “top N by V”, which groups by the value of the field (name) and produces a count per grouping, then $sorts by the count descending, then $limits to the intended number.

I noted that the generated queries use the old-school {$sum: 1} trick to produce the count. Understandable since Compass may be used on older MongoDB version that lacked the $count group accumulator operator, but this is a bit more cryptic for the untrained when reading the resultant pipeline.

Let’s try some more:

Prompt

Which customers have and odd id number, and a name that starts with D?

Generated Query

[
{
$match: {
_id: { $mod: [2, 1] },
name: /^D/
}
}
]

Nicely interpreted. The $mod operator is a bit obscure, but fits the bill. The name is compared with a regex match on the expression /^D/. Left-anchored regex matches are the lesser evil of non-anchored matches, and can use indexes. But maybe a range expression such as {name: {$gte: "D", $lt: "E"}} would perform just as well?

I decided to change one little thing in the query: change from an upper case D (which is by the way how my data is - names are capitalized in the dataset) to a lower case d.

Prompt

Which customers have and odd id number, and a name that starts with d?

Generated Query

[
{
$match: {
$and: [
{
_id: { $mod: [2, 1] }
},
{
name: /^d/
}
]
}
},
{
$project: {
name: 1
}
}
]

Interestingly, the tool decided to add a $project stage to the pipeline. Re-reading the prompt, I can see that I under specified which fields to return. But what caused the previous query to not $project at all and this one to do so is a mystery to me. Not strictly wrong, but a mystery nonetheless.

A fix to this is to explicitly specify what to show. The prompt Which customers have and odd id number, and a name that starts with d? Give me their name labeled "Customer Name" and their id labeled "Customer Id" only. indeed generated the expected pipeline.

[
{
$match: {
$and: [
{ _id: { $mod: [2, 1] } },
{ name: /^d/i }
]
}
},
{
$project: {
"Customer Name": "$name",
"Customer Id": "$_id"
}
}
]

The projection renames the fields as the prompt described.

My standing conclusion is to be as explicit and concise as possible with the prompt. I guess this should be expected.

Prompt

How likely are people to have both a cell phone and a fax number?

Generated Query

[
{
$match: {
"contact.cell": {
$exists: true
},
"contact.fax": {
$exists: true
}
}
},
{
$group: {
_id: null,
count: {
$sum: 1
}
}
}
]

Chalk this one to “not correct”. The prompt intent was to generate a ratio or number representing the chance or likelihood of a document to contain both fields. The generator understood the and condition correctly in the $match stage. But it failed to produce an aggregation which compares the count of documents matching the criteria vs. those which do not.

Non-English

Many people speak English as a second language. But many do not. How about typing in your native tongue?

I gave it a whirl with Hebrew. Not remotely a popular language, so I can’t imaging the team built specific support for it.

This query asks to find those who are named “יוסי“.

Prompt

מצא לי מי ששמם יוסי

Generated Query

{"name": "יוסי"}

The generator correctly figured out the pluralized word for “name” in my query, and determined the field to search against is name. It understood the value sought was “יוסי“, and kept the text as is - again, correct. It did not try and translate the value into english (Yossi), but that’s a good thing. I would expect that if I knew the name was in english, i’d maybe type a hybrid question such as

Prompt

מצא לי מי ששמם Elmer

Generated Query

{"name": "Elmer"}

Compass did the needful and produced my expected query. A note here is that BiDi editing isn’t perfect… Hebrew is written left to right, and the textbox moved the word “Elmer” to the very “end” of the sentence from an English perspective, but a native Hebrew speaker would expect it to be on the very left, at the end from a left-to-right perspective. But the results were correct, and I won’t argue with “it worked.”

How about some aggregation? The Hebrew query asks how many people have an email but not a cell phone.

Prompt

לכמה אנשים יש דואל אבל אין סלולרי

Generated Query

[
{
$match: {
"contact.cell": { $exists: false },
"contact.email": { $exists: true }
}
},
{
$group: {
_id: null,
count: { $sum: 1 }
}
}
]

Correct, again. The Hebrew prompt used the acronym term for “email” (דואל) and the common word for “mobile phone” or “cell phone” which is “cellular” (סלולרי). Not only did it match the terms to the fields cell and email, it also figured out correctly that the fields are in the sub-document contact and generated the dot-path to the fields.

Conclusion

I have reservations about using natural language to describe queries. For one, spoken language was not designed to be concise, and the range of misinterpretation that can be cause by using imprecise language as well as ambiguities within the language itself can yield wild results. I was surprised how well this new feature performed. Color me impressed. Not only has the generator created my intended queries, it picked up on nuances that I did not imagine it would.

Can there be test cases where it comes up with the wrong answer? Yeah, sure. But id performed remarkably well on what I gave it, and in doing so proved it is suitable for a wide enough audience for a wide enough set of reporting and analysis needs.

Am I going to use this daily? Time will tell. For simple quests? Maybe. For more complex ones? Maybe also - as I can see it creating a skeleton query for me to approve or refine. Does this replace learning and knowing the breadth and depth of MQL - no. But for newcomers and occasional analysis this may just do the trick.

Give it a try, see what you think!

Compound Wildcard Indexes in MongoDB 7

What

Wildcard indexes have improved in several ways in MongoDB 7.0.

Notabley, wildcard indexes can now be compound, including additional non-wildcard fields.

Previously, wildcard indexes only acted as a single-field index on each indivudual target field, stored in a single index which included those fields.

Newly supported is the ability to include both wildcard fields and named individual fields in the same index.

This capability supports common use cases such as the attribute-pattern document design. In such cases, a sub-document contains a set of keys, but the presence of the keys varies from document to document.

Consider the case of the followin documents describing a prospect record in a CRM system

{ _id: 35, name: 'Francis', status: 'customer', contact: {} }

{ _id: 36, name: 'Connor', status: 'customer', contact: { cell: '07787 304928', fax: '0191 558 5860' }}

{ _id: 1, name: 'Ethan', status: 'customer', contact: { cell: '07943 819337' }}
  1. Francis has no contact fields at all - just a name.
  2. Connor has both a cell and a fax .
  3. Ethan has only a cell phone number.

A wildcard index on contact.** would allow single-field match against one of the contact fields.

But query filtering on the status field and one of the wildcard fields would not be supported without a compound wildcard index.

Using a Compound Wildcard Index

First, create an index by supplying both a wildcard expression and an additional plain (non-wildcard) field(s).

db.prospect.createIndex({ "contact.$**": 1, status:1 }, {name: "idx_compound_wild"})

As with other compound indexes, the key structure supports key prefixes that include the first, or first and second fields.

This query leverages the compound index:

db.prospect.find({'contact.cell':'07408 926850', status:'customer'})

The query includes the a full prefix in the index: a single wildcard field contact.cell, and the non-wildcard field status.

Running .explain(true) on this query shows a winning plan that states our index named idx_compound_wild was used. The indexBounds entry shows that both the contact.cell and status fields were considered when processing the query agains the index.

"winningPlan": {
"queryPlan": {
"stage": "FETCH",
"planNodeId": 2,
"inputStage": {
"stage": "IXSCAN",
"planNodeId": 1,
"keyPattern": {
"$_path": 1,
"contact.cell": 1,
"status": 1
},
"indexName": "idx_compound_wild",
"isMultiKey": false,
"multiKeyPaths": {
"$_path": [],
"contact.cell": [],
"status": []
},
"isUnique": false,
"isSparse": true,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"$_path": [
"[\"contact.cell\", \"contact.cell\"]"
],
"contact.cell": [
"[\"07408 926850\", \"07408 926850\"]"
],
"status": [
"[\"customer\", \"customer\"]"
]
}
}

No Index Intersection

Index intersection is a strategy where more than one indexed fields are stated in a filter, and the result is computed by perforing a set intersection on the keys of 2 indexes.

Intersection seems to make sense when two separate simple indexes are at play on two separate fields. What about wildcard index? Will the mongo query planner choose to “intersect” the same wildcard index with itself somehow? Do we expect it will be smart about choosing to scan for two values, one found in a certain path contact.cell, then the second found in a different path contact.fax - both of which are contained with the same index? The short answer: no!

Consider the following query:

db.prospect.find({ 'contact.email': '[email protected]', 'contact.fax': '0121 062 9173'})

The query filters on two separate wildcard fields email and fax.

The query planner shows the following output:

"winningPlan": {
"queryPlan": {
"stage": "FETCH",
"planNodeId": 2,
"filter": {
"contact.email": {
"$eq": "[email protected]"
}
},
"inputStage": {
"stage": "IXSCAN",
"planNodeId": 1,
"keyPattern": {
"$_path": 1,
"contact.fax": 1,
"status": 1
},
"indexName": "idx_compound_wild",
"isMultiKey": false,
"multiKeyPaths": {
"$_path": [],
"contact.fax": [],
"status": []
},
"isUnique": false,
"isSparse": true,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"$_path": [
"[\"contact.fax\", \"contact.fax\"]"
],
"contact.fax": [
"[\"0121 062 9173\", \"0121 062 9173\"]"
],
"status": [
"[MinKey, MaxKey]"
]
}
}
}...

The above is a bit awkward. The filter term mentions the contact.email field alone. The indexBounds field mentions the contact.fax field only. Though one might intuit that the strategy is to us the index to scan the index contact.fax first, then filter the entries in the index by contact.email, it is neither clear that this is the case nor expected. One would expect that the index contains the list of documents under either of the keys, and therefore an AND query would hit the index key structure twice and do some nested loop join or something similar.

Let’s try specifying three of the wildcard fields in the index, and see if things become any clearer:

db.prospect.find({
'contact.cell': '07404 190465',
'contact.email': '[email protected]',
'contact.fax': '01995 364874'
}).explain("queryPlanner")

The plan shown below seems to imply the same general strategy. The index bounds only mention one of the three query terms, then the filter mentions the other two.

"winningPlan": {
"queryPlan": {
"stage": "FETCH",
"planNodeId": 2,
"filter": {
"$and": [
{
"contact.cell": {
"$eq": "07404 190465"
}
},
{
"contact.fax": {
"$eq": "01995 364874"
}
}
]
},
"inputStage": {
"stage": "IXSCAN",
"planNodeId": 1,
"keyPattern": {
"$_path": 1,
"contact.email": 1,
"status": 1
},
"indexName": "idx_compound_wild",
"isMultiKey": false,
"multiKeyPaths": {
"$_path": [],
"contact.email": [],
"status": []
},
"isUnique": false,
"isSparse": true,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"$_path": [
"[\"contact.email\", \"contact.email\"]"
],
"contact.email": [
"[\"[email protected]\", \"[email protected]\"]"
],
"status": [
"[MinKey, MaxKey]"
]
}
}
},...

Well, the plan is not lying. MongoDB does indeed choose only one of the query terms to serve as the value sought in the index. It evaluates the selectivity of the terms in the query, and picks one it deems most suitable.

This still doesn’t answer the question why it doesn’t compare the other terms agains the index either. Which brings the topic of index intersection. MongoDB introduced an implementation of index intersection at some point, then reversed the default implementation of it to a degree that pretty much prevents intersection from every being excercised. Turns out that the performance gains were rather low for the intersection candidates as compared to other strategies. Worse: the planner would often pick intersection in cases where other strategies would have performed way better. In other words: Index intersection turned out to be the wrong answer too often, and would actually hurt performance.

But this is a single index you say - wouldn’t intersection make sense in this particular wildcard scenario? The way I like to think about it, a wildcard index is actually several indexes in a trench-coat pretending to be a single index. Each wildcard field value is processed into it’s own index structure, but all of those structures are stored together as one. A hint to this notion is that the planner outputs the $_path term, stating which wildcard field within the index is targeted.

The addition of non-wildcard fields to the index would then shape the key prefixes to contain the additional non-wildcard value so that the index is more selective and the (theoretical) need for intersection is reduced.

Conclusion

Compound wildcard indexes are a welcome addition to MongoDB 7.0. Having no intersection mechanism doomed many queries to only partially use indexes and resort to loading documents based on a single field, then filtering them further. With this feature, the number of documents loaded can be reduced. Reducing I/O should help many workloads. Whether it does for your workload is subject to - as always - experimintation and monitoring. Give it a try, hope it helps!

For limitations and further details, see Compound Wildcard Indexes

Increase Security by Limiting Access to Atlas

Security is a huge topic of immense importance. Ironically, it’s all too common to sacrifice security for the sake of convenience.

This is often the case when it comes to network access security: we are quick to open up network access “just for now”, but fail to close things down later. This increases the exposure surface into the protected resource (A MongoDB Atlas cluster for example) indefinitely. But it doesn’t have to be that way.

TL;DR: Use expiring network access permissions for IPs instead of permanent ones.

How to Make it Better

For a while now, Atlas had a simple measure that helps clean up temporary access grants. Instead of the network access permission lasting until you explicitly delete it, you can set it to expire automatically and get removed.

Flip the toggle at the bottom left to enforce a time limit on the access grant. Pick a time frame: 6 hours, one day, or a week - the shorter the better. Atlas will automatically clean up the entry and close the IP access after that duration elapsed.

Add IP Access List Entry

Some might ask “But what if I want to keep it open longer?”

Here’s my cheat-sheet:

IP Address Space Access Type Auto-Expire?
A. Private (10. …, 192.198. …) Production Application No
B. Private (10. …, 192.198. …) Interactive Yes!
C. Public IP / Block Interactive Yes!
D. Public IP / Block Production Application Yes!

This cheat sheet is of the paranoid opinion “Don’t allow public IP access to Atlas.” But paranoia good for security.

Here are some supporting thoughts. They all share the following core belief:

Sacrifice convenience to gain security

With that in mind, lets talk about interactive vs. application access.

Interactive Users

An interactive user is a human running MongoDB Compass, or the shell , or program during development.

This is the danger zone. “I just need to do my job now“ motivates allowing such traffic. But tomorrow is not now anymore… better expire it. If you are the grantor (you have an Atlas role allowing you to add the network), then you can just as well grant yourself permissions again after expiration. Sacrifice convenience.

If the interactive user does not have access-control privileges, the nuisance of provisioning access is higher. But then the question arises: why is access to Atlas done over the public IP?

  1. No peering? Ask to set it up.
  2. No private-link? Ask to set it up.
  3. Atlas in different zone? Ask to set up peering.
  4. Private IP block? Expire anyway.

Why expire an interactive IP anyway? Well, because it provides another surface for would-be attackers. Interactive users have potential OS exposure to compromise via channels such as email, malware, random downloads and just plain old neglect.

For Applications

An application is a program or process running unattended.

Seek to route traffic via either peering or a private link. This would turn the traffic into non-public traffic, and therefore fall under use case #1.

Auto-Expire in Other Ways

The alternatives to auto-expiring automatically existed for a while. Shun the “I’m personally responsible so it’s ok for me to have permanent access” argument.

If the DBaaS or DB doesn’t auto expire, a good sys-admin or DevOps practice would automate tearing down old entries. Still valid: Atlas management API allow for creating and removing network access, so it’s a viable option.

Manual “expiration” entails manual auditing, and manual removal of such entries. This is both error prone and less reliable in general. But if that is all your organization is willing to do for security - so be it. Chalk that one up for “acceptable within parameters”.

Permanent Access

As you surmise by now, the approach in this post strongly shuns permanence, especially for public IP space.

What about permanence within private space? Those should also be expired! But in this case, automatic expiration is a bit trickier. A production application loosing connectivity is a risk not many would take. There can be arrangements that either audit and expire connections with no observed traffic, or which renew/ re-provision access on a cadence as long as some sign of life is detected. But such automation is a chance for error with higher risk to availability of the production application. So letting practicality win here: revert to manual auditing of such entries. Yes: they are internal traffic, but still another open vector.

Bottom line: Your data is priceless, consequences are high - keep it as safe as possible. Sacrifice convenience to gain security.

Common Rollup $unionWith and $merge

TL;DR

Schedule periodic rollup and aggregation on historic data using $merge into a collection, then run ad-hoc reports on cumulative live data for the current period, $unionWith historic data.

WUT???

We often need to produce reports summarizing sales over time which include both current partial period, such as “sales this year to date”. While it’s perfectly fine to run ad hoc queries on the data set, this can become a performance issue when report frequency increases or data sets become extremely large or the periods are very long (eg: all sales to date since inception).

It’s been common practice to pre-aggregate historic data and persist it to some collection. This lets us query only the most recent yet-to-be-rolled-up period, and add it up to historic data queried from the other cumulative collection. Trivial concept, nothing terribly new here. So where’s the catch? Details (it’s always the details…):

sequenceDiagram
    autonumber
    participant a as Query
    participant ms as [monthlySales]
    participant s as [Sales]

    a->>+ms: January - July?
    ms-->>-a: <>

    a->>+s: Current Month?
    s-->>-a: <>

    a-->> a: Union

The implementation of this technique relies on 2 high level operations: the

  1. A periodic calculation to populate the historic data.
  2. An ad-hoc query over both historic and current data.

Historic Data

The technique used with MongoDB to persist historic data used to be mapReduce in the old days. Very old days. Using mapReduce allowed you to persist the results into a collection all server-side. Before or without that, you’d need to have client code computing a result-set, then turning around and writing it explicitly back into MongoDB.

With introduction the Aggregation Framework, more efficient query operators were introduced, as well as the $out pipeline operator. The issue with that though, was that $out did not play very well with sharded clusters, and did not allow for incremental data insertion. Instead, $out wiped the output collection clean and replaced its content with the new data. This implied that for a long history, you would have to compute again over a growing number of documents - not very efficient, and exacerbates the problem we were trying to avoid in the first place. You could of course run a query just across a year say, and store it into a yearly collection. But then the problem of summing up across years would require more and more collections be queried independently by the client, and you’d end up with many more collections just to handle this case. Not elegant.

Now with $merge the story becomes manageable again. $merge lets you perform incremental updates to an existing collection, so your historic data can largely remain save for the newest addition or update to the latest period.

Collection sales contains document-per-sale with a field total and date in the example below:

{ "date" : ISODate("2021-08-10T23:54:01.028Z"), "total" : NumberDecimal("178.4349") }
{ "date" : ISODate("2021-08-10T23:48:37.671Z"), "total" : NumberDecimal("17.6805") }
{ "date" : ISODate("2021-08-10T23:45:37.238Z"), "total" : NumberDecimal("135.4978") }
// etc.

To produce sum of sales, count of sale numbers across August, we just run an aggregation, producing the result into the monthlySales collection

db.sales.aggregate([
{$match: { date: {$gte: ISODate('2021-08-01'), $lt: ISODate('2021-09-01')}}},

{$group: { _id: ISODate('2021-08-01'), total: {$sum: '$total'}, saleCount: {$sum:1}}},

{$merge: { into: {db: 'test', coll: 'monthlySales'}}}
])

The shape of the document in monthlySales is verbatim the document(s) produced by the predecessor pipeline stage of $merge:

// db.monthlySales.findOne()

{ "_id" : ISODate("2021-08-01T00:00:00Z"), "saleCount" : 1613, "total" : NumberDecimal("815527.3523") }

What happens though if it is still August? A previous value for August may exist already. We have to address what would be done in case a document with the same _id is not present, as well as what do do if it is.

When a document is not present, $merge defaults to inserting it. But you can also tell it to discard the new document, or produce an error. For example, you may want to prevent accidental creation of documents in the target collection in order to prevent arbitrary items created unless some approved well known document already exists. The full universe of choices is expressed via the whenNotMatched field with value being one of insert,discard,or fail. Note that in case of fail, other documents may have been inserted so the state is kind of non-determinant , it is not wrapped in a transaction.

This example shows discarding silently and documents that $merge found no match for in the target collection. Someone would have had to create a document for August 2021 otherwise the result for August is effectively ignored by $merge and nothing will be written.

{$merge: { 
into: {db: 'test', coll: 'monthlySales'},
whenNotMatched: 'discard'
}}

When are document is matched, you have a few more choices. The default is to replace the existing document. The full list of options: replace, keepExisting, merge, fail, and pipeline. You can choose to keep the existing document to preserve things as they were. Replace and fail are pretty straightforward as well.

Specifying whenMatched: 'merge' will slam fields from the new document into the existing document.

Document Value
Existing {_id:1, name: 'bob', age: 33}
New Document {_id:1, age: 34, height: 1.74 }
Resultant Document {_id:1, name: 'bob', age: 34, height: 1.74 }

An existing field’s value is replaced and previously non-existing fields will be created. Fields not present in the new document which existed before are left as is.

That’s good for having multiple separate calculations affecting some destination fields. But for incremental update, we’ll be looking to combine the value of the existing field in the target collection with a value from the new calculation on the origin collection.

Given the document in an origin collection, and a corresponding document already in the destination collection:

db.origin.insert({ _id:1, x:2})
db.destination.insert({ _id:1, x:40})

We can now add the existing value in the destination‘s document field x to the sum of that value and the value of the field x from the preceding pipeline operator ($match for simplicity here):

db.origin.aggregate([{$match: {}}, {$merge: {into: 'destination', whenMatched: [{$set: {x: {$add: ["$x","$$new.x"]}}} ]}}] );

db.destination.find();
{ "_id" : 1, "x" : 42 }

Using the special pipeline variable $$new, we can distinguish the field x from the existing field x that was in the destination document.

The pipeline constructed to produce the merge of existing documents can only take one of the stages $addFields, $set, $project, $unset, $replaceRoot, or $replaceWith. This mini-pipeline only ever sees a single document input from the destination collection - the one that matched - so other pipeline operators really don’t make a lot of sense. Only single-document-modifier types are allowed - or at least that’s how I make sense of it.

Union Queries

With all this syntax background, let’s turn back to our original problem. We want to have cheaper queries over historical periods yet combine them with recent on-the-fly values. Storing the past data can be don in some frequency, either monthly or weekly or daily. Given that we can incrementally update the destination collection, we’re free to pick a frequency that makes sense.

To combine documents for a partial month, we’ll need to query 2 collections though. The historic sales in the monthlySales collection, and the results from the month-to-date query on the fly from sales. Would be nice to do it in one command to Mongo. Historically, we would shoot 2 queries and combine the results into one client-side. But that’s extra code that someone had to write, and makes populating generic graphs more involved.

With the introduction of $unionWith, a single aggregation command can now return results from more than one collection! Harnessing this for our scenario we’ll query the “main” collection sales for the current month-to-date, then union the results with a pipeline over the monthlySales for already computed document since the beginning of the year:

db.sales.aggregate([
{ $match: { date: { $gte: ISODate('2021-08-01'), $lt: ISODate('2021-09-01') } } },

{ $group: { _id: ISODate('2021-08-01'), saleCount: { $sum: 1 }, total: { $sum: '$total'} }},

{
$unionWith: {
coll: 'monthlySales',
pipeline: [
{ $match: { _id: { $gte: ISODate('2021-01-01'), $lt: ISODate('2021-08-01') } } },
]
}
},

{ $sort: { _id: -1 } }
])

{ "_id" : ISODate("2021-08-01T00:00:00Z"), "saleCount" : 1613, "total" : NumberDecimal("815527.3523") }
{ "_id" : ISODate("2021-07-01T00:00:00Z"), "saleCount" : 5255, "total" : NumberDecimal("2615967.9695") }
{ "_id" : ISODate("2021-06-01T00:00:00Z"), "saleCount" : 5134, "total" : NumberDecimal("2540986.0768") }
{ "_id" : ISODate("2021-05-01T00:00:00Z"), "saleCount" : 5294, "total" : NumberDecimal("2680115.7638") }
{ "_id" : ISODate("2021-04-01T00:00:00Z"), "saleCount" : 5083, "total" : NumberDecimal("2532271.9124") }
{ "_id" : ISODate("2021-03-01T00:00:00Z"), "saleCount" : 5270, "total" : NumberDecimal("2647041.1003") }
{ "_id" : ISODate("2021-02-01T00:00:00Z"), "saleCount" : 4809, "total" : NumberDecimal("2424930.7338") }
{ "_id" : ISODate("2021-01-01T00:00:00Z"), "saleCount" : 5252, "total" : NumberDecimal("2608231.7712") }

Mic drop. A single query processed all on the server side, giving us everything we need in one shot.

Which leaves us with the small matter of granularity. How much of the work do we want to do on the fly vs. just digging up the computed data?

If we populate the monthlySales monthly after the edn of each month, we’ll need to query up to 31 days worth of data on the ad-hoc side. If we update the sales daily we’ll have less than a day’s worth to sum on the fly, with the onus of determining exactly what cutoff window we set for the current vs the existing already merged data. Whatever we choose, we’ll just need to make sure our boundaries for the time window don’t overlap so we don’t count the same sale too many or too few times.

Getting There

To round things off, you might want to start accumulating pre-calculated data but already have a slew of individual documents in your live collection. You could write some script client side to iterate the months and years of the past. Or you could run a query to populate the monthlySales like so:

db.sales.aggregate([
{
$group:
{
_id: { $dateFromParts: { year: { $year: '$date' }, month: { $month: '$date' } } },
total: { $sum: '$total' },
saleCount: { $sum: 1 }
}
},
{
$merge:
{
into: { db: 'test', coll: 'monthlySales' }
}
}
])

There are more nuances bells and whistles that come along with $unionWith and $merge, but I think with the above techniques we can already achieve a good portion of the historic+current type scenarios. You will of course need to have a recent MongoDB version - $merge was introduced in 4.2, and $unionWith in 4.4.

Time Series Collections - Native Support

Time Series & Mongo

Time series data is a fairly common use case and has been “solved” before. Indeed, in blog posts around the net as well as in my course on schema design and data modeling, a common approach to storing such data was discussed at length.

But not everyone knows that bit of common wisdom. Further, it’s easy to get caught up in the storage details and make silly little mistakes that wreck the exercise.

Someone in MongoDB had the brilliant idea to simplify this further (I should by them a drink of their choice I find out who it was). Instead of you painstakingly bucketing each “reading” or time series item, then un-bucketing it or peppering your queries with time-bucket-boundary match clauses - why not provide a new collection for time series which does the tricky bits for you?

As far as the client is concerned, every document is a stand-alone point in time item which has a timestamp and some measurement - a reading - composed of one or more fields recording your data. If you insert a document, you insert one per reading. You don’t have to push into arrays, or compute any time-window or anything like that.

When reading a document from a time series collection, each point in time measurement returns as a single document - not a bucket with an array of readings withing a window.

It appears largely un-magical, until and unless you compare it to what you had to do manually before this existed.

The before document was a bucket with multiple readings, so for a 12h bucket and say 4 temperature readings in the field temperature in a day you would see something like:


db.sensings.find({timestamp: {$gte: ISODate('2021-09-01'), $lt: ISODate('2021-09-02')}})

> [
{_id: ..., bucketStart: ISODate("2021-09-01T00:00:00Z"), sensorId: 123, [
{temperature: 97.1, timestamp: ISODate("2021-09-01T00:00:00"},
{temperature: 98.6, timestamp: ISODate("2021-09-01T00:06:00"},]
},
{_id: ..., bucketStart: ISODate("2021-09-01T00:00:00Z"), sensorId: 123, [
{temperature: 96.9, timestamp: ISODate("2021-09-01T00:12:00"},
{temperature: 97.2, timestamp: ISODate("2021-09-01T00:18:00"},]
}
]


Using Time-Series Collection, the readings are presented to you individually:


db.sensings.find({timestamp: {$gte: ISODate('2021-09-01'), $lt: ISODate('2021-09-02')}})

> [
{_id: ..., temperature: 97.1, timestamp: ISODate("2021-09-01T00:00:00", info: [{sensorId:123}]},
{_id: ..., temperature: 98.6, timestamp: ISODate("2021-09-01T00:06:00", info: [{sensorId:123}]},
{_id: ..., temperature: 96.9, timestamp: ISODate("2021-09-01T00:12:00", info: [{sensorId:123}]},
{_id: ..., temperature: 97.2, timestamp: ISODate("2021-09-01T00:18:00", info: [{sensorId:123}]}
]

You will note that the bucket granularity is nowhere in the results! You may have chosen to bucket spanning seconds (the default), minutes, or hours. But that is a storage choice, and your queries and users don’t need to care. This awesome arrangement lets users just be users. Insert readings individually as the occur, read and summarize over time as if they were stored individually. On the surface this matches the business expectation and mental model of periodic measurements written or read as they are. However

The internals are what’s nice here. The “collection” presented to the user (sensings in the example above) is actually more of a view provided over internal collections that physically exist but you do not directly interact with them. The backing storage collections are organized to bucket the individual readings into some fixed time window size. If you choose an hours granularity, there can be up to 1 bucket per hour. Multiple readings within the same hour would all fall into the same bucket, and next hour’s readings will fall into the next bucket. Except now you don’t worry about structuring your document in any special way - Mongo does it for you!

At the time of writing, the internal physical backing collection is created when you declare your special Time-Series collection. For my example of the collection names sensings Mongo has created a backing collection system.buckets.sensings and stores physical documents under that namespace, in buckets.

show collections

> sensings
> system.buckets.sensings

Oh, yes: you have to declare your special collection. If you have Mongo implicitly create a collection by inserting data into a yet-to-be-created collection, Mongo will assume a normal collection. The syntax requires one thing only - the timeField. I like to include also the metaField and granularity parameters, for reasons expanded below.

To create my sensings collection and store at an ingestion rate fitting in hours long buckets issue the following command:

db.createCollection("sensings", {timeseries: {timeField: "timestamp", metaField: "info", granularity: "hours"}})

The collection named sensings will be created (as well as a corresponding system.buckets.sensings behind the scenes). It treat the field named timestamp on your inserted documents as the time for bucketing purpose. It will let you also add extra non-measurement data or “meta” details in the field named “info”. The collection will optimize the bucket in granularity consistent with multiple items in a matter of hours. For high ingestion rates you should choose minutes or seconds. The

What is the business with the metaField and granularity about? It lets Mongo optimize the bucket sizes internally and organize storage efficiently. The theory behind creating time buckets is that:

  1. Events are written once
  2. Events are ready for a time period
  3. Loading fewer documents that contain all the readings is more efficient than if each reading resided in one physical document.

When doing things manually we could shove the metadata extras such as sensor-id or patient id in any field. But for Mongo to understand that the metadata is a label and not a metric, and to enable it to create buckets specific to groups of readings we have to tell it where we store the metadata. In our case here, I used the info field. In that field, I can store a single value, or several different fields. This way when I query for temperature average for a particular sensorId, since it is written inside the info field, Mongo knows that and can load buckets for that meta only. It should try and create a bucket for each time-window and for each meta value or combo of values. That way, query for sensor 123 doesn’t need to read buckets for sensor 456.

But the magic ends here as things now and Mongo does not automatically create indexes on the meta fields . It leaves that up to you. Mongo does create indexes on the bucket boundaries timestamps on the backing collection, but that is usually no enough for prime time. You will no doubt have some narrower interest in slicing the time series by some of the meta fields. I’ll add a compound index on my sensorId like so:

db.sensings.createIndex({"info.sensorId": 1, "timestamp": 1})

My query can now look to get average temperature across the devices on a cartain day, and leverage the index I just created:

db.sensings.aggregate([
{$match: {timestamp: {$gte: ISODate("2021-09-01"), $lt: ISODate("2021-09-02")}}}
{$group: {_id: "$info.sensorId", avgTemperature: {$avg: "$temperature"}}}
] )

A query chased with an explain shows the index is used:

db.sensings.find({"timestamp" : ISODate("2021-09-01T00:18:00Z"), "info.sensorId" : 123}).explain()

// "inputStage" : {
// "stage" : "IXSCAN",
// "keyPattern" : {
// "meta.sensorId" : 1,
// "control.min.timestamp" : 1,
// "control.max.timestamp" : 1
// },
// "indexName" : "info.sensorId_1_timestamp_1",

The inputStrage section shows an IXSCAN leveraging my index named info.sensorId_1_timestamp_1.

What is interesting to compare, is the index definitions on the nominal collection sensings versus the internal collection system.buckets.sensings.

The nominal collection shows what you expect: an index on the direct structural fields timestamp and the embedded info.sensorId:

> db.sensings.getIndexes()
[
{
"v" : 2,
"key" : {
"info.sensorId" : 1,
"timestamp" : 1
},
"name" : "info.sensorId_1_timestamp_1"
}
]
>

But the backing collection shows a different picture:

> db.system.buckets.sensings.getIndexes()
[
{
"v" : 2,
"key" : {
"meta.sensorId" : 1,
"control.min.timestamp" : 1,
"control.max.timestamp" : 1
},
"name" : "info.sensorId_1_timestamp_1"
}
]

The 2-field index I specified to create was actually created as a 3-key field behind the scenes. The timestamp - a single field in my nominal documents - was translated into a bucket boundary defined by the control.min.timestamp and control.max.timestamp fields. Makes sense. The raison d’être of this whole thing is exactly that: that your nominal collection facade deals with individual reading documents, but the physical storage of the documents is something entirely different and more efficient than the facade lets us know. For a lesson on how the more efficient document is actually structured, I inserted a few documents:

db.sensings.insertMany([
{temperature: 97.1, timestamp: ISODate("2021-09-01T00:00:00"), info: {sensorId:123}},
{temperature: 98.6, timestamp: ISODate("2021-09-01T00:06:00"), info: {sensorId:123}},
{temperature: 96.9, timestamp: ISODate("2021-09-01T00:12:00"), info: {sensorId:123}},
{temperature: 97.2, timestamp: ISODate("2021-09-01T00:18:00"), info: {sensorId:123}}
])

Then I chased it with a query on the backing collection system.buckets.sensings :

> db.system.buckets.sensings.findOne()
{
"_id" : ObjectId("612ec2809f952fa914335e8e"),
"control" : {
"version" : 1,
"min" : {
"_id" : ObjectId("6104819959cb4aa1c00f0b9c"),
"temperature" : 96.9,
"timestamp" : ISODate("2021-09-01T00:00:00Z")
},
"max" : {
"_id" : ObjectId("6104819959cb4aa1c00f0b9f"),
"temperature" : 98.6,
"timestamp" : ISODate("2021-09-01T00:18:00Z")
}
},
"meta" : {
"sensorId" : 123
},
"data" : {
"_id" : {
"0" : ObjectId("6104819959cb4aa1c00f0b9c"),
"1" : ObjectId("6104819959cb4aa1c00f0b9d"),
"2" : ObjectId("6104819959cb4aa1c00f0b9e"),
"3" : ObjectId("6104819959cb4aa1c00f0b9f")
},
"temperature" : {
"0" : 97.1,
"1" : 98.6,
"2" : 96.9,
"3" : 97.2
},
"timestamp" : {
"0" : ISODate("2021-09-01T00:00:00Z"),
"1" : ISODate("2021-09-01T00:06:00Z"),
"2" : ISODate("2021-09-01T00:12:00Z"),
"3" : ISODate("2021-09-01T00:18:00Z")
}
}
}

The control fields describe the bucket itself. They provide boundary info, minimum and maximum timestamps for the bucket. The meta field contains my fields that I told Mongo I’d store under the original info field in my inserted documents. In my case, a single field but there could be more. It would make little logical sense for Mongo to have a bucket spanning multiple sensors. So Mongo will find or create a bucket for every differing meta values for a given time period. Finally, the data. Fields not originally contained in the info field as a sub-document and fields which are not the timestamp itself are stored under the data field in backing storage. This includes the temperature - as expected. It also contains a corresponding timestamp that the measurement was taken at. Since every document has an _id in Mongo, the _id of the reading is also stored here. The correlation among the triplet of values is positional, so the first reading in the bucket is spread across data._id[0], data.temperature[0] and data.timestamp[0], the next reading in the bucket is ...[1] and so forth.

Back to my aggregation from before: db.sensings.aggregate([{$match: {timestamp: {$gte: ISODate("2021-09-01"), $lt: ISODate("2021-09-02")}}}, {$group: {_id: "$info.sensorId", avgTemperature: {$avg: "$temperature"}}}] ). Nowhere did I include any positional magic or unwinds or extra filtering logic. MongoDB did the work for my taking my straightforward mental model of the nominal collection - one reading per document - and did all the heavy lifting. Life is good.

In the course of writing this post and poking around, I found this bit interesting: when you run collection stats on the nominal collection you get a new section named timeseries:

> db.sensings.stats().timeseries
{
"bucketsNs" : "test.system.buckets.sensings",
"bucketCount" : 0,
"numBucketInserts" : 0,
"numBucketUpdates" : 0,
"numBucketsOpenedDueToMetadata" : 0,
"numBucketsClosedDueToCount" : 0,
"numBucketsClosedDueToSize" : 0,
"numBucketsClosedDueToTimeForward" : 0,
"numBucketsClosedDueToTimeBackward" : 0,
"numBucketsClosedDueToMemoryThreshold" : 0,
"numCommits" : 0,
"numWaits" : 0,
"numMeasurementsCommitted" : 0
}

The new section gives some clues that the optimization and salient metrics to watch have to do with buckets, their management in terms of “opening” and “closing” them with respect to aspects such as the metadata, the number of items, the total size of the bucket, the time boundaries, as well as memory pressure. Pretty interesting, but I have no insight or clue how further to interpret these numbers. Time will tell (rim-shot, exit stage).

Whats next for Time-Series? I don’t know. Now that the abstraction and use-case has been introduced, the team can presumably focus on even more efficient storage or ways to process time-series queries more efficiently. Judging from the past, this is the way things unfold: a feature is introduced, which “makes it a thing”, then future iterations are taken to make it even better and more useful. I for one see this as a great new feature.

If it ain't broke - HTTP PATCH it!

Is your document “broken” inside your database? Sure, from time to time a value on a document needs to be updated, because the real world thing it describes changes. But does that mean what you currently have is “broken”? No, you say - it is not broken, just that “field xyz needs to change value”. Or “field xyz is no longer needed”.

Fine then: the document structure - the schema if you will - is still sound.

Where am I going with this? If the document is not broken, then you should not “touch” the whole document. You should use a surgical update, one that only sends a command to modify a field inside the document.

In MongoDB, the update command takes 2 forms:

  1. Document Replacement (AKA: “Slam”)
  2. Surgical Field Manipulation

In the Document Replacement form, you supply the update command with a full document value which will replace the whole document in the database. Doing so will logically target a single document (by _id or some alternate key) and supply the future document in its entirety.

db.peeps.update({_id: 'superbob'}, {name: 'bob', likes: ['coffee','bacon']}})

The update above slams in a new document with only the _ id, name, and likes fields remaining. Beyond losing any previous document field values, this update also assumes you intended to replace both the name and the likes. This may be true. But usually it’s just that in order to update one thing such as add ‘bacon’ to the list of things ‘superbob’ likes, you had to include previous unchanged values such as the name. Two problems here: One that you need to read the value from the database ahead of the update, second that the value you just read may be stale - updated by someone else - by the time your write back into the database.

The second - and my preferred - way to update a document is using surgical updates. If all I want to do is add ‘bacon’ to the list of likes, I can issue the command:

db.peeps.update({_id: 'superbob'}, {$addToSet: {likes: 'bacon'}})

This form will

  1. Add ‘bacon’ to the likes field if it doesn’t already have ‘bacon’ in it
  2. Not touch the name field, or any other field in the document
  3. Not require you to read the document before you issue the update, since you will not touch other fields.

Which brings me to the point of this post: What about REST?

REST API use HTTP actions to represent the action taken. The usual suspects are GET, POST, PUT, DELE. The least controversial one is DELE probably. GET gains some bad-boy street creds for whether queries should be solely specified using the URI path or query variables. POST and PUT generate lots of lengthy distinction discussions to decide whether PUT should or can create original objects and whether in the context of databases it is permissible to return the value of the created or modified object in full or at least the id or URI. Lots of discussions. But my focus here is on update. I’d like to avoid having to query the current document value, and I’d like to only issue a change for a subset of the fields on my entity, umm, document.

HTTP’s PATCH method aims to do exactly that. It lets a caller supply a future state of some of the fields, against a background of the existing, current entity. The docs state:

“The difference between the PUT and PATCH requests is reflected in the way the server processes the enclosed entity to modify the resource identified by the Request-URI. In a PUT request, the enclosed entity is considered to be a modified version of the resource stored on the origin server, and the client is requesting that the stored version be replaced. With PATCH, however, the enclosed entity contains a set of instructions describing how a resource currently residing on the origin server should be modified to produce a new version.”

This has “Surgical Update” written all over it, and maps naturally to the intent of the surgical update. And with MongoDB, some operators are idempotent such as $addToSet, $set , $unset.

Where’s the rub?

PATCH is not always implemented as surgical update. Some API frameworks don’t explicitly support PATCH, and others implement PATCH as a PUT (read: “slam” semantics) instead of the intended surgical semantics.

Some, like Mongoose, support the surgical update semantics explicitly with the Document.prototype.update() function, or by collecting single field assignments to an existing document and issuing a concise update. Careful here! * Some frameworks will just load the document first even when PATCH is supposed to be able to handle things in a more efficient way. Other * blogs might advise you to build it yourself in the same way. This is because mongoose by default is trying to entity-track the state of the document in memory then do the math to update or create the document backing the entity.

The culprit here is not frameworks actually - it’s the consumer. Uninformed or under-curious consumers of the frameworks will be causing extra back-end round trips, concurrency issues, or even data loss. And though there are applications that would not be horribly negatively impacted by these nuanced issues, but I don’t like to risk it.

Resolving this requires that you build an update outside of the document instance itself. Create an update command and issue it directly to the underlying database. If using something like mongoose, this means calling updateOne() on the Model level, something along the lines of the code below. It does not have to find() nor save() anything ahead of updating the document and therefore saves that round-trip. Consequently also makes no assumption as to the current values in other fields, and would leave any un-mentioned field values intact.

// ... in your handler code:

// Get id and patch values from your REST somehow
const id = req.params.id; // or whatever your REST API gives you
const patchValues = JSON.parse(req.body); // eg: { age: 23}

// ... validate somehow, then:
const updateResults = await Person.updateOne({ _id: id }, patchValues);

console.log(updateResults);
// { n: 1, nModified: 1, ok: 1 } ... or something to that effect.

This handles the simple cases. To build more elaborate manipulation such as adding/removing array items or nested fields, you would want to explore more expressive PATCH parsing so that nuanced intents are clearer. You may want to take a look at jsonpatch.js for inspiration.

Why doesn’t update clarify the semantics? AFAIK: For historic reasons. To remedy this, official MongoDB drivers expose a replaceOne command (exact naming depends on language) which more precisely describes the operation when you intend to slam the existing document, leaving update and updateOne to convey surgical updates. This makes reading programs better, but does not prevent you from issuing a slam using update, so you will need to take care of it manually, especially since the update signature still allows for “slam” values to be provided.

Hopefully, this post revealed nothing new. But if you are unsure how your REST API maps to backend DB mutation commands, it’s worth taking a look.

* Examples provided to illustrate the topic of this post only. They are not in any way meant to offend or diminish the contribution of respective referenced codebase authors.

Positional Array Indexing in MongoDB

Overview

Did you know that MongoDB lets you index a specific positional element in an array? Well, it does! I stumbled across this feature when reading the source code for mongod for an unrelated feature exploration, and stumbled upon this great feature. This feature can help you create more efficient indexes for specific scenarios.

How it Works

What is that good for you ask? Consider the following scenario:

Your company makes widgets as kits. Each widget is made of parts - each of which can be also sold separately. When a customer buys a “product” they are actually buying a box of items - a list of parts where the first one is by convention the “widget” and the rest are the parts of that widget. The schema for the things collection might look like this:

{  
_id: 123,
name: 'Widget A',
kit_sku: 'jk',
"parts": [
{name: 'widget a', sku: 'abc' },
{name: 'hex key', sku: 'xyz' },
{name: 'obscure manual in Klingon', sku: 'qwe' },
{name: 'widget stand', sku: 'cbc' },
]
}

The schema above has a field kit_sku which contains a unique selling id that folks can order. The parts list all items in the kit. The company sells other kits featuring the same widget just with differing extra parts. For example, there might be an obscure manual in Klingon for one kit, and a different kit with instructions in ancient Phoenician (you know, because somebody called and complained that they need one).

The catalog application needs call up the widget “widget a”. Except, it doesn’t really. It knows that the widget is sold under the sku abc in a variety of kits and that the widget is by convention the first item in the array. So the kit_sku is not useful here.

The query could look something like this:

db.things.find({ 'parts.0.sku': 'abc' })

This is a frequent enough query, so we’d want to support it with an index. But while indexing on 'parts.sku' will produce a usable index, it is not necessarily optimal. Consider that you have a long list of parts for each kit across a large collection. The index would have to contain a key for every part, and point to all documents. This is both wasteful and misses the point. It is wasteful because memory and disk will need to contain entries that are not useful to the user - Only the first element in the parts list is what we’re looking for ever. It misses the point because of the same reason. We create this index only to satisfy queries that are interested in the first element of the array, not any part in the list.

You can create an index that includes a positional path though. Here:

db.things.createIndex({'parts.0.sku':1})

The index create above includes the positional marker 0, telling mongo only to index the sku of the first element of the parts array!

Given the index above, running our query will make use of this index, as long as we use the path to the first element.

db.things.find({ 'parts.0.sku': 'abc' }).explain()

The query plan looks something like this:

       "queryPlanner" : {
...
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"parts.0.sku" : 1
},
"indexName" : "parts.0.sku_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"parts.0.sku" : [ ]
},
...
"direction" : "forward",
"indexBounds" : {
"parts.0.sku" : [
"[1.0, 1.0]"
]
}
}
},
"rejectedPlans" : [ ]
},...
"ok" : 1
}

The plan above shows a few reassuring feature. First and foremost, an index is used for our query as evident by the winning plan’s "IXSCAN" indicator. This is good.

Second, note the "isMultiKey" reports false. When you index an array field, MongoDB creates a multi-key index. A multi index index includes a pointer to the document for each element in its array hence the name. Non-multiKey indexes will only contain one pointer to a given document from a single key, since the indexed field contains only one document. This affects query processing because with multiKey index, mongo needs to work a bit harder to eliminate duplicates. TL;DR : when you use the positional index on an array element , it is not a multi-key index.

Third thing to note is the path. The index will be useful for queries that include the specific path "parts.0.sku". Other queries that do not include the positional path part will NOT be using this index:

db.things.find({ 'parts.sku': 'abc' }).explain()
// ...
// "parsedQuery" : {
// "parts.sku" : {
// "$eq" : "abc"
// }
// },
// "winningPlan" : {
// "stage" : "COLLSCAN",
// "filter" : {
// "parts.sku" : {
// "$eq" : "abc"
// }
// ...

The path used in the query above did not target the first positional element in the array, and therefore the query planner chose a full collection scan "COLLSCAN". It is not correct to use the positional array index because the general query above asks are there any elements in the array with sku “abc”. The positional index we created doesn’t include pointers based on all elements, so it would be “missing” documents if mongo was to use the positional index for this query.

Conclusion

If your query targets a specific known positional array element, this indexing strategy can buy you a performance boost. This index will be smaller than one on a whole array and will be a non-multiKey one, reducing the work the server has to do to process your query.

MongoDB’s support for this and other interesting indexing strategies is pretty phenomenal. This gives DBAs and developers finer surgical indexing capabilities to support a wide range of schema and query scenarios.

Unwind MongoDB Array - A Pivotal Pipeline Stage

When your document contains an array, $unwind is the aggregation pipeline stage that you most likely need. $unwind lets you un-pivot array elements, so that you can perform specific operations on each (or some) array elements independently.

Why Unwind?

In relational databases, we used to need to pivot rows into columns. Say we had and IoT readings table like this one:

| Device | Measure | Reading | |— |— |— |— | | 1 | ‘Temp’ | 42 | 1:00 | | 1 | ‘Speed | 91 | 1:00 | | 2 | ‘Speed’ | 94 | 1:00 | 2 | ‘Temp’ | 33 | 1:00 | | 2 | ‘Temp’ | 37 | 2:00 |

And we want a query to return results such as:

Device Speed Temp Time
1 91 42 1:00
2 94 33 1:00
2 37 2:00

We can achieve that with a SQL pivoting function. The need to pivot stems from the model in which each “soft property” in the data table. We didn’t want to create a column for each measure, and we didn’t want to add a column to the table each time a new measure is needed.

In MongoDB (and any other document oriented database), the model for a single reading document might look more like the one below. Each document is a point in time that contains a bunch of measures. If ever a new measure was added - no problem! An extra element can be added to the array with no schema change. If ever a reading had more or fewer readings at a point in time - no problem either. The array contains a set of readings which can grow or shrink.

{
device: 1,
time: ISODate('2019-01-31T01:00:00Z'),
readings: [{
name: 'speed',
value: 91
},
{
name: 'temp',
value: 42
},
]
}

With this document model, we would not need to do anything special for the report of readings by hour. The document already contains relevant measurements for each time period.

But if we wanted to calculate the average temperature across the devices, we’d need to do some digging. We don’t want to average every readings.value field - only the ones labeled 'temp'.

Using the aggregation framework and $unwind we can do this in 3 steps:

  1. Unwind the readings array.
  2. Filter out anything except 'temp'.
  3. Perform $avg on the values.

How to Unwind?

Let’s do this step by step. First, just unwind the array

db.readings.aggregate([
{$unwind:'$readings'}
]);

/*
{ "_id" : ..., "device" : 1, "readings" : { "name" : "speed", "value" : 91 }, "time" : ISODate("2019-01-31T01:00:00Z") }
{ "_id" : ..., "device" : 1, "readings" : { "name" : "temp", "value" : 42 }, "time" : ISODate("2019-01-31T01:00:00Z") }
{ "_id" : ..., "device" : 2, "readings" : { "name" : "speed", "value" : 94 }, "time" : ISODate("2019-01-31T01:00:00Z") }
...
*/

The result of $unwind is that for each original document’s readings array element, a new document will be produces that contains all the fields from the original document, but only one element assigned to the value of readings. As you can see above, the data type of the field readings after $unwind became a single-valued field rather than the original array. Device 1 had 2 elements in the original document, so in the $unwind output we have 2 documents, each attached to one element from the original array. One document contains a single reading for the measurement named speed, another document contains a reading for temperature measurement named temp.

Next step - we want only the temperature readings, getting rid of any speed or other measurements that may exist. We can do this with the $match pipeline stage. $match filters documents and only emits ones that match the criteria supplied - just like find() does.

db.readings.aggregate([
{ $unwind: '$readings'},
{ $match: { 'readings.name': 'temp' }}
]);
/*
{ "_id" : ..., "device" : 1, "readings" : { "name" : "temp", "value" : 42 }, "time" : ISODate("2019-01-31T01:00:00Z"), }
{ "_id" : ..., "device" : 2, "readings" : { "name" : "temp", "value" : 33 }, "time" : ISODate("2019-01-31T01:00:00Z"), }
{ "_id" : ..., "device" : 2, "readings" : { "name" : "temp", "value" : 37 }, "time" : ISODate("2019-01-31T02:00:00Z"), }
...
*/

This got rid of any measurements that were not labeled temp. With temperature only values in hand, we can now calculate average temperature. The $group pipeline operator will let us calculate an average for each device:

db.readings.aggregate([
{ $unwind: '$readings'},
{ $match: { 'readings.name': 'temp' }},
{ $group: { _id: '$device', averageTemperature: {$avg: '$readings.value'}}}
]);

/*
{ "_id" : 2, "averageTemperature" : 35 }
{ "_id" : 1, "averageTemperature" : 42 }
*/

The behavior of $unwind evolved over time, matching common usage patterns and expectation the community found useful. Several scenarios and special cases arise when unwinding arrays:

  • What if the document contains a field with a single value - not an array?
  • What if the array has no elements?
  • What if the document contains no such field at all?

All these need to be handled in some way, and understanding the behavior would help you avoid fluky results.

Since MongoDB version 3.2, $unwind treats a target field containing a scalar or non-array value as if it were a single element array containing that one non-array value. These two documents are treated the same way by $unwind:

{
device: 3,
readings: { name: 'temp', value: 29 },
time: ISODate('2019-01-31T03:00:00Z')
}
{
device: 4,
readings: [ { name: 'temp', value: 29 } ],
time: ISODate('2019-01-31T03:00:00Z')
}

Device 3 above has a single reading in the readings field - not an array. Device 4 has a single-element array for readings. Since $unwind treats them the same, we don’t neat fancy $project and $cond checks to normalize input documents (Squee!). If we added these documents to the readings collection, and ran the same 3 stage pipeline as before, we’d get the results we expect. Prior to mongo 3.2, a non-array field handed to $unwind would have produced an error.

db.readings.aggregate([
{ $unwind: '$readings'},
{ $match: { 'readings.name': 'temp' }},
{ $group: { _id: '$device', averageTemperature: {$avg: '$readings.value'}}}
]);

/*
{ "_id" : 3, "averageTemperature" : 37 }
{ "_id" : 2, "averageTemperature" : 35 }
{ "_id" : 4, "averageTemperature" : 29 }
{ "_id" : 1, "averageTemperature" : 42 }
*/

The Nitty Gritty

Now what about a missing field, an empty array, or a field containing null? Here are examples of these documents:

{ device: 10, time: ISODate('2019-01-31T06:00:00Z') } 
{ device: 10, readings: null, time: ISODate('2019-01-31T07:00:00Z') }
{ device: 10, readings: [], time: ISODate('2019-01-31T08:00:00Z') }

If we ran $unwind on those documents, none of them would be emitted past $unwind. The example below passes documents with device == 10 to $unwind, but since none of them had a “useful” readings value to attach to an output document - none of them got emitted:

db.readings.aggregate([
{ $match: { device: 10}},
{ $unwind: '$readings'},
])
// No documents returned

This is just a behavior though. the word “useful” in our case where we won’t want an empty or missing reading to pollute our averaging math is our choice. Other use cases may exist where you actually want every document entering $unwind to be emitted forward even if it doesn’t have an array element to contribute. If you want that behavior, you can use the option preserveNullAndEmptyArrays set to true. To supply this option though, you would need to use a slightly different argument to $unwind. Until now, we gave $unwind a bare value-of-field argument $<field name>. The more elaborate syntax is {$unwind: { path: '$<field name>'}}. The following two forms are identical in meaning:

{ $unwind: '$readings' }
{ $unwind: { path: '$readings' }}

The latter expression is where we can add our preserveNullAndEmptyArrays option. Putting it all together looks like this:

db.readings.aggregate([
{ $match: { device: 10}},
{ $unwind: {
path: '$readings',
preserveNullAndEmptyArrays: true
}
},
])

/*
{ "_id" : ..., "device" : 10, "time" : ISODate("2019-01-31T06:00:00Z") }
{ "_id" : ..., "device" : 10, "readings" : null, "time" : ISODate("2019-01-31T07:00:00Z") }
{ "_id" : ..., "device" : 10, "time" : ISODate("2019-01-31T08:00:00Z") }
*/

This time, a single document was emitted corresponding to each of the incoming device 10 documents. For each of those 3 documents, MongoDB considered that the incoming document has to be emitted out at least once. But what value should the output document have? In the case of the document containing readings: null, mongo emitted a null in the readings field. For the ones with missing field and with the array containing no value, $unwind emitted a document with no readings field at all. The notion is that a missing field should be preserved - no invention of a non-existing field. As far as an existing but empty array - think of it as “a document with no value there”. Emitting a field would have required a value to be assigned to the field. The value null is still a value and is not the same as not having any value. If mongo would have emitted a {readings: null} field in response to an empty array that’s like saying that the original document had an array element null - which is not the case. Bottom line: the behavior is defined in a certain way. We can leverage it when we need to, and we can also manipulate the fields further using other pipeline operators.

Unwind Without $unwind?

The three-step pipeline to calculate average is not very difficult to read, understand, or produce. But the real need is expressed in the $group stage. The stages for filtering and unwinding were cruft that adds verbiage - but not for nothing.

Tempted to shorten the syntax, one might think to use some of the projection operators, such as $filter or $arrayToObject. The $filter example would go something along the lines of:

db.readings.aggregate([
{ $match: { readings: { $type:4 } } },
{ $project: {
temperature: {
$filter: {
input: '$readings',
as: 'elm',
cond: {
$eq: [ '$$elm.name', 'temp' ]
}
}
}
}
}])

Where we can pluck only the 'temp' elements out of the readings array. This works, but must be preceded by something to ensure that readings is actually an array. Unlike the more forgiving $unwind, $filter insists that the input field is an array or it produces an error. That’s why I included a $match that only passes along documents where that condition holds true (See BSON spec for type information. 0x04 is Array). This seems almost palatable, but would still require more syntax to transform the one-element array to something we can use directly for the math. So since the amount of text required to achieve the goal is not reduced, and the overall pipeline does not become more readable, I see little reason to go to those lengths.

Conclusion

$unwind is an important pipeline stage in the document-oriented world. Dealing with arrays and aggregation will undoubtedly get you using $unwind sooner or later. In a previous post we focused on data scenarios where this all made sense. In this post, we focused more on the balance between syntax and the meaning of the stage and related operators. Either way, it’s well worth learning the nuanced behavior of this key pipeline stage.

Notice

We use cookies to personalise content, to allow you to contact us, to provide social media features and to analyse our site usage. Information about your use of our site may be combined by our analytics partners with other information that you’ve provided to them or that they’ve collected from your use of their services. You consent to our cookies if you continue to use our website.