+N Consulting, Inc.

Positional Array Indexing in MongoDB

Overview

Did you know that MongoDB lets you index a specific positional element in an array? Well, it does! I stumbled across this feature when reading the source code for mongod for an unrelated feature exploration, and stumbled upon this great feature. This feature can help you create more efficient indexes for specific scenarios.

How it Works

What is that good for you ask? Consider the following scenario:

Your company makes widgets as kits. Each widget is made of parts - each of which can be also sold separately. When a customer buys a “product” they are actually buying a box of items - a list of parts where the first one is by convention the “widget” and the rest are the parts of that widget. The schema for the things collection might look like this:

{  
_id: 123,
name: 'Widget A',
kit_sku: 'jk',
"parts": [
{name: 'widget a', sku: 'abc' },
{name: 'hex key', sku: 'xyz' },
{name: 'obscure manual in Klingon', sku: 'qwe' },
{name: 'widget stand', sku: 'cbc' },
]
}

The schema above has a field kit_sku which contains a unique selling id that folks can order. The parts list all items in the kit. The company sells other kits featuring the same widget just with differing extra parts. For example, there might be an obscure manual in Klingon for one kit, and a different kit with instructions in ancient Phoenician (you know, because somebody called and complained that they need one).

The catalog application needs call up the widget “widget a”. Except, it doesn’t really. It knows that the widget is sold under the sku abc in a variety of kits and that the widget is by convention the first item in the array. So the kit_sku is not useful here.

The query could look something like this:

db.things.find({ 'parts.0.sku': 'abc' })

This is a frequent enough query, so we’d want to support it with an index. But while indexing on 'parts.sku' will produce a usable index, it is not necessarily optimal. Consider that you have a long list of parts for each kit across a large collection. The index would have to contain a key for every part, and point to all documents. This is both wasteful and misses the point. It is wasteful because memory and disk will need to contain entries that are not useful to the user - Only the first element in the parts list is what we’re looking for ever. It misses the point because of the same reason. We create this index only to satisfy queries that are interested in the first element of the array, not any part in the list.

You can create an index that includes a positional path though. Here:

db.things.createIndex({'parts.0.sku':1})

The index create above includes the positional marker 0, telling mongo only to index the sku of the first element of the parts array!

Given the index above, running our query will make use of this index, as long as we use the path to the first element.

db.things.find({ 'parts.0.sku': 'abc' }).explain()

The query plan looks something like this:

       "queryPlanner" : {
...
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"parts.0.sku" : 1
},
"indexName" : "parts.0.sku_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"parts.0.sku" : [ ]
},
...
"direction" : "forward",
"indexBounds" : {
"parts.0.sku" : [
"[1.0, 1.0]"
]
}
}
},
"rejectedPlans" : [ ]
},...
"ok" : 1
}

The plan above shows a few reassuring feature. First and foremost, an index is used for our query as evident by the winning plan’s "IXSCAN" indicator. This is good.

Second, note the "isMultiKey" reports false. When you index an array field, MongoDB creates a multi-key index. A multi index index includes a pointer to the document for each element in its array hence the name. Non-multiKey indexes will only contain one pointer to a given document from a single key, since the indexed field contains only one document. This affects query processing because with multiKey index, mongo needs to work a bit harder to eliminate duplicates. TL;DR : when you use the positional index on an array element , it is not a multi-key index.

Third thing to note is the path. The index will be useful for queries that include the specific path "parts.0.sku". Other queries that do not include the positional path part will NOT be using this index:

db.things.find({ 'parts.sku': 'abc' }).explain()
// ...
// "parsedQuery" : {
// "parts.sku" : {
// "$eq" : "abc"
// }
// },
// "winningPlan" : {
// "stage" : "COLLSCAN",
// "filter" : {
// "parts.sku" : {
// "$eq" : "abc"
// }
// ...

The path used in the query above did not target the first positional element in the array, and therefore the query planner chose a full collection scan "COLLSCAN". It is not correct to use the positional array index because the general query above asks are there any elements in the array with sku “abc”. The positional index we created doesn’t include pointers based on all elements, so it would be “missing” documents if mongo was to use the positional index for this query.

Conclusion

If your query targets a specific known positional array element, this indexing strategy can buy you a performance boost. This index will be smaller than one on a whole array and will be a non-multiKey one, reducing the work the server has to do to process your query.

MongoDB’s support for this and other interesting indexing strategies is pretty phenomenal. This gives DBAs and developers finer surgical indexing capabilities to support a wide range of schema and query scenarios.

Notice

We use cookies to personalise content, to allow you to contact us, to provide social media features and to analyse our site usage. Information about your use of our site may be combined by our analytics partners with other information that you’ve provided to them or that they’ve collected from your use of their services. You consent to our cookies if you continue to use our website.