MongoDB 3.2 goodies coming your way: Partial Indexes

MongoDB includes sparse indexing for a while now.

What is a sparse index? A sparse index is space optimized index which only contains pointers to documents that contain value(s) in the indexed fields.

Lets say our Person collection had a document for each user in our system. All people have to supply a first name. But the prefix is optional. Some people have a “Mrs” or “Prince” or something, but most users don’t:

{ _id: 1, name: 'Jan', prefix: 'Mrs' }
{ _id: 2, name: 'Dude' }

Jan (in our contrived example) has a prefix, but Dude doesn’t. Now if we created an index on the field “prefix”, then an entry for both documents would be created. The key “Mrs” would point to the document with _id 1 (and any other documents with the prefix “Mrs”, and the index key (Null) would point to the document with _id 2 and any other document that happens not to have a prefix field at all, or which has a value of null in the prefix field. This is what inverted indexes do after all. They hold a fast-to-find key value, which in turns contains a list of documents in which that key value exists.

Consider a 10 million user collection where only 5% of users have supplied a prefix. The index will still be responsible to represent 9.5 million documents. That’s wasteful. It takes extra space on disk. It takes space in memory. It takes extra effort to maintain. Waste.

A sparse index simply avoids including the “empty” key. Instead of storing a key with references to all the documents that DON’T have the value, it just doesn’t. This brings to space savings. Less space on disk, less space in memory. Less work to maintain for documents that don’t have that field at all.

A sparse index can be created using the {sparse: true} option, like so:

db.person.createIndex( {prefix: 1 }, {sparse: true });

So what is a partial index? Let’s say our company offered college tours for anyone about to finish high school. Our marketing department is interested in anyone who has 11 to 12 years of schooling. Anything lower and it’s too far in the future. Anything higher, well, they are already in college and not eligible.

To support this, we would maybe create an index on the yearsInSchool field, like so:

db.person.createIndex( {yearsInSchool :1});

But we find out that most of the people are younger, older, or didn’t even supply their years in school number. So that index is going to be a good candidate to be sparse. But the sparse property only handles cases where the field doesn’t exist or is null. People with 9 years or 17 years of schooling will also be in the index. Wouldn’t it be nice to have an index only on documents where the years in school are 11 or 12? That index would be super compact and therefore super efficient in space and usage!

A Partial Index does just that. A partial index allows you to specify a static criteria, and include in the index only keys where the value matches that criteria. So in our example, here’s how we create a partial index to help college-bound candidates:

db.person.createIndex(
{ yearsInSchool: 1},
{ partialFilterExpression: { yearsInSchool: {$gte: 11, $lte:12} }}
);

The partialFilterExpression option lets you supply a criteria using equality, $eq, $gt, $gte, $lt, $lte, $type, $exists and $and (these restrictions are early pre-release) at the top level as the filter. Any document not matching the criteria will be excluded (not indexed) by this index.

Pretty nifty!

But wait, there’s more!

The college tours are only offered to people with a GPA (Grade Point Average) of 3.0 and above. Hey – you want to go to college, better get your grades up! The partial index criteria can include arbitrary fields for the filter. The indexed fields need not be the fields mentioned in the filter expression. So we can have instead:

db.person.createIndex(
{ yearsInSchool: 1},
{ partialFilterExpression: { yearsInSchool: {$gte: 11, $lte:12}, gpa: {$gte:3.0} }}
);

The GPA field is mentioned in the filter but is NOT the indexed field. Now our index can be even more concise and compact, which is very cool.

In fact, we can have filter expressions with none of the indexed fields in them. Or frequent-shopper program is only open to people who shopped with us 3 or more times. And our marketing can send them email periodically. To do that, they need to actually have an email address. So to send targeted email to those we might want a partial index on the orderCount field, and only include in the index people with an email:

db.person.ensureIndex(
{ orderCount: 1},
{ partialFilterExpression: { email:{$exists: true}}});

Now if I query for people with orderCount greater than 2, I’m in for a disappointment: mongo will NOT use the index for this query:

db.person.find({ orderCount: {$gt: 2}})

That’s because it can’t determine that the partial index even may apply. For the optimizer to choose the partial index, the query criteria must include the index filter expression fields with a non-empty value. So this CAN use the index:

db.person.find({orderCount: {$gt: 2}, email: {$exists: true}})

Which is both true to the form and expressive: I want shoppers with 3 or more past purchases that have an email address for the email campaign.

Other restrictions?

  • Mongo allows only one index on a set of fields regardless of the index options. So we can’t create several partial indexes on the same set of index-field definitions.* The _id field index can’t be partial. And since sharding relies on the shard-key index to locate documents in the cluster, the shard-key index cannot be partial either.
  • Can’t create a partial index with the sparse option as well, but that’s just silly anyway. The partial index is a superset of the sparse index. The documentation suggests that we actually use partial filter expression in index creation to satisfy sparse index definitions (by using {$exists: true} ).

I’m very jazzed over this feature. It’s certain to offer an efficient indexing option for many query scenarios.

This post is based on early pre-release information (3.1.8 currently, dev-only-release) , so please be patient with pushing code to production (which is to say: don’t do it!) . Please see dev release notes.