MongoDB Change Streams - More event processing, less queue infrastructure.

What are Change Streams?

Change streams are a new way to tap into all of the data being written (or deleted) in mongo. Using change streams, you can do nifty things like triggering any reaction you want in response to very specific document changes.

For example, you have a user that registers to your website. You want to send them an email, or maybe put them on an on-boarding campaign. Maybe both? Using change streams you can hook into the live event itself - insert of a document to the Users collection - and react to that by immediately pinging your remarketing system with the details of the new user.

The flow looks something like this:

Simple Integration

Your website code only worries about registering the new user. Want to also send the new user a gift? No problem: have the listener also ping the warehouse. Want to add the user to your CRM system? You guessed it: just add a hookup.

Now you may wonder “Why this is news? Didn’t mongo always have tailable cursors? Can’t we just tail the oplog and do the same?” Sure we can. But when you tail the oplog, every mutation is going to be returned to you. Every. Single. Write.

Consider a moderate website, say 1000 users sign ups a day. That’s roughly 1 every minute or two. Not too bad. But what about the social media documents? Every tweet? Every page view you logged? Every catalog item visited? That can get pretty big. Each of these will be sent to you from the oplog, because each one is gets written into the oplog, regardless of if you need it or not. Extra work, extra load. Extra data you’re not interested in. Filtering though this is like putting a butterfly net over a firehose nozzle.

Change streams allow you to do this more efficiently. With change stream you get to customize several aspects of the stream that is returned to you:

  1. The specific collection you are interested in.
  2. The type of change you are interested in (inserts and updates only, for example).
  3. The specific parts of the changed document (if any) you want back.

These options give you great control over the source, size, and nature of the changes your listener app wants to handle. Rather than returning all writes to you and, and making you filter them out in application memory, the filtering is done at the server level. This reduces wire traffic, reduces your memory and CPU usage on the client. Further, it can save another round trip per document: if the mutation touched only one field - say, last login date - tailing the oplog would have only yielded you the content of the command mutation - the date value itself. If you needed the user’s email and name - which were not part of the command- they are not part of the oplog. Tailing the oplog would then not be enough, and you’d have to turn around and shoot another query (Joy! More I/O!) to get the extra details. With change streams, you get to include the extra fields in the returned event data by just configuring the stream to return them.

How to Consume Change Streams

Subscribing to change events depends on your particular driver support. Early beta had Node and Java driver support. I’ll focus here on Node.

There are 2 modes of consuming change streams: event driven, and aggregation pipeline. The aggregation pipeline way, is to just issue a $aggregate on a collection, with a mandatory new $changeStream pipeline operator as the first pipeline stage. You can introduce other pipeline stages after that stage, but it must be the first one. The resulting cursor would then return change items as the occur.

var coll = client.db('demo').collection('peeps');
var cursor = coll.aggregate([
{ $changeStream: {} }
]);
// .. now consume cursor as you would any

In the example above, were just saying that any write to the peeps collection in the demo database would be returned back in the cursor we opened. The cursor would remain open (baring errors or network issues or collection disappearance) so we can just process the events coming back sequentially.

Speaking of sequentially the event stream guarantees events returned would be in the order they were executed by MongoDB. The change stream system uses logical ordering that ensures you get the events in the same order mongo would have serialized them internally.

To include the full document affected, you can add the fullDocument option and a value of updateLookup. The other option is default which indicates you don’t need any extra data. Omitting the fullDocument option is equivalent to specifying the value default.

With the updateLookup value set, an event will include the queried value at some point after applying the operation that prompted this event. The subtlety here is that if there’s a high rate of change, your event may return a further future value of a document. For example, if you update a pageview.clickCount field twice at high rate, the event resulting from the first update may reflect the result of the second update. You will get 2 change notifications because 2 writes happened, and they will be reflected in order. But the lookup that brings in the current state of the document is not guaranteed to contain the first change because it is, in fact, a lookup performed after the event is queued for discovery by your change stream and not a point in time capture of the document.

var cursor = coll.aggregate([
{ $changeStream: { fullDocument: 'updateLookup' } }
]);

In the example above, the event returned from the cursor will have a field named fullDocument containing the document looked up at the server, without an additional round trip.

Speaking of event data, what else is returned?

{
"_id":{"clusterTime":{"ts":"..."},"uuid":"...","documentKey": {"_id":"..."}},
"operationType":"update",
"fullDocument":{"_id":"waldo","at":"2017-10-01","seen":"beach"}, // this is my document!
"ns":{ "db":"changeStreamDemo","coll":"peeps"},
"documentKey":{"_id":"waldo"},
"updateDescription":{
"updatedFields":{"at":"2017-10-17"},
"removedFields":[]
}
}

The change event includes a bunch of useful information:

  • An _id for the event. In case you wanted to stop and start again, save this one and ask only for changes that occurred since that id.
  • The operationType - discussed shortly.
  • The full namespace ns, including the database name and collection names.
  • documentKey pointing to the document _id subject of this event
  • fullDocument containing the document as it was just after write occurred. The word full may be a misnomer though: it depends on further projection you may add to retrieved a subset or a modified document. More about that in a bit. This field will not be present unless lookupDocument option was specified.
  • An update description, which contains the exact field value changes the values assigned in the updatedFields array, and the fields removed (think $unset) in the removedFields array.

Thus far, we just asked for any change. You can narrow down the change nature, using the various operation types during the stream setup. The options are:

  • insert
  • update
  • delete
  • replace
  • invalidate

The first 4 are self explanatory. invalidate occurs when a collection level change occurred which makes the document not available, other than delete. Think collection.drop().

To get notified only on a subset of the change types, you can request the change stream, and add a $match stage, like so:

var coll = client.db('demo').collection('peeps');
var cursor = coll.aggregate([
{ $changeStream: { fullDocument: 'updateLookup' } },
{ $match: { operationType: { $in: ['update', 'replace'] } } }
]);

Now only updates and document-replacements would be emitted back, filtered at the source.

This pipeline can also include transformations, so the aforementioned fullDocument field is really what you want to make of it.

var coll = client.db('demo').collection('peeps');
var cursor = coll.aggregate([
{ $changeStream: { fullDocument: 'updateLookup' } },
{ $match: { operationType: { $in: ['update', 'replace'] } } },
{ $project: {secret:0}}
]);

The code above will omit the secret field from the emitted event.

The $match stage can also apply other arbitrary conditions. When the updateLookup option is set, you can pretty much filter based on any document field or set of fields you want, using familiar aggregation syntax.

That was a quick rundown of consuming change streams using aggregation. But I actually like another option in the node driver: subscribing to change stream events as asynchronous events!

Here’s how that plays out:

// a filter describing the things i'm interested in
var filter = {
$match: {
$or: [{ operationType: 'update' },{ operationType: 'replace' }],
'fullDocument._id': 'waldo'
}
};
// the option to return the full document
var options = { fullDocument: 'updateLookup' };
// creating a change stream definition
var changeStream = coll.watch([filter], options);
// subscribing to events
changeStream.on('change', c => {
console.log('Waldo alert!', c)
});
changeStream.on('error', err => console.log('Oh snap!', err));

Using the code above, any update or replace to the document with _id ‘waldo’ in the collection coll will emit a change event.

My preference for this syntax is that it fits better into the modular async coding model. It also makes it easy to hook up the change function but define the handler elsewhere. This makes testing and reuse easy too. But that’s me - you do you.

Resume / Retry

In the face of network error, the driver to re-connect once automatically once. If it fails again, you can use the last _id of the change event (which will also be in the exception thrown), after you reconnect to your cluster. The responsibility for remembering where you were is on you (after the one-shot recovery the driver does for you). But both syntaxes allow you to add a resumeAfter option when requesting the stream, setting the value to the last change stream _id you successfully processed.

var lastProcessedChangeId = ...;
var options = { resumeAfter: lastProcessedChangeId, fullDocument: 'updateLookup' };

Due to the nature of large interconnected system, it is useful to design the change stream events in your application as ‘at least once’, and make any operation you trigger base on a notification idempotent. You should be able to run the same operation more than once with the same event data without adverse effect on your system.

Even when you limit the number, nature, and collection you listen to you may still end up with a large amount of changes to process. Or you may have subordinate systems that can’t handle your calls emanating from the events (think slow legacy systems). One thing you can do is deploy multiple listeners, each one taking on the same change definition, but also a partition filter on the change _id. A partition filter can be employed on any field actually, but the _id is present when you don’t ask for updateLookup too. The idea is to just run several listeners, each one with an extra field match on the modulus of the timestamp or something. If you need more capacity, you spin up more listeners and re-partition according to the number of listeners you need.

The change stream feature is currently in RC0 as of this writing, and should hit GA soon. I look forward to writing less code and incurring less I/O in the event processing component of our systems. It’s kind of nice to be able to use the same database infrastructure for all kinds of workloads, and this operational integration feature is a very welcome addition.

MongoDB - Don't be so (case) sensitive!

I have a collection of people, named peeps:

db.peeps.insert({UserName: 'BOB'})
db.peeps.insert({UserName: 'bob'})
db.peeps.insert({UserName: 'Bob'})
db.peeps.insert({UserName: 'Sally'})

And I want to be able to find the user named “Bob”. Except I don’t know if the user name is lower, upper or mixed case:

db.peeps.find({UserName:'bob'}).count() // 1 result
db.peeps.find({UserName:'BOB'}).count() // 1 result
db.peeps.find({UserName:'Bob'}).count() // 1 result

But I don’t really want to require my users to type the name the same case as was entered into the database…

db.peeps.find({UserName:/bob/i}).count() // 3 results

Me: Yey, Regex!!!

MongoDB: Ahem! Not so fast… Look at the query plan.

db.peeps.find({UserName:/bob/}).explain()
// ugg, collection scan:
// "winningPlan" : {
// "stage" : "COLLSCAN", ...

Me: Oh, I’ll create an index!

db.peeps.createIndex({UserName:1});
// query again...
db.peeps.find({UserName:/bob/}).explain()
// Ah! IXSCAN!
// "winningPlan" : {
// "stage" : "FETCH",
// "inputStage" : {
// "stage" : "IXSCAN", ...
// ignore case, still uses index
db.peeps.find({UserName:/bob/i}).explain()
// Ah! IXSCAN!
// "winningPlan" : {
// "stage" : "FETCH",
// "inputStage" : {
// "stage" : "IXSCAN", ...

Me: Yey!!!

MongoDB: Dude, dig deeper… and don’t forget to left-anchor your query.

// Run explain(true) to get full blown details:
db.peeps.find({UserName:/^bob/i}).explain(true)
// "executionStats" : {
// "executionSuccess" : true,
// "nReturned" : 3,
// "executionTimeMillis" : 0,
// "totalKeysExamined" : 4, // <<<=== Oy! Each key examined.
// "totalDocsExamined" : 3,

Me: Yey?

MongoDB: Each key in the index was examined! That’s not scalable… for a million documents, mongo will have to evaluate a million keys.

Me: But, but, but…

db.peeps.find({UserName:/^bob/}).explain(true)
//"executionStats" : {
// "executionSuccess" : true,
// "nReturned" : 1,
// "executionTimeMillis" : 0,
// "totalKeysExamined" : 1, // <<<=== Ok,
// "totalDocsExamined" : 1, // <<<=== Not Ok...

Me: This is back to exact match :-) Only one document returned. I want case insensitive match!

Old MongoDB: ¯\(ツ)… Normalize string case for that field, or add another field where you store a lowercase version just for this comparison, then do an exact match?

Me:

New MongoDB: Dude: Collation!

Me: Oh?

Me: (Googles MongoDB Collation frantically…)

Me: Ahh!

db.peeps.createIndex({UserName:-1}, { collation: { locale: 'en', strength: 2 } )
db.peeps.find({UserName:'bob'}).collation({locale:'en',strength:2}) // 3 results!
// "executionStats" : {
// "executionSuccess" : true,
// "nReturned" : 3,
// "executionTimeMillis" : 0,
// "totalKeysExamined" : 3, // <<<=== Good! Only matching keys examined
// "totalDocsExamined" : 3,

Me: Squee!

MongoDB: Indeed.


Collation is a very welcome addition to MongoDB.

You can set Collation on a whole collection, or use it in specific indexing strategies.

The main pain point it solves for me is the case-insensitive string match, which previously required either changing the schema just for that (ick!), or using regex (index supported, but not nearly as efficient as exact match).

Beyond case-sensitivity, collation also addresses character variants, diacritics, and sorting concerns. This is a very important addition to the engine, and critical for wide adoption in many languages.

Check out the docs: Collation

MongoDB Sharding & Chunks

If you want to scale your reads or writes using MongoDB, you are going to use sharding.

The nice thing about sharding, is auto-sharding. Mongo doesn’t only distribute the data across nodes for you, it automatically splits collection data and keeps the nodes fairly equally loaded with data. The core mechanism supporting this arrangement is the chunk.

So, what are chunks anyway?

When you want to split data in a collection horizontally, you need a way to track which documents live where. When you define a sharded cluster you pick a sharding key. That sharding key is the data value which is used from each document when deciding on which physical shard that document shall be located. To keep track of all the documents, the config servers need to have some way to persist the shard association. If we kept each shard key value pointing to a shard for each document, it wouldn’t scale very well. A billion documents will require a billion entries, making the config servers work hard and consume lots of memory and I/O. If we defined a hash function that takes the key and distributes the key-value space across some nodes, we wouldn’t need to save each key value, and could predict the shard a document lives on by applying the function only. The trouble with that is that if we added or wanted to remove shards, we’d have a hard time balancing or re-balancing documents across available nodes, as the original function wouldn’t have considered a different amount of nodes. Further, a formulaic key to shard association would not necessarily result in even load on shards. Although a key space (like an integer) may be evenly round-robin’ed across shards (think Mod(x) where x is number of shards), the documents in reality may occur in clumps that end up on one shard vs. the theoretical key-space distribution. No, these methods have severe challenges. What Mongo does instead is define chunks.

A chunk is simply a small record that describes a key range, and the shard that key range is associated to. By having a chunk describe a key range rather than each discrete key, we get around the issue of storing every shard key value found. Although chunks are stored on the config servers, the storage size is much reduced compared to storing each key, since ranges can/would encompass a large number of keys in one small descriptor.

Initially, a chunk is created which encompasses all possible keys. For a sharded collection with a shard key “score” this may look like:

{ "score" : { "$minKey" : 1 } } -->> { "score" : {"$maxKey":1} } on : shard0000

This descriptor states that any document with any score value (from theoretical minimum to maximum) will be located on the shard named “shard0000”.

Once you start populating the collection with documents, Mongo will start splitting chunks and migrating them to other shards in an attempt to keep data evenly spread across the shards. This action takes place when Mongo sees a chunk containing 64MB worth of data. Once a chunk is split, Mongo will move the one of the two chunks to another shard. It will copy all the documents which fall into the chunk’s range over to the new shard, update the config servers to point the chunk to the new shard, and finally clean up the documents from the old shard. All this work is automatic and baked into the sharding mechanism for Mongo.

Chunks are split based on actual seen key values. Mongo computes a splitting point on the shard key based on the actual keys in the shard. The nice thing about this is that given an uneven distribution along a key-space, the splitting will follow the density and reduce it. Let’s say you have a shard key on a credit score of your customers. The theoretical range is 0 - 850. But in reality, your seen scores - the actual scores - are such that the majority are between 600 and 750, with a few outliers below 650, and a few above. The chunk-split progression may look something like:

// one chunk, covering all
{ "score" : { "$minKey" : 1 } } -->> { "score" : {"$maxKey":1} }
// then 2 chunks,
{ "score" : { "$minKey" : 1 } } -->> { "score" : 0 }
{ "score" : 0 } -->> { "score" : {"$maxKey":1} }
// then the upper chunk will split
{ "score" : { "$minKey" : 1 } } -->> { "score" : 0 }
{ "score" : 0 } -->> { "score" : 419 }
{ "score" : 419 } -->> { "score" : {"$maxKey":1} }
// since most scores fall above 419, that chunk will again split given enough data
{ "score" : { "$minKey" : 1 } } -->> { "score" : 0 }
{ "score" : 0 } -->> { "score" : 419 }
{ "score" : 419 } -->> 639 }
{ "score" : 639 } -->> { "score" : {"$maxKey":1} }

This is all data and insertion order related. Based on the addition of documents at a point in time, it may grow and need to split. Chunks remaining small enough are left alone. Over time, and given a good enough shard key, the data will be roughly evenly split across the shards. This mechanism does require active management on Mongo’s part as opposed to fixed hash function over the number of shards. But it has the advantage of allowing you to grow the cluster with ease. If you add another shard to the cluster, Mongo can migrate chunks (without splitting) draining some load off existing shards. Another issue with static hash functions is that it relies on theoretical values, not actual. Imagine all scores are even for some reason, and that your hash function is simply odd/even to place documents on 2 shards. All the even documents will end up on one shard and the other shard for odd numbers will be empty. Not so with Mongo’s chunk mechanism: Mongo will split chunks as they become full, and will move chunks to the least full shard at that time. It’s a dynamic thing. It’s easy to figure out the data load because you can just count chunks. Sure, there might be some pretty empty chunks, but the imbalance over a large enough dataset is negligible.

Mongo takes care of leveling the data size load on each shard. But what it can’t do is level the query load or write load on the shards. Given 3 shards - call them A, B, and C - Mongo doesn’t know if your next 3 documents will end up all on A, or go to B and C. It’s up to you. You must pick a shard key that will have such distribution properties. This is one of the tougher tasks in designing a sharded cluster. The effects of picking a good key can make or break your cluster. (And no, the “score” field is not a good shard key by any means). The principle considerations for picking the shard key are:

  1. We want writes to be spread across shards as evenly as possible
  2. We want reads to be as local as possible
  3. We want chunks to always be splitable

We’re sharding because we want to realize more I/O and processing bandwidth using more servers. Writes are I/O intensive, and therefore spreading the writes across shards evenly will give us the best write-performance. A bad shard key will cause consecutive document write load to go to one of the shards, leaving others idle.

mongos will route queries to the shards it thinks the documents are on. It is capable of clubbing together query results from multiple shards and hand them over to you as on result set. But if your queries can be served by one shard only, the rest of them will be free to answer more/other queries in parallel. Query locality has to do with picking a shard key that is likely present in your queries, and for which the documents largely or wholly live on one or few shards.

A source of imbalance can arise from a giant chunk. A chunk will grow beyond 64MB if Mongo can’t split it. This happens when the shard key is such that a chunk fills up, and all documents in it have the same shard key value. The score value in our previous example is susceptible to this. Over time, many customers may have a score of 720 let’s say. But once a chunk contains all of those, Mongo won’t be able to split it because it won’t find a center point between 720 and 720… That’s bad. It creates for a giant chunk, and the situation just worsens when more and more documents have that key value because they all end up on the same shard. More load will go to that shard relative to others because mongo can’t split that chunk and won’t move it either (balancing the load is based on chunk count). So it really pays to pick a key that has ample range in reality, with your expected documents. An integer such as score has a huge theoretical range, but when used for this problem domain of score it doesn’t.

Chunks are the basis of the sharding mechanism in Mongo. They offer some clear benefits over other distribution mechanisms. There are even more advanced scenarios supported by chunks, such as tag-aware sharding. As an administrator, you can do a few things to affect chunk splitting and migration, taking over some control when needed. But sharding is largely automatic and self managing, letting you focus on writing and deploying your awesome apps.

For a video tutorial on configuring sharding and much more, see this Pluralsight.com course).

MongoDB 3.2 Goodies Coming Your Way: More ways to $unwind

The aggregation framework has a slew of new operators and pipeline improvements. One notable improvement is the more robust $unwind pipeline stage.

One of the main motivations for using Mongo is its flexible document model. Any document in a collection can have arbitrarily different fields and data types in fields. This is great for many reasons. But when it came to aggregating array elements it posed a few problems. Unwinding an array is done with the $unwind pipeline stage, but you could only specify the field to unwind.

Problem one was that the variety across documents created for non-existent arrays. Consider a cake collection, containing documents about cakes.

> db.cakes.find()
{ "_id" : "pound cake", "recipe" : [ "butter", "flour", "eggs", "sugar" ] }
{ "_id" : "brownies", "makeup" : "brownie" }

Pound cake has a recipe field, containing ingredients. Brownies are defined in a different document schema, listing the main constituents of the brownie in a makeup field.

If we wanted to create a listing of possible cakes by ingredient, we’d write something like:

db.cakes.aggregate([
{$unwind:'$recipe'},
{$group:{_id: '$recipe', found_in: {$addToSet:'$_id'}}}
])

But this will skip brownies because brownies don’t have field recipe at all. We’d have to synthetically add a recipe field in order to fix that like so:

db.cakes.aggregate([
{$project:{ fixed: {$ifNull: ['$recipe', ['-']]}}},
{$unwind:'$fixed'},
{$group:{_id: '$fixed', found_in: {$addToSet:'$_id'}}}
])

Adding an array with the single element ‘-‘ (or whatever token you want) was necessary because $unwind insisted that the field being unwound would both be an array type and contain an element. No element, not an array? $unwind didn’t emit the document at all.

But with 3.2, $unwind would allow you to emit a document placeholder even if there is no field or the field contains null. You activate this option by adding a second argument to $unwind named preserveNullAndEmptyArrays with a value of true or false. If you don’t specify the extra argument, then no document is emitted for null or empty array fields. This allows us the more concise expression

db.cakes.aggregate([
{$unwind: {path: '$recipe', preserveNullAndEmptyArrays: true}},
{$group: {_id: '$recipe', found_in: {$addToSet: '$_id'}}}
])
// result of this aggregation:
{ "_id" : "sugar", "found_in" : [ "pound cake" ] }
{ "_id" : "eggs", "found_in" : [ "pound cake" ] }
{ "_id" : "flour", "found_in" : [ "pound cake" ] }
{ "_id" : null, "found_in" : [ "brownies" ] }
{ "_id" : "butter", "found_in" : [ "pound cake" ] }

Using preserveNullAndEmptyArrays will emit documents that have either null value, or an empty array. The above aggregation example will produce a result with _id null for all cakes that don’t have a recipe field.

Nice.

But variety doesn’t stop at existence or non-existence of fields. What about a field that contains an array in some documents, but is a straight out string in another?

Consider these 3 cakes:

{ "_id" : "princess", "makeup" : [ "sponge", "jam", "sponge", "custard", "sponge", "whipped-cream", "marzipan" ] }
{ "_id" : "angel cake", "makeup" : [ "sponge", "whipped-cream", "sponge", "icing" ] }
{ "_id" : "brownies", "makeup" : "brownie" }

The first two contain an array for their multi-part “cakiness” in the makeup field. But the brownie, well.. it’s made out of brownie! The value of the makeup field there is just a string.

$unwind used to error out on this condition. If it encountered any document who’s field was not an underlying BSON type of array it would halt the pipeline and throw an error. But not anymore!

Running a straightforward aggregation:

db.cakes.aggregate([
{$unwind: {path: '$makeup'}},
{$group: {_id: '$makeup', makes: {$addToSet: '$_id'}}}
])
// results in the following:
{ "_id" : "marzipan", "makes" : [ "princess" ] }
{ "_id" : "sponge", "makes" : [ "princess", "angel cake" ] }
{ "_id" : "icing", "makes" : [ "angel cake" ] }
{ "_id" : "jam", "makes" : [ "princess" ] }
{ "_id" : "custard", "makes" : [ "princess" ] }
{ "_id" : "brownie", "makes" : [ "brownies" ] }
{ "_id" : "whipped-cream", "makes" : [ "princess", "angel cake" ] }

Is perfectly acceptable in the new unwinding world. When $unwind sees a field with a non-array value, it treats it as an array with that single value, and emits a single document containing that value. No more error. Since in the past such data condition would have produced an error, your legacy code should largely work OK under upgrade. If it didn’t error then this condition wasn’t present – presumably clean data or you took the time to fashion fancy $match clause or expressions that prevented single-valued fields from entering the $unwind stage.

Nice.

This feature adds nuances to a common scenario. Initially, a document schema contains only a single value because it is version 0.9 of the software, and the use case was simple. Later on, we discover that a single value doesn’t cover the future feature and convert a field to an array. But then we may have some limbo time when some documents are saved with one schema, and others with another. Either way, we can now aggregate across that field and handle single-valued fields as an array with single a value without the need for elaborate $project of $match.

Oh, and one more thing:

We may have a need to figure out how cake composition varies across cakes. For that, I want to aggregate around the ordinal position of a component across cakes. I’d like to have an idea across the layer 1’s, layer 2’s etc what’s the amount of variation. To assist with that, $unwind has a new option to emit the array offset alongside the value.

> db.cakes.aggregate([
{$unwind: {path: '$makeup', includeArrayIndex: 'offset'}},
{$group: {_id: '$offset', options: {$addToSet: '$makeup'}}}
])
// returns these results:
{ "_id" : NumberLong(6), "options" : [ "marzipan" ] }
{ "_id" : NumberLong(5), "options" : [ "whipped-cream" ] }
{ "_id" : NumberLong(4), "options" : [ "sponge" ] }
{ "_id" : NumberLong(3), "options" : [ "custard", "icing" ] }
{ "_id" : NumberLong(2), "options" : [ "sponge" ] }
{ "_id" : NumberLong(1), "options" : [ "jam", "whipped-cream" ] }
{ "_id" : NumberLong(0), "options" : [ "sponge" ] }
{ "_id" : null, "options" : [ "brownie" ] }

Which lets us see that our cakes thus far have sponge layers and a variety of fillings in between for the first 4 components (offsets 0 - 4)

When we evaluate our customer schema and use cases, we also spend time ensuring that we can produce requisite (and sometimes intricate) queries using available Mongo server syntax. The 3.2 release includes an improved $unwind and a bevy of other operators which helps us cover more use cases in a more compact and efficient way.

MongoDB 3.2 Goodies coming your way: Schema Validator

Mongo has a flexible document model, allowing you to pretty much save any data you want, any way you want, without any ceremony around declaring your document structure ahead of time. This makes some folks uneasy. After all, if there’s no schema structure enforcement, what prevents humans from making silly programming errors like naming a field “wrong” of storing a number as a string and not an integer? The data integrity or coherency surely can kick you in the rear binary area.

On the other hand, documents are really cool. The inherent flexibility lets us ingest arbitrary variety of data and deal with fields and sub-documents as first class citizens with rich query and update operators. This is not something we want to give up.

So what to do?

How about a compromise: For a given highly important collection, Mongo will let you declare a mini-spec for acceptable documents. A schema(ish) declaration which describes specific fields which must exist and which should be of a certain type. This is called a validator. A validator is a rule that helps you ensure that documents in a collection all adhere to that rule.

Let’s start with an example. Your company has customers. You do business on the internet. You figure every customer must have an email address. To ensure that, you bust open a Mongo shell, and create a validator like so.

db.createCollection("customer", {
validator: {
email: { $exists: true }
}
})

The collection named “customer“ is not created. A validator has been assigned to it. The validator simply states that a field named email must exist on documents stored in the customer collection.

Let’s try and add some documents:

> db.customer.insert({_id: 1, email:'[email protected]'})
WriteResult({ "nInserted" : 1 })
>
> db.customer.insert({_id: 2, name: 'iggy'})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id: 3, emailAddress: '[email protected]'})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id: 4, email: 42})
WriteResult({ "nInserted" : 1 })
>
> db.customer.insert({_id: 5, email: null})
WriteResult({ "nInserted" : 1 })
>

Line 1 shows that adding a document with an email field works. No problem, no surprise.

Line 4 shows that attempting to add a document with no email field fails. We expected that. it’s what we validated for: an email field must exist.

Line 13 shows that attempting to add a document with a different field that happens to contain an email is not allowed. Ok, makes sense. Mongo is not fishing around to find some other field. We just told it that the field email must exist!

Line 22 shows that our rule is a bit weak: we can have a document that has a field email, but with a numeric value. Will fix this in a bit.

Line 25 points another weakness: we can add a document that has the filed, but with a null value. Clearly, not our intent.

Well, with what we’ve learned from these tests, we want to improve our rule. What we want is an email field, that has a value, which is an email address. So let’s craft a validator for that:

var newRule = { email: { $exists: true, $type: "string", $regex: /^\[email protected]\w+\.gov$/ } }
db.runCommand( { collMod: "customer", validator: newRule} )

The new rule we created states that an email field must exist. It should contain a string data type (remember, BSON has precise data types knowledge, so we can ensure the exact type we want). The new rule also says that the email address must follow the email syntax stated in the given regular expression. The regex ensures that our email address indeed looks like a valid one. It also - by nature of requiring some characters – ensures that the filed is not empty. Null doesn’t match that regex. Neither will a number.

This time, I applied the rule to an existing collection. The collection already exists, so I can modify it and assign the new validator to it. This new validator replaces the old validator.

So lets try a few inserts then:

> db.customer.insert({_id: 6, email:'[email protected]'})
WriteResult({ "nInserted" : 1 })
>
> db.customer.insert({_id: 7, email: '[email protected]'})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id: 8, email: 42})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id: 9, email: null})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id: 10 })
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>

Only the document with email address “[email protected]” was inserted. The document with the email “[email protected]” doesn’t match the pattern because it doesn’t end with “.gov”. A document with the field email containing a numeric or null value is rejected. And of course, a document with no email field at all is rejected. So our new rule works! It a bit overly verbose, we don’t really need the $exits and $type because the $regex criteria subsumes that meaning. But here I wanted to show how you can apply multiple rules to one field.

The matching criteria in a validator can be any of the query operators you use with the find() command. The exceptions are operators $geoNear, $near, $nearSphere, $text, and $where. And we can combine validation rules together with logical $or or $and to create more complex scenarios.

Here’s a scenario that aims to make sure customer documents adhere to some credit score and status rules:

var newRule = { $or: [
{
$or: [
{creditScore: {$lt: 700}},
{creditScore: {$exists: false}}
],
status: 'rejected'
},
{
creditScore: {$gte: 700},
status: {$in: ['pending','rejected','approved']}
}
]}
db.runCommand( { collMod: "customer", validator: newRule} )

We’re trying to ensure that

  • A customer with creditScore less than 700 or no creditScore at all is in the ‘rejected’ status.
  • A customer with a creditScore at or above 700 is in one of the 3 statuses: rejected, pending, or approved.

The validator itself is a bit more involved, including an $or combination which allows us to validate several alternative acceptable states of an acceptable document .

To test this, lets try a few inserts:

> db.customer.insert({_id:11, status: 'rejected'})
WriteResult({ "nInserted" : 1 })
>
> db.customer.insert({_id:12, status: 'approved'})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id:13, creditScore: 700, status: 'approved'})
WriteResult({ "nInserted" : 1 })
>
> db.customer.insert({_id:14, creditScore: 300, status: 'approved'})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id:15, creditScore: 300})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>

A customer with no creditScore field at all is possible, as long as the status is ‘rejected’. One with a status ‘approved’ but no creditScore field is invalid.

A customer with a creditScore of 700 and a status ‘approved’ is valid and Mongo accepts it.

A customer with a creditScore of 300 and a status ‘approved’ is not valid, since those with this low score should have a status ‘rejected’.

A customer with a low creditScore alone, without a status at all is not accepted either. By now, I updated the validator definition on the collection several times. Which begs the question: what about existing documents in the collection? The answer is : nothing. Mongo will not reject a new validator because of existing documents. It will not delete, fix, or flinch about existing document already in the collection. Mongo applies the validator to mutations only : updates and inserts. What happens when you update a document that already existed? That depends on the validationLevel you mark your validator with. When you apply your validator to a collection, you can add a validationLevel field, with a value of ‘off’, ‘strict’, or ‘moderate’. The default validation level is ‘strict’. Under strict validation, every update or insert must pass validation. If you use the moderate level, the validation is applied when inserting a document and when updating an existing and valid document. An update to an existing but invalid document doesn’t undergo validation.

Earlier on, we inserted documents that have no creditScore and no status fields. Under our latest rule with strict validation, we can’t update them unless they adhere to the current new rule. But if we change the validator and add validationLevel of moderate, we can. For example:

> db.runCommand( { collMod: "customer", validator: newRule, validationLevel: 'moderate'} )
{ "ok" : 1 }
// Existing document is
// {_id: 4, email: 42}
> db.customer.update({_id:4},{$set: {email: '[email protected]'}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
// Now modified document is
// {_id: 4, email: '[email protected]'}

The new validator assigned to the collection is the same as our latest credit score one. We just added the validationLevel field option to exempt existing documents from validation against the validator upon update. This lets me change the existing documents’ email field to an email string. Under strict validation level this would not be allowed because that document doesn’t have a creditScore field and no status ‘rejected’ either.

It’s useful to know what the current validator is (or any other collection specific setting). To do that, use the shell function db.getCollectionInfos({name: 'collection_name'}) like so:

> db.getCollectionInfos({name: 'customer'})
[
{
"name" : "customer",
"options" : {
"validator" : {
"$or" : [
{
"$or" : [
{
"creditScore" : {
"$lt" : 700
}
},
{
"creditScore" : {
"$exists" : false
}
}
],
"status" : "rejected"
},
{
"creditScore" : {
"$gte" : 700
},
"status" : {
"$in" : [
"pending",
"rejected",
"approved"
]
}
}
]
},
"validationLevel" : "moderate",
"validationAction" : "error"
}
}
]

Which gives you a nicely formatted account of the current validator expression and the validation level.

Notice the extra field validationAction set to “error”? The validation action controls what the Mongo server returns to a client that attempts to write something that is not valid. The choices are ‘error’ or ‘warn’, with ‘error’ being the default. As the names imply, an error is returned to a client if the action is ‘error’ and a warning if set to warn. To roll in a new validator, you might want to start with ‘warn’. When you use ‘warn’, a failed validation does not produce and error. The write continues and does not depend on validation. The validation failure will be logged into Mongo’s log file, but the write operation will not be prevented.

Applying a validation action is a matter of adding that field:

> db.runCommand( {
collMod: "customer",
validator: newRule,
validationLevel: 'moderate',
validationAction: 'warn'
})
{ "ok" : 1 }
>
> db.customer.insert({_id: 16, status: 'pending'})
WriteResult({ "nInserted" : 1 })
>

Here we apply our latest rule with a moderate validation level, and a validation action of ‘warn’. Under ‘moderate’ validation level, an insert will always be validated. But with a validationAction of ‘warn’, the write will be accepted and only a log entry will tell us that something went awry. The insert itself went fine as far as a connected client is concerned.

Before you declare complete victory on the side of corporate auditing and the proponents of schema lock-down, there’s something else you should know: A client performing an update or insert can ignore all this validation goodness easily. The all you have to do to circumvent validation, is say so. When a client adds a bypassDocumentValidation field with value true, the validator will not kick into effect and won’t prevent any writes in neither strict nor moderate levels. This goes something like :

> db.customer.insert({_id: 17, status: 'arbitrary_junk'}, { bypassDocumentValidation: true})
WriteResult({ "nInserted" : 1 })
/* log entry: ...Document would fail validation collection: validators.customer doc: { _id: 17.0, status: "arbitrary_junk" }
*

Which will produce a warning in the log, but not prevent the operation. So this whole feature should be deemed a collaborative tool to improve data quality, not an enforcement hammer in the hands of the person controlling the collection.

This is a pretty interesting new feature. It straddles the line between declared strict schema and flexible, software-defined document in a very interesting way. It can help people protect against some of the dangers of flexible document model, yet leaves open the ability to operate without the harsh schema restrictions that a pre-defined schema engine would have imposed.

If you are interested in learning more, let us know!

MongoDB 3.2 goodies coming your way: Partial Indexes

MongoDB includes sparse indexing for a while now.

What is a sparse index? A sparse index is space optimized index which only contains pointers to documents that contain value(s) in the indexed fields.

Lets say our Person collection had a document for each user in our system. All people have to supply a first name. But the prefix is optional. Some people have a “Mrs” or “Prince” or something, but most users don’t:

{ _id: 1, name: 'Jan', prefix: 'Mrs' }
{ _id: 2, name: 'Dude' }

Jan (in our contrived example) has a prefix, but Dude doesn’t. Now if we created an index on the field “prefix”, then an entry for both documents would be created. The key “Mrs” would point to the document with _id 1 (and any other documents with the prefix “Mrs”, and the index key (Null) would point to the document with _id 2 and any other document that happens not to have a prefix field at all, or which has a value of null in the prefix field. This is what inverted indexes do after all. They hold a fast-to-find key value, which in turns contains a list of documents in which that key value exists.

Consider a 10 million user collection where only 5% of users have supplied a prefix. The index will still be responsible to represent 9.5 million documents. That’s wasteful. It takes extra space on disk. It takes space in memory. It takes extra effort to maintain. Waste.

A sparse index simply avoids including the “empty” key. Instead of storing a key with references to all the documents that DON’T have the value, it just doesn’t. This brings to space savings. Less space on disk, less space in memory. Less work to maintain for documents that don’t have that field at all.

A sparse index can be created using the {sparse: true} option, like so:

db.person.createIndex( {prefix: 1 }, {sparse: true });

So what is a partial index? Let’s say our company offered college tours for anyone about to finish high school. Our marketing department is interested in anyone who has 11 to 12 years of schooling. Anything lower and it’s too far in the future. Anything higher, well, they are already in college and not eligible.

To support this, we would maybe create an index on the yearsInSchool field, like so:

db.person.createIndex( {yearsInSchool :1});

But we find out that most of the people are younger, older, or didn’t even supply their years in school number. So that index is going to be a good candidate to be sparse. But the sparse property only handles cases where the field doesn’t exist or is null. People with 9 years or 17 years of schooling will also be in the index. Wouldn’t it be nice to have an index only on documents where the years in school are 11 or 12? That index would be super compact and therefore super efficient in space and usage!

A Partial Index does just that. A partial index allows you to specify a static criteria, and include in the index only keys where the value matches that criteria. So in our example, here’s how we create a partial index to help college-bound candidates:

db.person.createIndex(
{ yearsInSchool: 1},
{ partialFilterExpression: { yearsInSchool: {$gte: 11, $lte:12} }}
);

The partialFilterExpression option lets you supply a criteria using equality, $eq, $gt, $gte, $lt, $lte, $type, $exists and $and (these restrictions are early pre-release) at the top level as the filter. Any document not matching the criteria will be excluded (not indexed) by this index.

Pretty nifty!

But wait, there’s more!

The college tours are only offered to people with a GPA (Grade Point Average) of 3.0 and above. Hey – you want to go to college, better get your grades up! The partial index criteria can include arbitrary fields for the filter. The indexed fields need not be the fields mentioned in the filter expression. So we can have instead:

db.person.createIndex(
{ yearsInSchool: 1},
{ partialFilterExpression: { yearsInSchool: {$gte: 11, $lte:12}, gpa: {$gte:3.0} }}
);

The GPA field is mentioned in the filter but is NOT the indexed field. Now our index can be even more concise and compact, which is very cool.

In fact, we can have filter expressions with none of the indexed fields in them. Or frequent-shopper program is only open to people who shopped with us 3 or more times. And our marketing can send them email periodically. To do that, they need to actually have an email address. So to send targeted email to those we might want a partial index on the orderCount field, and only include in the index people with an email:

db.person.ensureIndex(
{ orderCount: 1},
{ partialFilterExpression: { email:{$exists: true}}});

Now if I query for people with orderCount greater than 2, I’m in for a disappointment: mongo will NOT use the index for this query:

db.person.find({ orderCount: {$gt: 2}})

That’s because it can’t determine that the partial index even may apply. For the optimizer to choose the partial index, the query criteria must include the index filter expression fields with a non-empty value. So this CAN use the index:

db.person.find({orderCount: {$gt: 2}, email: {$exists: true}})

Which is both true to the form and expressive: I want shoppers with 3 or more past purchases that have an email address for the email campaign.

Other restrictions?

  • Mongo allows only one index on a set of fields regardless of the index options. So we can’t create several partial indexes on the same set of index-field definitions.* The _id field index can’t be partial. And since sharding relies on the shard-key index to locate documents in the cluster, the shard-key index cannot be partial either.
  • Can’t create a partial index with the sparse option as well, but that’s just silly anyway. The partial index is a superset of the sparse index. The documentation suggests that we actually use partial filter expression in index creation to satisfy sparse index definitions (by using {$exists: true} ).

I’m very jazzed over this feature. It’s certain to offer an efficient indexing option for many query scenarios.

This post is based on early pre-release information (3.1.8 currently, dev-only-release) , so please be patient with pushing code to production (which is to say: don’t do it!) . Please see dev release notes.

Node Mongoose Demo Code to ignore

There’s lots of “getting started” tutorials out there. Some are great, but some, well, shall we say “sub-optimal”?

When using Mongoose, you get an entity-centric model to work with. Very often, it becomes the basis for a RESTful API. The verb mapping typically just rips through POST, GET (list) and GET (one by _id) and DELETE no problem. When it comes to PUT though, things become a bit trickier. Genereally understuood simply as an “update”, implementing PUT can get you into all sorts of funkiness.

The code to ignore in general is something to this effect: (error checking removed for brevity)

// define schema for Animal
var AnimalSchema = new Schema({
name: String
isCute: Boolean
});
module.exports = mongoose.model( 'Animal', AnimalSchema);
/// then wire it up in some other place
router.route( '/animal/:animal_id')
.put( function(req, res) {
// dig up existing entity
Animal.findById(req. params.animal_id, function(err, existing) {
// update the field
existing.name = req.body.name;
// save the animal
existing.save( function(err) {
res.json({ message: 'Yey!' });
});
});
});

TL;DR;

Don’t fetch and entity in order to update it.

Why? Performance, data loss and concurrency.

Let’s talk performance first. The number of transactions against mongo server here is 2, in order to perform only one logical operation. In the code above, there’s Animal.findById() and then a .save() operation. This is wasteful. Mongo has explicit update syntax which allows you to perform an update without 2 round-trips to the server. While 2 operations can be fast, this limits the speed at scale and consumes more resources both mongodb side and the node application side since, well, double work. In addition, the opportunity for failure just increased as now we have 2 operations happening. How do you do an update? Here’s an example:

router.route( '/animal/:animal_id')
.put( function (req, res) {
Animal.update(
{ _id: req. params.animal_id },
{ $set: { name: req.body.name } },
function (err, updateResult) {
res.json(updateResult);
});
});

We are shipping the work of finding the right document and updating a field within the document to Mongo server, which then saves us from doing another round trip. Since the command itself is shipped, the mongo server doesn’t need to send us the entire object over the network and then have us return the same object (modified) again. The larger the document, the bigger the drag on resources if you don’t use the update syntax.

Update takes a query term as the first argument, and an update term as the second argument. The query term is just id equality. So we know that the search for the document is going to be fast since the _id field is always indexed. The update term here is pretty simple, we just set the name field value to the value sent from the API client.

TL;DR; – use the update() function!

What about the data loss potential? Imaging 2 clients trying to update the same animal at roughly the same time. Instead of changing the name, one client wants to update the isCute field only, and second client wanting to update the name field only. So someone might update the original code to look like this:

router.route( '/animal/:animal_id')
.put( function(req, res) {
// dig up existing entity
Animal.findById(req. params.animal_id, function(err, existing) {
if(req.body.name) existing.name = req.body.name;
if(req.body.isCute) existing.isCute = req.body.isCute;
// save the animal
existing.save( function(err) {
res.json({ message: 'Yey!' });
});
});
});

Here the “improved” code first checks if the client even submitted a value for isCute or name, and only if the caller supplied a value it replaces it. Seems like it should work. But there’s a chance of data loss here.

Let’s say the animal right now is {_id: 1, name: ‘Fido’, isCute: false};

  1. Client A reads the animal, gets: _{id: 1, name: ‘Fido’, isCute: false}
  2. Client B reads the animal, gets: _{id: 1, name: ‘Fido’, isCute: false}
  3. Client A updates in-memory, and makes the animal name Rex:_ {id: 1, name: ‘Rex’, isCute: false}
  4. Client B updates in-memory, and makes the animal cute: _{id: 1, name: ‘Rex’, isCute: true}
  5. Client A saves her in-memory object to mongo. Mongo will now have: _{id: 1, name: ‘Rex’, isCute: false}
  6. Client B saves his in-memory object to mongo. Mongo will now have: _{id: 1, name: ‘Fido’, isCute: true}

After they are both done, we would have expected to see _{_id: 1, name: ‘Rex’, isCute: true} _but it isn’t. Client B overwrote A’s update. Worst, client A had no idea that her renaming from Fido to Rex has failed. In fact, it was even succeeded for a small window of time between steps 5 and 6. But the change is nonetheless lost.

What should be done? You guessed it: update!

router.route( '/animal/:animal_id')
.put( function (req, res) {
var values = {};
if(req.body.name) values.name = req.body.name;
if(req.body.isCute) values.isCute = req.body.isCute;
Animal.update(
{ _id: req. params.animal_id },
{ $set: values },
function (err, updateResult) {
res.json(updateResult);
});
});

Here, we exercise the same logic to conditionally only “touch” the fields that the API client submitted. This time however, since we are telling Mongo to update the document and only touch the field submitted, the other field (not submitted) will no be affected!

The sequence then becomes

  1. Client A sends an update to mongo for {_id: 1} {$set: {name: **'Rex'**}}
  2. Client B sends an update to mongo for {_id: 1} {$set: {isCute: **true**}}

Since the mongo server performs these, the result would be that the animal would become named Rex and declared isCute –> true. It doesn’t matter if 1 or 2 occurred out of order. Since each update is touching a different field, they won’t step over each other.

There probably are plugins or middleware that help building update() correctly. But I wanted to make sure the principle and reasons are made clear here. Also, if you are doing REST API, consider distinguishing a PUT from a PATCH. Whereby a PUT might replace the whole entity with the submitted API value alone (destructively, not field-wise) and a PATCH specifies only select parts of the entity be touched. Whatever path you choose, take care you don’t subject yourself to the performance and data loss potential coming from a read-then-save cycle.

Who would save() me now? MongoDB 2.0 C#Driver deprecates save()

For years now, I’ve been using mongo’s save() method on a collection. It’s convenient: hand it an document with an id, slam it in and done. With the C# 2.0 driver, (and other drivers as well) it’s now gone!

Will we miss it? Should we miss it? Lets take a closer look:

First – what is the syntactic meaning of “save”? The “save” function provided add-or-replace semantics . If an document by that id existed, it would be overwritten with the new document . If an document with that id did not exist, then the document at hand would become a new document. Seems legit, right?

Consider though, what would happen when a document already existed. It would be gone. Gone in the sense that the new document would overwrite the existing one. I know, I know. We know that! But not everyone catches on to this. Some people have in mind a merge-and-save behavior. A non-existent behavior where save will somehow:

  1. Overwrite fields from the new document over any existing ones 1.

  2. Add fields from the new document that didn’t exist before

  3. Leave existing fields in the old document which aren’t present in the new document alone.

Well, effectively, 1 and 2 would actually happen, but 3 will not. And more than one naïve developer would then be surprised to find skimpy documents “missing” previous values. The remedy, of course, is education. But on the other hand, maybe there’s a better way (please read on). Second – What did save actually do? “It saved it” would be the first inclination. Yes. it did. But how? Turns out, that it had a bit of logic behind it. If the new document you hand to save() didn’t have an id field defined, then save would attempt to assign it an id and then simply insert() the document. This depended on an id generator being assumed or present or inferred. In the shell, an ObjectId() would be assigned. Language drivers had conventions and defaults to cover such scenario.

In pseudo code, this would look something like

if( newDocument._id is undefined) {
// assign an id
newDocument._id = _idGenerator.getNewId(); // conceptually
db.collectionName.insert(newDocument);
...

If the document did have an id defined, then save() would turn around and execute an update(). that is, send an update command to the mongo server, with the {upsert: true} option set, using the _id to identify which document to update. If a document by that id did not exit, the document would have been created, with that _id. That seems fine, right? But here is where things get interesting.

The update command can operate with 2 different interpretations of the “update” part of it.

When the update term is “plain”, Mongo would take the update term and use it as a verbatim document, setting the entire document to that update. Plain means that no fields in the update term started with the dollar sign (“$”). Plain means that the update term did not contain any operators.

If mongo sensed that the update term contained operators, then it would have done a surgical update, carrying out only the field updates specified and potentially maintaining the values of fields not mentioned in the update.

Since update() used the “plain” mode of the update, any existing document would have been replaced ( the update() behavior is documented quite well here).

The pseudo code for this would just look like an update, since an id was guaranteed present (otherwise the insert() path would have been chosen), something along the lines of:

db.collectionName.update({_id: newDocument._id}, newDocument, {upsert: true});

Fine then, one might say. But why not just transform the new document into a bunch of $set operators? Well, that’s just not how update works. And even if it did, is this the correct behavior? If a user supplied a document with 3 fields, and previous document had 5, did the user intend that the new document would contain the 3 new fields and the old 2? Or did the user intend that the new document contain only the new 3 fields?

Deprecate feels a bit like a loss. But the semantic meaning is, in fact supported, albeit with a different syntax. Consider this C# snippet:

var person = new Person { Id = "some_id", Name = "Bob" };
var filter = Builders<Person>.Filter.Eq(p => p.Id, person.Id);
var task = mongoCollection.ReplaceOneAsync(filter, person, new UpdateOptions {IsUpsert = true});

Given a person object, with some assigned id, ReplaceOneAsync with the IsUpsert = true; will carry out the intended save(). The syntax is a bit more elaborate, but the meaning is clear.

The words “replace one” refer to the whole document, not individual fields. This conveys the meaning well.The “upsert” intent is also explicit. When the value is true, the document will be inserted if it doesn’t already exist. When false, the document would only be replaced if it exists. Secondly, this syntax has you set the filter specifying which document to update on your own. You can, for instance, set a filter on a filed other than the _id field as the filter.

Theoretically this gives you the flexibility to not care about the _id at all. Technically, you can express a filter on a filed other than _id. But in practice, this will go nowhere fast: The “upserted” document must have some _id. If another document is found first with the filter but the _id doesn’t match the incoming document, an error would occur. When we run mongo training courses, questions around these kind of things arise quite often. Hopefully this shed a bit of light on the why and how to properly address such concerns.

The save() function may be deprecated, but the intended functionality is not. In the new C# driver, you can achieve the same task using ReplaceOnAsync. I like software that says what it does and does what it says!

Developers should do better since things are explicit, and the nuances of save() vs. insert() vs. update are less of a mystery.