MongoDB 3.2 Goodies coming your way: Schema Validator

Mongo has a flexible document model, allowing you to pretty much save any data you want, any way you want, without any ceremony around declaring your document structure ahead of time. This makes some folks uneasy. After all, if there’s no schema structure enforcement, what prevents humans from making silly programming errors like naming a field “wrong” of storing a number as a string and not an integer? The data integrity or coherency surely can kick you in the rear binary area.

On the other hand, documents are really cool. The inherent flexibility lets us ingest arbitrary variety of data and deal with fields and sub-documents as first class citizens with rich query and update operators. This is not something we want to give up.

So what to do?

How about a compromise: For a given highly important collection, Mongo will let you declare a mini-spec for acceptable documents. A schema(ish) declaration which describes specific fields which must exist and which should be of a certain type. This is called a validator. A validator is a rule that helps you ensure that documents in a collection all adhere to that rule.

Let’s start with an example. Your company has customers. You do business on the internet. You figure every customer must have an email address. To ensure that, you bust open a Mongo shell, and create a validator like so.

db.createCollection("customer", {
validator: {
email: { $exists: true }
}
})

The collection named “customer“ is not created. A validator has been assigned to it. The validator simply states that a field named email must exist on documents stored in the customer collection.

Let’s try and add some documents:

> db.customer.insert({_id: 1, email:[email protected]'})
WriteResult({ "nInserted" : 1 })
>
> db.customer.insert({_id: 2, name: 'iggy'})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id: 3, emailAddress: [email protected]'})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id: 4, email: 42})
WriteResult({ "nInserted" : 1 })
>
> db.customer.insert({_id: 5, email: null})
WriteResult({ "nInserted" : 1 })
>

Line 1 shows that adding a document with an email field works. No problem, no surprise.

Line 4 shows that attempting to add a document with no email field fails. We expected that. it’s what we validated for: an email field must exist.

Line 13 shows that attempting to add a document with a different field that happens to contain an email is not allowed. Ok, makes sense. Mongo is not fishing around to find some other field. We just told it that the field email must exist!

Line 22 shows that our rule is a bit weak: we can have a document that has a field email, but with a numeric value. Will fix this in a bit.

Line 25 points another weakness: we can add a document that has the filed, but with a null value. Clearly, not our intent.

Well, with what we’ve learned from these tests, we want to improve our rule. What we want is an email field, that has a value, which is an email address. So let’s craft a validator for that:

var newRule = { email: { $exists: true, $type: "string", $regex: /^\w+@\w+\.gov$/ } }
db.runCommand( { collMod: "customer", validator: newRule} )

The new rule we created states that an email field must exist. It should contain a string data type (remember, BSON has precise data types knowledge, so we can ensure the exact type we want). The new rule also says that the email address must follow the email syntax stated in the given regular expression. The regex ensures that our email address indeed looks like a valid one. It also - by nature of requiring some characters – ensures that the filed is not empty. Null doesn’t match that regex. Neither will a number.

This time, I applied the rule to an existing collection. The collection already exists, so I can modify it and assign the new validator to it. This new validator replaces the old validator.

So lets try a few inserts then:

> db.customer.insert({_id: 6, email:[email protected]'})
WriteResult({ "nInserted" : 1 })
>
> db.customer.insert({_id: 7, email: [email protected]'})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id: 8, email: 42})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id: 9, email: null})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id: 10 })
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>

Only the document with email address [email protected] was inserted. The document with the email [email protected] doesn’t match the pattern because it doesn’t end with “.gov”. A document with the field email containing a numeric or null value is rejected. And of course, a document with no email field at all is rejected. So our new rule works! It a bit overly verbose, we don’t really need the $exits and $type because the $regex criteria subsumes that meaning. But here I wanted to show how you can apply multiple rules to one field.

The matching criteria in a validator can be any of the query operators you use with the find() command. The exceptions are operators $geoNear, $near, $nearSphere, $text, and $where. And we can combine validation rules together with logical $or or $and to create more complex scenarios.

Here’s a scenario that aims to make sure customer documents adhere to some credit score and status rules:

var newRule = { $or: [
{
$or: [
{creditScore: {$lt: 700}},
{creditScore: {$exists: false}}
],
status: 'rejected'
},
{
creditScore: {$gte: 700},
status: {$in: ['pending','rejected','approved']}
}
]}
db.runCommand( { collMod: "customer", validator: newRule} )

We’re trying to ensure that

  • A customer with creditScore less than 700 or no creditScore at all is in the ‘rejected’ status.
  • A customer with a creditScore at or above 700 is in one of the 3 statuses: rejected, pending, or approved.

The validator itself is a bit more involved, including an $or combination which allows us to validate several alternative acceptable states of an acceptable document .

To test this, lets try a few inserts:

> db.customer.insert({_id:11, status: 'rejected'})
WriteResult({ "nInserted" : 1 })
>
> db.customer.insert({_id:12, status: 'approved'})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id:13, creditScore: 700, status: 'approved'})
WriteResult({ "nInserted" : 1 })
>
> db.customer.insert({_id:14, creditScore: 300, status: 'approved'})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id:15, creditScore: 300})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>

A customer with no creditScore field at all is possible, as long as the status is ‘rejected’. One with a status ‘approved’ but no creditScore field is invalid.

A customer with a creditScore of 700 and a status ‘approved’ is valid and Mongo accepts it.

A customer with a creditScore of 300 and a status ‘approved’ is not valid, since those with this low score should have a status ‘rejected’.

A customer with a low creditScore alone, without a status at all is not accepted either. By now, I updated the validator definition on the collection several times. Which begs the question: what about existing documents in the collection? The answer is : nothing. Mongo will not reject a new validator because of existing documents. It will not delete, fix, or flinch about existing document already in the collection. Mongo applies the validator to mutations only : updates and inserts. What happens when you update a document that already existed? That depends on the validationLevel you mark your validator with. When you apply your validator to a collection, you can add a validationLevel field, with a value of ‘off’, ‘strict’, or ‘moderate’. The default validation level is ‘strict’. Under strict validation, every update or insert must pass validation. If you use the moderate level, the validation is applied when inserting a document and when updating an existing and valid document. An update to an existing but invalid document doesn’t undergo validation.

Earlier on, we inserted documents that have no creditScore and no status fields. Under our latest rule with strict validation, we can’t update them unless they adhere to the current new rule. But if we change the validator and add validationLevel of moderate, we can. For example:

> db.runCommand( { collMod: "customer", validator: newRule, validationLevel: 'moderate'} )
{ "ok" : 1 }
// Existing document is
// {_id: 4, email: 42}
> db.customer.update({_id:4},{$set: {email: [email protected]'}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
// Now modified document is
// {_id: 4, email: [email protected]'}

The new validator assigned to the collection is the same as our latest credit score one. We just added the validationLevel field option to exempt existing documents from validation against the validator upon update. This lets me change the existing documents’ email field to an email string. Under strict validation level this would not be allowed because that document doesn’t have a creditScore field and no status ‘rejected’ either.

It’s useful to know what the current validator is (or any other collection specific setting). To do that, use the shell function db.getCollectionInfos({name: 'collection_name'}) like so:

> db.getCollectionInfos({name: 'customer'})
[
{
"name" : "customer",
"options" : {
"validator" : {
"$or" : [
{
"$or" : [
{
"creditScore" : {
"$lt" : 700
}
},
{
"creditScore" : {
"$exists" : false
}
}
],
"status" : "rejected"
},
{
"creditScore" : {
"$gte" : 700
},
"status" : {
"$in" : [
"pending",
"rejected",
"approved"
]
}
}
]
},
"validationLevel" : "moderate",
"validationAction" : "error"
}
}
]

Which gives you a nicely formatted account of the current validator expression and the validation level.

Notice the extra field validationAction set to “error”? The validation action controls what the Mongo server returns to a client that attempts to write something that is not valid. The choices are ‘error’ or ‘warn’, with ‘error’ being the default. As the names imply, an error is returned to a client if the action is ‘error’ and a warning if set to warn. To roll in a new validator, you might want to start with ‘warn’. When you use ‘warn’, a failed validation does not produce and error. The write continues and does not depend on validation. The validation failure will be logged into Mongo’s log file, but the write operation will not be prevented.

Applying a validation action is a matter of adding that field:

> db.runCommand( {
collMod: "customer",
validator: newRule,
validationLevel: 'moderate',
validationAction: 'warn'
})
{ "ok" : 1 }
>
> db.customer.insert({_id: 16, status: 'pending'})
WriteResult({ "nInserted" : 1 })
>

Here we apply our latest rule with a moderate validation level, and a validation action of ‘warn’. Under ‘moderate’ validation level, an insert will always be validated. But with a validationAction of ‘warn’, the write will be accepted and only a log entry will tell us that something went awry. The insert itself went fine as far as a connected client is concerned.

Before you declare complete victory on the side of corporate auditing and the proponents of schema lock-down, there’s something else you should know: A client performing an update or insert can ignore all this validation goodness easily. The all you have to do to circumvent validation, is say so. When a client adds a bypassDocumentValidation field with value true, the validator will not kick into effect and won’t prevent any writes in neither strict nor moderate levels. This goes something like :

> db.customer.insert({_id: 17, status: 'arbitrary_junk'}, { bypassDocumentValidation: true})
WriteResult({ "nInserted" : 1 })
/* log entry: ...Document would fail validation collection: validators.customer doc: { _id: 17.0, status: "arbitrary_junk" }
*

Which will produce a warning in the log, but not prevent the operation. So this whole feature should be deemed a collaborative tool to improve data quality, not an enforcement hammer in the hands of the person controlling the collection.

This is a pretty interesting new feature. It straddles the line between declared strict schema and flexible, software-defined document in a very interesting way. It can help people protect against some of the dangers of flexible document model, yet leaves open the ability to operate without the harsh schema restrictions that a pre-defined schema engine would have imposed.

If you are interested in learning more, let us know!