Who would save() me now? MongoDB 2.0 C#Driver deprecates save()

For years now, I’ve been using mongo’s save() method on a collection. It’s convenient: hand it an document with an id, slam it in and done. With the C# 2.0 driver, (and other drivers as well) it’s now gone!

Will we miss it? Should we miss it? Lets take a closer look:

First – what is the syntactic meaning of “save”? The “save” function provided add-or-replace semantics . If an document by that id existed, it would be overwritten with the new document . If an document with that id did not exist, then the document at hand would become a new document. Seems legit, right?

Consider though, what would happen when a document already existed. It would be gone. Gone in the sense that the new document would overwrite the existing one. I know, I know. We know that! But not everyone catches on to this. Some people have in mind a merge-and-save behavior. A non-existent behavior where save will somehow:

  1. Overwrite fields from the new document over any existing ones 1.

  2. Add fields from the new document that didn’t exist before

  3. Leave existing fields in the old document which aren’t present in the new document alone.

Well, effectively, 1 and 2 would actually happen, but 3 will not. And more than one naïve developer would then be surprised to find skimpy documents “missing” previous values. The remedy, of course, is education. But on the other hand, maybe there’s a better way (please read on). Second – What did save actually do? “It saved it” would be the first inclination. Yes. it did. But how? Turns out, that it had a bit of logic behind it. If the new document you hand to save() didn’t have an id field defined, then save would attempt to assign it an id and then simply insert() the document. This depended on an id generator being assumed or present or inferred. In the shell, an ObjectId() would be assigned. Language drivers had conventions and defaults to cover such scenario.

In pseudo code, this would look something like

if( newDocument._id is undefined) {
// assign an id
newDocument._id = _idGenerator.getNewId(); // conceptually
db.collectionName.insert(newDocument);
...

If the document did have an id defined, then save() would turn around and execute an update(). that is, send an update command to the mongo server, with the {upsert: true} option set, using the _id to identify which document to update. If a document by that id did not exit, the document would have been created, with that _id. That seems fine, right? But here is where things get interesting.

The update command can operate with 2 different interpretations of the “update” part of it.

When the update term is “plain”, Mongo would take the update term and use it as a verbatim document, setting the entire document to that update. Plain means that no fields in the update term started with the dollar sign (“$”). Plain means that the update term did not contain any operators.

If mongo sensed that the update term contained operators, then it would have done a surgical update, carrying out only the field updates specified and potentially maintaining the values of fields not mentioned in the update.

Since update() used the “plain” mode of the update, any existing document would have been replaced ( the update() behavior is documented quite well here).

The pseudo code for this would just look like an update, since an id was guaranteed present (otherwise the insert() path would have been chosen), something along the lines of:

db.collectionName.update({_id: newDocument._id}, newDocument, {upsert: true});

Fine then, one might say. But why not just transform the new document into a bunch of $set operators? Well, that’s just not how update works. And even if it did, is this the correct behavior? If a user supplied a document with 3 fields, and previous document had 5, did the user intend that the new document would contain the 3 new fields and the old 2? Or did the user intend that the new document contain only the new 3 fields?

Deprecate feels a bit like a loss. But the semantic meaning is, in fact supported, albeit with a different syntax. Consider this C# snippet:

var person = new Person { Id = "some_id", Name = "Bob" };
var filter = Builders<Person>.Filter.Eq(p => p.Id, person.Id);
var task = mongoCollection.ReplaceOneAsync(filter, person, new UpdateOptions {IsUpsert = true});

Given a person object, with some assigned id, ReplaceOneAsync with the IsUpsert = true; will carry out the intended save(). The syntax is a bit more elaborate, but the meaning is clear.

The words “replace one” refer to the whole document, not individual fields. This conveys the meaning well.The “upsert” intent is also explicit. When the value is true, the document will be inserted if it doesn’t already exist. When false, the document would only be replaced if it exists. Secondly, this syntax has you set the filter specifying which document to update on your own. You can, for instance, set a filter on a filed other than the _id field as the filter.

Theoretically this gives you the flexibility to not care about the _id at all. Technically, you can express a filter on a filed other than _id. But in practice, this will go nowhere fast: The “upserted” document must have some _id. If another document is found first with the filter but the _id doesn’t match the incoming document, an error would occur. When we run mongo training courses, questions around these kind of things arise quite often. Hopefully this shed a bit of light on the why and how to properly address such concerns.

The save() function may be deprecated, but the intended functionality is not. In the new C# driver, you can achieve the same task using ReplaceOnAsync. I like software that says what it does and does what it says!

Developers should do better since things are explicit, and the nuances of save() vs. insert() vs. update are less of a mystery.

Of transactions and Mongo

What’s the first thing you hear about NoSQL databases? That they lose your data? That there’s no transactions? No joins? No hope for “real” applications?

Well, you should be wondering whether a certain of database is the right one for your job. But if you do so, you should be wondering that about “traditional” databases as well!

In the spirit of exploration let’s take a look at a common challenge:

  1. You are a bank.
  2. You have customers with accounts.
  3. Customer A wants to pay B.
  4. You want to allow that only if A can cover the amount being transferred.

Let’s looks at the problem without any context of any database engine in mind. What would you do? How would you ensure that the amount transfer is done “properly”? Would you prevent a “transaction” from taking place unless A can cover the amount?

There are several options:

  1. Prevent any change to A’s account while the transfer is taking place. That boils down to locking.2. Apply the change, and allow A’s balance to go below zero. Charge person A some interest on the negative balance. Not friendly, but certainly a choice.3. Don’t do either.

Options 1 and 2 are difficult to attain in the NoSQL world. Mongo won’t save you headaches here either.

Option 3 looks a bit harsh. But here’s where this can go: ledger. See, and account doesn’t need to be represented by a single row in a table of all accounts with only the current balance on it. More often than not, accounting systems use ledgers. And entries in ledgers - as it turns out – don’t actually get updated. Once a ledger entry is written, it is not removed or altered. A transaction is represented by an entry in the ledger stating and amount withdrawn from A’s account and an entry in the ledger stating an addition of said amount to B’s account. For sake of space-saving, that entry in the ledger can happen using one entry. Think {Timestamp, FromAccountId, ToAccountId, Amount}.

The implication of the original question – “how do you enforce non-negative balance rule” then boils down to:

  1. Insert entry in ledger2. Run validation of recent entries3. Insert reverse entry to roll back transaction if validation failed.

What is validation? Sum up the transactions that A’s account has (all deposits and debits), and ensure the balance is positive. For sake of efficiency, one can roll up transactions and “close the book” on transactions with a pseudo entry stating balance as of midnight or something. This lets you avoid doing math on the fly on too many transactions. You simply run from the latest “approved balance” marker to date. But that’s an optimization, and premature optimizations are the root of (some? most?) evil..

Back to some nagging questions though: “But mongo is only eventually consistent!” Well, yes, kind of. It’s not actually true that Mongo has not transactions. It would be more descriptive to say that Mongo’s transaction scope is a single document in a single collection.

A write to a Mongo document happens completely or not at all. So although it is true that you can’t update more than one documents “at the same time” under a “transaction” umbrella as an atomic update, it is NOT true that there’ is no isolation. So a competition between two concurrent updates is completely coherent and the writes will be serialized. They will not scribble on the same document at the same time. In our case - in choosing a ledger approach - we’re not even trying to “update” a document, we’re simply adding a document to a collection. So there goes the “no transaction” issue.

Now let’s turn our attention to consistency. What you should know about mongo is that at any given moment, only on member of a replica set is writable. This means that the writable instance in a set of replicated instances always has “the truth”. There could be a replication lag such that a reader going to one of the replicas still sees “old” state of a collection or document. But in our ledger case, things fall nicely into place: Run your validation against the writable instance. It is guaranteed to have a ledger either with (after) or without (before) the ledger entry got written. No funky states. Again, the ledger writing adds a document, so there’s no inconsistent document state to be had either way.

Next, we might worry about data loss. Here, mongo offers several write-concerns. Write-concern in Mongo is a mode that marshals how uptight you want the db engine to be about actually persisting a document write to disk before it reports to the application that it is “done”. The most volatile, is to say you don’t care. In that case, mongo would just accept your write command and say back “thanks” with no guarantee of persistence. If the server loses power at the wrong moment, it may have said “ok” but actually no written the data to disk. That’s kind of bad. Don’t do that with data you care about. It may be good for votes on a pole regarding how cute a furry animal is, but not so good for business.

There are several other write-concerns varying from flushing the write to the disk of the writable instance, flushing to disk on several members of the replica set, a majority of the replica set or all of the members of a replica set. The former choice is the quickest, as no network coordination is required besides the main writable instance. The others impose extra network and time cost. Depending on your tolerance for latency and read-lag, you will face a choice of what works for you.

It’s really important to understand that no data loss occurs once a document is flushed to an instance. The record is on disk at that point. From that point on, backup strategies and disaster recovery are your worry, not loss of power to the writable machine. This scenario is not different from a relational database at that point.

Where does this leave us? Oh, yes. Eventual consistency. By now, we ensured that the “source of truth” instance has the correct data, persisted and coherent. But because of lag, the app may have gone to the writable instance, performed the update and then gone to a replica and looked at the ledger there before the transaction replicated. Here are 2 options to deal with this.

Similar to write concerns, mongo support read preferences. An app may choose to read only from the writable instance. This is not an awesome choice to make for every ready, because it just burdens the one instance, and doesn’t make use of the other read-only servers. But this choice can be made on a query by query basis. So for the app that our person A is using, we can have person A issue the transfer command to B, and then if that same app is going to immediately as “are we there yet?” we’ll query that same writable instance. But B and anyone else in the world can just chill and read from the read-only instance. They have no basis to expect that the ledger has just been written to. So as far as they know, the transaction hasn’t happened until they see it appear later. We can further relax the demand by creating application UI that reacts to a write command with “thank you, we will post it shortly” instead of “thank you, we just did everything and here’s the new balance”. This is a very powerful thing. UI design for highly scalable systems can’t insist that the all databases be locked just to paint an “all done” on screen. People understand. They were trained by many online businesses already that your placing of an order does not mean that your product is already outside your door waiting (yes, I know, large retailers are working on it… but were’ not there yet).

The second thing we can do, is add some artificial delay to a transaction’s visibility on the ledger. The way that works is simply adding some logic such that the query against the ledger never nets a transaction for customers newer than say 15 minutes and who’s validation flag is not set.

This buys us time 2 ways:

  1. Replication can catch up to all instances by then, and validation rules can run and determine if this transaction should be “negated” with a compensating transaction.2. In case we do need to “roll back” the transaction, the backend system can place the timestamp of the compensating transaction at the exact same time or 1ms after the original one. Effectively, once A or B visits their ledger, both transactions would be visible and the overall balance “as of now” would reflect no change. The 2 transactions (attempted/ reverted) would be visible , since we do actually account for the attempt.

Hold on a second. There’s a hole in the story: what if several transfers from A to some accounts are registered, and 2 independent validators attempt to compute the balance concurrently? Is there a chance that both would conclude non-sufficient-funds even though rolling back transaction 100 would free up enough for transaction 117 (some random later transaction)? Yes. there is that chance. But the integrity of the business rule is not compromised, since the prime rule is don’t dispense money you don’t have. To minimize or eliminate this scenario, we can also assign a single validation process per origin account. This may seem non-scalable, but it can easily be done as a “sharded” distributrion. Say we have 11 validation threads (or processing nodes etc.). We divide the account number space such that each validator is exclusively responsible for a certain range of account numbers. Sounds cunningly similar to Mongo’s sharding strategy, doesn’t it? Each validator then works in isolation. More capacity needed? Chop the account space into more chunks.

So where are we now with the nagging questions?

  • “No joins”: ..* Huh? What are those for?
  • “No transactions”: ..* You mean no cross-collection and no cross-document transactions? Granted - but don’t always need them either.
  • “No hope for real applications”: ..* Well… if you want locking transactions, look to another database..

There are more issues and edge cases to slog through, I’m sure. But hopefully this gives you some ideas of how to solve common problems without distributed locking and relational databases. But then again, you can choose relational databases if they suit your problem.

MVC Model State Validation–DRY to the rescue!

ASP.NET MVC comes with nice features to aid model validation. Unfortunately, you are still stuck writing boilerplate code on all the data entry actions. The boilerplate code looks something like:

public ActionResult DoSomething(Foo value)
{
if (ModelState.IsValid)
{
return View();
}
// ... do some actual work ...
return View("AllGoodThanks");
}

The common desired behavior is that when the submitted model is invalid the view is immediately returned so the user can fix erroneous entries. But since the flow is such that a value needs to be returned, you can’t just refactor this into a common method.

What to do? Lets implement DRY (don’t repeat yourself. Duh! just did..) based on ActionFilterAttribute.

public class ValidateModelAttribute : ActionFilterAttribute
{
public override void OnActionExecuting(ActionExecutingContext filterContext)
{
if (filterContext.Controller.ViewData.ModelState.IsValid)
{
return;
}
filterContext.Result = new ViewResult
{
ViewName = filterContext.ActionDescriptor.ActionName,
ViewData = filterContext.Controller.ViewData,
TempData = filterContext.Controller.TempData
};
}
}

This custom attribute uses the same mechanism the Controller would have used and relies on model attributes to signal data fitness.

A straightforward behavior returning the user to the same form (view) is sufficient in most cases:

[ValidateModel]
public ActionResult DoSomething(Foo value)
{
// ... do some work ...
return View("AllGoodThanks");
}

The total lines of code saved grow as you add many more actions (as my projects tend to gain momentum), and is quite significant.

Manufacturing a MongoDB ObjectId for the past

MongoDB’s ObjectId() has some nice sequential properties. One of the interesting ones is the fact that the most significant 4 bytes are a timestamp with seconds granularity.

Suppose you want to query your collection for items created on or after a certain date. Since the timestamp portion can be constructed (milliseconds since epoch), and the rest can be manufactured (zero would be fine) we can now write a function to generate what the ObjectId would be or be just higher or lower than:

var past = new Date((new Date()).getTime() - (90 * 24 * 60 * 60 * 1000));
var stamp = ObjectId(Math.floor(past.getTime() / 1000).toString(16) + "0000000000000000");

The stamp object contains an ObjectId with a value representing the floor value of any object ids generated 90 days ago to the seconds granularity. Using the stamp value, we can then write a query for objects created on or after that time.

While this value may not be suitable for exact reporting (as the rounding may exclude or include some values because of the lack of granularity) it is well suited to finding records inserted at or around that time, such as retiring older records etc.

MongoDB log rotation

MongoDB’s engine can log quite a bit of useful detail. Whether high-transaction rate or verbose, the log can get quite large.

While setting the log mode to append helps you retain the old / existing log, mongo does not currently have a facility to rotate the log at prescribed times or when a size limit is reached. In other words, the log will grow indefinitely.

There are 2 ways to have the engine release current file and start a new one:

  1. SIGHUP
  2. Issue a command to mongod via a client connection

The first option, available on Unix variants, is issued like so:

killall -SIGUSR1 mongod

This would force log rotation on all instances of mongod on that machine.

The second option requires a connection to mongo. The mongo shell is capable of running in non-interactive mode in 2 ways: using eval command line expression or running a named JavaScript file. Let’s pick the eval method, since we only have one command to send. Since the log rotate command needs to be issued from the admin namespace, we specify that on the connection string directly:

mongo localhost/admin -eval "db.runCommand({logRotate:1})"

The result of either of these methods is that mongod will take its current file, say /var/log/mongo/mongod.log and rename it to be suffixed with a date/time stamp such as /var/log/mongo/mongod.log.2012-01-31T12-34-56 if invoked January 31st at 12:34:56 .

The next sticking point is that you may want to compress down that file, and clean out older log files. There are some tools out there, Logrotate being one, but I decided to write a small shell script:

#!/bin/bash
### log rotate
mongo localhost/admin -eval "db.runCommand({logRotate:1})"
### compress newly rotated
for f in /var/log/mongo/mongod.log.????-??-??T??-??-??;
do
7za a "$f.z" "$f"
rm -f "$f"
done
### remove files older than x days
find /var/log/mongo/mongod.log.????-??-??T??-??-??.z -ctime +14 -delete

You might like a different compression library, 7z works for me. It produces the .z suffix by default, so the cleanup step looks for that.

Notice how the find command is issued against the file creation time, and configured to delete files older than 14 days here. Your rotation policy and deletion may require otherwise. You can run this as often as you wish to keep files small and granular, or less frequently to get logs covering extended periods. It is sometimes useful to spelunk a single larger file instead of scrolling through several hourly files to track a single event.

This solution takes care of time-triggered rotation, and is not sensing file size in any way. But it should be easy enough to modify the script to only log rotate if the current mongod.log is larger than some predefined size.

Happy admin!

Custom domain for Windows Azure in the new portal

The new Windows Azure Portal looks great, but has moved things around a bit. This post serves as note to self and others:

How do I set a custom domain name for blob/ table/ queue?

  1. Go to the new portal https://manage.windowsazure.com/
  2. Click the “Storage” item on the left (icon reminiscent of a table or spreadsheet)
  3. Click on the one of your storage items for which you want to create a custom domain
  4. Click the “configure” tab (you are in “dashboard” by default)
  5. Click the “manage domain” icon on the bottom action bar (look all the way at the bottom between “manage keys” and “delete”)
  6. Enter the full domain name you want to have point to the storage “bob.mydomain.com” (assuming you own mydomain.com)
  7. Set up a CNAME record in your DNS server for the domain you own as instructed
  8. Validate the CNAME entry (may need a bit of time to propagate, so let it).

Steps 6-8 described here: http://www.windowsazure.com/en-us/develop/net/common-tasks/custom-dns/

Operations in action–Marking WCF interface as down for maintenance

As many of us deploy our shiny web services and expose them to the world (or just our apps), we invariably encounter these pesky maintenance windows. During these times, a database, other web services or any other IO dependent tasks cannot be performed.

Wouldn’t it be nice to tell the caller of your web API that the operation is currently unavailable? It can get pretty ugly if we don’t solve this. If we simply bring down the whole endpoint, connecting clients will experience a pile-up of timed out connection attempts. If we leave it up, every operation attempted would experience it’s own slow excruciating failure, with the same IO timeout pile-up, this time on your server and often bringing the server to it’s knees with too many doomed connection requests queued up. My game plan shaped up to :

  1. Each service operation shall return a standard response, exposing some status flag
  2. A configuration controls whether services are to be marked as unavailable
  3. A WFC extension will take care of returning the standard response with proper flag when so configured, but let the regular response return under normal conditions.

The requirement that each operation returns a standard response may seem peculiar. You may have created

string GetUserName(string id);
DateTime GetUserBirthdate(string id);

The thing is, when operations fail, you have no way to signal the caller except for smelly nulls or throwing exceptions. Although Soap Fault Exception can do the trick, I find it distasteful to throw a Client Fault exception because exceptions are more costly, and validation of request data often enough finds client faults. For that and other reasons, I use code that looks like the following:

[DataContract(Namespace = "...")]
public class ServiceResponse
{
[DataMember]
public string Error { get; set; }
[DataMember]
public ResponseStatus Status { get; set; }
}

Where the status is an enumeration:

[DataContract(Namespace = "...")]
[Flags]
public enum ResponseStatus
{
[EnumMember]
None = 0,
/// <summary>
/// Operation completed without failure
/// </summary>
[EnumMember]
Success = 1,
/// <summary>
/// General failure
/// </summary>
[EnumMember]
Failure = 2,
/// <summary>
/// Client request not valid or not acceptable
/// </summary>
[EnumMember]
ClientFault = 4,
/// <summary>
/// Server failed processing request
/// </summary>
[EnumMember]
ServerFault = 8,
/// <summary>
/// The underlying service is not available, down for maintenance or otherwise marked as non-available.
/// </summary>
[EnumMember]
BackendFault = 16,
/// <summary>
/// Convenience value for client fault failure comparison
/// </summary>
ClientFailure = Failure + ClientFault,
/// <summary>
/// Convenience value for server fault failure comparison
/// </summary>
ServerFailure = Failure + ServerFault,
/// <summary>
/// Convenience value for backend failure comparison.
/// </summary>
BackendFailure = Failure + BackendFault
}

One may also abstract the ServiceResponse to an interface, allowing any response object to implement the interface rather than inherit the base response. For this post, let’s just go with the base class.

Now the signature of every operation would be an object derived from ServiceResponse. Rather than a fragmented GetName, GetBirthdate etc – a chatty interface anyway – we would expose:

public class GetUserResponse: ServiceResponse
{
[DataMember]
string Name{get;set;}
[DataMember]
DateTime Birthdate {get;set;}
// whatever else a user profile has..
}
// then the operation signature becomes
[ServiceContract]
public interface IMyService
{
[OperationContract]
GetuserResponse GetUser(string id);
// and other operations
}

Now that we have that out of the way, you get the payoff: we can define a fail fast attribute to decorate operations we know rely on some back-end which may be turned off on us. We’ll utilize the IOperationBehavior extension point of WCF, allowing us to specify behavior on an operation by operation basis.

I’ve created an attribute implementing the IOperationBehavior. It replaces the operation invoker with my own implementation when ApplyDispatchBehavior is called. All other IOperationBehavior methods remain blank.

public class FailFastOperationAttribute : Attribute, IOperationBehavior
{
public void Validate(OperationDescription operationDescription) { }
public void ApplyDispatchBehavior(OperationDescription operationDescription, DispatchOperation dispatchOperation)
{
var returnType = operationDescription.SyncMethod.ReturnType;
dispatchOperation.Invoker = new FailFastOperationInvoker(dispatchOperation.Invoker,returnType);
}
public void ApplyClientBehavior(OperationDescription operationDescription, ClientOperation clientOperation) { }
public void AddBindingParameters(OperationDescription operationDescription, BindingParameterCollection bindingParameters) { }
}

The finishing piece is to implement the operation invoker. It will check a special configuration, and based on that would either invoke the underlying operation as the stock implementation would have, or construct a new response with the failed flags set.

public class FailFastOperationInvoker : IOperationInvoker
{
private readonly IOperationInvoker _operationInvoker;
private readonly Type _returnType;
public FailFastOperationInvoker(IOperationInvoker operationInvoker, Type returnType)
{
_operationInvoker = operationInvoker;
_returnType = returnType;
}
#region IOperationInvoker Members
public object[] AllocateInputs()
{
return _operationInvoker.AllocateInputs();
}
public object Invoke(object instance, object[] inputs, out object[] outputs)
{
object result;
if (Config.ShouldFailFast())
{
outputs = new object[0];
// construct response of the type the specific method expects to return
result = Activator.CreateInstance(_returnType);
// mark the response as fail fast failure
result = (result as ServiceResponse).Error = "Not available";
result = (result as ServiceResponse).Status = ResponseStatus.Failure|ResponseStatus.BackendFault;
}
else
{
result = _operationInvoker.Invoke(instance, inputs, out outputs);
}
return result;
}
public IAsyncResult InvokeBegin(object instance, object[] inputs, AsyncCallback callback, object state)
{
return _operationInvoker.InvokeBegin(instance, inputs, callback, state);
}
public object InvokeEnd(object instance, out object[] outputs, IAsyncResult result)
{
return _operationInvoker.InvokeEnd(instance, out outputs, result);
}
public bool IsSynchronous
{
get { return _operationInvoker.IsSynchronous; }
}
#endregion
}

A method for determining if the API should by up or down hides behind the Config.ShouldFailFast() call. Read your app setting, check a file, do what you like to make that determination.

The next thing is manufacturing an instance of a response object. Here we need to create the same type or a type assignable to the one the formal method expected. Note that that type would need to have a parameter-less constructor for this to work. Since all my service DTO are plain POCO, this is rarely a restriction. With this code in place, all we need to do is decorate specific methods as [FailFastOperation] and bingo!

Of Image Exif, PropertyItem and reflection

As documented here on MSDN the PropertyItem object does not have a public constructor.

What to do then when you want to add a property to an image, using Image.SetPropertyItem(..) method?

This post suggests you create some bank of all property items you want, hold it memory and clone from it.

A commenter on that blog suggested using reflection: Get the non-public parameter-less constructor and invoke it. Notable downside for this approach is reliance on internal implementation of the object. True. I’ll risk it though.

In my implementation, I added a helper method which simply generates the PropertyItem using System.Activator like so:

public static PropertyItem CreatePropertyItem(int id, int length, short exifType, byte[] buffer)
{
var instance = (PropertyItem)Activator.CreateInstance(typeof(PropertyItem), true);
instance.Id = id;
instance.Len = length;
instance.Type = exifType;
instance.Value = buffer;
return instance;
}

Pretty clean and simple. Under the covers, Activator will use some reflection to create the instance, but would also utilize some caching and speed written by not-me. I like not-me code because it means I don’t have to write it.

Since one of my my upcoming talks at http://socalcodecamp.com is on the subject of reflection, this all falls neatly into place.