Time Series & Mongo
Time series data is a fairly common use case and has been “solved” before. Indeed, in blog posts around the net as well as in my course on schema design and data modeling, a common approach to storing such data was discussed at length.
But not everyone knows that bit of common wisdom. Further, it’s easy to get caught up in the storage details and make silly little mistakes that wreck the exercise.
Someone in MongoDB had the brilliant idea to simplify this further (I should by them a drink of their choice I find out who it was). Instead of you painstakingly bucketing each “reading” or time series item, then un-bucketing it or peppering your queries with time-bucket-boundary match clauses - why not provide a new collection for time series which does the tricky bits for you?
As far as the client is concerned, every document is a stand-alone point in time item which has a timestamp and some measurement - a reading - composed of one or more fields recording your data. If you insert a document, you insert one per reading. You don’t have to push into arrays, or compute any time-window or anything like that.
When reading a document from a time series collection, each point in time measurement returns as a single document - not a bucket with an array of readings withing a window.
It appears largely un-magical, until and unless you compare it to what you had to do manually before this existed.
The before document was a bucket with multiple readings, so for a 12h bucket and say 4 temperature readings in the field temperature
in a day you would see something like:
|
Using Time-Series Collection, the readings are presented to you individually:
|
You will note that the bucket granularity is nowhere in the results! You may have chosen to bucket spanning seconds (the default), minutes, or hours. But that is a storage choice, and your queries and users don’t need to care. This awesome arrangement lets users just be users. Insert readings individually as the occur, read and summarize over time as if they were stored individually. On the surface this matches the business expectation and mental model of periodic measurements written or read as they are. However
The internals are what’s nice here. The “collection” presented to the user (sensings
in the example above) is actually more of a view provided over internal collections that physically exist but you do not directly interact with them. The backing storage collections are organized to bucket the individual readings into some fixed time window size. If you choose an hours granularity, there can be up to 1 bucket per hour. Multiple readings within the same hour would all fall into the same bucket, and next hour’s readings will fall into the next bucket. Except now you don’t worry about structuring your document in any special way - Mongo does it for you!
At the time of writing, the internal physical backing collection is created when you declare your special Time-Series collection. For my example of the collection names sensings
Mongo has created a backing collection system.buckets.sensings
and stores physical documents under that namespace, in buckets.
show collections |
Oh, yes: you have to declare your special collection. If you have Mongo implicitly create a collection by inserting data into a yet-to-be-created collection, Mongo will assume a normal collection. The syntax requires one thing only - the timeField
. I like to include also the metaField
and granularity
parameters, for reasons expanded below.
To create my sensings collection and store at an ingestion rate fitting in hours long buckets issue the following command:
db.createCollection("sensings", {timeseries: {timeField: "timestamp", metaField: "info", granularity: "hours"}}) |
The collection named sensings
will be created (as well as a corresponding system.buckets.sensings
behind the scenes). It treat the field named timestamp
on your inserted documents as the time for bucketing purpose. It will let you also add extra non-measurement data or “meta” details in the field named “info”. The collection will optimize the bucket in granularity consistent with multiple items in a matter of hours. For high ingestion rates you should choose minutes
or seconds
. The
What is the business with the metaField
and granularity
about? It lets Mongo optimize the bucket sizes internally and organize storage efficiently. The theory behind creating time buckets is that:
- Events are written once
- Events are ready for a time period
- Loading fewer documents that contain all the readings is more efficient than if each reading resided in one physical document.
When doing things manually we could shove the metadata extras such as sensor-id or patient id in any field. But for Mongo to understand that the metadata is a label and not a metric, and to enable it to create buckets specific to groups of readings we have to tell it where we store the metadata. In our case here, I used the info
field. In that field, I can store a single value, or several different fields. This way when I query for temperature average for a particular sensorId
, since it is written inside the info
field, Mongo knows that and can load buckets for that meta only. It should try and create a bucket for each time-window and for each meta value or combo of values. That way, query for sensor 123 doesn’t need to read buckets for sensor 456.
But the magic ends here as things now and Mongo does not automatically create indexes on the meta fields
. It leaves that up to you. Mongo does create indexes on the bucket boundaries timestamps on the backing collection, but that is usually no enough for prime time. You will no doubt have some narrower interest in slicing the time series by some of the meta fields. I’ll add a compound index on my sensorId
like so:
db.sensings.createIndex({"info.sensorId": 1, "timestamp": 1}) |
My query can now look to get average temperature across the devices on a cartain day, and leverage the index I just created:
db.sensings.aggregate([ |
A query chased with an explain shows the index is used:
db.sensings.find({"timestamp" : ISODate("2021-09-01T00:18:00Z"), "info.sensorId" : 123}).explain() |
The inputStrage section shows an IXSCAN
leveraging my index named info.sensorId_1_timestamp_1
.
What is interesting to compare, is the index definitions on the nominal collection sensings
versus the internal collection system.buckets.sensings
.
The nominal collection shows what you expect: an index on the direct structural fields timestamp
and the embedded info.sensorId
:
> db.sensings.getIndexes() |
But the backing collection shows a different picture:
> db.system.buckets.sensings.getIndexes() |
The 2-field index I specified to create was actually created as a 3-key field behind the scenes. The timestamp - a single field in my nominal documents - was translated into a bucket boundary defined by the control.min.timestamp
and control.max.timestamp
fields. Makes sense. The raison d’être of this whole thing is exactly that: that your nominal collection facade deals with individual reading documents, but the physical storage of the documents is something entirely different and more efficient than the facade lets us know. For a lesson on how the more efficient document is actually structured, I inserted a few documents:
db.sensings.insertMany([ |
Then I chased it with a query on the backing collection system.buckets.sensings
:
> db.system.buckets.sensings.findOne() |
The control
fields describe the bucket itself. They provide boundary info, minimum and maximum timestamps for the bucket. The meta
field contains my fields that I told Mongo I’d store under the original info
field in my inserted documents. In my case, a single field but there could be more. It would make little logical sense for Mongo to have a bucket spanning multiple sensors. So Mongo will find or create a bucket for every differing meta values for a given time period. Finally, the data. Fields not originally contained in the info
field as a sub-document and fields which are not the timestamp itself are stored under the data
field in backing storage. This includes the temperature
- as expected. It also contains a corresponding timestamp that the measurement was taken at. Since every document has an _id in Mongo, the _id of the reading is also stored here. The correlation among the triplet of values is positional, so the first reading in the bucket is spread across data._id[0]
, data.temperature[0]
and data.timestamp[0]
, the next reading in the bucket is ...[1]
and so forth.
Back to my aggregation from before: db.sensings.aggregate([{$match: {timestamp: {$gte: ISODate("2021-09-01"), $lt: ISODate("2021-09-02")}}}, {$group: {_id: "$info.sensorId", avgTemperature: {$avg: "$temperature"}}}] )
. Nowhere did I include any positional magic or unwinds or extra filtering logic. MongoDB did the work for my taking my straightforward mental model of the nominal collection - one reading per document - and did all the heavy lifting. Life is good.
In the course of writing this post and poking around, I found this bit interesting: when you run collection stats on the nominal collection you get a new section named timeseries
:
> db.sensings.stats().timeseries |
The new section gives some clues that the optimization and salient metrics to watch have to do with buckets, their management in terms of “opening” and “closing” them with respect to aspects such as the metadata, the number of items, the total size of the bucket, the time boundaries, as well as memory pressure. Pretty interesting, but I have no insight or clue how further to interpret these numbers. Time will tell (rim-shot, exit stage).
Whats next for Time-Series? I don’t know. Now that the abstraction and use-case has been introduced, the team can presumably focus on even more efficient storage or ways to process time-series queries more efficiently. Judging from the past, this is the way things unfold: a feature is introduced, which “makes it a thing”, then future iterations are taken to make it even better and more useful. I for one see this as a great new feature.