<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>+N Consulting, Inc.</title>
  
  <subtitle>Websites | Databases | Consulting | Training</subtitle>
  <link href="https://www.plusnconsulting.com/atom.xml" rel="self"/>
  
  <link href="https://www.plusnconsulting.com/"/>
  <updated>2024-12-05T23:30:01.263Z</updated>
  <id>https://www.plusnconsulting.com/</id>
  
  <author>
    <name>+N Consulting</name>
    
  </author>
  
  <generator uri="https://hexo.io/">Hexo</generator>
  
  <entry>
    <title></title>
    <link href="https://www.plusnconsulting.com/post/mongodb-availability-and-durability/"/>
    <id>https://www.plusnconsulting.com/post/mongodb-availability-and-durability/</id>
    <published>2024-12-05T17:08:07.000Z</published>
    <updated>2024-12-05T23:30:01.263Z</updated>
    
    <content type="html"><![CDATA[<h1 id="The-MongoDB-Durability-and-High-Availability-Advantage"><a href="#The-MongoDB-Durability-and-High-Availability-Advantage" class="headerlink" title="The MongoDB Durability and High Availability Advantage"></a>The MongoDB Durability and High Availability Advantage</h1><p>In this era of digital transformation, businesses rely on data-driven applications and data driven insights to make informed decisions. The central piece of this is data. Moreover, the unspoken features of the data are widely understood to be threefold: That the data is accurate, that the data is available, and that the data is durable. The following discussion will touch on the last two features. When selecting a database to serve your real-time applications, the reliability and availability of your databases are crucial. How do we acheive that? How do we determine the levels of availability and durability appropriate for our application? Which database provides us with the strategic advantage that drives agility and rapid innovation without forgoing the three key data features? This article will examine the MongoDB database architecture with respect to those questions, and help shed light on some of these aspects.</p><blockquote><p>MongoDB provides increased durability and high availability compared to traditional databasesThis translates into improved data integrity, reduced downtime, and higher system resilience.</p></blockquote><h2 id="About-Data-durability-and-high-availability"><a href="#About-Data-durability-and-high-availability" class="headerlink" title="About Data durability and high availability"></a>About Data durability and high availability</h2><p>Availability and durability are key concepts that all databases address, or at least should address to be considered for production workloads. Especially in the world of OLTP (Online Transaction Processing), data which is not guaranteed to be durable and available makes for low utility. Before digging into MongoDB’s approach to this, let’s define these two terms:</p><p><strong>Data durability</strong> refers to the ability of a database system to persist data even in cases of hardware failures, operating system failures. A database system that guarantees durability should have the same data that was available pre-crash completely present and usable post crash recovery. The database system ensures that data is not lost due to abnormal system behavior. If power is lots, a disk is corrupted, the operating system crashes unexpectely, or the database process crashes - data which was written prior to any such event should not be lost.</p><p><strong>High availability</strong> refers to the ability of a database system to remain accessible and operational, even in the event of hardware failures or network outages. The word “system” in “database system” is key here, since a certain redundancy is assumed here. Whether by having a chassis stuffed with extra hardware (CPU, Memory, I&#x2F;O bus, disks, power supply, network, etc.) or by enrolling multiple host machines for the database system as a whole, this feature requires redundancy to a degree where no single resource failure causes the system as a whole to stop fulfilling its function. This is typically accomplished through techniques like replication, load balancing, and automatic failover. A highly available database system minimizes downtime and continues to funcion in the face of system component failure.</p><p>Traditionaly, relational database systems struggled to provide both strong durability and high availability. While they offer robust data consistency, their high-availability offering was often complex to configure, complex to manage, and often required manual intervention or considerable downtime to stabilize in the face of failure. The bigger the system, the more pronounced this became. This complexity by itself sometime led to failure by itself, increasing - rather than decresing - the risk of data loss and system downtime.</p><p>Here are a few examples of traditional database system approaches, with noted risks to durability and high availability.</p><table><thead><tr><th>Setup</th><th>Aspect</th><th>Risk</th></tr></thead><tbody><tr><td>RAID Subsystem</td><td><code>H/A</code>, <code>Durability</code></td><td>RAID subsystem failure, connection failure, enclusure failure.</td></tr><tr><td>Block Replication</td><td><code>H/A</code>, <code>Durability</code></td><td>Replication lag, data loss, integrity if blocks are corrupt, consistency issues, difficult to determine logical data point of  recovery.</td></tr><tr><td>Logical Replication</td><td><code>H/A</code>, <code>Durability</code></td><td>Replication lag, cross-node consistency.</td></tr><tr><td>Replication - hot standby</td><td><code>H/A</code>, <code>Durability</code></td><td>Extended failover period, manual intervention, replication lag, non-deterministic data integrity</td></tr><tr><td>Replication - bidirectional</td><td><code>H/A</code>, <code>Durability</code></td><td>Data consistency, slower writes, complex conflict resolution</td></tr><tr><td>Backup</td><td><code>Durability</code></td><td>Manual recovery, lengthy downtime, potential data loss</td></tr><tr><td>Backup</td><td><code>Durability</code></td><td>Manual recovery, lengthy downtime, potential data loss</td></tr></tbody></table><blockquote><p>Database system designers attempt to address all of these risks to some degree, and have developed remarkable point solutions to specific risks. The question one faces is though: In totality, does my chosen system provide an adequate answer to these risks?</p></blockquote><p>Before moving to how MongoDB’s adresses durability and high availability, a note about ACID. The ACID acronym characterizes desirable system propertis of Atomicity, Consistency, Isolation, and Durability. While durability is a focus of this article, the first 3 terms can use a bit of framing.</p><p>The properties discuss a “transaction” - a re-used term which traditionally is interpreted as a form of write-bundling. In that interpretation, read and write operations within a single transaction would either succeed of fail as a unit. When successful, all writes within a transaction become visible “at once” and readers would see them as the new state of data. But in traditional interpretation of these terms, things were quite narrow. Atomicity was focused on a database process writing locally, on a single host. It did not marshal who transactions on a multi-host system or systems with hot-standby replication would apply writes. Consistency was designed to guarantee even less: When a multy byte piece of data is written, it should appear to be written completely. For example if a long string is written into a column in a single row, it would have to be completely written, and not in parts as far as the users are concerned. This therefore had more to do with the integrity of multi-byte data not being interrupted or corrupt. No “half strings”, no long numbers being partially written due to OS time-slicing or any parallel conditions. This has not in any way extended to writing the same data on two hosts. The idea of “Eventual consistency” had to do with the more collaborative clustering techniques modern systems (MongoDB among them) use to deliver a highly available database system. And finally, isolation: The property of a transaction spanning multiple logical operations (say check row condition, write data to multiple rows and tables, etc). Again, this was designed as a property of a single node database, and protects data within a transaction from parallel operations by other users at the same time. It’s the notion of a view of the data which is maintained by the system to support the atomicity: not to expose effects of transactions not yet commiteted to others, and not to be affected by neighboring transactions of others during the execution of our own transaction.</p><p>When more modern multi-node database systems came on the scene, attempts to fault them for not being “ACID Compliant” imposed the notion that single-node ACID guarantees would magically extend to multi-node clusters. Indeed, distributed locking and other pessimistic concurrency mechanisms did emerge, but were typically to slow on write and are largely disfavored. The understanding of “Eventual Consistency” have evolved thankfully, and now we are talking about applications being able to have consistent reads rather than insisting that replication magic would alwas ensure any piece of data written to the system is syncronized under lock. Without this understanding, it’s hard to discuss why and which databases fulfil on ACID and on consistency.</p><h2 id="MongoDB’s-Durability-Mechanism"><a href="#MongoDB’s-Durability-Mechanism" class="headerlink" title="MongoDB’s Durability Mechanism"></a>MongoDB’s Durability Mechanism</h2><p>The ability to have data survice unforseen crashes and various failures in MongoDB relies on the internal journal mechanism, whereby writes are applied first to disk in raw form and then processed later into the main storage blocks. This ensures that if the host crashed before data made it to the main storage, the transactions are recovered from the journal upon re-start, and applied before the database is open for any operation.</p><p>Once data is on disk - whether in journal or main storage form - it is considered durable. But this is just the beginning. In a replica set cluster, MongoDB further applies those writes to other members of the cluster, thereby protecting not only against temporary crash, but also from complete, irrecoverable failure of a host (or even several hosts). Having the data distributed among several different hosts allows MongoDB to offere higher durability than any single-instance database system, especially when the cluster hosts reside in different geographical locations.</p><p>Insofar as transactional isolation, MongoDB supports multi-document transactions (and has always supported single-document transaction isolation too). You may read more about <a href="https://www.mongodb.com/docs/manual/core/transactions/#transactions-and-atomicity">transactions in the official documentation</a>, especially atomicity and isolation.</p><p>The journaling mechanism allows customers to pick the host and cross-host transaction consistency levels. At minimum, applications can send un-acknowledged write operations. This would be fast, but does not guarantee durability. In face of an error, the application would not have received any guarantee that the operation carried out. From there on, write concerns of <code>j</code> or <code>w:majority</code> would ensure that the writing application would receive an acknowledgement only after data has been locally applied to the journal, and therefore would withstand a crash.</p><p>But as mentioned, MongoDB can extend the durability guarantee beyond a single host, by using a Replica Set configuration - something traditional database systems struggle to provide.</p><h2 id="The-MongoDB-Replica-Set"><a href="#The-MongoDB-Replica-Set" class="headerlink" title="The MongoDB Replica Set"></a>The MongoDB Replica Set</h2><p>A replica set in MongoDB is an arrangement of 3 or more hosts, where one of the replica set’s members is writable at any given point in time, and other members mirror the data from that writable member. In case of failure of the primary member, a secondary node takes over the primary role quickly, automatically, and without manual intervention.</p><p>The data replication mechanism relies a special linarized log of logical write operations called the OpLog. Every write received by MongoDB includes an entry into the OpLog as well as the Journal. Operations in the OpLog are written in the sequence in which they were applied to that server. Therefore, any secondary member reading and writing them locally in the same order would be guaranteed to end up with the same exact state the primary had when applying that write.</p><p>With this mechanism, MongoDB can now guarantee a higher level of durability. A write which is sent over to the system with a write concern of <code>w:majority</code> and is acknowledged by the primary guarantees that the data already resides elsewhere, at a majority of secondaries and not just on one server. At that point, a catastrophic complete and irrecoverable loss of the primary would still not lose our data.</p><p>MongoDB employs a single-writable node in a replica set, meaning that applications must connect and write to that primary. But applications do not have to implement monitoring or connection logic. All official software drivers provide cluster discovery, connection, connection pooling, and routing out of the box. When failure occurs for a primary node, a secondary node becomes primary, and applications are largely oblivious to that fact. In cases where an in-flight write has been interrupted due to primary failure, applications will receive an error and can re-try their transaction again, successfully once a new primary is available. The whole failover can take seconds. Traditional hot-standby cutovers and other recovery systems in RDBMS would take tens of seconds, minutes, or even require manual intervention. This level of high availability is much more difficult to maintain or even achieve.</p><h2 id="Eventual-Consistency"><a href="#Eventual-Consistency" class="headerlink" title="Eventual Consistency"></a>Eventual Consistency</h2><p>At this point the concept of <em>Eventual Consistency</em> typically comes up. This entails the observation that at some point data may be written at the primary and be visible at the primary, but is not yet visible or even present at some secondary node. Luckily - by design actually - MongoDB has addressed this with what is called a <em>Read Concern</em>. A <em>Read Concern</em> is an mode that an application sends along with commands, instructing the replica set node to only consider certain data while processing. Specifying a <em>Read Concern</em> of <code>majority</code> ensures that mongo will reply to a query only with data which has already been acknowledged by the majority of nodes. This guarantees data is both durable and highly available. For data written with <em>Write Concern</em> <code>w:majority</code>, this reciprocal <em>Read Concern</em> provides a “read your own writes” behavior which generally satisfies consistency needs across the board.There are several <a href="https://www.mongodb.com/docs/manual/reference/read-concern/"><em>Read Concern</em></a> levels, and nuances regarding transactional isolation levels, but the general idea here is: Yes, you can have consistent reads with MongoDB. Surprisingly though, not all NoSQL systems - let alone RDBMS - provide this level. Multi-primary (or bi-directional replication) databases don’t always allow for this. Hot-standby and replication arrangement at the storage level struggle to provide this too. Some systems boast speed, yet fail to update indexes at the same time as the writes - causing furhter conusion attempting to fulfill on consitency.</p><h2 id="Summary"><a href="#Summary" class="headerlink" title="Summary"></a>Summary</h2><p>MongoDB’s Replica Set mechanism provides both consistency and high availability not common in other distributed database systems. Further, this mechanism provides stronger durability and availability than any single-node database can provide. It therefore offers organizations an attractive database solution that can support the most demanding needs.</p>]]></content>
    
    
      
      
    <summary type="html">&lt;h1 id=&quot;The-MongoDB-Durability-and-High-Availability-Advantage&quot;&gt;&lt;a href=&quot;#The-MongoDB-Durability-and-High-Availability-Advantage&quot; class=&quot;hea</summary>
      
    
    
    
    
  </entry>
  
  <entry>
    <title>AI Query Generator Added to MongoDB Compass</title>
    <link href="https://www.plusnconsulting.com/post/AI-Query-Generator-Added-to-MongoDB-Compass/"/>
    <id>https://www.plusnconsulting.com/post/AI-Query-Generator-Added-to-MongoDB-Compass/</id>
    <published>2023-10-04T17:38:06.000Z</published>
    <updated>2023-10-05T22:50:52.348Z</updated>
    
    <content type="html"><![CDATA[<p>MongoDB’s <code>aggregate</code> command (and its predecessor <code>find</code>) take some time to get used to.</p><p>This is seen as a barrier to entry for those coming from a relational background as well as those who are early in the journey of MongoDB adoption.</p><p>With all the new AI and awesome natural language tools, can we get some help for generating queries? “Yes!” says the <a href="https://www.mongodb.com/products/tools/compass">Compass</a> team.</p><p>Announced at <a href="https://www.mongodb.com/press/mongo-db-announces-four-new-ai-powered-capabilities-to-improve-developer-productivity">the .Local event in London</a>, Compass is rolling out a natural language text prompt which translates your natural language input into a properly stated query using either the find syntax or an aggregation pipeline (MQL).</p><p><img src="/../images/compass-ai-generated-query.png" alt="Compass AI Assisted Query Generator" title="Compass AI Assisted Query Generator"></p><p>I ran some experiments to kick the tiers on this feature.</p><h2 id="Taking-a-Test-Run"><a href="#Taking-a-Test-Run" class="headerlink" title="Taking a Test Run"></a>Taking a Test Run</h2><p>The following are prompts and results I ran against a collection which has documents like the one below. Documents have different field values. The fields in the sub-document <code>contact</code> are sparse: a document may have one, two, or all of the fields <code>cell</code>, <code>email</code>, and <code>fax</code>. These fields - if missing - are absent (not assigned <em>null</em>).</p><figure class="highlight json"><table><tr><td class="code"><pre><span class="line"><span class="punctuation">&#123;</span></span><br><span class="line">  <span class="attr">&quot;_id&quot;</span><span class="punctuation">:</span> <span class="number">6</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;Elmer&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;status&quot;</span><span class="punctuation">:</span> <span class="string">&quot;prospect&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;contact&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">    <span class="attr">&quot;cell&quot;</span><span class="punctuation">:</span> <span class="string">&quot;07969 773415&quot;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;email&quot;</span><span class="punctuation">:</span> <span class="string">&quot;wupah@what.not&quot;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;fax&quot;</span><span class="punctuation">:</span> <span class="string">&quot;01949 525019&quot;</span></span><br><span class="line">  <span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">&#125;</span></span><br></pre></td></tr></table></figure><h3 id="Prompt"><a href="#Prompt" class="headerlink" title="Prompt"></a>Prompt</h3><blockquote><p>How How many Elmers are there?</p></blockquote><h2 id="Generated-Query"><a href="#Generated-Query" class="headerlink" title="Generated Query"></a>Generated Query</h2><figure class="highlight json"><table><tr><td class="code"><pre><span class="line"><span class="punctuation">[</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $match<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      name<span class="punctuation">:</span> <span class="string">&quot;Elmer&quot;</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $group<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      _id<span class="punctuation">:</span> <span class="literal"><span class="keyword">null</span></span><span class="punctuation">,</span></span><br><span class="line">      count<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">        $sum<span class="punctuation">:</span> <span class="number">1</span></span><br><span class="line">      <span class="punctuation">&#125;</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">]</span></span><br></pre></td></tr></table></figure><p>Success. The query generated answers the prompted text as intended. Note that the field <code>name</code> is nowhere in the prompt. The tool understood that we’re looking for Elmers (a proper name, plural) and figured it should look in the field <code>name</code>. Nice!</p><h3 id="Prompt-1"><a href="#Prompt-1" class="headerlink" title="Prompt"></a>Prompt</h3><blockquote><p>What are the 3 most common names of prospects with the status “customer” who have an email in the system?</p></blockquote><h3 id="Generated-Query-1"><a href="#Generated-Query-1" class="headerlink" title="Generated Query"></a>Generated Query</h3><figure class="highlight json"><table><tr><td class="code"><pre><span class="line"><span class="punctuation">[</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $match<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      status<span class="punctuation">:</span> <span class="string">&quot;customer&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;contact.email&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span> $exists<span class="punctuation">:</span> <span class="literal"><span class="keyword">true</span></span> <span class="punctuation">&#125;</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $group<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      _id<span class="punctuation">:</span> <span class="string">&quot;$name&quot;</span><span class="punctuation">,</span></span><br><span class="line">      count<span class="punctuation">:</span> <span class="punctuation">&#123;</span> $sum<span class="punctuation">:</span> <span class="number">1</span> <span class="punctuation">&#125;</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $sort<span class="punctuation">:</span> <span class="punctuation">&#123;</span> count<span class="punctuation">:</span> <span class="number">-1</span> <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $limit<span class="punctuation">:</span> <span class="number">3</span></span><br><span class="line">  <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $project<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      _id<span class="punctuation">:</span> <span class="number">0</span><span class="punctuation">,</span></span><br><span class="line">      name<span class="punctuation">:</span> <span class="string">&quot;$_id&quot;</span><span class="punctuation">,</span></span><br><span class="line">      count<span class="punctuation">:</span> <span class="number">1</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">]</span></span><br></pre></td></tr></table></figure><p>Correct. The AI understood the <em>AND</em> filter for documents with the <code>status</code> set to “customer” and <code>email</code> existing.It used the common trope for “top N by V”, which groups by the value of the field (<code>name</code>) and produces a count per grouping, then <code>$sort</code>s by the count descending, then <code>$limit</code>s to the intended number.</p><p>I noted that the generated queries use the old-school <code>&#123;$sum: 1&#125;</code> trick to produce the count. Understandable since Compass may be used on older MongoDB version that lacked the <code>$count</code> group accumulator operator, but this is a bit more cryptic for the untrained when reading the resultant pipeline.</p><p>Let’s try some more:</p><h3 id="Prompt-2"><a href="#Prompt-2" class="headerlink" title="Prompt"></a>Prompt</h3><blockquote><p>Which customers have and odd id number, and a name that starts with D?</p></blockquote><h3 id="Generated-Query-2"><a href="#Generated-Query-2" class="headerlink" title="Generated Query"></a>Generated Query</h3><figure class="highlight json"><table><tr><td class="code"><pre><span class="line"><span class="punctuation">[</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $match<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      _id<span class="punctuation">:</span> <span class="punctuation">&#123;</span> $mod<span class="punctuation">:</span> <span class="punctuation">[</span><span class="number">2</span><span class="punctuation">,</span> <span class="number">1</span><span class="punctuation">]</span> <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">      name<span class="punctuation">:</span> /^D/</span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">]</span></span><br></pre></td></tr></table></figure><p>Nicely interpreted. The <code>$mod</code> operator is a bit obscure, but fits the bill. The name is compared with a <em>regex</em> match on the expression <code>/^D/</code>. Left-anchored regex matches are the lesser evil of non-anchored matches, and can use indexes. But maybe a range expression such as <code>&#123;name: &#123;$gte: &quot;D&quot;, $lt: &quot;E&quot;&#125;&#125;</code> would perform just as well?</p><p>I decided to change one little thing in the query: change from an upper case D (which is by the way how my data is - names are capitalized in the dataset) to a lower case d.</p><h3 id="Prompt-3"><a href="#Prompt-3" class="headerlink" title="Prompt"></a>Prompt</h3><blockquote><p>Which customers have and odd id number, and a name that starts with d?</p></blockquote><h3 id="Generated-Query-3"><a href="#Generated-Query-3" class="headerlink" title="Generated Query"></a>Generated Query</h3><figure class="highlight json"><table><tr><td class="code"><pre><span class="line"><span class="punctuation">[</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $match<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      $and<span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          _id<span class="punctuation">:</span> <span class="punctuation">&#123;</span> $mod<span class="punctuation">:</span> <span class="punctuation">[</span><span class="number">2</span><span class="punctuation">,</span> <span class="number">1</span><span class="punctuation">]</span> <span class="punctuation">&#125;</span></span><br><span class="line">        <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          name<span class="punctuation">:</span> /^d/</span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $project<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      name<span class="punctuation">:</span> <span class="number">1</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">]</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>Interestingly, the tool decided to add a <code>$project</code> stage to the pipeline. Re-reading the prompt, I can see that I under specified which fields to return. But what caused the previous query to not <code>$project</code> at all and this one to do so is a mystery to me. Not strictly wrong, but a mystery nonetheless.</p><p>A fix to this is to explicitly specify what to show.The prompt <code>Which customers have and odd id number, and a name that starts with d? Give me their name labeled &quot;Customer Name&quot; and their id labeled &quot;Customer Id&quot; only.</code> indeed generated the expected pipeline.</p><figure class="highlight json"><table><tr><td class="code"><pre><span class="line"><span class="punctuation">[</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $match<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      $and<span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span> _id<span class="punctuation">:</span> <span class="punctuation">&#123;</span> $mod<span class="punctuation">:</span> <span class="punctuation">[</span><span class="number">2</span><span class="punctuation">,</span> <span class="number">1</span><span class="punctuation">]</span> <span class="punctuation">&#125;</span> <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">        <span class="punctuation">&#123;</span> name<span class="punctuation">:</span> /^d/i <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $project<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Customer Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;$name&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Customer Id&quot;</span><span class="punctuation">:</span> <span class="string">&quot;$_id&quot;</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">]</span></span><br></pre></td></tr></table></figure><p>The projection renames the fields as the prompt described.</p><p>My standing conclusion is to be as explicit and concise as possible with the prompt. I guess this should be expected.</p><h3 id="Prompt-4"><a href="#Prompt-4" class="headerlink" title="Prompt"></a>Prompt</h3><blockquote><p>How likely are people to have both a cell phone and a fax number?</p></blockquote><h3 id="Generated-Query-4"><a href="#Generated-Query-4" class="headerlink" title="Generated Query"></a>Generated Query</h3><figure class="highlight json"><table><tr><td class="code"><pre><span class="line"><span class="punctuation">[</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $match<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;contact.cell&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">        $exists<span class="punctuation">:</span> <span class="literal"><span class="keyword">true</span></span></span><br><span class="line">      <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;contact.fax&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">        $exists<span class="punctuation">:</span> <span class="literal"><span class="keyword">true</span></span></span><br><span class="line">      <span class="punctuation">&#125;</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $group<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      _id<span class="punctuation">:</span> <span class="literal"><span class="keyword">null</span></span><span class="punctuation">,</span></span><br><span class="line">      count<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">        $sum<span class="punctuation">:</span> <span class="number">1</span></span><br><span class="line">      <span class="punctuation">&#125;</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">]</span></span><br></pre></td></tr></table></figure><p>Chalk this one to “not correct”. The prompt intent was to generate a ratio or number representing the <em>chance</em> or <em>likelihood</em> of a document to contain both fields. The generator understood the <em>and</em> condition correctly in the <code>$match</code> stage. But it failed to produce an aggregation which compares the count of documents matching the criteria vs. those which do not.</p><h2 id="Non-English"><a href="#Non-English" class="headerlink" title="Non-English"></a>Non-English</h2><p>Many people speak English as a second language. But many do not. How about typing in your native tongue?</p><p>I gave it a whirl with Hebrew. Not remotely a popular language, so I can’t imaging the team built specific support for it.</p><p>This query asks to find those who are named “<code>יוסי</code>“.</p><h3 id="Prompt-5"><a href="#Prompt-5" class="headerlink" title="Prompt"></a>Prompt</h3><blockquote><p>מצא לי מי ששמם יוסי</p></blockquote><h3 id="Generated-Query-5"><a href="#Generated-Query-5" class="headerlink" title="Generated Query"></a>Generated Query</h3><figure class="highlight json"><table><tr><td class="code"><pre><span class="line"><span class="punctuation">&#123;</span><span class="attr">&quot;name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;יוסי&quot;</span><span class="punctuation">&#125;</span></span><br></pre></td></tr></table></figure><p>The generator correctly figured out the pluralized word for “name” in my query, and determined the field to search against is <code>name</code>. It understood the value sought was “<code>יוסי</code>“, and kept the text as is - again, correct. It did not try and translate the value into english (Yossi), but that’s a good thing. I would expect that if I knew the name was in english, i’d maybe type a hybrid question such as</p><h3 id="Prompt-6"><a href="#Prompt-6" class="headerlink" title="Prompt"></a>Prompt</h3><blockquote><p>מצא לי מי ששמם Elmer</p></blockquote><h3 id="Generated-Query-6"><a href="#Generated-Query-6" class="headerlink" title="Generated Query"></a>Generated Query</h3><figure class="highlight json"><table><tr><td class="code"><pre><span class="line"><span class="punctuation">&#123;</span><span class="attr">&quot;name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;Elmer&quot;</span><span class="punctuation">&#125;</span></span><br></pre></td></tr></table></figure><p>Compass did the needful and produced my expected query. A note here is that BiDi editing isn’t perfect… Hebrew is written left to right, and the textbox moved the word “Elmer” to the very “end” of the sentence from an English perspective, but a native Hebrew speaker would expect it to be on the very left, at the end from a left-to-right perspective.But the results were correct, and I won’t argue with “it worked.”</p><p>How about some aggregation? The Hebrew query asks how many people have an email but not a cell phone.</p><h3 id="Prompt-7"><a href="#Prompt-7" class="headerlink" title="Prompt"></a>Prompt</h3><p>לכמה אנשים יש דואל אבל אין סלולרי</p><h3 id="Generated-Query-7"><a href="#Generated-Query-7" class="headerlink" title="Generated Query"></a>Generated Query</h3><figure class="highlight json"><table><tr><td class="code"><pre><span class="line"><span class="punctuation">[</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $match<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;contact.cell&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span> $exists<span class="punctuation">:</span> <span class="literal"><span class="keyword">false</span></span> <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;contact.email&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span> $exists<span class="punctuation">:</span> <span class="literal"><span class="keyword">true</span></span> <span class="punctuation">&#125;</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="punctuation">&#123;</span></span><br><span class="line">    $group<span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      _id<span class="punctuation">:</span> <span class="literal"><span class="keyword">null</span></span><span class="punctuation">,</span></span><br><span class="line">      count<span class="punctuation">:</span> <span class="punctuation">&#123;</span> $sum<span class="punctuation">:</span> <span class="number">1</span> <span class="punctuation">&#125;</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">]</span></span><br></pre></td></tr></table></figure><p>Correct, again. The Hebrew prompt used the acronym term for “email” (דואל) and the common word for “mobile phone” or “cell phone” which is “cellular” (סלולרי). Not only did it match the terms to the fields <code>cell</code> and <code>email</code>, it also figured out correctly that the fields are in the sub-document <code>contact</code> and generated the dot-path to the fields.</p><h2 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h2><p>I have reservations about using natural language to describe queries. For one, spoken language was not designed to be concise, and the range of misinterpretation that can be cause by using imprecise language as well as ambiguities within the language itself can yield wild results. I was surprised how well this new feature performed.  Color me impressed.Not only has the generator created my intended queries, it picked up on nuances that I did not imagine it would.</p><p>Can there be test cases where it comes up with the wrong answer? Yeah, sure. But id performed remarkably well on what I gave it, and in doing so proved it is suitable for a wide enough audience for a wide enough set of reporting and analysis needs.</p><p>Am I going to use this daily? Time will tell. For simple quests? Maybe. For more complex ones? Maybe also - as I can see it creating a skeleton query for me to approve or refine.Does this replace learning and knowing the breadth and depth of MQL - no. But for newcomers and occasional analysis this may just do the trick.</p><p>Give it a try, see what you think!</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;MongoDB’s &lt;code&gt;aggregate&lt;/code&gt; command (and its predecessor &lt;code&gt;find&lt;/code&gt;) take some time to get used to.&lt;/p&gt;
&lt;p&gt;This is seen as a </summary>
      
    
    
    
    
    <category term="MongoDB" scheme="https://www.plusnconsulting.com/tags/MongoDB/"/>
    
    <category term="MongoDB Compass" scheme="https://www.plusnconsulting.com/tags/MongoDB-Compass/"/>
    
    <category term="Aggregation" scheme="https://www.plusnconsulting.com/tags/Aggregation/"/>
    
    <category term="Query" scheme="https://www.plusnconsulting.com/tags/Query/"/>
    
    <category term="MQL" scheme="https://www.plusnconsulting.com/tags/MQL/"/>
    
    <category term="AI" scheme="https://www.plusnconsulting.com/tags/AI/"/>
    
  </entry>
  
  <entry>
    <title>Compound Wildcard Indexes in MongoDB 7</title>
    <link href="https://www.plusnconsulting.com/post/Compound-Wildcard-Indexes-in-MongoDB-7/"/>
    <id>https://www.plusnconsulting.com/post/Compound-Wildcard-Indexes-in-MongoDB-7/</id>
    <published>2023-09-20T16:31:24.000Z</published>
    <updated>2023-09-20T16:48:24.154Z</updated>
    
    <content type="html"><![CDATA[<h2 id="What"><a href="#What" class="headerlink" title="What"></a>What</h2><p>Wildcard indexes have improved in several ways in MongoDB 7.0.</p><p>Notabley, wildcard indexes can now be compound, including additional non-wildcard fields.</p><p>Previously, wildcard indexes only acted as a single-field index on each indivudual target field, stored in a single index which included those fields.</p><p>Newly supported is the ability to include both wildcard fields and named individual fields in the same index.</p><p>This capability supports common use cases such as the <code>attribute-pattern</code> document design. In such cases, a sub-document contains a set of keys, but the presence of the keys varies from document to document.</p><p>Consider the case of the followin documents describing a prospect record in a CRM system</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">&#123; <span class="attr">_id</span>: <span class="number">35</span>, <span class="attr">name</span>: <span class="string">&#x27;Francis&#x27;</span>, <span class="attr">status</span>: <span class="string">&#x27;customer&#x27;</span>, <span class="attr">contact</span>: &#123;&#125; &#125;</span><br><span class="line"></span><br><span class="line">&#123; <span class="attr">_id</span>: <span class="number">36</span>, <span class="attr">name</span>: <span class="string">&#x27;Connor&#x27;</span>,  <span class="attr">status</span>: <span class="string">&#x27;customer&#x27;</span>, <span class="attr">contact</span>: &#123; <span class="attr">cell</span>: <span class="string">&#x27;07787 304928&#x27;</span>, <span class="attr">fax</span>: <span class="string">&#x27;0191 558 5860&#x27;</span> &#125;&#125;</span><br><span class="line"></span><br><span class="line">&#123; <span class="attr">_id</span>: <span class="number">1</span>,  <span class="attr">name</span>: <span class="string">&#x27;Ethan&#x27;</span>,   <span class="attr">status</span>: <span class="string">&#x27;customer&#x27;</span>, <span class="attr">contact</span>: &#123; <span class="attr">cell</span>: <span class="string">&#x27;07943 819337&#x27;</span> &#125;&#125;</span><br></pre></td></tr></table></figure><ol><li>Francis has no contact fields at all - just a name.</li><li>Connor has both a <code>cell</code> and a <code>fax </code>.</li><li>Ethan has only a <code>cell</code> phone number.</li></ol><p>A wildcard index on <code>contact.**</code> would allow single-field match against one of the contact fields.</p><p>But query filtering on the <code>status</code> field <em>and</em> one of the wildcard fields would not be supported without a compound wildcard index.</p><h2 id="Using-a-Compound-Wildcard-Index"><a href="#Using-a-Compound-Wildcard-Index" class="headerlink" title="Using a Compound Wildcard Index"></a>Using a Compound Wildcard Index</h2><p>First, create an index by supplying both a wildcard expression and an additional plain (non-wildcard) field(s).</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">prospect</span>.<span class="title function_">createIndex</span>(&#123; <span class="string">&quot;contact.$**&quot;</span>: <span class="number">1</span>, <span class="attr">status</span>:<span class="number">1</span> &#125;, &#123;<span class="attr">name</span>: <span class="string">&quot;idx_compound_wild&quot;</span>&#125;)</span><br></pre></td></tr></table></figure><p>As with other compound indexes, the key structure supports key prefixes that include the first, or first and second fields.</p><p>This query leverages the compound index:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">prospect</span>.<span class="title function_">find</span>(&#123;<span class="string">&#x27;contact.cell&#x27;</span>:<span class="string">&#x27;07408 926850&#x27;</span>, <span class="attr">status</span>:<span class="string">&#x27;customer&#x27;</span>&#125;)</span><br></pre></td></tr></table></figure><p>The query includes the a full prefix in the index: a single wildcard field <code>contact.cell</code>, and the non-wildcard field <code>status</code>.</p><p>Running <code>.explain(true)</code> on this query shows a winning plan that states our index named <em>idx_compound_wild</em> was used. The indexBounds entry shows that both the <code>contact.cell</code> and <code>status</code> fields were considered when processing the query agains the index.</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line"><span class="string">&quot;winningPlan&quot;</span>: &#123;</span><br><span class="line">      <span class="string">&quot;queryPlan&quot;</span>: &#123;</span><br><span class="line">        <span class="string">&quot;stage&quot;</span>: <span class="string">&quot;FETCH&quot;</span>,</span><br><span class="line">        <span class="string">&quot;planNodeId&quot;</span>: <span class="number">2</span>,</span><br><span class="line">        <span class="string">&quot;inputStage&quot;</span>: &#123;</span><br><span class="line">          <span class="string">&quot;stage&quot;</span>: <span class="string">&quot;IXSCAN&quot;</span>,</span><br><span class="line">          <span class="string">&quot;planNodeId&quot;</span>: <span class="number">1</span>,</span><br><span class="line">          <span class="string">&quot;keyPattern&quot;</span>: &#123;</span><br><span class="line">            <span class="string">&quot;$_path&quot;</span>: <span class="number">1</span>,</span><br><span class="line">            <span class="string">&quot;contact.cell&quot;</span>: <span class="number">1</span>,</span><br><span class="line">            <span class="string">&quot;status&quot;</span>: <span class="number">1</span></span><br><span class="line">          &#125;,</span><br><span class="line">          <span class="string">&quot;indexName&quot;</span>: <span class="string">&quot;idx_compound_wild&quot;</span>,</span><br><span class="line">          <span class="string">&quot;isMultiKey&quot;</span>: <span class="literal">false</span>,</span><br><span class="line">          <span class="string">&quot;multiKeyPaths&quot;</span>: &#123;</span><br><span class="line">            <span class="string">&quot;$_path&quot;</span>: [],</span><br><span class="line">            <span class="string">&quot;contact.cell&quot;</span>: [],</span><br><span class="line">            <span class="string">&quot;status&quot;</span>: []</span><br><span class="line">          &#125;,</span><br><span class="line">          <span class="string">&quot;isUnique&quot;</span>: <span class="literal">false</span>,</span><br><span class="line">          <span class="string">&quot;isSparse&quot;</span>: <span class="literal">true</span>,</span><br><span class="line">          <span class="string">&quot;isPartial&quot;</span>: <span class="literal">false</span>,</span><br><span class="line">          <span class="string">&quot;indexVersion&quot;</span>: <span class="number">2</span>,</span><br><span class="line">          <span class="string">&quot;direction&quot;</span>: <span class="string">&quot;forward&quot;</span>,</span><br><span class="line">          <span class="string">&quot;indexBounds&quot;</span>: &#123;</span><br><span class="line">            <span class="string">&quot;$_path&quot;</span>: [</span><br><span class="line">              <span class="string">&quot;[\&quot;contact.cell\&quot;, \&quot;contact.cell\&quot;]&quot;</span></span><br><span class="line">            ],</span><br><span class="line">            <span class="string">&quot;contact.cell&quot;</span>: [</span><br><span class="line">              <span class="string">&quot;[\&quot;07408 926850\&quot;, \&quot;07408 926850\&quot;]&quot;</span></span><br><span class="line">            ],</span><br><span class="line">            <span class="string">&quot;status&quot;</span>: [</span><br><span class="line">              <span class="string">&quot;[\&quot;customer\&quot;, \&quot;customer\&quot;]&quot;</span></span><br><span class="line">            ]</span><br><span class="line">          &#125;</span><br><span class="line">        &#125;</span><br></pre></td></tr></table></figure><h3 id="No-Index-Intersection"><a href="#No-Index-Intersection" class="headerlink" title="No Index Intersection"></a>No Index Intersection</h3><p>Index intersection is a strategy where more than one indexed fields are stated in a filter, and the result is computed by perforing a set intersection on the keys of 2 indexes.</p><p>Intersection seems to make sense when two separate simple indexes are at play on two separate fields. What about wildcard index? Will the mongo query planner choose to “intersect” the same wildcard index with itself somehow? Do we expect it will be smart about choosing to scan for two values, one found in a certain path <code>contact.cell</code>, then the second found in a different path <code>contact.fax</code> - both of which are contained with the same index? The short answer: <strong>no!</strong></p><p>Consider the following query:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">prospect</span>.<span class="title function_">find</span>(&#123; <span class="string">&#x27;contact.email&#x27;</span>: <span class="string">&#x27;madtiwvab@what.not&#x27;</span>, <span class="string">&#x27;contact.fax&#x27;</span>: <span class="string">&#x27;0121 062 9173&#x27;</span>&#125;)</span><br></pre></td></tr></table></figure><p>The query filters on two separate <em>wildcard fields</em> <code>email</code> and <code>fax</code>.</p><p>The query planner shows the following output:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line"><span class="string">&quot;winningPlan&quot;</span>: &#123;</span><br><span class="line">      <span class="string">&quot;queryPlan&quot;</span>: &#123;</span><br><span class="line">        <span class="string">&quot;stage&quot;</span>: <span class="string">&quot;FETCH&quot;</span>,</span><br><span class="line">        <span class="string">&quot;planNodeId&quot;</span>: <span class="number">2</span>,</span><br><span class="line">        <span class="string">&quot;filter&quot;</span>: &#123;</span><br><span class="line">          <span class="string">&quot;contact.email&quot;</span>: &#123;</span><br><span class="line">            <span class="string">&quot;$eq&quot;</span>: <span class="string">&quot;madtiwvab@what.not&quot;</span></span><br><span class="line">          &#125;</span><br><span class="line">        &#125;,</span><br><span class="line">        <span class="string">&quot;inputStage&quot;</span>: &#123;</span><br><span class="line">          <span class="string">&quot;stage&quot;</span>: <span class="string">&quot;IXSCAN&quot;</span>,</span><br><span class="line">          <span class="string">&quot;planNodeId&quot;</span>: <span class="number">1</span>,</span><br><span class="line">          <span class="string">&quot;keyPattern&quot;</span>: &#123;</span><br><span class="line">            <span class="string">&quot;$_path&quot;</span>: <span class="number">1</span>,</span><br><span class="line">            <span class="string">&quot;contact.fax&quot;</span>: <span class="number">1</span>,</span><br><span class="line">            <span class="string">&quot;status&quot;</span>: <span class="number">1</span></span><br><span class="line">          &#125;,</span><br><span class="line">          <span class="string">&quot;indexName&quot;</span>: <span class="string">&quot;idx_compound_wild&quot;</span>,</span><br><span class="line">          <span class="string">&quot;isMultiKey&quot;</span>: <span class="literal">false</span>,</span><br><span class="line">          <span class="string">&quot;multiKeyPaths&quot;</span>: &#123;</span><br><span class="line">            <span class="string">&quot;$_path&quot;</span>: [],</span><br><span class="line">            <span class="string">&quot;contact.fax&quot;</span>: [],</span><br><span class="line">            <span class="string">&quot;status&quot;</span>: []</span><br><span class="line">          &#125;,</span><br><span class="line">          <span class="string">&quot;isUnique&quot;</span>: <span class="literal">false</span>,</span><br><span class="line">          <span class="string">&quot;isSparse&quot;</span>: <span class="literal">true</span>,</span><br><span class="line">          <span class="string">&quot;isPartial&quot;</span>: <span class="literal">false</span>,</span><br><span class="line">          <span class="string">&quot;indexVersion&quot;</span>: <span class="number">2</span>,</span><br><span class="line">          <span class="string">&quot;direction&quot;</span>: <span class="string">&quot;forward&quot;</span>,</span><br><span class="line">          <span class="string">&quot;indexBounds&quot;</span>: &#123;</span><br><span class="line">            <span class="string">&quot;$_path&quot;</span>: [</span><br><span class="line">              <span class="string">&quot;[\&quot;contact.fax\&quot;, \&quot;contact.fax\&quot;]&quot;</span></span><br><span class="line">            ],</span><br><span class="line">            <span class="string">&quot;contact.fax&quot;</span>: [</span><br><span class="line">              <span class="string">&quot;[\&quot;0121 062 9173\&quot;, \&quot;0121 062 9173\&quot;]&quot;</span></span><br><span class="line">            ],</span><br><span class="line">            <span class="string">&quot;status&quot;</span>: [</span><br><span class="line">              <span class="string">&quot;[MinKey, MaxKey]&quot;</span></span><br><span class="line">            ]</span><br><span class="line">          &#125;</span><br><span class="line">        &#125;</span><br><span class="line">      &#125;...</span><br></pre></td></tr></table></figure><p>The above is a bit awkward. The <code>filter</code> term mentions the <code>contact.email</code> field alone. The <code>indexBounds</code> field mentions the <code>contact.fax</code> field only. Though one might intuit that the strategy is to us the index to scan the index <code>contact.fax</code> first, then filter the entries in the index by <code>contact.email</code>, it is neither clear that this is the case nor expected. One would expect that the index contains the list of documents under either of the keys, and therefore an AND query would hit the index key structure twice and do some nested loop join or something similar.</p><p>Let’s try specifying three of the wildcard fields in the index, and see if things become any clearer:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">prospect</span>.<span class="title function_">find</span>(&#123;</span><br><span class="line">  <span class="string">&#x27;contact.cell&#x27;</span>: <span class="string">&#x27;07404 190465&#x27;</span>,</span><br><span class="line">  <span class="string">&#x27;contact.email&#x27;</span>: <span class="string">&#x27;kagaje@what.not&#x27;</span>,</span><br><span class="line">  <span class="string">&#x27;contact.fax&#x27;</span>: <span class="string">&#x27;01995 364874&#x27;</span></span><br><span class="line">&#125;).<span class="title function_">explain</span>(<span class="string">&quot;queryPlanner&quot;</span>)</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>The plan shown below seems to imply the same general strategy. The index bounds only mention one of the three query terms, then the filter mentions the other two.</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line"><span class="string">&quot;winningPlan&quot;</span>: &#123;</span><br><span class="line">      <span class="string">&quot;queryPlan&quot;</span>: &#123;</span><br><span class="line">        <span class="string">&quot;stage&quot;</span>: <span class="string">&quot;FETCH&quot;</span>,</span><br><span class="line">        <span class="string">&quot;planNodeId&quot;</span>: <span class="number">2</span>,</span><br><span class="line">        <span class="string">&quot;filter&quot;</span>: &#123;</span><br><span class="line">          <span class="string">&quot;$and&quot;</span>: [</span><br><span class="line">            &#123;</span><br><span class="line">              <span class="string">&quot;contact.cell&quot;</span>: &#123;</span><br><span class="line">                <span class="string">&quot;$eq&quot;</span>: <span class="string">&quot;07404 190465&quot;</span></span><br><span class="line">              &#125;</span><br><span class="line">            &#125;,</span><br><span class="line">            &#123;</span><br><span class="line">              <span class="string">&quot;contact.fax&quot;</span>: &#123;</span><br><span class="line">                <span class="string">&quot;$eq&quot;</span>: <span class="string">&quot;01995 364874&quot;</span></span><br><span class="line">              &#125;</span><br><span class="line">            &#125;</span><br><span class="line">          ]</span><br><span class="line">        &#125;,</span><br><span class="line">        <span class="string">&quot;inputStage&quot;</span>: &#123;</span><br><span class="line">          <span class="string">&quot;stage&quot;</span>: <span class="string">&quot;IXSCAN&quot;</span>,</span><br><span class="line">          <span class="string">&quot;planNodeId&quot;</span>: <span class="number">1</span>,</span><br><span class="line">          <span class="string">&quot;keyPattern&quot;</span>: &#123;</span><br><span class="line">            <span class="string">&quot;$_path&quot;</span>: <span class="number">1</span>,</span><br><span class="line">            <span class="string">&quot;contact.email&quot;</span>: <span class="number">1</span>,</span><br><span class="line">            <span class="string">&quot;status&quot;</span>: <span class="number">1</span></span><br><span class="line">          &#125;,</span><br><span class="line">          <span class="string">&quot;indexName&quot;</span>: <span class="string">&quot;idx_compound_wild&quot;</span>,</span><br><span class="line">          <span class="string">&quot;isMultiKey&quot;</span>: <span class="literal">false</span>,</span><br><span class="line">          <span class="string">&quot;multiKeyPaths&quot;</span>: &#123;</span><br><span class="line">            <span class="string">&quot;$_path&quot;</span>: [],</span><br><span class="line">            <span class="string">&quot;contact.email&quot;</span>: [],</span><br><span class="line">            <span class="string">&quot;status&quot;</span>: []</span><br><span class="line">          &#125;,</span><br><span class="line">          <span class="string">&quot;isUnique&quot;</span>: <span class="literal">false</span>,</span><br><span class="line">          <span class="string">&quot;isSparse&quot;</span>: <span class="literal">true</span>,</span><br><span class="line">          <span class="string">&quot;isPartial&quot;</span>: <span class="literal">false</span>,</span><br><span class="line">          <span class="string">&quot;indexVersion&quot;</span>: <span class="number">2</span>,</span><br><span class="line">          <span class="string">&quot;direction&quot;</span>: <span class="string">&quot;forward&quot;</span>,</span><br><span class="line">          <span class="string">&quot;indexBounds&quot;</span>: &#123;</span><br><span class="line">            <span class="string">&quot;$_path&quot;</span>: [</span><br><span class="line">              <span class="string">&quot;[\&quot;contact.email\&quot;, \&quot;contact.email\&quot;]&quot;</span></span><br><span class="line">            ],</span><br><span class="line">            <span class="string">&quot;contact.email&quot;</span>: [</span><br><span class="line">              <span class="string">&quot;[\&quot;kagaje@what.not\&quot;, \&quot;kagaje@what.not\&quot;]&quot;</span></span><br><span class="line">            ],</span><br><span class="line">            <span class="string">&quot;status&quot;</span>: [</span><br><span class="line">              <span class="string">&quot;[MinKey, MaxKey]&quot;</span></span><br><span class="line">            ]</span><br><span class="line">          &#125;</span><br><span class="line">        &#125;</span><br><span class="line">      &#125;,...  </span><br></pre></td></tr></table></figure><p>Well, the plan is not lying. MongoDB does indeed choose only one of the query terms to serve as the value sought in the index. It evaluates the selectivity of the terms in the query, and picks one it deems most suitable.</p><p>This still doesn’t answer the question <em>why</em> it doesn’t compare the other terms agains the index either. Which brings the topic of <strong>index intersection</strong>. MongoDB introduced an implementation of index intersection at some point, then reversed the default implementation of it to a degree that pretty much prevents intersection from every being excercised. Turns out that the performance gains were rather low for the intersection candidates as compared to other strategies. Worse: the planner would often pick intersection in cases where other strategies would have performed way better. In other words: Index intersection turned out to be the wrong answer too often, and would actually <em>hurt</em> performance.</p><p>But this is a <em>single index</em> you say - wouldn’t intersection make sense in this particular wildcard scenario? The way I like to think about it, a wildcard index is actually several indexes in a trench-coat pretending to be a single index. Each wildcard field value is processed into it’s own index structure, but all of those structures are stored together as one. A hint to this notion is that the planner outputs the <code>$_path</code> term, stating which wildcard field within the index is targeted.</p><p>The addition of non-wildcard fields to the index would then shape the key prefixes to contain the additional non-wildcard value so that the index is more selective and the (theoretical) need for intersection is reduced.</p><h2 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h2><p>Compound wildcard indexes are a welcome addition to MongoDB 7.0. Having no intersection mechanism doomed many queries to only partially use indexes and resort to loading documents based on a single field, then filtering them further. With this feature, the number of documents loaded can be reduced. Reducing I&#x2F;O should help many workloads. Whether it does for your workload is subject to - as always - experimintation and monitoring. Give it a try, hope it helps!</p><blockquote><p>For limitations and further details, see <a href="https://www.mongodb.com/docs/v7.0/core/indexes/index-types/index-wildcard/index-wildcard-compound/">Compound Wildcard Indexes</a></p></blockquote>]]></content>
    
    
      
      
    <summary type="html">&lt;h2 id=&quot;What&quot;&gt;&lt;a href=&quot;#What&quot; class=&quot;headerlink&quot; title=&quot;What&quot;&gt;&lt;/a&gt;What&lt;/h2&gt;&lt;p&gt;Wildcard indexes have improved in several ways in MongoDB 7.0.</summary>
      
    
    
    
    
    <category term="MongoDB" scheme="https://www.plusnconsulting.com/tags/MongoDB/"/>
    
    <category term="Performance" scheme="https://www.plusnconsulting.com/tags/Performance/"/>
    
    <category term="Query Planner" scheme="https://www.plusnconsulting.com/tags/Query-Planner/"/>
    
  </entry>
  
  <entry>
    <title>Increase Security by Limiting Access to Atlas</title>
    <link href="https://www.plusnconsulting.com/post/Increase-Security-by-Limiting-Access-to-Atlas/"/>
    <id>https://www.plusnconsulting.com/post/Increase-Security-by-Limiting-Access-to-Atlas/</id>
    <published>2023-03-16T19:20:20.000Z</published>
    <updated>2023-03-16T22:32:37.371Z</updated>
    
    <content type="html"><![CDATA[<p>Security is a huge topic of immense importance. Ironically, it’s all too common to sacrifice security for the sake of convenience.</p><p>This is often the case when it comes to network access security: we are quick to open up network access “just for now”, but fail to close things down later. This increases the exposure surface into the protected resource (A MongoDB Atlas cluster for example) indefinitely. But it doesn’t have to be that way.</p><blockquote><p><strong>TL;DR</strong>: Use expiring network access permissions for IPs instead of permanent ones.</p></blockquote><h2 id="How-to-Make-it-Better"><a href="#How-to-Make-it-Better" class="headerlink" title="How to Make it Better"></a>How to Make it Better</h2><p>For a while now, Atlas had a simple measure that helps clean up temporary access grants. Instead of the network access permission lasting until you explicitly delete it, you can set it to expire automatically and get removed.</p><p>Flip the toggle at the bottom left to enforce a time limit on the access grant. Pick a time frame: 6 hours, one day, or a week - the shorter the better. Atlas will automatically clean up the entry and close the IP access after that duration elapsed.</p><p><img src="/../images/atlas-network-access-temporary-choices.png" alt="Add IP Access List Entry" title="Add IP Access Entry"></p><p>Some might ask “But what if I want to keep it open longer?”</p><p>Here’s my cheat-sheet:</p><table><thead><tr><th>IP Address Space</th><th>Access Type</th><th>Auto-Expire?</th></tr></thead><tbody><tr><td>A. Private (10. …, 192.198. …)</td><td>Production Application</td><td>No</td></tr><tr><td>B. Private (10. …, 192.198. …)</td><td>Interactive</td><td>Yes!</td></tr><tr><td>C. Public IP &#x2F; Block</td><td>Interactive</td><td>Yes!</td></tr><tr><td>D. Public IP &#x2F; Block</td><td>Production Application</td><td>Yes!</td></tr></tbody></table><p>This cheat sheet is of the paranoid opinion “Don’t allow public IP access to Atlas.” But paranoia good for security.</p><p>Here are some supporting thoughts. They all share the following core belief:</p><blockquote><p>Sacrifice convenience to gain security</p></blockquote><p>With that in mind, lets talk about interactive vs. application access.</p><h3 id="Interactive-Users"><a href="#Interactive-Users" class="headerlink" title="Interactive Users"></a>Interactive Users</h3><p>An interactive user is a human running MongoDB Compass, or the shell , or program during development.</p><p>This is the danger zone. “I just need to do my job <strong>now</strong>“ motivates allowing such traffic. But tomorrow is not <strong>now</strong> anymore… better expire it.If you are the grantor (you have an Atlas role allowing you to add the network), then you can just as well grant yourself permissions again after expiration. Sacrifice convenience.</p><p>If the interactive user does not have access-control privileges, the nuisance of provisioning access is higher. But then the question arises: why is access to Atlas done over the public IP?</p><ol><li>No peering? Ask to set it up.</li><li>No private-link? Ask to set it up.</li><li>Atlas in different zone? Ask to set up peering.</li><li>Private IP block? Expire anyway.</li></ol><p>Why expire an interactive IP anyway? Well, because it provides another surface for would-be attackers. Interactive users have potential OS exposure to compromise via channels such as email, malware, random downloads and just plain old neglect.</p><h3 id="For-Applications"><a href="#For-Applications" class="headerlink" title="For Applications"></a>For Applications</h3><p>An application is a program or process running unattended.</p><p>Seek to route traffic via either peering or a private link. This would turn the traffic into non-public traffic, and therefore fall under use case #1.</p><h3 id="Auto-Expire-in-Other-Ways"><a href="#Auto-Expire-in-Other-Ways" class="headerlink" title="Auto-Expire in Other Ways"></a>Auto-Expire in Other Ways</h3><p>The alternatives to auto-expiring automatically existed for a while. Shun the “I’m personally responsible so it’s ok for me to have permanent access” argument.</p><p>If the DBaaS or DB doesn’t auto expire, a good sys-admin or DevOps practice would automate tearing down old entries. Still valid: Atlas management API allow for creating and removing network access, so it’s a viable option.</p><p>Manual “expiration” entails manual auditing, and manual removal of such entries. This is both error prone and less reliable in general. But if that is all your organization is willing to do for security - so be it. Chalk that one up for “acceptable within parameters”.</p><h3 id="Permanent-Access"><a href="#Permanent-Access" class="headerlink" title="Permanent Access"></a>Permanent Access</h3><p>As you surmise by now, the approach in this post strongly shuns permanence, especially for public IP space.</p><p>What about permanence within private space? Those should also be expired! But in this case, automatic expiration is a bit trickier. A production application loosing connectivity is a risk not many would take. There can be arrangements that either audit and expire connections with no observed traffic, or which renew&#x2F; re-provision access on a cadence as long as some sign of life is detected. But such automation is a chance for error with higher risk to availability of the production application. So letting practicality win here: revert to manual auditing of such entries. Yes: they are internal traffic, but still another open vector.</p><p>Bottom line: Your data is priceless, consequences are high - keep it as safe as possible. Sacrifice convenience to gain security.</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;Security is a huge topic of immense importance. Ironically, it’s all too common to sacrifice security for the sake of convenience.&lt;/p&gt;
&lt;p</summary>
      
    
    
    
    
    <category term="MongoDB" scheme="https://www.plusnconsulting.com/tags/MongoDB/"/>
    
    <category term="Atlas" scheme="https://www.plusnconsulting.com/tags/Atlas/"/>
    
    <category term="Security" scheme="https://www.plusnconsulting.com/tags/Security/"/>
    
  </entry>
  
  <entry>
    <title>Common Rollup $unionWith and $merge</title>
    <link href="https://www.plusnconsulting.com/post/Common-Rollup-unionWith-and-merge/"/>
    <id>https://www.plusnconsulting.com/post/Common-Rollup-unionWith-and-merge/</id>
    <published>2021-08-19T17:17:38.000Z</published>
    <updated>2021-08-20T00:58:42.194Z</updated>
    
    <content type="html"><![CDATA[<h2 id="TL-DR"><a href="#TL-DR" class="headerlink" title="TL;DR"></a>TL;DR</h2><p>Schedule periodic rollup and aggregation on historic data using <code>$merge</code> into a collection, then run ad-hoc reports on cumulative live data for the current period, <code>$unionWith</code> historic data.</p><h2 id="WUT"><a href="#WUT" class="headerlink" title="WUT???"></a>WUT???</h2><p>We often need to produce reports summarizing sales over time which include both current partial period, such as “sales this year to date”. While it’s perfectly fine to run ad hoc queries on the data set, this can become a performance issue when report frequency increases or data sets become extremely large or the periods are very long (eg: all sales to date since inception).</p><p>It’s been common practice to pre-aggregate historic data and persist it to some collection. This lets us query only the most recent yet-to-be-rolled-up period, and add it up to historic data queried from the other cumulative collection.  Trivial concept, nothing terribly new here. So where’s the catch? Details (it’s always the details…):</p><pre class="mermaid">sequenceDiagram    autonumber    participant a as Query    participant ms as [monthlySales]    participant s as [Sales]    a->>+ms: January - July?    ms-->>-a: <<precomputed>>    a->>+s: Current Month?    s-->>-a: <<month-to-date>>    a-->> a: Union</pre><p>The implementation of this technique relies on 2 high level operations: the</p><ol><li>A periodic calculation to populate the historic data. </li><li>An ad-hoc query over both historic and current data.</li></ol><h2 id="Historic-Data"><a href="#Historic-Data" class="headerlink" title="Historic Data"></a>Historic Data</h2><p>The technique used with MongoDB to persist historic data used to be <code>mapReduce</code> in the old days. Very old days. Using <code>mapReduce</code> allowed you to persist the results into a collection all server-side. Before or without that, you’d need to have client code computing a result-set, then turning around and writing it explicitly back into MongoDB.</p><p>With introduction the Aggregation Framework, more efficient query operators were introduced, as well as the <code>$out</code> pipeline operator. The issue with that though, was that $out did not play very well with sharded clusters, and did not allow for incremental data insertion. Instead, $out wiped the output collection clean and replaced its content with the new data. This implied that for a long history, you would have to compute again over a growing number of documents - not very efficient, and exacerbates the problem we were trying to avoid in the first place. You could of course run a query just across a year say, and store it into a yearly collection. But then the problem of summing up across years would require more and more collections be queried independently by the client, and you’d end up with many more collections just to handle this case. Not elegant.</p><p>Now with <code>$merge</code> the story becomes manageable again. <code>$merge</code> lets you perform incremental updates to an existing collection, so your historic data can largely remain save for the newest addition or update to the latest period. </p><p>Collection <strong>sales</strong> contains document-per-sale with a field <strong>total</strong> and <strong>date</strong> in the example below:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">&#123; <span class="string">&quot;date&quot;</span> : <span class="title class_">ISODate</span>(<span class="string">&quot;2021-08-10T23:54:01.028Z&quot;</span>), <span class="string">&quot;total&quot;</span> : <span class="title class_">NumberDecimal</span>(<span class="string">&quot;178.4349&quot;</span>) &#125;</span><br><span class="line">&#123; <span class="string">&quot;date&quot;</span> : <span class="title class_">ISODate</span>(<span class="string">&quot;2021-08-10T23:48:37.671Z&quot;</span>), <span class="string">&quot;total&quot;</span> : <span class="title class_">NumberDecimal</span>(<span class="string">&quot;17.6805&quot;</span>) &#125;</span><br><span class="line">&#123; <span class="string">&quot;date&quot;</span> : <span class="title class_">ISODate</span>(<span class="string">&quot;2021-08-10T23:45:37.238Z&quot;</span>), <span class="string">&quot;total&quot;</span> : <span class="title class_">NumberDecimal</span>(<span class="string">&quot;135.4978&quot;</span>) &#125;</span><br><span class="line"><span class="comment">// etc.</span></span><br></pre></td></tr></table></figure><p>To produce sum of sales, count of sale numbers across August, we just run an aggregation, producing the result into the <strong>monthlySales</strong> collection</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">sales</span>.<span class="title function_">aggregate</span>([</span><br><span class="line">    &#123;<span class="attr">$match</span>: &#123; <span class="attr">date</span>: &#123;<span class="attr">$gte</span>: <span class="title class_">ISODate</span>(<span class="string">&#x27;2021-08-01&#x27;</span>), <span class="attr">$lt</span>: <span class="title class_">ISODate</span>(<span class="string">&#x27;2021-09-01&#x27;</span>)&#125;&#125;&#125;,</span><br><span class="line"></span><br><span class="line">    &#123;<span class="attr">$group</span>: &#123; <span class="attr">_id</span>: <span class="title class_">ISODate</span>(<span class="string">&#x27;2021-08-01&#x27;</span>), <span class="attr">total</span>: &#123;<span class="attr">$sum</span>: <span class="string">&#x27;$total&#x27;</span>&#125;, <span class="attr">saleCount</span>: &#123;<span class="attr">$sum</span>:<span class="number">1</span>&#125;&#125;&#125;,</span><br><span class="line"></span><br><span class="line">    &#123;<span class="attr">$merge</span>: &#123; <span class="attr">into</span>: &#123;<span class="attr">db</span>: <span class="string">&#x27;test&#x27;</span>, <span class="attr">coll</span>: <span class="string">&#x27;monthlySales&#x27;</span>&#125;&#125;&#125;</span><br><span class="line">])</span><br></pre></td></tr></table></figure><p>The shape of the document in <strong>monthlySales</strong> is verbatim the document(s) produced by the predecessor pipeline stage of <code>$merge</code>:</p><figure class="highlight reasonml"><table><tr><td class="code"><pre><span class="line"><span class="comment">// db.monthlySales.findOne()</span></span><br><span class="line"></span><br><span class="line">&#123; <span class="string">&quot;_id&quot;</span> : <span class="constructor">ISODate(<span class="string">&quot;2021-08-01T00:00:00Z&quot;</span>)</span>, <span class="string">&quot;saleCount&quot;</span> : <span class="number">1613</span>, <span class="string">&quot;total&quot;</span> : <span class="constructor">NumberDecimal(<span class="string">&quot;815527.3523&quot;</span>)</span> &#125;</span><br></pre></td></tr></table></figure><p>What happens though if it is still August? A previous value for August may exist already. We have to address what would be done in case a document with the same <code>_id</code> is not present, as well as what do do if it is.</p><p>When a document is not present, <code>$merge</code> defaults to inserting it. But you can also tell it to discard the new document, or produce an error. For example, you may want to prevent accidental creation of documents in the target collection in order to prevent arbitrary items created unless some approved well known document already exists. The full universe of choices is expressed via the <code>whenNotMatched</code> field with value being one of <code>insert</code>,<code>discard</code>,or <code>fail</code>. Note that in case of <code>fail</code>, other documents may have been inserted so the state is kind of non-determinant , it is not wrapped in a transaction.</p><p>This example shows discarding silently and documents that <code>$merge</code> found no match for in the target collection. Someone would have had to create a document for August 2021 otherwise the result for August is effectively ignored by $merge and nothing will be written.</p><figure class="highlight lasso"><table><tr><td class="code"><pre><span class="line">&#123;$merge: &#123; </span><br><span class="line">    <span class="keyword">into</span>: &#123;db: <span class="string">&#x27;test&#x27;</span>, coll: <span class="string">&#x27;monthlySales&#x27;</span>&#125;, </span><br><span class="line">    whenNotMatched: <span class="string">&#x27;discard&#x27;</span></span><br><span class="line">&#125;&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>When are document <em>is</em> matched, you have a few more choices. The default is to <em>replace</em> the existing document. The full list of options: <code>replace</code>, <code>keepExisting</code>, <code>merge</code>, <code>fail</code>, and <code>pipeline</code>. You can choose to keep the existing document to preserve things as they were. Replace and fail are pretty straightforward as well. </p><p> Specifying <code>whenMatched: &#39;merge&#39;</code> will slam fields from the new document into the existing document.</p><table><thead><tr><th>Document</th><th>Value</th></tr></thead><tbody><tr><td>Existing</td><td><code>&#123;_id:1, name: &#39;bob&#39;, age: 33&#125;</code></td></tr><tr><td>New Document</td><td><code>&#123;_id:1, age: 34, height: 1.74 &#125;</code></td></tr><tr><td>Resultant Document</td><td><code>&#123;_id:1, name: &#39;bob&#39;, age: 34, height: 1.74 &#125;</code></td></tr></tbody></table><p>An existing field’s value is replaced and previously non-existing fields will be created. Fields not present in the new document which existed before are left as is.</p><p>That’s good for having multiple separate calculations affecting some destination fields. But for incremental update, we’ll be looking to combine the value of the existing field in the target collection with a value from the new calculation on the origin collection.</p><p>Given the document in an <strong>origin</strong> collection, and a corresponding document already in the <strong>destination</strong> collection:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">origin</span>.<span class="title function_">insert</span>(&#123; <span class="attr">_id</span>:<span class="number">1</span>, <span class="attr">x</span>:<span class="number">2</span>&#125;)</span><br><span class="line">db.<span class="property">destination</span>.<span class="title function_">insert</span>(&#123; <span class="attr">_id</span>:<span class="number">1</span>, <span class="attr">x</span>:<span class="number">40</span>&#125;)</span><br></pre></td></tr></table></figure><p>We can now add the existing value in the <strong>destination</strong>‘s document field <strong>x</strong> to the sum of that value and the value of the field <strong>x</strong> from the preceding pipeline operator ($match for simplicity here):</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">origin</span>.<span class="title function_">aggregate</span>([&#123;<span class="attr">$match</span>: &#123;&#125;&#125;, &#123;<span class="attr">$merge</span>: &#123;<span class="attr">into</span>: <span class="string">&#x27;destination&#x27;</span>, <span class="attr">whenMatched</span>: [&#123;<span class="attr">$set</span>: &#123;<span class="attr">x</span>: &#123;<span class="attr">$add</span>: [<span class="string">&quot;$x&quot;</span>,<span class="string">&quot;$$new.x&quot;</span>]&#125;&#125;&#125; ]&#125;&#125;] );</span><br><span class="line"></span><br><span class="line">db.<span class="property">destination</span>.<span class="title function_">find</span>();</span><br><span class="line">&#123; <span class="string">&quot;_id&quot;</span> : <span class="number">1</span>, <span class="string">&quot;x&quot;</span> : <span class="number">42</span> &#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>Using the special pipeline variable <code>$$new</code>, we can distinguish the field x from the existing field x that was in the destination document. </p><p>The pipeline constructed to produce the merge of existing documents can only take one of the stages <code>$addFields</code>, <code>$set</code>, <code>$project</code>, <code>$unset</code>, <code>$replaceRoot</code>, or <code>$replaceWith</code>. This mini-pipeline only ever sees a single document input from the destination collection - the one that matched - so other pipeline operators really don’t make a lot of sense. Only single-document-modifier types are allowed - or at least that’s how I make sense of it.</p><h2 id="Union-Queries"><a href="#Union-Queries" class="headerlink" title="Union Queries"></a>Union Queries</h2><p>With all this syntax background, let’s turn back to our original problem. We want to have cheaper queries over historical periods yet combine them with recent on-the-fly values. Storing the past data can be don in some frequency, either monthly or weekly or daily. Given that we can incrementally update the destination collection, we’re free to pick a frequency that makes sense.</p><p>To combine documents for a partial month, we’ll need to query 2 collections though. The historic sales in the <strong>monthlySales</strong> collection, and the results from the month-to-date query on the fly from <strong>sales</strong>. Would be nice to do it in one command to Mongo. Historically, we would shoot 2 queries and combine the results into one client-side. But that’s extra code that someone had to write, and makes populating generic graphs more involved. </p><p>With the introduction of <code>$unionWith</code>, a single aggregation command can now return results from more than one collection! Harnessing this for our scenario we’ll query the “main” collection <strong>sales</strong> for the current month-to-date, then union the results with a pipeline over the <strong>monthlySales</strong> for already computed document since the beginning of the year:</p><figure class="highlight php"><table><tr><td class="code"><pre><span class="line">db.sales.<span class="title function_ invoke__">aggregate</span>([</span><br><span class="line">    &#123; <span class="variable">$match</span>: &#123; <span class="attr">date</span>: &#123; <span class="variable">$gte</span>: <span class="title function_ invoke__">ISODate</span>(<span class="string">&#x27;2021-08-01&#x27;</span>), <span class="variable">$lt</span>: <span class="title function_ invoke__">ISODate</span>(<span class="string">&#x27;2021-09-01&#x27;</span>) &#125; &#125; &#125;,</span><br><span class="line"></span><br><span class="line">    &#123; <span class="variable">$group</span>: &#123; <span class="attr">_id</span>: <span class="title function_ invoke__">ISODate</span>(<span class="string">&#x27;2021-08-01&#x27;</span>), <span class="attr">saleCount</span>: &#123; <span class="variable">$sum</span>: <span class="number">1</span> &#125;, <span class="attr">total</span>: &#123; <span class="variable">$sum</span>: <span class="string">&#x27;$total&#x27;</span>&#125; &#125;&#125;,</span><br><span class="line">    </span><br><span class="line">    &#123;</span><br><span class="line">        <span class="variable">$unionWith</span>: &#123;</span><br><span class="line">            <span class="attr">coll</span>: <span class="string">&#x27;monthlySales&#x27;</span>,</span><br><span class="line">            <span class="attr">pipeline</span>: [</span><br><span class="line">                &#123; <span class="variable">$match</span>: &#123; <span class="attr">_id</span>: &#123; <span class="variable">$gte</span>: <span class="title function_ invoke__">ISODate</span>(<span class="string">&#x27;2021-01-01&#x27;</span>), <span class="variable">$lt</span>: <span class="title function_ invoke__">ISODate</span>(<span class="string">&#x27;2021-08-01&#x27;</span>) &#125; &#125; &#125;,</span><br><span class="line">            ]</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;,</span><br><span class="line">    </span><br><span class="line">    &#123; <span class="variable">$sort</span>: &#123; <span class="attr">_id</span>: -<span class="number">1</span> &#125; &#125;</span><br><span class="line">])</span><br><span class="line"></span><br><span class="line">&#123; <span class="string">&quot;_id&quot;</span> : <span class="title function_ invoke__">ISODate</span>(<span class="string">&quot;2021-08-01T00:00:00Z&quot;</span>), <span class="string">&quot;saleCount&quot;</span> : <span class="number">1613</span>, <span class="string">&quot;total&quot;</span> : <span class="title function_ invoke__">NumberDecimal</span>(<span class="string">&quot;815527.3523&quot;</span>) &#125;</span><br><span class="line">&#123; <span class="string">&quot;_id&quot;</span> : <span class="title function_ invoke__">ISODate</span>(<span class="string">&quot;2021-07-01T00:00:00Z&quot;</span>), <span class="string">&quot;saleCount&quot;</span> : <span class="number">5255</span>, <span class="string">&quot;total&quot;</span> : <span class="title function_ invoke__">NumberDecimal</span>(<span class="string">&quot;2615967.9695&quot;</span>) &#125;</span><br><span class="line">&#123; <span class="string">&quot;_id&quot;</span> : <span class="title function_ invoke__">ISODate</span>(<span class="string">&quot;2021-06-01T00:00:00Z&quot;</span>), <span class="string">&quot;saleCount&quot;</span> : <span class="number">5134</span>, <span class="string">&quot;total&quot;</span> : <span class="title function_ invoke__">NumberDecimal</span>(<span class="string">&quot;2540986.0768&quot;</span>) &#125;</span><br><span class="line">&#123; <span class="string">&quot;_id&quot;</span> : <span class="title function_ invoke__">ISODate</span>(<span class="string">&quot;2021-05-01T00:00:00Z&quot;</span>), <span class="string">&quot;saleCount&quot;</span> : <span class="number">5294</span>, <span class="string">&quot;total&quot;</span> : <span class="title function_ invoke__">NumberDecimal</span>(<span class="string">&quot;2680115.7638&quot;</span>) &#125;</span><br><span class="line">&#123; <span class="string">&quot;_id&quot;</span> : <span class="title function_ invoke__">ISODate</span>(<span class="string">&quot;2021-04-01T00:00:00Z&quot;</span>), <span class="string">&quot;saleCount&quot;</span> : <span class="number">5083</span>, <span class="string">&quot;total&quot;</span> : <span class="title function_ invoke__">NumberDecimal</span>(<span class="string">&quot;2532271.9124&quot;</span>) &#125;</span><br><span class="line">&#123; <span class="string">&quot;_id&quot;</span> : <span class="title function_ invoke__">ISODate</span>(<span class="string">&quot;2021-03-01T00:00:00Z&quot;</span>), <span class="string">&quot;saleCount&quot;</span> : <span class="number">5270</span>, <span class="string">&quot;total&quot;</span> : <span class="title function_ invoke__">NumberDecimal</span>(<span class="string">&quot;2647041.1003&quot;</span>) &#125;</span><br><span class="line">&#123; <span class="string">&quot;_id&quot;</span> : <span class="title function_ invoke__">ISODate</span>(<span class="string">&quot;2021-02-01T00:00:00Z&quot;</span>), <span class="string">&quot;saleCount&quot;</span> : <span class="number">4809</span>, <span class="string">&quot;total&quot;</span> : <span class="title function_ invoke__">NumberDecimal</span>(<span class="string">&quot;2424930.7338&quot;</span>) &#125;</span><br><span class="line">&#123; <span class="string">&quot;_id&quot;</span> : <span class="title function_ invoke__">ISODate</span>(<span class="string">&quot;2021-01-01T00:00:00Z&quot;</span>), <span class="string">&quot;saleCount&quot;</span> : <span class="number">5252</span>, <span class="string">&quot;total&quot;</span> : <span class="title function_ invoke__">NumberDecimal</span>(<span class="string">&quot;2608231.7712&quot;</span>) &#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>Mic drop. A single query processed all on the server side, giving us everything we need in one shot.</p><p>Which leaves us with the small matter of granularity. How much of the work do we want to do on the fly vs. just digging up the computed data?</p><p>If we populate the <strong>monthlySales</strong> monthly after the edn of each month, we’ll need to query up to 31 days worth of data on the ad-hoc side. If we update the sales daily we’ll have less than a day’s worth to sum on the fly, with the onus of determining exactly what cutoff window we set for the current vs the existing already merged data. Whatever we choose, we’ll just need to make sure our boundaries for the time window don’t overlap so we don’t count the same sale too many or too few times.</p><h2 id="Getting-There"><a href="#Getting-There" class="headerlink" title="Getting There"></a>Getting There</h2><p>To round things off, you might want to start accumulating pre-calculated data but already have a slew of individual documents in your live collection. You could write some script client side to iterate the months and years of the past. Or you could run a query to populate the monthlySales like so:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">sales</span>.<span class="title function_">aggregate</span>([</span><br><span class="line">    &#123;</span><br><span class="line">        <span class="attr">$group</span>:</span><br><span class="line">        &#123;</span><br><span class="line">            <span class="attr">_id</span>: &#123; <span class="attr">$dateFromParts</span>: &#123; <span class="attr">year</span>: &#123; <span class="attr">$year</span>: <span class="string">&#x27;$date&#x27;</span> &#125;, <span class="attr">month</span>: &#123; <span class="attr">$month</span>: <span class="string">&#x27;$date&#x27;</span> &#125; &#125; &#125;,</span><br><span class="line">            <span class="attr">total</span>: &#123; <span class="attr">$sum</span>: <span class="string">&#x27;$total&#x27;</span> &#125;,</span><br><span class="line">            <span class="attr">saleCount</span>: &#123; <span class="attr">$sum</span>: <span class="number">1</span> &#125;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;,</span><br><span class="line">    &#123;</span><br><span class="line">        <span class="attr">$merge</span>:</span><br><span class="line">        &#123;</span><br><span class="line">            <span class="attr">into</span>: &#123; <span class="attr">db</span>: <span class="string">&#x27;test&#x27;</span>, <span class="attr">coll</span>: <span class="string">&#x27;monthlySales&#x27;</span> &#125;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">])</span><br></pre></td></tr></table></figure><p>There are more nuances bells and whistles that come along with <code>$unionWith</code> and <code>$merge</code>, but I think with the above techniques we can already achieve a good portion of the historic+current type scenarios. You will of course need to have a recent MongoDB version - <code>$merge</code> was introduced in 4.2, and <code>$unionWith</code> in 4.4.</p>]]></content>
    
    
      
      
    <summary type="html">&lt;h2 id=&quot;TL-DR&quot;&gt;&lt;a href=&quot;#TL-DR&quot; class=&quot;headerlink&quot; title=&quot;TL;DR&quot;&gt;&lt;/a&gt;TL;DR&lt;/h2&gt;&lt;p&gt;Schedule periodic rollup and aggregation on historic data </summary>
      
    
    
    
    
    <category term="MongoDB" scheme="https://www.plusnconsulting.com/tags/MongoDB/"/>
    
    <category term="BI" scheme="https://www.plusnconsulting.com/tags/BI/"/>
    
  </entry>
  
  <entry>
    <title>Time Series Collections - Native Support</title>
    <link href="https://www.plusnconsulting.com/post/Time-Series-Collections-Native-Support/"/>
    <id>https://www.plusnconsulting.com/post/Time-Series-Collections-Native-Support/</id>
    <published>2021-07-30T21:14:51.000Z</published>
    <updated>2021-07-30T23:43:54.950Z</updated>
    
    <content type="html"><![CDATA[<h2 id="Time-Series-amp-Mongo"><a href="#Time-Series-amp-Mongo" class="headerlink" title="Time Series &amp; Mongo"></a>Time Series &amp; Mongo</h2><p>Time series data is a fairly common use case and has been “solved” before. Indeed, in blog posts around the net as well as in my course on schema design and data modeling, a common approach to storing such data was discussed at length.</p><p>But not everyone knows that bit of common wisdom. Further, it’s easy to get caught up in the storage details and make silly little mistakes that wreck the exercise.</p><p>Someone in MongoDB had the brilliant idea to simplify this further (I should by them a drink of their choice I find out who it was). Instead of you painstakingly bucketing each “reading” or time series item, then un-bucketing it or peppering your queries with time-bucket-boundary match clauses - why not provide a new collection for time series which does the tricky bits for you?</p><p>As far as the client is concerned, every document is a stand-alone point in time item which has a timestamp and some measurement - a reading - composed of one or more fields recording your data. If you insert a document, you insert one per reading. You don’t have to push into arrays, or compute any time-window or anything like that. </p><p>When reading a document from a time series collection, each point in time measurement returns as a single document - not a bucket with an array of readings withing a window. </p><p>It appears largely un-magical, until and unless you compare it to what you had to do manually before this existed. </p><p>The <strong>before</strong> document was a bucket with multiple readings, so for a 12h bucket and say 4 temperature readings in the field <code>temperature</code> in a day you would see something like: </p><figure class="highlight json"><table><tr><td class="code"><pre><span class="line"></span><br><span class="line">db.sensings.find(<span class="punctuation">&#123;</span>timestamp<span class="punctuation">:</span> <span class="punctuation">&#123;</span>$gte<span class="punctuation">:</span> ISODate(&#x27;<span class="number">2021</span><span class="number">-09</span><span class="number">-01</span>&#x27;)<span class="punctuation">,</span> $lt<span class="punctuation">:</span> ISODate(&#x27;<span class="number">2021</span><span class="number">-09</span><span class="number">-02</span>&#x27;)<span class="punctuation">&#125;</span><span class="punctuation">&#125;</span>)</span><br><span class="line"></span><br><span class="line">&gt; <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">&#123;</span>_id<span class="punctuation">:</span> ...<span class="punctuation">,</span> bucketStart<span class="punctuation">:</span> ISODate(<span class="string">&quot;2021-09-01T00:00:00Z&quot;</span>)<span class="punctuation">,</span> sensorId<span class="punctuation">:</span> <span class="number">123</span><span class="punctuation">,</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span>temperature<span class="punctuation">:</span> <span class="number">97.1</span><span class="punctuation">,</span> timestamp<span class="punctuation">:</span> ISODate(<span class="string">&quot;2021-09-01T00:00:00&quot;</span><span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">        <span class="punctuation">&#123;</span>temperature<span class="punctuation">:</span> <span class="number">98.6</span><span class="punctuation">,</span> timestamp<span class="punctuation">:</span> ISODate(<span class="string">&quot;2021-09-01T00:06:00&quot;</span><span class="punctuation">&#125;</span><span class="punctuation">,</span><span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span>_id<span class="punctuation">:</span> ...<span class="punctuation">,</span> bucketStart<span class="punctuation">:</span> ISODate(<span class="string">&quot;2021-09-01T00:00:00Z&quot;</span>)<span class="punctuation">,</span> sensorId<span class="punctuation">:</span> <span class="number">123</span><span class="punctuation">,</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span>temperature<span class="punctuation">:</span> <span class="number">96.9</span><span class="punctuation">,</span> timestamp<span class="punctuation">:</span> ISODate(<span class="string">&quot;2021-09-01T00:12:00&quot;</span><span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">        <span class="punctuation">&#123;</span>temperature<span class="punctuation">:</span> <span class="number">97.2</span><span class="punctuation">,</span> timestamp<span class="punctuation">:</span> ISODate(<span class="string">&quot;2021-09-01T00:18:00&quot;</span><span class="punctuation">&#125;</span><span class="punctuation">,</span><span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">]</span></span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>Using Time-Series Collection, the readings are presented to you individually:</p><figure class="highlight json"><table><tr><td class="code"><pre><span class="line"></span><br><span class="line">db.sensings.find(<span class="punctuation">&#123;</span>timestamp<span class="punctuation">:</span> <span class="punctuation">&#123;</span>$gte<span class="punctuation">:</span> ISODate(&#x27;<span class="number">2021</span><span class="number">-09</span><span class="number">-01</span>&#x27;)<span class="punctuation">,</span> $lt<span class="punctuation">:</span> ISODate(&#x27;<span class="number">2021</span><span class="number">-09</span><span class="number">-02</span>&#x27;)<span class="punctuation">&#125;</span><span class="punctuation">&#125;</span>)</span><br><span class="line"></span><br><span class="line">&gt; <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">&#123;</span>_id<span class="punctuation">:</span> ...<span class="punctuation">,</span> temperature<span class="punctuation">:</span> <span class="number">97.1</span><span class="punctuation">,</span> timestamp<span class="punctuation">:</span> ISODate(<span class="string">&quot;2021-09-01T00:00:00&quot;</span><span class="punctuation">,</span> info<span class="punctuation">:</span> <span class="punctuation">[</span><span class="punctuation">&#123;</span>sensorId<span class="punctuation">:</span><span class="number">123</span><span class="punctuation">&#125;</span><span class="punctuation">]</span><span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span>_id<span class="punctuation">:</span> ...<span class="punctuation">,</span> temperature<span class="punctuation">:</span> <span class="number">98.6</span><span class="punctuation">,</span> timestamp<span class="punctuation">:</span> ISODate(<span class="string">&quot;2021-09-01T00:06:00&quot;</span><span class="punctuation">,</span> info<span class="punctuation">:</span> <span class="punctuation">[</span><span class="punctuation">&#123;</span>sensorId<span class="punctuation">:</span><span class="number">123</span><span class="punctuation">&#125;</span><span class="punctuation">]</span><span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span>_id<span class="punctuation">:</span> ...<span class="punctuation">,</span> temperature<span class="punctuation">:</span> <span class="number">96.9</span><span class="punctuation">,</span> timestamp<span class="punctuation">:</span> ISODate(<span class="string">&quot;2021-09-01T00:12:00&quot;</span><span class="punctuation">,</span> info<span class="punctuation">:</span> <span class="punctuation">[</span><span class="punctuation">&#123;</span>sensorId<span class="punctuation">:</span><span class="number">123</span><span class="punctuation">&#125;</span><span class="punctuation">]</span><span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span>_id<span class="punctuation">:</span> ...<span class="punctuation">,</span> temperature<span class="punctuation">:</span> <span class="number">97.2</span><span class="punctuation">,</span> timestamp<span class="punctuation">:</span> ISODate(<span class="string">&quot;2021-09-01T00:18:00&quot;</span><span class="punctuation">,</span> info<span class="punctuation">:</span> <span class="punctuation">[</span><span class="punctuation">&#123;</span>sensorId<span class="punctuation">:</span><span class="number">123</span><span class="punctuation">&#125;</span><span class="punctuation">]</span><span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">]</span></span><br></pre></td></tr></table></figure><p>You will note that the bucket granularity is nowhere in the results! You may have chosen to bucket spanning seconds (the default), minutes, or hours. But that is a storage choice, and your queries and users don’t need to care. This awesome arrangement lets users just be users. Insert readings individually as the occur, read and summarize over time as if they were stored individually. On the surface this matches the business expectation and mental model of periodic measurements written or read as they are. However</p><p>The internals are what’s nice here. The “collection” presented to the user (<code>sensings</code> in the example above) is actually more of a view provided over internal collections that physically exist but you do not directly interact with them. The backing storage collections are organized to bucket the individual readings into some fixed time window size. If you choose an hours granularity, there can be up to 1 bucket per hour. Multiple readings within the same hour would all fall into the same bucket, and next hour’s readings will fall into the next bucket. Except now you don’t worry about structuring your document in any special way - Mongo does it for you!</p><p>At the time of writing, the internal physical backing collection is created when you declare your special Time-Series collection. For my example of the collection names <code>sensings</code> Mongo has created a backing collection <code>system.buckets.sensings</code> and stores physical documents under that namespace, in buckets.</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">show collections</span><br><span class="line"></span><br><span class="line">&gt; sensings </span><br><span class="line">&gt; system.<span class="property">buckets</span>.<span class="property">sensings</span></span><br></pre></td></tr></table></figure><p>Oh, yes: you have to declare your special collection. If you have Mongo implicitly create a collection by inserting data into a yet-to-be-created collection, Mongo will assume a normal collection. The syntax requires one thing only - the <code>timeField</code>. I like to include also the <code>metaField</code> and <code>granularity</code> parameters, for reasons expanded below.</p><p>To create my sensings collection and store at an ingestion rate fitting in hours long buckets issue the following command:</p><figure class="highlight reasonml"><table><tr><td class="code"><pre><span class="line">db.create<span class="constructor">Collection(<span class="string">&quot;sensings&quot;</span>, &#123;<span class="params">timeseries</span>: &#123;<span class="params">timeField</span>: <span class="string">&quot;timestamp&quot;</span>, <span class="params">metaField</span>: <span class="string">&quot;info&quot;</span>, <span class="params">granularity</span>: <span class="string">&quot;hours&quot;</span>&#125;&#125;)</span></span><br></pre></td></tr></table></figure><p>The collection named <code>sensings</code> will be created (as well as a corresponding <code>system.buckets.sensings</code> behind the scenes). It treat the field named <code>timestamp</code> on your inserted documents as the time for bucketing purpose. It will let you also add extra non-measurement data or “meta” details in the field named “info”. The collection will optimize the bucket in granularity consistent with multiple items in a matter of hours. For high ingestion rates you should choose <code>minutes</code> or <code>seconds</code>. The</p><p>What is the business with the <code>metaField</code> and <code>granularity</code> about? It lets Mongo optimize the bucket sizes internally and organize storage efficiently. The theory behind creating time buckets is that:</p><ol><li>Events are written once</li><li>Events are ready for a time period</li><li>Loading fewer documents that contain all the readings is more efficient than if each reading resided in one physical document.</li></ol><p>When doing things manually we could shove the metadata extras such as sensor-id or patient id in any field. But for Mongo to understand that the metadata is a label and not a metric, and to enable it to create buckets specific to groups of readings we have to tell it where we store the metadata. In our case here, I used the <code>info</code> field. In that field, I can store a single value, or several different fields. This way when I query for temperature average for a particular <code>sensorId</code>, since it is written inside the <code>info</code> field, Mongo knows that and can load buckets for that meta only. It should try and create a bucket for each time-window and for each meta value or combo of values. That way, query for sensor 123 doesn’t need to read buckets for sensor 456.</p><p>But the magic ends here as things now and <code>Mongo does not automatically create indexes on the meta fields</code> . It leaves that up to you. Mongo <strong>does</strong> create indexes on the bucket boundaries timestamps on the backing collection, but that is usually no enough for prime time. You will no doubt have some narrower interest in slicing the time series by some of the meta fields. I’ll add a compound index on my <code>sensorId</code> like so:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">sensings</span>.<span class="title function_">createIndex</span>(&#123;<span class="string">&quot;info.sensorId&quot;</span>: <span class="number">1</span>, <span class="string">&quot;timestamp&quot;</span>: <span class="number">1</span>&#125;)</span><br></pre></td></tr></table></figure><p>My query can now look to get average temperature across the devices on a cartain day, and leverage the index I just created:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">sensings</span>.<span class="title function_">aggregate</span>([</span><br><span class="line">    &#123;<span class="attr">$match</span>: &#123;<span class="attr">timestamp</span>: &#123;<span class="attr">$gte</span>: <span class="title class_">ISODate</span>(<span class="string">&quot;2021-09-01&quot;</span>), <span class="attr">$lt</span>: <span class="title class_">ISODate</span>(<span class="string">&quot;2021-09-02&quot;</span>)&#125;&#125;&#125;</span><br><span class="line">    &#123;<span class="attr">$group</span>: &#123;<span class="attr">_id</span>: <span class="string">&quot;$info.sensorId&quot;</span>, <span class="attr">avgTemperature</span>: &#123;<span class="attr">$avg</span>: <span class="string">&quot;$temperature&quot;</span>&#125;&#125;&#125;</span><br><span class="line">    ] )</span><br></pre></td></tr></table></figure><p>A query chased with an <em>explain</em> shows the index is used:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">sensings</span>.<span class="title function_">find</span>(&#123;<span class="string">&quot;timestamp&quot;</span> : <span class="title class_">ISODate</span>(<span class="string">&quot;2021-09-01T00:18:00Z&quot;</span>), <span class="string">&quot;info.sensorId&quot;</span> : <span class="number">123</span>&#125;).<span class="title function_">explain</span>()</span><br><span class="line"></span><br><span class="line"><span class="comment">// &quot;inputStage&quot; : &#123;</span></span><br><span class="line"><span class="comment">// &quot;stage&quot; : &quot;IXSCAN&quot;, </span></span><br><span class="line"><span class="comment">// &quot;keyPattern&quot; : &#123;</span></span><br><span class="line"><span class="comment">//         &quot;meta.sensorId&quot; : 1,</span></span><br><span class="line"><span class="comment">//         &quot;control.min.timestamp&quot; : 1,</span></span><br><span class="line"><span class="comment">//         &quot;control.max.timestamp&quot; : 1</span></span><br><span class="line"><span class="comment">// &#125;,</span></span><br><span class="line"><span class="comment">// &quot;indexName&quot; : &quot;info.sensorId_1_timestamp_1&quot;,</span></span><br></pre></td></tr></table></figure><p>The inputStrage section shows an <code>IXSCAN</code> leveraging my index named <code>info.sensorId_1_timestamp_1</code>.</p><p>What is interesting to compare, is the index definitions on the nominal collection <code>sensings</code> versus the internal collection <code>system.buckets.sensings</code>.</p><p>The nominal collection shows what you expect: an index on the direct structural fields <code>timestamp</code> and the embedded <code>info.sensorId</code>: </p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">&gt; db.<span class="property">sensings</span>.<span class="title function_">getIndexes</span>()</span><br><span class="line">[</span><br><span class="line">        &#123;</span><br><span class="line">                <span class="string">&quot;v&quot;</span> : <span class="number">2</span>,</span><br><span class="line">                <span class="string">&quot;key&quot;</span> : &#123;</span><br><span class="line">                        <span class="string">&quot;info.sensorId&quot;</span> : <span class="number">1</span>,</span><br><span class="line">                        <span class="string">&quot;timestamp&quot;</span> : <span class="number">1</span></span><br><span class="line">                &#125;,</span><br><span class="line">                <span class="string">&quot;name&quot;</span> : <span class="string">&quot;info.sensorId_1_timestamp_1&quot;</span></span><br><span class="line">        &#125;</span><br><span class="line">]</span><br><span class="line">&gt;</span><br></pre></td></tr></table></figure><p>But the backing collection shows a different picture: </p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">&gt; db.<span class="property">system</span>.<span class="property">buckets</span>.<span class="property">sensings</span>.<span class="title function_">getIndexes</span>()</span><br><span class="line">[</span><br><span class="line">        &#123;</span><br><span class="line">                <span class="string">&quot;v&quot;</span> : <span class="number">2</span>,</span><br><span class="line">                <span class="string">&quot;key&quot;</span> : &#123;</span><br><span class="line">                        <span class="string">&quot;meta.sensorId&quot;</span> : <span class="number">1</span>,</span><br><span class="line">                        <span class="string">&quot;control.min.timestamp&quot;</span> : <span class="number">1</span>,</span><br><span class="line">                        <span class="string">&quot;control.max.timestamp&quot;</span> : <span class="number">1</span></span><br><span class="line">                &#125;,</span><br><span class="line">                <span class="string">&quot;name&quot;</span> : <span class="string">&quot;info.sensorId_1_timestamp_1&quot;</span></span><br><span class="line">        &#125;</span><br><span class="line">]</span><br></pre></td></tr></table></figure><p>The 2-field index I specified to create was actually created as a 3-key field behind the scenes. The timestamp - a single field in my nominal documents  - was translated into a bucket boundary defined by the <code>control.min.timestamp</code> and <code>control.max.timestamp</code> fields. Makes sense. The raison d’être of this whole thing is exactly that: that your nominal collection facade deals with individual reading documents, but the physical storage of the documents is something entirely different and more efficient than the facade lets us know. For a lesson on how the more efficient document is actually structured, I inserted a few documents: </p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">sensings</span>.<span class="title function_">insertMany</span>([</span><br><span class="line">&#123;<span class="attr">temperature</span>: <span class="number">97.1</span>, <span class="attr">timestamp</span>: <span class="title class_">ISODate</span>(<span class="string">&quot;2021-09-01T00:00:00&quot;</span>), <span class="attr">info</span>: &#123;<span class="attr">sensorId</span>:<span class="number">123</span>&#125;&#125;,</span><br><span class="line">&#123;<span class="attr">temperature</span>: <span class="number">98.6</span>, <span class="attr">timestamp</span>: <span class="title class_">ISODate</span>(<span class="string">&quot;2021-09-01T00:06:00&quot;</span>), <span class="attr">info</span>: &#123;<span class="attr">sensorId</span>:<span class="number">123</span>&#125;&#125;,</span><br><span class="line">&#123;<span class="attr">temperature</span>: <span class="number">96.9</span>, <span class="attr">timestamp</span>: <span class="title class_">ISODate</span>(<span class="string">&quot;2021-09-01T00:12:00&quot;</span>), <span class="attr">info</span>: &#123;<span class="attr">sensorId</span>:<span class="number">123</span>&#125;&#125;,</span><br><span class="line">&#123;<span class="attr">temperature</span>: <span class="number">97.2</span>, <span class="attr">timestamp</span>: <span class="title class_">ISODate</span>(<span class="string">&quot;2021-09-01T00:18:00&quot;</span>), <span class="attr">info</span>: &#123;<span class="attr">sensorId</span>:<span class="number">123</span>&#125;&#125;</span><br><span class="line">])</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>Then I chased it with a query on the backing collection <code>system.buckets.sensings</code> : </p><figure class="highlight ada"><table><tr><td class="code"><pre><span class="line">&gt; db.system.buckets.sensings.findOne()</span><br><span class="line">&#123;</span><br><span class="line">        <span class="string">&quot;_id&quot;</span> : <span class="type">ObjectId</span>(<span class="string">&quot;612ec2809f952fa914335e8e&quot;</span>),</span><br><span class="line">        <span class="string">&quot;control&quot;</span> : &#123;</span><br><span class="line">                <span class="string">&quot;version&quot;</span> : 1,</span><br><span class="line">                <span class="string">&quot;min&quot;</span> : &#123;</span><br><span class="line">                        <span class="string">&quot;_id&quot;</span> : <span class="type">ObjectId</span>(<span class="string">&quot;6104819959cb4aa1c00f0b9c&quot;</span>),</span><br><span class="line">                        <span class="string">&quot;temperature&quot;</span> : 96.9,</span><br><span class="line">                        <span class="string">&quot;timestamp&quot;</span> : <span class="type">ISODate</span>(<span class="string">&quot;2021-09-01T00:00:00Z&quot;</span>)</span><br><span class="line">                &#125;,</span><br><span class="line">                <span class="string">&quot;max&quot;</span> : &#123;</span><br><span class="line">                        <span class="string">&quot;_id&quot;</span> : <span class="type">ObjectId</span>(<span class="string">&quot;6104819959cb4aa1c00f0b9f&quot;</span>),</span><br><span class="line">                        <span class="string">&quot;temperature&quot;</span> : 98.6,</span><br><span class="line">                        <span class="string">&quot;timestamp&quot;</span> : <span class="type">ISODate</span>(<span class="string">&quot;2021-09-01T00:18:00Z&quot;</span>)</span><br><span class="line">                &#125;</span><br><span class="line">        &#125;,</span><br><span class="line">        <span class="string">&quot;meta&quot;</span> : &#123;</span><br><span class="line">                <span class="string">&quot;sensorId&quot;</span> : 123</span><br><span class="line">        &#125;,</span><br><span class="line">        <span class="string">&quot;data&quot;</span> : &#123;</span><br><span class="line">                <span class="string">&quot;_id&quot;</span> : &#123;</span><br><span class="line">                        <span class="string">&quot;0&quot;</span> : <span class="type">ObjectId</span>(<span class="string">&quot;6104819959cb4aa1c00f0b9c&quot;</span>),</span><br><span class="line">                        <span class="string">&quot;1&quot;</span> : <span class="type">ObjectId</span>(<span class="string">&quot;6104819959cb4aa1c00f0b9d&quot;</span>),</span><br><span class="line">                        <span class="string">&quot;2&quot;</span> : <span class="type">ObjectId</span>(<span class="string">&quot;6104819959cb4aa1c00f0b9e&quot;</span>),</span><br><span class="line">                        <span class="string">&quot;3&quot;</span> : <span class="type">ObjectId</span>(<span class="string">&quot;6104819959cb4aa1c00f0b9f&quot;</span>)</span><br><span class="line">                &#125;,</span><br><span class="line">                <span class="string">&quot;temperature&quot;</span> : &#123;</span><br><span class="line">                        <span class="string">&quot;0&quot;</span> : 97.1,</span><br><span class="line">                        <span class="string">&quot;1&quot;</span> : 98.6,</span><br><span class="line">                        <span class="string">&quot;2&quot;</span> : 96.9,</span><br><span class="line">                        <span class="string">&quot;3&quot;</span> : 97.2</span><br><span class="line">                &#125;,</span><br><span class="line">                <span class="string">&quot;timestamp&quot;</span> : &#123;</span><br><span class="line">                        <span class="string">&quot;0&quot;</span> : <span class="type">ISODate</span>(<span class="string">&quot;2021-09-01T00:00:00Z&quot;</span>),</span><br><span class="line">                        <span class="string">&quot;1&quot;</span> : <span class="type">ISODate</span>(<span class="string">&quot;2021-09-01T00:06:00Z&quot;</span>),</span><br><span class="line">                        <span class="string">&quot;2&quot;</span> : <span class="type">ISODate</span>(<span class="string">&quot;2021-09-01T00:12:00Z&quot;</span>),</span><br><span class="line">                        <span class="string">&quot;3&quot;</span> : <span class="type">ISODate</span>(<span class="string">&quot;2021-09-01T00:18:00Z&quot;</span>)</span><br><span class="line">                &#125;</span><br><span class="line">        &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>The <code>control</code> fields describe the bucket itself. They provide boundary info, minimum and maximum timestamps for the bucket. The <code>meta</code> field contains my fields that I told Mongo I’d store under the original <code>info</code> field in my inserted documents. In my case, a single field but there could be more. It would make little logical sense for Mongo to have a bucket spanning multiple sensors. So Mongo will find or create a bucket for every differing meta values for a given time period. Finally, the data. Fields not originally contained in the <code>info</code> field as a sub-document and fields which are not the timestamp itself are stored under the <code>data</code> field in backing storage. This includes the <code>temperature</code> - as expected. It also contains a corresponding timestamp that the measurement was taken at. Since every document has an _id in Mongo, the _id of the reading is also stored here. The correlation among the triplet of values is positional, so the first reading in the bucket is spread across <code>data._id[0]</code>, <code>data.temperature[0]</code> and <code>data.timestamp[0]</code>, the next reading in the bucket is <code>...[1]</code> and so forth.</p><p>Back to my aggregation from before: <code>db.sensings.aggregate([&#123;$match: &#123;timestamp: &#123;$gte: ISODate(&quot;2021-09-01&quot;), $lt: ISODate(&quot;2021-09-02&quot;)&#125;&#125;&#125;, &#123;$group: &#123;_id: &quot;$info.sensorId&quot;, avgTemperature: &#123;$avg: &quot;$temperature&quot;&#125;&#125;&#125;] )</code>. Nowhere did I include any positional magic or unwinds or extra filtering logic. MongoDB did the work for my taking my straightforward mental model of the nominal collection - one reading per document - and did all the heavy lifting. Life is good.</p><p>In the course of writing this post and poking around, I found this bit interesting: when you run collection stats on the nominal collection you get a new section named <code>timeseries</code>:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">&gt; db.<span class="property">sensings</span>.<span class="title function_">stats</span>().<span class="property">timeseries</span></span><br><span class="line">&#123;</span><br><span class="line">        <span class="string">&quot;bucketsNs&quot;</span> : <span class="string">&quot;test.system.buckets.sensings&quot;</span>,</span><br><span class="line">        <span class="string">&quot;bucketCount&quot;</span> : <span class="number">0</span>,</span><br><span class="line">        <span class="string">&quot;numBucketInserts&quot;</span> : <span class="number">0</span>,</span><br><span class="line">        <span class="string">&quot;numBucketUpdates&quot;</span> : <span class="number">0</span>,</span><br><span class="line">        <span class="string">&quot;numBucketsOpenedDueToMetadata&quot;</span> : <span class="number">0</span>,</span><br><span class="line">        <span class="string">&quot;numBucketsClosedDueToCount&quot;</span> : <span class="number">0</span>,</span><br><span class="line">        <span class="string">&quot;numBucketsClosedDueToSize&quot;</span> : <span class="number">0</span>,</span><br><span class="line">        <span class="string">&quot;numBucketsClosedDueToTimeForward&quot;</span> : <span class="number">0</span>,</span><br><span class="line">        <span class="string">&quot;numBucketsClosedDueToTimeBackward&quot;</span> : <span class="number">0</span>,</span><br><span class="line">        <span class="string">&quot;numBucketsClosedDueToMemoryThreshold&quot;</span> : <span class="number">0</span>,</span><br><span class="line">        <span class="string">&quot;numCommits&quot;</span> : <span class="number">0</span>,</span><br><span class="line">        <span class="string">&quot;numWaits&quot;</span> : <span class="number">0</span>,</span><br><span class="line">        <span class="string">&quot;numMeasurementsCommitted&quot;</span> : <span class="number">0</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>The new section gives some clues that the optimization and salient metrics to watch have to do with buckets, their management in terms of “opening” and “closing” them with respect to aspects such as the metadata, the number of items, the total size of the bucket, the time boundaries, as well as memory pressure. Pretty interesting, but I have no insight or clue how further to interpret these numbers. Time will tell (rim-shot, exit stage).</p><p>Whats next for Time-Series? I don’t know. Now that the abstraction and use-case has been introduced, the team can presumably focus on even more efficient storage or ways to process time-series queries more efficiently. Judging from the past, this is the way things unfold: a feature is introduced, which “makes it a thing”, then future iterations are taken to make it even better and more useful. I for one see this as a great new feature.</p>]]></content>
    
    
      
      
    <summary type="html">&lt;h2 id=&quot;Time-Series-amp-Mongo&quot;&gt;&lt;a href=&quot;#Time-Series-amp-Mongo&quot; class=&quot;headerlink&quot; title=&quot;Time Series &amp;amp; Mongo&quot;&gt;&lt;/a&gt;Time Series &amp;amp; Mon</summary>
      
    
    
    
    
    <category term="MongoDB" scheme="https://www.plusnconsulting.com/tags/MongoDB/"/>
    
    <category term="IoT" scheme="https://www.plusnconsulting.com/tags/IoT/"/>
    
    <category term="Modeling" scheme="https://www.plusnconsulting.com/tags/Modeling/"/>
    
  </entry>
  
  <entry>
    <title>If it ain&#39;t broke - HTTP PATCH it!</title>
    <link href="https://www.plusnconsulting.com/post/If-it-ain-t-broke-patch-it/"/>
    <id>https://www.plusnconsulting.com/post/If-it-ain-t-broke-patch-it/</id>
    <published>2021-01-19T17:59:12.000Z</published>
    <updated>2021-01-19T23:47:08.284Z</updated>
    
    <content type="html"><![CDATA[<p>Is your document “broken” inside your database? Sure, from time to time a value on a document needs to be updated, because the real world thing it describes changes. But does that mean what you currently have is “broken”? No, you say - it is not broken, just that “field xyz needs to change value”. Or “field xyz is no longer needed”. </p><p>Fine then: the document structure - the schema if you will - is still sound. </p><p>Where am I going with this? If the document is not broken, then you should not “touch” the whole document. You should use a <em>surgical</em> update, one that only sends a command to modify a field inside the document.</p><p>In MongoDB, the <code>update</code> command takes 2 forms:</p><ol><li>Document Replacement (AKA: “Slam”)</li><li>Surgical Field Manipulation</li></ol><p>In the Document Replacement form, you supply the <code>update</code> command with a full document value which will replace the whole document in the database. Doing so will logically target a single document (by <code>_id</code> or some alternate key) and supply the future document in its entirety.</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">peeps</span>.<span class="title function_">update</span>(&#123;<span class="attr">_id</span>: <span class="string">&#x27;superbob&#x27;</span>&#125;, &#123;<span class="attr">name</span>: <span class="string">&#x27;bob&#x27;</span>, <span class="attr">likes</span>: [<span class="string">&#x27;coffee&#x27;</span>,<span class="string">&#x27;bacon&#x27;</span>]&#125;&#125;)</span><br></pre></td></tr></table></figure><p>The <code>update</code> above slams in a new document with only the _ <em>id</em>, <em>name</em>, and <em>likes</em> fields remaining.  Beyond losing any previous document field values, this update also assumes you intended to replace both the <em>name</em> and the <em>likes</em>. This may be true. But usually it’s just that in order to update one thing such as add <em>‘bacon’</em> to the list of things ‘superbob’ likes, you had to include previous <em>unchanged</em> values such as the <em>name</em>. Two problems here: One that you need to read the value from the database ahead of the update, second that the value you just read may be stale - updated by someone else - by the time your write back into the database.</p><p>The second - and my preferred - way to update a document is using surgical updates. If all I want to do is add <em>‘bacon’</em> to the list of likes, I can issue the command:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">peeps</span>.<span class="title function_">update</span>(&#123;<span class="attr">_id</span>: <span class="string">&#x27;superbob&#x27;</span>&#125;, &#123;<span class="attr">$addToSet</span>: &#123;<span class="attr">likes</span>: <span class="string">&#x27;bacon&#x27;</span>&#125;&#125;)</span><br></pre></td></tr></table></figure><p>This form will</p><ol><li>Add ‘bacon’ to the <em>likes</em> field if it doesn’t already have ‘bacon’ in it</li><li>Not touch the <em>name</em> field, or any other field in the document</li><li>Not require you to read the document before you issue the update, since you will not touch other fields.</li></ol><p>Which brings me to the point of this post: What about REST?</p><p>REST API use HTTP actions to represent the action taken. The usual suspects are <em>GET</em>, <em>POST</em>, <em>PUT</em>, <em>DELE</em>. The least controversial one is DELE probably. GET gains some bad-boy street creds for whether queries should be solely specified using the URI path or query variables. POST and PUT generate lots of lengthy distinction discussions to decide whether PUT should or can create original objects and whether in the context of databases it is permissible to return the value of the created or modified object in full or at least the id or URI. Lots of discussions. But my focus here is on update. I’d like to avoid having to query the current document value, and I’d like to only issue a change for a subset of the fields on my entity, umm, document.</p><p>HTTP’s <a href="https://tools.ietf.org/html/rfc5789"><strong>PATCH</strong></a> method aims to do exactly that. It lets a caller supply a future state of some of the fields, against a background of the existing, current entity. The docs state:</p><blockquote><p>“The difference between the PUT and PATCH requests is reflected in the    way the server processes the enclosed entity to modify the resource    identified by the Request-URI.  In a PUT request, the enclosed entity is considered to be a modified version of the resource stored on the origin server, and the client is requesting that the stored version be replaced.  With PATCH, however, the enclosed entity contains a set of instructions describing how a resource currently residing on the origin server should be modified to produce a new version.”</p></blockquote><p>This has “Surgical Update” written all over it, and maps naturally to the intent of the surgical update. And with MongoDB, some operators are idempotent such as <code>$addToSet</code>, <code>$set</code> , <code>$unset</code>. </p><p>Where’s the rub?</p><p>PATCH is not always implemented as surgical update. Some API frameworks don’t explicitly support PATCH, and others implement PATCH as a PUT (read: “slam” semantics) instead of the intended surgical semantics.</p><p>Some, like <a href="https://mongoosejs.com/">Mongoose</a>, support the surgical update semantics  explicitly with the <code>Document.prototype.update()</code> function, or by collecting single field assignments to an <em>existing</em> document and issuing a concise update. Careful here! * <a href="https://github.com/saintedlama/restify-mongoose/blob/master/index.js">Some frameworks</a> will just load the document first even when PATCH is supposed to be able to handle things in a more efficient way. Other * <a href="https://rahmanfadhil.com/express-rest-api/">blogs</a> might advise you to build it yourself in the same way. This is because mongoose by default is trying to entity-track the state of the document in memory then do the math to update or create the document backing the entity.</p><p>The culprit here is not frameworks actually - it’s the consumer. Uninformed or under-curious consumers of the frameworks will be causing extra back-end round trips, concurrency issues, or even data loss. And though there are applications that would not be horribly negatively impacted by these nuanced issues, but I don’t like to risk it. </p><p>Resolving this requires that you build an update outside of the document instance itself. Create an update command and issue it directly to the underlying database. If using something like mongoose, this means calling <code>updateOne()</code> on the <a href="https://mongoosejs.com/docs/api.html#model_Model.updateOne">Model</a> level, something along the lines of the code below. It does not have to <code>find()</code> nor <code>save()</code> anything ahead of updating the document and therefore saves that round-trip. Consequently also makes no assumption as to the current values in other fields, and would leave any un-mentioned field values intact.</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line"><span class="comment">// ... in your handler code:</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// Get id and patch values from your REST somehow</span></span><br><span class="line">    <span class="keyword">const</span> id = req.<span class="property">params</span>.<span class="property">id</span>; <span class="comment">// or whatever your REST API gives you</span></span><br><span class="line">    <span class="keyword">const</span> patchValues = <span class="title class_">JSON</span>.<span class="title function_">parse</span>(req.<span class="property">body</span>); <span class="comment">// eg: &#123; age: 23&#125;</span></span><br><span class="line">    </span><br><span class="line"><span class="comment">// ... validate somehow, then: </span></span><br><span class="line">    <span class="keyword">const</span> updateResults = <span class="keyword">await</span> <span class="title class_">Person</span>.<span class="title function_">updateOne</span>(&#123; <span class="attr">_id</span>: id &#125;, patchValues);</span><br><span class="line">    </span><br><span class="line">    <span class="variable language_">console</span>.<span class="title function_">log</span>(updateResults);</span><br><span class="line"><span class="comment">// &#123; n: 1, nModified: 1, ok: 1 &#125; ... or something to that effect.</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>This handles the simple cases. To build more elaborate manipulation such as adding&#x2F;removing array items or nested fields, you would want to explore more expressive PATCH parsing so that nuanced intents are clearer. You may want to take a look at <a href="http://jsonpatchjs.com/">jsonpatch.js</a> for inspiration.</p><blockquote><p>Why doesn’t <code>update</code> clarify the semantics? AFAIK: For historic reasons. To remedy this, official MongoDB drivers expose a <strong>replaceOne</strong> command (exact naming depends on language) which more precisely describes the operation when you intend to slam the existing document, leaving <code>update</code> and <code>updateOne</code> to convey surgical updates. This makes reading programs better, but does not prevent you from issuing a slam using update, so you will need to take care of it manually, especially since the <code>update</code> signature still allows for “slam” values to be provided.</p></blockquote><p>Hopefully, this post revealed nothing new. But if you are unsure how your REST API maps to backend DB mutation commands, it’s worth taking a look.</p><blockquote><p> <strong>*</strong> <em>Examples provided to illustrate the topic of this post only. They are not in any way meant to offend or diminish the contribution of respective referenced codebase authors.</em> </p></blockquote>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;Is your document “broken” inside your database? 
Sure, from time to time a value on a document needs to be updated, because the real worl</summary>
      
    
    
    
    
    <category term="MongoDB" scheme="https://www.plusnconsulting.com/tags/MongoDB/"/>
    
    <category term="Update" scheme="https://www.plusnconsulting.com/tags/Update/"/>
    
    <category term="REST" scheme="https://www.plusnconsulting.com/tags/REST/"/>
    
  </entry>
  
  <entry>
    <title>Positional Array Indexing in MongoDB</title>
    <link href="https://www.plusnconsulting.com/post/Positional-Array-Indexing-in-MongoDB/"/>
    <id>https://www.plusnconsulting.com/post/Positional-Array-Indexing-in-MongoDB/</id>
    <published>2019-02-26T16:33:32.000Z</published>
    <updated>2019-02-26T17:59:26.407Z</updated>
    
    <content type="html"><![CDATA[<h2 id="Overview"><a href="#Overview" class="headerlink" title="Overview"></a>Overview</h2><p>Did you know that MongoDB lets you index a specific positional element in an array? Well, it does!  I stumbled across this feature when reading the source code for <code>mongod</code> for an unrelated feature exploration, and stumbled upon this great feature. This feature can help you create more efficient indexes for specific scenarios.</p><h2 id="How-it-Works"><a href="#How-it-Works" class="headerlink" title="How it Works"></a>How it Works</h2><p>What is that good for you ask? Consider the following scenario:</p><p>Your company makes widgets as kits. Each widget is made of parts - each of which can be also sold separately. When a customer buys a “product” they are actually buying a box of items - a list of parts where the first one is by convention the “widget” and the rest are the parts of that widget. The schema for the <em>things</em> collection might look like this:</p><figure class="highlight json"><table><tr><td class="code"><pre><span class="line"><span class="punctuation">&#123;</span>  </span><br><span class="line">    _id<span class="punctuation">:</span> <span class="number">123</span><span class="punctuation">,</span></span><br><span class="line">    name<span class="punctuation">:</span> &#x27;Widget A&#x27;<span class="punctuation">,</span></span><br><span class="line">    kit_sku<span class="punctuation">:</span> &#x27;jk&#x27;<span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;parts&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span>name<span class="punctuation">:</span> &#x27;widget a&#x27;<span class="punctuation">,</span> sku<span class="punctuation">:</span> &#x27;abc&#x27; <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">        <span class="punctuation">&#123;</span>name<span class="punctuation">:</span> &#x27;hex key&#x27;<span class="punctuation">,</span> sku<span class="punctuation">:</span> &#x27;xyz&#x27; <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">        <span class="punctuation">&#123;</span>name<span class="punctuation">:</span> &#x27;obscure manual in Klingon&#x27;<span class="punctuation">,</span> sku<span class="punctuation">:</span> &#x27;qwe&#x27; <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">        <span class="punctuation">&#123;</span>name<span class="punctuation">:</span> &#x27;widget stand&#x27;<span class="punctuation">,</span> sku<span class="punctuation">:</span> &#x27;cbc&#x27; <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">]</span></span><br><span class="line"><span class="punctuation">&#125;</span></span><br></pre></td></tr></table></figure><p>The schema above has a field <code>kit_sku</code> which contains a unique selling id that folks can order. The <code>parts</code> list all items in the kit. The company sells other kits featuring the same widget just with differing extra parts. For example, there might be an obscure manual in Klingon for one kit, and a different kit with instructions in ancient Phoenician (you know, because somebody called and complained that they need one).</p><p>The catalog application needs call up the widget “widget a”. Except, it doesn’t really. It knows that the widget is sold under the sku <em>abc</em> in a variety of kits and that the widget is by convention the first item in the array. So the <code>kit_sku</code> is <em>not</em> useful here.</p><p>The query could look something like this:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">things</span>.<span class="title function_">find</span>(&#123; <span class="string">&#x27;parts.0.sku&#x27;</span>: <span class="string">&#x27;abc&#x27;</span> &#125;)</span><br></pre></td></tr></table></figure><p>This is a frequent enough query, so we’d want to support it with an index. But while indexing on <code>&#39;parts.sku&#39;</code> will produce a usable index, it is not necessarily optimal. Consider that you have a long list of parts for each kit across a large collection. The index would have to contain a key for every part, and point to all documents. This is both wasteful and misses the point. It is wasteful because memory and disk will need to contain entries that are not useful to the user - Only the first element in the parts list is what we’re looking for ever. It misses the point because of the same reason. We create this index only to satisfy queries that are interested in the <em>first element</em> of the array, not any part in the list.</p><p>You can create an index that includes a positional path though. Here:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">things</span>.<span class="title function_">createIndex</span>(&#123;<span class="string">&#x27;parts.0.sku&#x27;</span>:<span class="number">1</span>&#125;)</span><br></pre></td></tr></table></figure><p>The index create above includes the positional marker <code>0</code>, telling mongo only to index the <code>sku</code> of the first element of the <code>parts</code> array!</p><p>Given the index above, running our query will make use of this index, as long as we use the path to the first element.</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">things</span>.<span class="title function_">find</span>(&#123; <span class="string">&#x27;parts.0.sku&#x27;</span>: <span class="string">&#x27;abc&#x27;</span> &#125;).<span class="title function_">explain</span>()</span><br></pre></td></tr></table></figure><p>The query plan looks something like this:</p><figure class="highlight json"><table><tr><td class="code"><pre><span class="line">       <span class="attr">&quot;queryPlanner&quot;</span> <span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">                ...</span><br><span class="line">                <span class="attr">&quot;winningPlan&quot;</span> <span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">                        <span class="attr">&quot;stage&quot;</span> <span class="punctuation">:</span> <span class="string">&quot;FETCH&quot;</span><span class="punctuation">,</span></span><br><span class="line">                        <span class="attr">&quot;inputStage&quot;</span> <span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">                                <span class="attr">&quot;stage&quot;</span> <span class="punctuation">:</span> <span class="string">&quot;IXSCAN&quot;</span><span class="punctuation">,</span></span><br><span class="line">                                <span class="attr">&quot;keyPattern&quot;</span> <span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">                                        <span class="attr">&quot;parts.0.sku&quot;</span> <span class="punctuation">:</span> <span class="number">1</span></span><br><span class="line">                                <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">                                <span class="attr">&quot;indexName&quot;</span> <span class="punctuation">:</span> <span class="string">&quot;parts.0.sku_1&quot;</span><span class="punctuation">,</span></span><br><span class="line">                                <span class="attr">&quot;isMultiKey&quot;</span> <span class="punctuation">:</span> <span class="literal"><span class="keyword">false</span></span><span class="punctuation">,</span></span><br><span class="line">                                <span class="attr">&quot;multiKeyPaths&quot;</span> <span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">                                        <span class="attr">&quot;parts.0.sku&quot;</span> <span class="punctuation">:</span> <span class="punctuation">[</span> <span class="punctuation">]</span></span><br><span class="line">                                <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">                                ...</span><br><span class="line">                                <span class="attr">&quot;direction&quot;</span> <span class="punctuation">:</span> <span class="string">&quot;forward&quot;</span><span class="punctuation">,</span></span><br><span class="line">                                <span class="attr">&quot;indexBounds&quot;</span> <span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">                                        <span class="attr">&quot;parts.0.sku&quot;</span> <span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">                                                <span class="string">&quot;[1.0, 1.0]&quot;</span></span><br><span class="line">                                        <span class="punctuation">]</span></span><br><span class="line">                                <span class="punctuation">&#125;</span></span><br><span class="line">                        <span class="punctuation">&#125;</span></span><br><span class="line">                <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">                <span class="attr">&quot;rejectedPlans&quot;</span> <span class="punctuation">:</span> <span class="punctuation">[</span> <span class="punctuation">]</span></span><br><span class="line">        <span class="punctuation">&#125;</span><span class="punctuation">,</span>...</span><br><span class="line">        <span class="attr">&quot;ok&quot;</span> <span class="punctuation">:</span> <span class="number">1</span></span><br><span class="line"><span class="punctuation">&#125;</span></span><br></pre></td></tr></table></figure><p>The plan above shows a few reassuring feature. First and foremost, an index is used for our query as evident by the winning plan’s <code>&quot;IXSCAN&quot;</code> indicator. This is good.</p><p>Second, note the <code>&quot;isMultiKey&quot;</code> reports <code>false</code>. When you index an array field, MongoDB creates a multi-key index. A multi index index includes a pointer to the document for each element in its array hence the name. Non-multiKey indexes will only contain one pointer to a given document from a single key, since the indexed field contains only one document. This affects query processing because with multiKey index, mongo needs to work a bit harder to eliminate duplicates. <em>TL;DR</em> : when you use the positional index on an array element , it is not a multi-key index.</p><p>Third thing to note is the path. The index will be useful for queries that include the specific path <code>&quot;parts.0.sku&quot;</code>. Other queries that do not include the positional path part will NOT be using this index:</p><figure class="highlight awk"><table><tr><td class="code"><pre><span class="line">db.things.find(&#123; <span class="string">&#x27;parts.sku&#x27;</span>: <span class="string">&#x27;abc&#x27;</span> &#125;).explain()</span><br><span class="line"><span class="regexp">//</span> ...</span><br><span class="line"><span class="regexp">//</span> <span class="string">&quot;parsedQuery&quot;</span> : &#123;</span><br><span class="line"><span class="regexp">//</span>         <span class="string">&quot;parts.sku&quot;</span> : &#123;</span><br><span class="line"><span class="regexp">//</span>                 <span class="string">&quot;$eq&quot;</span> : <span class="string">&quot;abc&quot;</span></span><br><span class="line"><span class="regexp">//</span>         &#125;</span><br><span class="line"><span class="regexp">//</span> &#125;,</span><br><span class="line"><span class="regexp">//</span> <span class="string">&quot;winningPlan&quot;</span> : &#123;</span><br><span class="line"><span class="regexp">//</span>         <span class="string">&quot;stage&quot;</span> : <span class="string">&quot;COLLSCAN&quot;</span>,</span><br><span class="line"><span class="regexp">//</span>         <span class="string">&quot;filter&quot;</span> : &#123;</span><br><span class="line"><span class="regexp">//</span>                 <span class="string">&quot;parts.sku&quot;</span> : &#123;</span><br><span class="line"><span class="regexp">//</span>                         <span class="string">&quot;$eq&quot;</span> : <span class="string">&quot;abc&quot;</span></span><br><span class="line"><span class="regexp">//</span>                 &#125;</span><br><span class="line"><span class="regexp">//</span> ...</span><br></pre></td></tr></table></figure><p>The path used in the query above <em>did not</em> target the first positional element in the array, and therefore the query planner chose a full collection scan <code>&quot;COLLSCAN&quot;</code>. It is not correct to use the positional array index because the general query above asks <em>are there any elements in the array with sku “abc”</em>. The positional index we created doesn’t include pointers based on all elements, so it would be “missing” documents if mongo was to use the positional index for this query.</p><h2 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h2><p>If your query targets a specific known positional array element, this indexing strategy can buy you a performance boost. This index will be smaller than one on a whole array and will be a non-multiKey one, reducing the work the server has to do to process your query.</p><p>MongoDB’s support for this and other interesting indexing strategies is pretty phenomenal. This gives DBAs and developers finer surgical indexing capabilities to support a wide range of schema and query scenarios.</p>]]></content>
    
    
      
      
    <summary type="html">&lt;h2 id=&quot;Overview&quot;&gt;&lt;a href=&quot;#Overview&quot; class=&quot;headerlink&quot; title=&quot;Overview&quot;&gt;&lt;/a&gt;Overview&lt;/h2&gt;&lt;p&gt;Did you know that MongoDB lets you index a spe</summary>
      
    
    
    
    
    <category term="MongoDB" scheme="https://www.plusnconsulting.com/tags/MongoDB/"/>
    
    <category term="Performance" scheme="https://www.plusnconsulting.com/tags/Performance/"/>
    
  </entry>
  
  <entry>
    <title>Unwind MongoDB Array - A Pivotal Pipeline Stage</title>
    <link href="https://www.plusnconsulting.com/post/Unwind-MongoDB-Array-A-Pivotal-Pipeline-Stage/"/>
    <id>https://www.plusnconsulting.com/post/Unwind-MongoDB-Array-A-Pivotal-Pipeline-Stage/</id>
    <published>2019-01-31T17:46:48.000Z</published>
    <updated>2019-01-31T22:15:51.314Z</updated>
    
    <content type="html"><![CDATA[<p>When your document contains an array, <code>$unwind</code> is the aggregation pipeline stage that you most likely need. <code>$unwind</code> lets you un-pivot array elements, so that you can perform specific operations on each (or some) array elements independently. </p><h2 id="Why-Unwind"><a href="#Why-Unwind" class="headerlink" title="Why Unwind?"></a>Why Unwind?</h2><p>In relational databases, we used to need to pivot rows into columns. Say we had and IoT <code>readings</code> table like this one:</p><p>| Device | Measure | Reading ||— |— |— |— || 1  | ‘Temp’ | 42 | 1:00 || 1 | ‘Speed | 91 | 1:00 || 2 | ‘Speed’ | 94 | 1:00| 2 | ‘Temp’ | 33 | 1:00 || 2 | ‘Temp’ | 37 | 2:00 |</p><p>And we want a query to return results such as:</p><table><thead><tr><th>Device</th><th>Speed</th><th>Temp</th><th>Time</th></tr></thead><tbody><tr><td>1</td><td>91</td><td>42</td><td>1:00</td></tr><tr><td>2</td><td>94</td><td>33</td><td>1:00</td></tr><tr><td>2</td><td></td><td>37</td><td>2:00</td></tr></tbody></table><p>We can achieve that with a SQL pivoting function. The need to pivot stems from the model in which each “soft property” in the data table. We didn’t want to create a column for each measure, and we didn’t want to add a column to the table each time a new measure is needed.</p><p>In MongoDB (and any other document oriented database), the model for a single <code>reading</code> document might look more like the one below. Each document is a point in time that contains a bunch of measures. If ever a new measure was added - no problem! An extra element can be added to the array with no schema change. If ever a reading had more or fewer readings at a point in time - no problem either. The array contains a set of readings which can grow or shrink.</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">&#123;</span><br><span class="line">    <span class="attr">device</span>: <span class="number">1</span>,</span><br><span class="line">    <span class="attr">time</span>: <span class="title class_">ISODate</span>(<span class="string">&#x27;2019-01-31T01:00:00Z&#x27;</span>),</span><br><span class="line">    <span class="attr">readings</span>: [&#123;</span><br><span class="line">            <span class="attr">name</span>: <span class="string">&#x27;speed&#x27;</span>,</span><br><span class="line">            <span class="attr">value</span>: <span class="number">91</span></span><br><span class="line">        &#125;,</span><br><span class="line">        &#123;</span><br><span class="line">            <span class="attr">name</span>: <span class="string">&#x27;temp&#x27;</span>,</span><br><span class="line">            <span class="attr">value</span>: <span class="number">42</span></span><br><span class="line">        &#125;,</span><br><span class="line">    ]</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>With this document model, we would not need to do anything special for the report of readings by hour. The document already contains relevant measurements for each time period.</p><p>But if we wanted to calculate the average temperature across the devices, we’d need to do some digging. We don’t want to average <em>every</em> <code>readings.value</code> field - only the ones labeled <code>&#39;temp&#39;</code>.</p><p>Using the aggregation framework and <code>$unwind</code> we can do this in 3 steps:</p><ol><li>Unwind the <code>readings</code> array.</li><li>Filter out anything except <code>&#39;temp&#39;</code>.</li><li>Perform <code>$avg</code> on the values.</li></ol><h2 id="How-to-Unwind"><a href="#How-to-Unwind" class="headerlink" title="How to Unwind?"></a>How to Unwind?</h2><p>Let’s do this step by step. First, just unwind the array</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">readings</span>.<span class="title function_">aggregate</span>([</span><br><span class="line">    &#123;<span class="attr">$unwind</span>:<span class="string">&#x27;$readings&#x27;</span>&#125;</span><br><span class="line">]);</span><br><span class="line"></span><br><span class="line"><span class="comment">/* </span></span><br><span class="line"><span class="comment">&#123; &quot;_id&quot; : ..., &quot;device&quot; : 1, &quot;readings&quot; : &#123; &quot;name&quot; : &quot;speed&quot;, &quot;value&quot; : 91 &#125;, &quot;time&quot; : ISODate(&quot;2019-01-31T01:00:00Z&quot;) &#125;</span></span><br><span class="line"><span class="comment">&#123; &quot;_id&quot; : ..., &quot;device&quot; : 1, &quot;readings&quot; : &#123; &quot;name&quot; : &quot;temp&quot;,  &quot;value&quot; : 42 &#125;, &quot;time&quot; : ISODate(&quot;2019-01-31T01:00:00Z&quot;) &#125;</span></span><br><span class="line"><span class="comment">&#123; &quot;_id&quot; : ..., &quot;device&quot; : 2, &quot;readings&quot; : &#123; &quot;name&quot; : &quot;speed&quot;, &quot;value&quot; : 94 &#125;, &quot;time&quot; : ISODate(&quot;2019-01-31T01:00:00Z&quot;) &#125;</span></span><br><span class="line"><span class="comment">...</span></span><br><span class="line"><span class="comment">*/</span></span><br></pre></td></tr></table></figure><p>The result of <code>$unwind</code> is that for each original document’s <code>readings</code> array element, a new document will be produces that contains all the fields from the original document, but only one element assigned to the value of <code>readings</code>. As you can see above, the data type of the field <code>readings</code> after <code>$unwind</code> became a single-valued field rather than the original array. Device <code>1</code> had 2 elements in the original document, so in the <code>$unwind</code> output we have 2 <em>documents</em>, each attached to one <em>element</em> from the original array. One document contains a single reading for the measurement named <code>speed</code>, another document contains a reading for temperature measurement named <code>temp</code>.</p><p>Next step - we want only the temperature readings, getting rid of any <code>speed</code> or other measurements that may exist. We can do this with the <code>$match</code> pipeline stage. <code>$match</code> filters documents and only emits ones that match the criteria supplied - just like <code>find()</code> does.</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">readings</span>.<span class="title function_">aggregate</span>([</span><br><span class="line">    &#123; <span class="attr">$unwind</span>: <span class="string">&#x27;$readings&#x27;</span>&#125;,</span><br><span class="line">    &#123; <span class="attr">$match</span>: &#123; <span class="string">&#x27;readings.name&#x27;</span>: <span class="string">&#x27;temp&#x27;</span> &#125;&#125;</span><br><span class="line">]);</span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment">&#123; &quot;_id&quot; : ..., &quot;device&quot; : 1,  &quot;readings&quot; : &#123; &quot;name&quot; : &quot;temp&quot;, &quot;value&quot; : 42 &#125;, &quot;time&quot; : ISODate(&quot;2019-01-31T01:00:00Z&quot;), &#125;</span></span><br><span class="line"><span class="comment">&#123; &quot;_id&quot; : ..., &quot;device&quot; : 2,  &quot;readings&quot; : &#123; &quot;name&quot; : &quot;temp&quot;, &quot;value&quot; : 33 &#125;, &quot;time&quot; : ISODate(&quot;2019-01-31T01:00:00Z&quot;), &#125;</span></span><br><span class="line"><span class="comment">&#123; &quot;_id&quot; : ..., &quot;device&quot; : 2,  &quot;readings&quot; : &#123; &quot;name&quot; : &quot;temp&quot;, &quot;value&quot; : 37 &#125;, &quot;time&quot; : ISODate(&quot;2019-01-31T02:00:00Z&quot;), &#125;</span></span><br><span class="line"><span class="comment">...</span></span><br><span class="line"><span class="comment">*/</span></span><br></pre></td></tr></table></figure><p>This got rid of any measurements that were not labeled <code>temp</code>. With temperature only values in hand, we can now calculate average temperature.The <code>$group</code> pipeline operator will let us calculate an average for each device:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">readings</span>.<span class="title function_">aggregate</span>([</span><br><span class="line">    &#123; <span class="attr">$unwind</span>: <span class="string">&#x27;$readings&#x27;</span>&#125;,</span><br><span class="line">    &#123; <span class="attr">$match</span>: &#123; <span class="string">&#x27;readings.name&#x27;</span>: <span class="string">&#x27;temp&#x27;</span> &#125;&#125;,</span><br><span class="line">    &#123; <span class="attr">$group</span>: &#123; <span class="attr">_id</span>: <span class="string">&#x27;$device&#x27;</span>, <span class="attr">averageTemperature</span>: &#123;<span class="attr">$avg</span>: <span class="string">&#x27;$readings.value&#x27;</span>&#125;&#125;&#125;</span><br><span class="line">]);</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment">&#123; &quot;_id&quot; : 2, &quot;averageTemperature&quot; : 35 &#125;</span></span><br><span class="line"><span class="comment">&#123; &quot;_id&quot; : 1, &quot;averageTemperature&quot; : 42 &#125;</span></span><br><span class="line"><span class="comment">*/</span></span><br></pre></td></tr></table></figure><p>The behavior of <code>$unwind</code> evolved over time, matching common usage patterns and expectation the community found useful. Several scenarios and special cases arise when unwinding arrays:</p><ul><li>What if the document contains a field with a single value - not an array?</li><li>What if the array has no elements?</li><li>What if the document contains no such field at all?</li></ul><p>All these need to be handled in <em>some</em> way, and understanding the behavior would help you avoid fluky results.</p><p>Since  MongoDB version 3.2, <code>$unwind</code> treats a target field containing a scalar or non-array value as if it were a single element array containing that one non-array value. These two documents are treated the same way by <code>$unwind</code>:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">&#123;</span><br><span class="line">    <span class="attr">device</span>: <span class="number">3</span>,</span><br><span class="line">    <span class="attr">readings</span>: &#123; <span class="attr">name</span>: <span class="string">&#x27;temp&#x27;</span>, <span class="attr">value</span>: <span class="number">29</span> &#125;,</span><br><span class="line">    <span class="attr">time</span>: <span class="title class_">ISODate</span>(<span class="string">&#x27;2019-01-31T03:00:00Z&#x27;</span>)</span><br><span class="line">&#125;</span><br><span class="line">&#123;</span><br><span class="line">    <span class="attr">device</span>: <span class="number">4</span>,</span><br><span class="line">    <span class="attr">readings</span>: [ &#123; <span class="attr">name</span>: <span class="string">&#x27;temp&#x27;</span>, <span class="attr">value</span>: <span class="number">29</span> &#125; ],</span><br><span class="line">    <span class="attr">time</span>: <span class="title class_">ISODate</span>(<span class="string">&#x27;2019-01-31T03:00:00Z&#x27;</span>)</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>Device 3 above has a single reading in the <code>readings</code> field - not an array. Device 4 has a single-element array for <code>readings</code>. Since <code>$unwind</code> treats them the same, we don’t neat fancy <code>$project</code> and <code>$cond</code> checks to normalize input documents (Squee!).  If we added these documents to the readings collection, and ran the same 3 stage pipeline as before, we’d get the results we expect. Prior to mongo 3.2, a non-array field handed to <code>$unwind</code> would have produced an error.</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">readings</span>.<span class="title function_">aggregate</span>([</span><br><span class="line">    &#123; <span class="attr">$unwind</span>: <span class="string">&#x27;$readings&#x27;</span>&#125;,</span><br><span class="line">    &#123; <span class="attr">$match</span>: &#123; <span class="string">&#x27;readings.name&#x27;</span>: <span class="string">&#x27;temp&#x27;</span> &#125;&#125;,</span><br><span class="line">    &#123; <span class="attr">$group</span>: &#123; <span class="attr">_id</span>: <span class="string">&#x27;$device&#x27;</span>, <span class="attr">averageTemperature</span>: &#123;<span class="attr">$avg</span>: <span class="string">&#x27;$readings.value&#x27;</span>&#125;&#125;&#125;</span><br><span class="line">]);</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment">&#123; &quot;_id&quot; : 3, &quot;averageTemperature&quot; : 37 &#125;</span></span><br><span class="line"><span class="comment">&#123; &quot;_id&quot; : 2, &quot;averageTemperature&quot; : 35 &#125;</span></span><br><span class="line"><span class="comment">&#123; &quot;_id&quot; : 4, &quot;averageTemperature&quot; : 29 &#125;</span></span><br><span class="line"><span class="comment">&#123; &quot;_id&quot; : 1, &quot;averageTemperature&quot; : 42 &#125;</span></span><br><span class="line"><span class="comment">*/</span></span><br></pre></td></tr></table></figure><h2 id="The-Nitty-Gritty"><a href="#The-Nitty-Gritty" class="headerlink" title="The Nitty Gritty"></a>The Nitty Gritty</h2><p>Now what about a missing field, an empty array, or a field containing null? Here are examples of these documents:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">&#123; <span class="attr">device</span>: <span class="number">10</span>, <span class="attr">time</span>: <span class="title class_">ISODate</span>(<span class="string">&#x27;2019-01-31T06:00:00Z&#x27;</span>) &#125; </span><br><span class="line">&#123; <span class="attr">device</span>: <span class="number">10</span>, <span class="attr">readings</span>: <span class="literal">null</span>, <span class="attr">time</span>: <span class="title class_">ISODate</span>(<span class="string">&#x27;2019-01-31T07:00:00Z&#x27;</span>) &#125; </span><br><span class="line">&#123; <span class="attr">device</span>: <span class="number">10</span>, <span class="attr">readings</span>: [], <span class="attr">time</span>: <span class="title class_">ISODate</span>(<span class="string">&#x27;2019-01-31T08:00:00Z&#x27;</span>) &#125;</span><br></pre></td></tr></table></figure><p>If we ran <code>$unwind</code> on those documents, <strong>none</strong> of them would be emitted past $unwind. The example below passes documents with <code>device</code> &#x3D;&#x3D; <code>10</code> to <code>$unwind</code>, but since none of them had a “useful” <code>readings</code> value to attach to an output document - none of them got emitted:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">readings</span>.<span class="title function_">aggregate</span>([</span><br><span class="line">    &#123; <span class="attr">$match</span>: &#123; <span class="attr">device</span>: <span class="number">10</span>&#125;&#125;,</span><br><span class="line">    &#123; <span class="attr">$unwind</span>: <span class="string">&#x27;$readings&#x27;</span>&#125;,</span><br><span class="line">])</span><br><span class="line"><span class="comment">// No documents  returned</span></span><br></pre></td></tr></table></figure><p>This is just a behavior though. the word “useful” in our case where we won’t want an empty or missing reading to pollute our averaging math is our choice. Other use cases may exist where you actually want every document entering $unwind to be emitted forward even if it doesn’t have an array element to contribute. If you want that behavior, you can use the option <code>preserveNullAndEmptyArrays</code> set to <code>true</code>. To supply this option though, you would need to use a slightly different argument to <code>$unwind</code>. Until now, we gave <code>$unwind</code> a bare value-of-field argument <code>$&lt;field name&gt;</code>. The more elaborate syntax is <code>&#123;$unwind: &#123; path: &#39;$&lt;field name&gt;&#39;&#125;&#125;</code>. The following two forms are identical in meaning:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">&#123; <span class="attr">$unwind</span>: <span class="string">&#x27;$readings&#x27;</span> &#125;</span><br><span class="line">&#123; <span class="attr">$unwind</span>: &#123; <span class="attr">path</span>: <span class="string">&#x27;$readings&#x27;</span> &#125;&#125;</span><br></pre></td></tr></table></figure><p>The latter expression is where we can add our <code>preserveNullAndEmptyArrays</code> option. Putting it all together looks like this:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">readings</span>.<span class="title function_">aggregate</span>([</span><br><span class="line">    &#123; <span class="attr">$match</span>: &#123; <span class="attr">device</span>: <span class="number">10</span>&#125;&#125;,</span><br><span class="line">    &#123; <span class="attr">$unwind</span>: &#123;</span><br><span class="line">        <span class="attr">path</span>: <span class="string">&#x27;$readings&#x27;</span>,</span><br><span class="line">        <span class="attr">preserveNullAndEmptyArrays</span>: <span class="literal">true</span></span><br><span class="line">        &#125;</span><br><span class="line">    &#125;,</span><br><span class="line">])</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment">&#123; &quot;_id&quot; : ..., &quot;device&quot; : 10, &quot;time&quot; : ISODate(&quot;2019-01-31T06:00:00Z&quot;) &#125;</span></span><br><span class="line"><span class="comment">&#123; &quot;_id&quot; : ..., &quot;device&quot; : 10, &quot;readings&quot; : null, &quot;time&quot; : ISODate(&quot;2019-01-31T07:00:00Z&quot;) &#125;</span></span><br><span class="line"><span class="comment">&#123; &quot;_id&quot; : ..., &quot;device&quot; : 10, &quot;time&quot; : ISODate(&quot;2019-01-31T08:00:00Z&quot;) &#125;</span></span><br><span class="line"><span class="comment">*/</span></span><br></pre></td></tr></table></figure><p>This time, a single document was emitted corresponding to each of the incoming device 10 documents. For each of those 3 documents, MongoDB considered that the incoming document has to be emitted out at least once. But what value should the output document have? In the case of the document containing <code>readings: null</code>, mongo emitted a null in the <code>readings</code> field. For the ones with missing field and with the array containing no value, <code>$unwind</code> emitted a document with no <code>readings</code> field at all. The notion is that a missing field should be preserved - no invention of a non-existing field. As far as an existing but empty array - think of it as “a document with no value there”. Emitting a field would have required a value to be assigned to the field. The value <code>null</code> is <em>still a value</em> and is not the same as <em>not having any value</em>. If mongo would have emitted a <code>&#123;readings: null&#125;</code> field in response to an empty array that’s like saying that the original document <em>had</em> an array element <code>null</code> - which is not the case. Bottom line: the behavior is defined in a certain way. We can leverage it when we need to, and we can also manipulate the fields further using other pipeline operators.</p><h2 id="Unwind-Without-unwind"><a href="#Unwind-Without-unwind" class="headerlink" title="Unwind Without $unwind?"></a>Unwind Without $unwind?</h2><p>The three-step pipeline to calculate average is not very difficult to read, understand, or produce. But the real need is expressed in the <code>$group</code> stage. The stages for filtering and unwinding were cruft that adds verbiage - but not for nothing.</p><p>Tempted to shorten the syntax, one might think to use some of the projection operators, such as <code>$filter</code> or <code>$arrayToObject</code>. The <code>$filter</code> example would go something along the lines of:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">readings</span>.<span class="title function_">aggregate</span>([</span><br><span class="line">    &#123; <span class="attr">$match</span>: &#123; <span class="attr">readings</span>: &#123; <span class="attr">$type</span>:<span class="number">4</span> &#125; &#125; &#125;,</span><br><span class="line">    &#123; <span class="attr">$project</span>: &#123;</span><br><span class="line">        <span class="attr">temperature</span>: &#123;</span><br><span class="line">            <span class="attr">$filter</span>: &#123;</span><br><span class="line">                <span class="attr">input</span>: <span class="string">&#x27;$readings&#x27;</span>,</span><br><span class="line">                <span class="attr">as</span>: <span class="string">&#x27;elm&#x27;</span>,</span><br><span class="line">                <span class="attr">cond</span>: &#123;</span><br><span class="line">                     <span class="attr">$eq</span>: [ <span class="string">&#x27;$$elm.name&#x27;</span>, <span class="string">&#x27;temp&#x27;</span> ]</span><br><span class="line">                &#125;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">&#125;])</span><br></pre></td></tr></table></figure><p>Where we can pluck only the <code>&#39;temp&#39;</code> elements out of the <code>readings</code> array. This works, but must be preceded by something to ensure that <code>readings</code> is actually an array. Unlike the more forgiving <code>$unwind</code>, <code>$filter</code> insists that the input field is an array or it produces an error. That’s why I included a <code>$match</code> that only passes along documents where that condition holds true (See <a href="http://bsonspec.org/spec.html">BSON spec</a> for type information. 0x04 is Array). This seems almost palatable, but would still require more syntax to transform the one-element array to something we can use directly for the math. So since the amount of text required to achieve the goal is not reduced, and the overall pipeline does not become more readable, I see little reason to go to those lengths.</p><h2 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h2><p><code>$unwind</code> is an important pipeline stage in the document-oriented world. Dealing with arrays and aggregation will undoubtedly get you using <code>$unwind</code> sooner or later. In a <a href="./MongoDB-3-2-Goodies-Coming-Your-Way-More-ways-to-unwind.html">previous post</a> we focused on data scenarios where this all made sense. In this post, we focused more on the balance between syntax and the meaning of the stage and related operators. Either way, it’s well worth learning the nuanced behavior of this key pipeline stage.</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;When your document contains an array, &lt;code&gt;$unwind&lt;/code&gt; is the aggregation pipeline stage that you most likely need. &lt;code&gt;$unwind&lt;/co</summary>
      
    
    
    
    
    <category term="MongoDB" scheme="https://www.plusnconsulting.com/tags/MongoDB/"/>
    
    <category term="Data" scheme="https://www.plusnconsulting.com/tags/Data/"/>
    
    <category term="Database" scheme="https://www.plusnconsulting.com/tags/Database/"/>
    
  </entry>
  
  <entry>
    <title>Better Cosmos DB Cost Sharing</title>
    <link href="https://www.plusnconsulting.com/post/Better-Cosmos-DB-Cost-Sharing/"/>
    <id>https://www.plusnconsulting.com/post/Better-Cosmos-DB-Cost-Sharing/</id>
    <published>2019-01-07T23:11:28.000Z</published>
    <updated>2019-01-11T00:39:02.203Z</updated>
    
    <content type="html"><![CDATA[<p>Until recently, Cosmos DB’s pricing model was based on a per-collection provisioning. This model was a good fit for single-antity applications using Cosmos as a flow-through data point collection or other narrow well defined collection usages.</p><p>But for applications spanning multiple entities with broader usage pattern this could prove cost prohibiting: Each document type or entity would have naturally been designed to go into its own collection. But with the minimum price for the smallest collection around $30 a month for 400RU, an enterprise would think twice before designing an application that used Cosmos as a full back-end because low-usage collection data would start adding up to the solution cost. If nothing else, this pricing strategy flew in the face of the code-first benefits which document oriented databases promise.</p><p>Worst, this pricing lead some to use a single collection for all entities of all types. While that move saved money by not over-provisioning collections for small infrequent documents, it fosters no separation of concerns and did not promote separation of concern in code as microservices would have otherwise.</p><h2 id="Good-News-Everyone"><a href="#Good-News-Everyone" class="headerlink" title="Good News Everyone!"></a>Good News Everyone!</h2><p>This is all in the past!</p><p>Cosmos DB now lets you provision database throughput at the database level. Collections created under that database will then share the cost based on that provisioned RU level. You can still provision specific collections with specific RU to meet your needs. But if you do not, then all the collections in that database will share the provisioned throughput.</p><p>To do this, you will start with creating a Cosmos DB account. Once an account is created, you can go to the data explorer and then hit the “New Database” link at the top.</p><p><img src="/images/cosmos-shared-db-provisioning/share-db-provisioning-data-explorer.png" alt="Create Database" title="Create Database"></p><p>In the dialog, type the name of your database, and check the “Provision throuput” checkbox.</p><p><img src="/images/cosmos-shared-db-provisioning/share-db-provisioning-new-database.png" alt="Provision Throughput" title="Provision Throughput"></p><p>This will reveal a secondary input where you can enter the desired throughput. The minimum provisioning is 400RU, but the default number popping in that box is 1000RU. Choose the throughput level your application demands - that’s something you have to determine on your own.</p><p><img src="/images/cosmos-shared-db-provisioning/share-db-provisioning-lower-ru.png" alt="Choose RU" title="Choose RU"></p><p>That’s it! Collections created under the shared throughput will share the cost and throughput at the database level. No more kludge code to force all documents of all types into one collection. No more skimping on collection creation for low-volume &#x2F; low-frequency data to save a buck. Sharing the throughput let your architect and code your application without weirdness coming from the pricing model.</p><p>Kudos to the team that made this possible.</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;Until recently, Cosmos DB’s pricing model was based on a per-collection provisioning. This model was a good fit for single-antity applica</summary>
      
    
    
    
    
    <category term="Azure" scheme="https://www.plusnconsulting.com/tags/Azure/"/>
    
    <category term="Cosmos DB" scheme="https://www.plusnconsulting.com/tags/Cosmos-DB/"/>
    
  </entry>
  
  <entry>
    <title>Transactions with MongoDB 4.0 and .Net</title>
    <link href="https://www.plusnconsulting.com/post/Transactions-with-MongoDB-4-0/"/>
    <id>https://www.plusnconsulting.com/post/Transactions-with-MongoDB-4-0/</id>
    <published>2018-10-02T19:27:32.000Z</published>
    <updated>2019-01-18T21:50:36.147Z</updated>
    
    <content type="html"><![CDATA[<h1 id="Multi-Document-Transactions-and-MongoDB"><a href="#Multi-Document-Transactions-and-MongoDB" class="headerlink" title="Multi-Document Transactions and MongoDB"></a>Multi-Document Transactions and MongoDB</h1><p>TL;DR - MongoDB 4.0 supports multi-document transactions!</p><p>When you have 2 mutations (write operations), each affecting 1 document,  Mongo used to apply each write as an independent commit. Consequently, there was a point in time in which mutation 1 was applied but 2 was not. If 2 failed for whatever reason, 1 was still applied.</p><p>This behavior caused some pain in attempting to manage all-or-nothing style operations which affect multiple documents. For example, if you had to lend a book to a person you might have wanted the bookmarked as lent-out, and the library visitor to have the book appended to their lent-out list.</p><p>If marking the book succeeded but marking the visitor failed, then the book is lent to no one. If marking the visitor first and failing to mark the book as lent, it can be double-lent. Further, there is a point in time in which the visitor “has” the book, but the book is not yet “held”.</p><p>Now, there are potential structural solutions to this problem in a document-oriented world, but let’s go with transactions. The newly available transaction support lets you wrap a transaction around the 2 operations. I like thinking of a Mongo transaction in terms of visibility:</p><ul><li>Before the “begin-transaction”, other readers see data as it was.</li><li>During the transaction’s lifetime, other readers see data as it was. Mutations happening during the transaction are not visible to readers (*there are some choices here, more on that in a bit..)</li><li>After the transaction commits, other readers see the results of all mutations.</li></ul><p>Speaking of visibility, one of the core changes that needed to occur in the Mongo engine is marking OpLog entries with a global logical cluster time. Why is that important? Because transactions are really about controlling the visibility of written items across the Replica Set. The implementation of point-in-time reads is another key piece of the puzzle. This feature provides a view for reading such that the document states visible to the reader are of the same version they were when the read operation started. Modifications occurring during a long-running operation would not be exposed to the reader, so a consistent view is ensured.</p><p>To scope several operations into a transaction, Mongo relies on the already available session implementation. Sessions existed in 3.6, so the leap is smaller. A session groups a sequence of commands, and is tracked throughout the cluster. It is therefore already suitable for the notion of a transaction. All that was added is a syntax for a developer to pass along a session into the commands themselves. All method signatures that mutate data now accept a session handle in the latest official drivers.From a client perspective (using a supported driver version), creating a transaction looks like:</p><ol><li>Create a session.<ol><li>Issue some CRUD operations with the session handle.</li></ol></li><li>Commit the session.</li></ol><p>Mongo will control the visibility of the constituent operations according to the session settings.</p><p>Consider this session initiation C# code:</p><blockquote><p>Demo code based on .Net driver <strong>mongodb.driver</strong> version <em>2.7.0</em></p></blockquote><figure class="highlight csharp"><table><tr><td class="code"><pre><span class="line"><span class="keyword">using</span> (<span class="keyword">var</span> session = client.StartSession())</span><br><span class="line">&#123;</span><br><span class="line">    session.StartTransaction(<span class="keyword">new</span> TransactionOptions(</span><br><span class="line"></span><br><span class="line">            readConcern: ReadConcern.Snapshot,</span><br><span class="line"></span><br><span class="line">            writeConcern: WriteConcern.WMajority));</span><br><span class="line"></span><br><span class="line">...</span><br></pre></td></tr></table></figure><p>A few things are readily apparent from this small code structure.</p><p>A session is a disposable object, so proper disposal is guaranteed by a <code>using</code> claws.</p><p>A session by itself is not a transaction. We explicitly start a transaction by calling the <code>StartTransaction()</code> method. Within a session, only one transaction may be “live” at a given time. Since we are within a <code>using</code> scope, this code has a low risk of breaking that rule.</p><p><code>TransactionOptions</code> describe 2 key parts of the transaction treatment: read and write concerns. The write-concern describes the durability expectation of the mutations. Just like any Replica Set write, it lets us control the risk of roll-back of individual writes in case of primary transitions or other adverse events.</p><p>The read-concern describes the visibility mode of the mutation during the transaction- the time between the start and the would-be commit or abort commands. As mentioned earlier, what happens during the transaction lifetime before it is finished - successful or not - is really what transactions are about.</p><p>The particular setting of <code>ReadConcern.Snapshot</code>, when paired with a write-concern <code>WriteConcern.WMajority</code> guarantee that any reads occurring as part of a transaction view data that is majority committed. Those reads are “stable” and should not roll back since the majority of nodes already have applied that data.You might be tempted to use a weaker read-concern such as <code>ReadConcern.Local</code> or <code>ReadConcern.Majority</code> for sake of speed. That choice may not be treated as you expect. For one, Mongo might “upgrade” the concern to a higher one such as snapshot. Further, Mongo does not guarantee that the writes won’t be rolled back in the face of cluster adverse events. In case of a rollback, your whole transaction might be rolled back so what’s the point really?…</p><h2 id="Snapshot"><a href="#Snapshot" class="headerlink" title="Snapshot"></a>Snapshot</h2><p>Snapshot is a read-concern relating to <em>read-your-own-writes</em> and <em>causal consistency</em>. Causal consistency describes a relationship between operations where one <em>causes</em> the other: A read operation returning the value of filed <em>count</em> right after a write operation setting <em>count &#x3D; 123</em> expects the count to be <em>123</em>. The write preceding the read <em>caused</em> the mutation and the reader expects the write to be “the one caused by the preceding operation”. An implied order is what this is about. As mentioned before, one of the underpinnings of transactions is a global timestamp, allowing a strict order of operations. Within a causally consistent session, pure read operations following a write is guaranteed to see the results of that write. It may seem trivial - desirable certainly - but keep in mind that other concurrent writes may occur during <em>your sequence</em> which may affect the state of a document. Causal consistency assures that the state of a read document following a write is seen as <em>that</em> writer’s result.</p><p>In the diagram below, a session with causal consistency ensures the reader sees the results of its preceding write. A session with no causal consistency does not ensure that, and depending may result in Client One reading a document modified by Client Two rather than the result of Client One’s own write.</p><p><img src="/images/transactions-with-mongodb-4-0-seq1.svg" alt="With and without causal relationship"></p><p>The default for creating a session is to create it <em>with</em> causal consistency. The code below creates a session with the default value or explicit option. Either of these result in the same.</p><figure class="highlight csharp"><table><tr><td class="code"><pre><span class="line"><span class="comment">// setting CausalConsistency explicitly</span></span><br><span class="line"><span class="keyword">using</span> (<span class="keyword">var</span> session = client.StartSession(<span class="keyword">new</span> ClientSessionOptions &#123; CausalConsistency = <span class="literal">true</span> &#125;))</span><br><span class="line">&#123; ...</span><br><span class="line"></span><br><span class="line"><span class="comment">// setting CausalConsistency implicitly</span></span><br><span class="line"><span class="keyword">using</span> (<span class="keyword">var</span> session = client.StartSession())</span><br><span class="line">&#123; ...</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>Now we can state this: A transaction in a causally consistent session with a read-concern of “snapshot” and write-concern of “majority” containing reads, will view documents committed to a majority of the nodes. This guarantee level extends to reads within the transaction such that not only will the transaction writes succeed if majority acknowledged, but the reads within the transactions will also only rely on majority committed documents according to the snapshots time. This shuts down the concern of having a transaction rely on document state which might be rolled back because once majority-committed, it won’t be rolled back. This provides a consistent view since the causal consistency kicks in and guarantees that.</p><h2 id="Code-it-Up"><a href="#Code-it-Up" class="headerlink" title="Code it Up"></a>Code it Up</h2><p>The theory above gives us the background necessary to understand what’s going on. The code below implements a multi-document transaction touching 3 documents across 3 different collections.</p><p>The scenario is that we have some <strong>Tool</strong> which can be borrowed by some <strong>Person</strong> and is then logged in the <strong>LendingLedger</strong>.We start by creating a new session. We then perform the sequence of operations inside the transaction:</p><ol><li>Mark the tool as held by the person.</li><li>Check that the tool was indeed found and marked.</li><li>If the tool is not found to be un-held, or if not found at all or update failed we’ll throw an exception, which is then caught and aborts the transaction.</li><li>Add a ledger entry detailing the tool, person, and time the tool was lent out.</li><li>Increase the number of tools the person has by 1.</li></ol><p>Under the cover of a transaction, performing this sequence gives us assurance that all three entities would be manipulated to satisfaction, or rolled back completely. Further - other concurrent operations in parallel would not interfere with the operations happening inside this transaction.</p><p>For a more complete demonstration <a href="https://github.com/plusn-nuri/mongodb-transactions-csharp">please see my GitHub repo</a>.</p><figure class="highlight csharp"><table><tr><td class="code"><pre><span class="line"><span class="keyword">using</span> (<span class="keyword">var</span> session = client.StartSession(<span class="keyword">new</span> ClientSessionOptions &#123; CausalConsistency = <span class="literal">true</span> &#125;))</span><br><span class="line">&#123;</span><br><span class="line"></span><br><span class="line">    session.StartTransaction(<span class="keyword">new</span> TransactionOptions(</span><br><span class="line">            readConcern: ReadConcern.Snapshot,</span><br><span class="line">            writeConcern: WriteConcern.WMajority));</span><br><span class="line"></span><br><span class="line">    <span class="keyword">try</span></span><br><span class="line">    &#123;</span><br><span class="line">        <span class="keyword">var</span> personCollection = db.GetCollection&lt;Person&gt;(<span class="keyword">nameof</span>(Person));</span><br><span class="line">        <span class="keyword">var</span> toolCollection = db.GetCollection&lt;Tool&gt;(<span class="keyword">nameof</span>(Tool));</span><br><span class="line">        <span class="keyword">var</span> lendLogCollection = db.GetCollection&lt;LendingLedger&gt;(<span class="keyword">nameof</span>(LendingLedger));</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">        <span class="keyword">var</span> holdTool = toolCollection.UpdateOne(session,</span><br><span class="line">            Builders&lt;Tool&gt;.Filter.Eq(t =&gt; t.Id, toolId) &amp; Builders&lt;Tool&gt;.Filter.Eq(t =&gt; t.HeldBy, <span class="literal">null</span>),</span><br><span class="line">            Builders&lt;Tool&gt;.Update.Set(t =&gt; t.HeldBy, personId));</span><br><span class="line"></span><br><span class="line">        <span class="keyword">if</span> (holdTool.ModifiedCount != <span class="number">1</span>)</span><br><span class="line">        &#123;</span><br><span class="line">            <span class="keyword">throw</span> <span class="keyword">new</span> InvalidOperationException(<span class="string">$&quot;Failed updating hold on tool <span class="subst">&#123;toolId&#125;</span>. It might be held or non-existent&quot;</span>);</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        lendLogCollection.InsertOne(session, <span class="keyword">new</span> LendingLedger</span><br><span class="line">        &#123;</span><br><span class="line">            ToolId = toolId,</span><br><span class="line">            PersonId = personId,</span><br><span class="line">            CheckOutTime = DateTime.UtcNow</span><br><span class="line">        &#125;);</span><br><span class="line">        </span><br><span class="line">        <span class="keyword">var</span> toolCount = personCollection.UpdateOne(</span><br><span class="line">                    session,</span><br><span class="line">                    Builders&lt;Person&gt;.Filter.Eq(p =&gt; p.Id, personId),</span><br><span class="line">                    Builders&lt;Person&gt;.Update.Inc(p =&gt; p.ToolCount, <span class="number">1</span>)</span><br><span class="line">                    );</span><br><span class="line"></span><br><span class="line">        <span class="keyword">if</span> (toolCount.ModifiedCount != <span class="number">1</span>)</span><br><span class="line">        &#123;</span><br><span class="line">            <span class="keyword">throw</span> <span class="keyword">new</span> InvalidOperationException(<span class="string">$&quot;Failed updating tool count on person <span class="subst">&#123;personId&#125;</span>&quot;</span>);</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">catch</span> (Exception exception)</span><br><span class="line">    &#123;</span><br><span class="line">        Logger.Error(<span class="string">$&quot;Caught exception during transaction, aborting: <span class="subst">&#123;exception.Message&#125;</span>.&quot;</span>);</span><br><span class="line">        session.AbortTransaction();</span><br><span class="line">        <span class="keyword">throw</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    session.CommitTransaction();</span><br><span class="line"></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h1 id="Epilog"><a href="#Epilog" class="headerlink" title="Epilog"></a>Epilog</h1><p>Transactions have been long awaited by some for quite a while. Others see transactions as a performance and scalability hindrances, placing an undue burden on the core engine. There are performance implications to transactions. Measuring those is tricky because the effect depends on concurrency, velocity, and size of data. Transactions also introduce more controls on timing, with defaults favoring quick transactions and opting to abort rather than consume precious resources. How much overhead will a transaction introduce? I don’t know - better measure it. <a href="https://docs.mongodb.com/manual/core/transactions/index.html">The documentation</a> currently states only a vague warning:</p><blockquote><blockquote><p>…In most cases, multi-document transaction incurs a greater performance cost over single document writes, and the availability of multi-document transaction should not be a replacement for effective schema design …</p></blockquote></blockquote><p>Certainly, something to consider, and I for one definitely model entities with the mindset of embedding where appropriate. After all - if you want completely “independent” entities with cross-references, there’s an app for that… We chose a document-oriented-database for its document-oriented nature - let’s leverage that. A good rule of thumb is that if your RDBMS schema was translated into one-collection-per-table in MongoDB - try again.</p><p>Lastly, I should mention that the v4.0 release includes multi-document transactions on replica sets but not on sharded collections. Support for sharded collections with transactions is slated for V4.2.</p><p>Happy transacting!</p>]]></content>
    
    
      
      
    <summary type="html">&lt;h1 id=&quot;Multi-Document-Transactions-and-MongoDB&quot;&gt;&lt;a href=&quot;#Multi-Document-Transactions-and-MongoDB&quot; class=&quot;headerlink&quot; title=&quot;Multi-Document</summary>
      
    
    
    
    
    <category term="MongoDB" scheme="https://www.plusnconsulting.com/tags/MongoDB/"/>
    
    <category term="Database" scheme="https://www.plusnconsulting.com/tags/Database/"/>
    
  </entry>
  
  <entry>
    <title>The Good, the Bad, and the Join</title>
    <link href="https://www.plusnconsulting.com/post/The-Good-the-Bad-and-the-Join/"/>
    <id>https://www.plusnconsulting.com/post/The-Good-the-Bad-and-the-Join/</id>
    <published>2018-07-06T16:16:51.000Z</published>
    <updated>2018-07-09T17:07:54.107Z</updated>
    
    <content type="html"><![CDATA[<p>When modeling in the document-oriented world, One to few relationships are often modeled as arrays embedded in the containing entity. Sometimes the relationship is one to many, or such that an embedded array would not work well. In those cases, the need for some sort of JOIN comes up.</p><p>The Aggregation Framework in MongoDB gives us a convenient option: <code>$lookup</code>!But <code>$lookup</code> is not JOIN - at least not the one we are familiar with from SQL and relational databases. This distinction is important to understand, as it may affect your modeling and performance.</p><h2 id="What-lookup-Is"><a href="#What-lookup-Is" class="headerlink" title="What $lookup Is"></a>What $lookup Is</h2><p><code>$lookup</code> queries a second collection for each document entering the pipeline stage. Any documents from the secondary collection that pass the lookup condition are attached to an array field on the entering document.</p><p>The behavior differs from SQL’s INNER and LEFT JOIN in its geometry. In SQL a row is emitted for each match of the right table, and produces a wide row with all columns from the left and all columns from the right - flat. Rows in the left table that had no match in the right table would not be produced at all. Similarly, a LEFT JOIN would have produced flat rows, and a NULL filled row for the right table in case no match was found for the right table.</p><p>The behavior also differs in the <a href="https://en.wikipedia.org/wiki/Data_definition_language">DDL</a> aspects and expectation. MongoDB does not represent or maintain any kind of foreign key relationship between collection. While relational databases expose foreign keys in the DDL and extend them with things like cascade-delete and insert&#x2F;update validation of key existence, MongoDB has no such constructs.</p><p>Apart from the geometrical and <a href="https://en.wikipedia.org/wiki/Data_manipulation_language">DML</a>-only nature of <code>$lookup</code>, it is also a bit more limited. You cannot lookup into a collection that is sharded. This limits the scale of the database where one could apply the syntax, and should give you pause to think through your modeling carefully before relying on this syntax.</p><p>The closest analog of <code>$lookup</code> to SQL JOIN varieties is the LEFT OUTER JOIN.</p><h2 id="Simulating-INNER-JOIN"><a href="#Simulating-INNER-JOIN" class="headerlink" title="Simulating INNER JOIN"></a>Simulating INNER JOIN</h2><p>At times, we want to simulate INNER JOIN. We may want to emit only documents from the first collection which have one or more match from the in the second collection.</p><p>There’s a way to do this using aggregation by chaining <code>$lookup</code> and an <code>$unwind</code>.</p><blockquote><p>Full props to <a href="https://twitter.com/asya999">Asya Kamsky </a> for this technique!</p></blockquote><p>The subtle point is that <code>$lookup</code> appends an array of values from the second collection for each input document, and an empty array if no match is found. That empty array is the key to this solution. <code>$unwind</code> has the ability to suppress emitting documents where the array field is empty or missing. In fact, that was the default behavior of <code>$unwind</code> all along.</p><p>Consider the following data sets:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">cities</span>.<span class="title function_">insert</span>([</span><br><span class="line">    &#123; <span class="string">&quot;_id&quot;</span> : <span class="string">&quot;PIE&quot;</span>, <span class="string">&quot;name&quot;</span> : <span class="string">&quot;St. Petersburg&quot;</span>, <span class="string">&quot;state&quot;</span> : <span class="string">&quot;FL&quot;</span> &#125;</span><br><span class="line">    &#123; <span class="string">&quot;_id&quot;</span> : <span class="string">&quot;YUM&quot;</span>, <span class="string">&quot;name&quot;</span> : <span class="string">&quot;Yuma&quot;</span>, <span class="string">&quot;state&quot;</span> : <span class="string">&quot;AZ&quot;</span> &#125;</span><br><span class="line">    &#123; <span class="string">&quot;_id&quot;</span> : <span class="string">&quot;TOL&quot;</span>, <span class="string">&quot;name&quot;</span> : <span class="string">&quot;Toledo&quot;</span>, <span class="string">&quot;state&quot;</span> : <span class="string">&quot;OH&quot;</span> &#125;</span><br><span class="line">    &#123; <span class="string">&quot;_id&quot;</span> : <span class="string">&quot;FAR&quot;</span>, <span class="string">&quot;name&quot;</span> : <span class="string">&quot;Fargo&quot;</span>, <span class="string">&quot;state&quot;</span> : <span class="string">&quot;ND&quot;</span> &#125;</span><br><span class="line">]);</span><br><span class="line"></span><br><span class="line">db.<span class="property">routes</span>.<span class="title function_">insert</span>([</span><br><span class="line">    &#123; <span class="string">&quot;_id&quot;</span> : <span class="number">1</span>, <span class="string">&quot;departure&quot;</span> : <span class="string">&quot;PIE&quot;</span>, <span class="string">&quot;arrival&quot;</span> : <span class="string">&quot;YUM&quot;</span>, <span class="string">&quot;d&quot;</span> : <span class="number">90</span> &#125;</span><br><span class="line">    &#123; <span class="string">&quot;_id&quot;</span> : <span class="number">2</span>, <span class="string">&quot;departure&quot;</span> : <span class="string">&quot;PIE&quot;</span>, <span class="string">&quot;arrival&quot;</span> : <span class="string">&quot;TOL&quot;</span>, <span class="string">&quot;d&quot;</span> : <span class="number">60</span> &#125;</span><br><span class="line">    &#123; <span class="string">&quot;_id&quot;</span> : <span class="number">3</span>, <span class="string">&quot;departure&quot;</span> : <span class="string">&quot;FAR&quot;</span>, <span class="string">&quot;arrival&quot;</span> : <span class="string">&quot;YUM&quot;</span>, <span class="string">&quot;d&quot;</span> : <span class="number">40</span> &#125;</span><br><span class="line">    &#123; <span class="string">&quot;_id&quot;</span> : <span class="number">4</span>, <span class="string">&quot;departure&quot;</span> : <span class="string">&quot;YUM&quot;</span>, <span class="string">&quot;arrival&quot;</span> : <span class="string">&quot;TOL&quot;</span>, <span class="string">&quot;d&quot;</span> : <span class="number">50</span> &#125;</span><br><span class="line">    &#123; <span class="string">&quot;_id&quot;</span> : <span class="number">5</span>, <span class="string">&quot;departure&quot;</span> : <span class="string">&quot;TOL&quot;</span>, <span class="string">&quot;arrival&quot;</span> : <span class="string">&quot;FAR&quot;</span>, <span class="string">&quot;d&quot;</span> : <span class="number">30</span> &#125;</span><br><span class="line">]);</span><br></pre></td></tr></table></figure><p>Now query the <code>cities</code> to discover the inbound flights arriving in for each city. For each city, a list of routes is appended from the <em>routes</em> collection. The match is on the <em>cities</em> <em>_id</em> field against the <em>routes</em> <em>arrival</em> field, both of which contain the FAA 3 letter airport code.</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">    &#123;</span><br><span class="line">        <span class="attr">$lookup</span>: &#123;</span><br><span class="line">            <span class="attr">from</span>: <span class="string">&#x27;routes&#x27;</span>,</span><br><span class="line">            <span class="attr">localField</span>:<span class="string">&#x27;_id&#x27;</span>,</span><br><span class="line">            <span class="attr">foreignField</span>:<span class="string">&#x27;arrival&#x27;</span>,</span><br><span class="line">            <span class="attr">as</span>: <span class="string">&#x27;inbound&#x27;</span> </span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">])</span><br></pre></td></tr></table></figure><p>The result for Toledo (‘TOL’) will show 2 inbound flights:</p><figure class="highlight json"><table><tr><td class="code"><pre><span class="line"><span class="punctuation">&#123;</span></span><br><span class="line">    <span class="attr">&quot;_id&quot;</span> <span class="punctuation">:</span> <span class="string">&quot;TOL&quot;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;name&quot;</span> <span class="punctuation">:</span> <span class="string">&quot;Toledo&quot;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;state&quot;</span> <span class="punctuation">:</span> <span class="string">&quot;OH&quot;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;inbound&quot;</span> <span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">            <span class="punctuation">&#123;</span></span><br><span class="line">                    <span class="attr">&quot;_id&quot;</span> <span class="punctuation">:</span> <span class="number">2</span><span class="punctuation">,</span></span><br><span class="line">                    <span class="attr">&quot;departure&quot;</span> <span class="punctuation">:</span> <span class="string">&quot;PIE&quot;</span><span class="punctuation">,</span></span><br><span class="line">                    <span class="attr">&quot;arrival&quot;</span> <span class="punctuation">:</span> <span class="string">&quot;TOL&quot;</span><span class="punctuation">,</span></span><br><span class="line">                    <span class="attr">&quot;d&quot;</span> <span class="punctuation">:</span> <span class="number">60</span></span><br><span class="line">            <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">            <span class="punctuation">&#123;</span></span><br><span class="line">                    <span class="attr">&quot;_id&quot;</span> <span class="punctuation">:</span> <span class="number">4</span><span class="punctuation">,</span></span><br><span class="line">                    <span class="attr">&quot;departure&quot;</span> <span class="punctuation">:</span> <span class="string">&quot;YUM&quot;</span><span class="punctuation">,</span></span><br><span class="line">                    <span class="attr">&quot;arrival&quot;</span> <span class="punctuation">:</span> <span class="string">&quot;TOL&quot;</span><span class="punctuation">,</span></span><br><span class="line">                    <span class="attr">&quot;d&quot;</span> <span class="punctuation">:</span> <span class="number">50</span></span><br><span class="line">            <span class="punctuation">&#125;</span></span><br><span class="line">    <span class="punctuation">]</span></span><br><span class="line"><span class="punctuation">&#125;</span></span><br></pre></td></tr></table></figure><p>The result for St. Petersburg (PIE), will contain no inbound flights - an empty array. This is because there is no route with an arrival into it.</p><figure class="highlight json"><table><tr><td class="code"><pre><span class="line"><span class="punctuation">&#123;</span></span><br><span class="line">    <span class="attr">&quot;_id&quot;</span> <span class="punctuation">:</span> <span class="string">&quot;PIE&quot;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;name&quot;</span> <span class="punctuation">:</span> <span class="string">&quot;St. Petersburg&quot;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;state&quot;</span> <span class="punctuation">:</span> <span class="string">&quot;FL&quot;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;inbound&quot;</span> <span class="punctuation">:</span> <span class="punctuation">[</span> <span class="punctuation">]</span></span><br><span class="line"><span class="punctuation">&#125;</span></span><br></pre></td></tr></table></figure><p>This behavior simulates an SQL LEFT OUTER JOIN. But we want to simulate an INNER JOIN. This is where <code>$unwind</code> comes in. Since version 3.2, <code>$unwind</code> includes the <code>preserveNullAndEmptyArrays</code> option. When <code>true</code>, documents containing empty arrays in the <code>$unwind</code> field will be emitted from the stage. When <code>false</code>, documents entering <code>$unwind</code> with an empty array or missing field will be suppressed - won’t be emitted. The default value of <code>preserveNullAndEmptyArrays</code> is false. So we can actually use the simplified backwards compatible syntax:</p><figure class="highlight javascript"><table><tr><td class="code"><pre><span class="line">db.<span class="property">cities</span>.<span class="title function_">aggregate</span>([</span><br><span class="line">    &#123;</span><br><span class="line">        <span class="attr">$lookup</span>: &#123;</span><br><span class="line">            <span class="attr">from</span>: <span class="string">&#x27;routes&#x27;</span>,</span><br><span class="line">            <span class="attr">localField</span>:<span class="string">&#x27;_id&#x27;</span>,</span><br><span class="line">            <span class="attr">foreignField</span>:<span class="string">&#x27;arrival&#x27;</span>,</span><br><span class="line">            <span class="attr">as</span>: <span class="string">&#x27;inbound&#x27;</span> </span><br><span class="line">        &#125;</span><br><span class="line">    &#125;,</span><br><span class="line"></span><br><span class="line">    &#123;</span><br><span class="line">        <span class="attr">$unwind</span>: <span class="string">&#x27;$inbound&#x27;</span></span><br><span class="line">    &#125;</span><br><span class="line">])</span><br></pre></td></tr></table></figure><p>This time, St. Petersburg (PIE) will be suppressed, because it had an empty array. Simulated INNER JOIN done!In addition to effectively simulating the INNER JOIN, this technique produces a geometry that matches the SQL counterpart a bit closer. The result documents after <code>$unwind</code> contain a “flat” one-to-one document with all fields from the first collection and all fields from one document from the second collection (albeit demoted one level into the <em>inbound</em> field).</p><h2 id="Not-Full-Parity"><a href="#Not-Full-Parity" class="headerlink" title="Not Full Parity"></a>Not Full Parity</h2><p>Relational databases expose 4 varieties of JOIN: INNER, LEFT OUTER, RIGHT OUTER, and FULL OUTER.</p><p>A RIGHT OUTER JOIN is just a LEFT OUTER JOIN, but on the opposite collection, right? Well, not exactly. First, the limitations for <code>$lookup</code> with regards to sharding might prevent you from enjoying full symmetry. Second, you might need to “join” in a late pipeline stage, after a few manipulation of the first collection. In that case - again - it may be either difficult or impossible to reverse the roles of the 2 collections. Consider for example collection A that you use a $text match on, and want to leverage the full text index for filtering. Reversing and aggregating on collection B will prevent you from using the same index effectively.</p><p>FULL OUTER JOIN is flatly not supported by <code>$lookup</code>. It would be difficult to implement at scale, and would obviously collide with the sharding limitation on one of the collections involved. Performance questions aside, one would also need to consider the meaning of a pipeline - the basis of the <code>aggregate()</code> command. The pipeline feeds documents from the previous stage or a collection into a given stage. Under no circumstance do all documents from 2 separate collection enter into a stage. <code>$lookup</code> is designed to bring in a subset of matched documents from a different collection, not all documents from a second collection. It may be tempting to simulate a FULL OUTER by using <a href="https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/#uncorrelated-subquery">uncorrelated sub-query</a> within the <code>$lookup</code> syntax, using a <code>pipeline</code> option that effectively drags in every document from the second collection. This will attach an array of every document from collection B into an array on <strong>every</strong> document from collection A. Ouch! Even if this worked, you’d surely need to chase this with a <code>$group</code> or similar operator to get distinct values somehow. But very likely the pipeline will abort: The memory impact on the stage will be arbitrarily large. Finally, and even with allow-disk-usage, the document size limit of 16MB applies. So if collection B has more than 16MB, it will fail. If both collections are sufficiently small, an application-side join might be your best bet. Application side join is not very efficient, but can be acceptable for small-ish collections such as metadata type documents.</p><p>In general though, taking SQL operators and hoping that MongoDB or the Aggregation syntax would provide full parity is perilous. Taking the time to analyze the actual use cases and modeling properly using Document Oriented techniques would lead to better performance and cleaner designs.</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;When modeling in the document-oriented world, One to few relationships are often modeled as arrays embedded in the containing entity. Som</summary>
      
    
    
    
    
    <category term="MongoDB" scheme="https://www.plusnconsulting.com/tags/MongoDB/"/>
    
    <category term="NoSQL" scheme="https://www.plusnconsulting.com/tags/NoSQL/"/>
    
    <category term="JOIN" scheme="https://www.plusnconsulting.com/tags/JOIN/"/>
    
  </entry>
  
</feed>
