The Good, the Bad, and the Join

2018-07-06

When modeling in the document-oriented world, One to few relationships are often modeled as arrays embedded in the containing entity. Sometimes the relationship is one to many, or such that an embedded array would not work well. In those cases, the need for some sort of JOIN comes up.

The Aggregation Framework in MongoDB gives us a convenient option: $lookup! But $lookup is not JOIN - at least not the one we are familiar with from SQL and relational databases. This distinction is important to understand, as it may affect your modeling and performance.

What $lookup Is

$lookup queries a second collection for each document entering the pipeline stage. Any documents from the secondary collection that pass the lookup condition are attached to an array field on the entering document.

The behavior differs from SQL’s INNER and LEFT JOIN in its geometry. In SQL a row is emitted for each match of the right table, and produces a wide row with all columns from the left and all columns from the right - flat. Rows in the left table that had no match in the right table would not be produced at all. Similarly, a LEFT JOIN would have produced flat rows, and a NULL filled row for the right table in case no match was found for the right table.

The behavior also differs in the DDL aspects and expectation. MongoDB does not represent or maintain any kind of foreign key relationship between collection. While relational databases expose foreign keys in the DDL and extend them with things like cascade-delete and insert/update validation of key existence, MongoDB has no such constructs.

Apart from the geometrical and DML-only nature of $lookup, it is also a bit more limited. You cannot lookup into a collection that is sharded. This limits the scale of the database where one could apply the syntax, and should give you pause to think through your modeling carefully before relying on this syntax.

The closest analog of $lookup to SQL JOIN varieties is the LEFT OUTER JOIN.

Simulating INNER JOIN

At times, we want to simulate INNER JOIN. We may want to emit only documents from the first collection which have one or more match from the in the second collection.

There’s a way to do this using aggregation by chaining $lookup and an $unwind.

Full props to Asya Kamsky for this technique!

The subtle point is that $lookup appends an array of values from the second collection for each input document, and an empty array if no match is found. That empty array is the key to this solution. $unwind has the ability to suppress emitting documents where the array field is empty or missing. In fact, that was the default behavior of $unwind all along.

Consider the following data sets:

db.cities.insert([
    { "_id" : "PIE", "name" : "St. Petersburg", "state" : "FL" }
    { "_id" : "YUM", "name" : "Yuma", "state" : "AZ" }
    { "_id" : "TOL", "name" : "Toledo", "state" : "OH" }
    { "_id" : "FAR", "name" : "Fargo", "state" : "ND" }
]);

db.routes.insert([
    { "_id" : 1, "departure" : "PIE", "arrival" : "YUM", "d" : 90 }
    { "_id" : 2, "departure" : "PIE", "arrival" : "TOL", "d" : 60 }
    { "_id" : 3, "departure" : "FAR", "arrival" : "YUM", "d" : 40 }
    { "_id" : 4, "departure" : "YUM", "arrival" : "TOL", "d" : 50 }
    { "_id" : 5, "departure" : "TOL", "arrival" : "FAR", "d" : 30 }
]);

Now query the cities to discover the inbound flights arriving in for each city. For each city, a list of routes is appended from the routes collection. The match is on the cities _id field against the routes arrival field, both of which contain the FAA 3 letter airport code.

    {
        $lookup: {
            from: 'routes',
            localField:'_id',
            foreignField:'arrival',
            as: 'inbound' 
        }
    }
])

The result for Toledo (‘TOL’) will show 2 inbound flights:

{
    "_id" : "TOL",
    "name" : "Toledo",
    "state" : "OH",
    "inbound" : [
            {
                    "_id" : 2,
                    "departure" : "PIE",
                    "arrival" : "TOL",
                    "d" : 60
            },
            {
                    "_id" : 4,
                    "departure" : "YUM",
                    "arrival" : "TOL",
                    "d" : 50
            }
    ]
}

The result for St. Petersburg (PIE), will contain no inbound flights - an empty array. This is because there is no route with an arrival into it.

{
    "_id" : "PIE",
    "name" : "St. Petersburg",
    "state" : "FL",
    "inbound" : [ ]
}

This behavior simulates an SQL LEFT OUTER JOIN. But we want to simulate an INNER JOIN. This is where $unwind comes in. Since version 3.2, $unwind includes the preserveNullAndEmptyArrays option. When true, documents containing empty arrays in the $unwind field will be emitted from the stage. When false, documents entering $unwind with an empty array or missing field will be suppressed - won’t be emitted. The default value of preserveNullAndEmptyArrays is false. So we can actually use the simplified backwards compatible syntax:

db.cities.aggregate([
    {
        $lookup: {
            from: 'routes',
            localField:'_id',
            foreignField:'arrival',
            as: 'inbound' 
        }
    },

    {
        $unwind: '$inbound'
    }
])

This time, St. Petersburg (PIE) will be suppressed, because it had an empty array. Simulated INNER JOIN done! In addition to effectively simulating the INNER JOIN, this technique produces a geometry that matches the SQL counterpart a bit closer. The result documents after $unwind contain a “flat” one-to-one document with all fields from the first collection and all fields from one document from the second collection (albeit demoted one level into the inbound field).

Not Full Parity

Relational databases expose 4 varieties of JOIN: INNER, LEFT OUTER, RIGHT OUTER, and FULL OUTER.

A RIGHT OUTER JOIN is just a LEFT OUTER JOIN, but on the opposite collection, right? Well, not exactly. First, the limitations for $lookup with regards to sharding might prevent you from enjoying full symmetry. Second, you might need to “join” in a late pipeline stage, after a few manipulation of the first collection. In that case - again - it may be either difficult or impossible to reverse the roles of the 2 collections. Consider for example collection A that you use a $text match on, and want to leverage the full text index for filtering. Reversing and aggregating on collection B will prevent you from using the same index effectively.

FULL OUTER JOIN is flatly not supported by $lookup. It would be difficult to implement at scale, and would obviously collide with the sharding limitation on one of the collections involved. Performance questions aside, one would also need to consider the meaning of a pipeline - the basis of the aggregate() command. The pipeline feeds documents from the previous stage or a collection into a given stage. Under no circumstance do all documents from 2 separate collection enter into a stage. $lookup is designed to bring in a subset of matched documents from a different collection, not all documents from a second collection. It may be tempting to simulate a FULL OUTER by using uncorrelated sub-query within the $lookup syntax, using a pipeline option that effectively drags in every document from the second collection. This will attach an array of every document from collection B into an array on every document from collection A. Ouch! Even if this worked, you’d surely need to chase this with a $group or similar operator to get distinct values somehow. But very likely the pipeline will abort: The memory impact on the stage will be arbitrarily large. Finally, and even with allow-disk-usage, the document size limit of 16MB applies. So if collection B has more than 16MB, it will fail. If both collections are sufficiently small, an application-side join might be your best bet. Application side join is not very efficient, but can be acceptable for small-ish collections such as metadata type documents.

In general though, taking SQL operators and hoping that MongoDB or the Aggregation syntax would provide full parity is perilous. Taking the time to analyze the actual use cases and modeling properly using Document Oriented techniques would lead to better performance and cleaner designs.

+N Consulting, Inc.