Compound Wildcard Indexes in MongoDB 7

2023-09-20

What

Wildcard indexes have improved in several ways in MongoDB 7.0.

Notabley, wildcard indexes can now be compound, including additional non-wildcard fields.

Previously, wildcard indexes only acted as a single-field index on each indivudual target field, stored in a single index which included those fields.

Newly supported is the ability to include both wildcard fields and named individual fields in the same index.

This capability supports common use cases such as the attribute-pattern document design. In such cases, a sub-document contains a set of keys, but the presence of the keys varies from document to document.

Consider the case of the followin documents describing a prospect record in a CRM system

{ _id: 35, name: 'Francis', status: 'customer', contact: {} }

{ _id: 36, name: 'Connor',  status: 'customer', contact: { cell: '07787 304928', fax: '0191 558 5860' }}

{ _id: 1,  name: 'Ethan',   status: 'customer', contact: { cell: '07943 819337' }}

Francis has no contact fields at all - just a name.
Connor has both a cell and a fax .
Ethan has only a cell phone number.

A wildcard index on contact.** would allow single-field match against one of the contact fields.

But query filtering on the status field and one of the wildcard fields would not be supported without a compound wildcard index.

Using a Compound Wildcard Index

First, create an index by supplying both a wildcard expression and an additional plain (non-wildcard) field(s).

db.prospect.createIndex({ "contact.$**": 1, status:1 }, {name: "idx_compound_wild"})

As with other compound indexes, the key structure supports key prefixes that include the first, or first and second fields.

This query leverages the compound index:

db.prospect.find({'contact.cell':'07408 926850', status:'customer'})

The query includes the a full prefix in the index: a single wildcard field contact.cell, and the non-wildcard field status.

Running .explain(true) on this query shows a winning plan that states our index named idx_compound_wild was used. The indexBounds entry shows that both the contact.cell and status fields were considered when processing the query agains the index.

"winningPlan": {
      "queryPlan": {
        "stage": "FETCH",
        "planNodeId": 2,
        "inputStage": {
          "stage": "IXSCAN",
          "planNodeId": 1,
          "keyPattern": {
            "$_path": 1,
            "contact.cell": 1,
            "status": 1
          },
          "indexName": "idx_compound_wild",
          "isMultiKey": false,
          "multiKeyPaths": {
            "$_path": [],
            "contact.cell": [],
            "status": []
          },
          "isUnique": false,
          "isSparse": true,
          "isPartial": false,
          "indexVersion": 2,
          "direction": "forward",
          "indexBounds": {
            "$_path": [
              "[\"contact.cell\", \"contact.cell\"]"
            ],
            "contact.cell": [
              "[\"07408 926850\", \"07408 926850\"]"
            ],
            "status": [
              "[\"customer\", \"customer\"]"
            ]
          }
        }

No Index Intersection

Index intersection is a strategy where more than one indexed fields are stated in a filter, and the result is computed by perforing a set intersection on the keys of 2 indexes.

Intersection seems to make sense when two separate simple indexes are at play on two separate fields. What about wildcard index? Will the mongo query planner choose to “intersect” the same wildcard index with itself somehow? Do we expect it will be smart about choosing to scan for two values, one found in a certain path contact.cell, then the second found in a different path contact.fax - both of which are contained with the same index? The short answer: no!

Consider the following query:

db.prospect.find({ 'contact.email': '[email protected]', 'contact.fax': '0121 062 9173'})

The query filters on two separate wildcard fields email and fax.

The query planner shows the following output:

"winningPlan": {
      "queryPlan": {
        "stage": "FETCH",
        "planNodeId": 2,
        "filter": {
          "contact.email": {
            "$eq": "[email protected]"
          }
        },
        "inputStage": {
          "stage": "IXSCAN",
          "planNodeId": 1,
          "keyPattern": {
            "$_path": 1,
            "contact.fax": 1,
            "status": 1
          },
          "indexName": "idx_compound_wild",
          "isMultiKey": false,
          "multiKeyPaths": {
            "$_path": [],
            "contact.fax": [],
            "status": []
          },
          "isUnique": false,
          "isSparse": true,
          "isPartial": false,
          "indexVersion": 2,
          "direction": "forward",
          "indexBounds": {
            "$_path": [
              "[\"contact.fax\", \"contact.fax\"]"
            ],
            "contact.fax": [
              "[\"0121 062 9173\", \"0121 062 9173\"]"
            ],
            "status": [
              "[MinKey, MaxKey]"
            ]
          }
        }
      }...

The above is a bit awkward. The filter term mentions the contact.email field alone. The indexBounds field mentions the contact.fax field only. Though one might intuit that the strategy is to us the index to scan the index contact.fax first, then filter the entries in the index by contact.email, it is neither clear that this is the case nor expected. One would expect that the index contains the list of documents under either of the keys, and therefore an AND query would hit the index key structure twice and do some nested loop join or something similar.

Let’s try specifying three of the wildcard fields in the index, and see if things become any clearer:

db.prospect.find({
  'contact.cell': '07404 190465',
  'contact.email': '[email protected]',
  'contact.fax': '01995 364874'
}).explain("queryPlanner")

The plan shown below seems to imply the same general strategy. The index bounds only mention one of the three query terms, then the filter mentions the other two.

"winningPlan": {
      "queryPlan": {
        "stage": "FETCH",
        "planNodeId": 2,
        "filter": {
          "$and": [
            {
              "contact.cell": {
                "$eq": "07404 190465"
              }
            },
            {
              "contact.fax": {
                "$eq": "01995 364874"
              }
            }
          ]
        },
        "inputStage": {
          "stage": "IXSCAN",
          "planNodeId": 1,
          "keyPattern": {
            "$_path": 1,
            "contact.email": 1,
            "status": 1
          },
          "indexName": "idx_compound_wild",
          "isMultiKey": false,
          "multiKeyPaths": {
            "$_path": [],
            "contact.email": [],
            "status": []
          },
          "isUnique": false,
          "isSparse": true,
          "isPartial": false,
          "indexVersion": 2,
          "direction": "forward",
          "indexBounds": {
            "$_path": [
              "[\"contact.email\", \"contact.email\"]"
            ],
            "contact.email": [
              "[\"[email protected]\", \"[email protected]\"]"
            ],
            "status": [
              "[MinKey, MaxKey]"
            ]
          }
        }
      },...

Well, the plan is not lying. MongoDB does indeed choose only one of the query terms to serve as the value sought in the index. It evaluates the selectivity of the terms in the query, and picks one it deems most suitable.

This still doesn’t answer the question why it doesn’t compare the other terms agains the index either. Which brings the topic of index intersection. MongoDB introduced an implementation of index intersection at some point, then reversed the default implementation of it to a degree that pretty much prevents intersection from every being excercised. Turns out that the performance gains were rather low for the intersection candidates as compared to other strategies. Worse: the planner would often pick intersection in cases where other strategies would have performed way better. In other words: Index intersection turned out to be the wrong answer too often, and would actually hurt performance.

But this is a single index you say - wouldn’t intersection make sense in this particular wildcard scenario? The way I like to think about it, a wildcard index is actually several indexes in a trench-coat pretending to be a single index. Each wildcard field value is processed into it’s own index structure, but all of those structures are stored together as one. A hint to this notion is that the planner outputs the $_path term, stating which wildcard field within the index is targeted.

The addition of non-wildcard fields to the index would then shape the key prefixes to contain the additional non-wildcard value so that the index is more selective and the (theoretical) need for intersection is reduced.

Conclusion

Compound wildcard indexes are a welcome addition to MongoDB 7.0. Having no intersection mechanism doomed many queries to only partially use indexes and resort to loading documents based on a single field, then filtering them further. With this feature, the number of documents loaded can be reduced. Reducing I/O should help many workloads. Whether it does for your workload is subject to - as always - experimintation and monitoring. Give it a try, hope it helps!

For limitations and further details, see Compound Wildcard Indexes

+N Consulting, Inc.