Datomic Cloud multi-tenancy

There are a multitude of ways of building multi-tenancy into a Datomic Cloud application. The approach you take is application dependent, but some common patterns arise.

It is worth noting the approaches discussed in the post are specific for Datomic Cloud. Datomic On-Prem has a different set of strategies that can be employed. Unless otherwise noted, when I refer to Datomic below, I am referring to the Cloud product.

What is multi-tenancy?

Generally, when we talk about multi-tenancy, we refer to the sharing of infrastructure components across multiple tenants while maintaining data isolation between them. The primary benefit of a multi-tenant architecture is cost savings, as it allows organizations to consolidate their resources and reduce their infrastructure and maintenance costs. Developers gain time back, no longer needing to focus on operating individual customer instances.

The level of data isolation can range from hard data boundaries to soft query-level filtering. The choice of data isolation rigor should be based on various factors such as the sensitivity and confidentiality of the tenant data, regulatory compliance requirements, and performance and scalability considerations. Achieving the right balance between data isolation and system performance can be challenging, and it often requires careful planning, testing, and optimization.

Multi-tenancy in Datomic

In Datomic, we are provided with a small set of primitives: systems, primary compute groups, query groups, and databases. A system contains exactly one primary compute group, can have many query groups, and both serve the same set of many databases. All data resides in a database.

Cloud differs from On-Prem in that it is designed to serve many databases. Each node in the primary compute group is designated to handle transactions for a particular database. If we have read contention, we can provision query groups to service particular types of queries or databases. This design difference opens the door to a simple form of multi-tenancy — by database.

By Database

The simplest form of multi-tenancy and data isolation is by database. The typical architecture followed is one database for administrative tasks and one database for each tenant.

For example, say each tenant is identified by a UUID. We can provision a tenant database by adding a prefix to the database name followed by the tenant’s UUID.

(d/create-database client {:db-name (str "tenant-" tenant-uuid)})

Complying with data deletion requirements (e.g., GDPR) is straightforward — we just delete the database when the tenant requests it, and all the data is inaccessible.

Listing tenants can be done by filtering the list of databases.

(filter #(str/starts-with? % "tenant-") (d/list-databases client {}))

Alternatively, a mapping of tenant UUID to tenant database name can be kept in the administrative database and retrieved with a simple pull.

(d/pull admin-db [:tenant/db-name] [:tenant/id tenant-uuid])

We’re easily able to piece together a useful set of functions for operating with our new primitives.

(def tenant-db-name-prefix "tenant-")
(def admin-db-name "admin")

(defn admin-conn
  [client]
  (d/connect client {:db-name admin-db-name}))

(defn tenant-db-name
  [tenant-uuid]
  (str tenant-db-name-prefix tenant-uuid))

(defn tenant-conn
  [client tenant-uuid]
  (d/connect client {:db-name (tenant-db-name tenant-uuid)}))

(defn list-tenant-uuids
  [client]
  (filter #(str/starts-with? % tenant-db-name-prefix)
    (d/list-databases client {})))

Boom. Multi-tenancy solved. Only, there are trade-offs. While the database-level separation approach enables multi-tenancy with minimal code and strong data isolation, it has some scalability limitations. As the number of databases grows, particularly if there are large databases, the performance of the system can degrade. This is especially true if some databases dominate the system size, as it can impact the performance of all other databases. To address this issue, a viable solution is to provision multiple systems and introduce a routing layer to distribute the tenants across those systems. By doing so, we balance the load and prevent a single database from affecting the performance of other tenants. Additionally, this approach allows us to dedicate specific systems to high-value or "white-glove" clients who require extra resources or customized settings.

However, with multiple systems comes greater cloud cost. Each system requires a minimum of two nodes to maintain high availability. Since each system can only serve a small number of tenants, the cost per tenant is quite high. For applications where we expect substantial margins per tenant, such as B2B enterprise applications, this may be an acceptable trade-off. In cases where we expect thousands of tenants, such a B2C apps, the cost element becomes unreasonable. As a result, we need to evaluate a different strategy.

One database to rule them all

Storing all tenant data in a single database works around the total database count constraint while introducing a new problem — it’s exceedingly easy to query data for the wrong tenant, since all data resides in a single database. We want to ensure when we issue a query, the resulting data is scoped to a single tenant. There’s no built-in way to do. As such, we must assert information about who owns what.

At write time, we attach permissions to the data each tenant has access to. Later when reading data, we must ensure the tenant can access the data requested. For both write and read, we trade off developer ergonomics with security.

For sake of explanation, let’s first start with example schema for a blog.

[{:db/ident       :post/content
  :db/valueType   :db.type/string
  :db/cardinality :db.cardinality/one}
 {:db/ident       :post/revisions
  :db/valueType   :db.type/ref
  :db/cardinality :db.cardinality/many}

 {:db/ident       :tenant/id
  :db/valueType   :db.type/uuid
  :db/cardinality :db.cardinality/one
  :db/unique      :db.unique/identity}]

Then a model for tenant data access. We intentionally keep the access model simple for the sake of the example. Additional use-case complexity could be added later.

[{:db/ident       :access/entity
  :db/valueType   :db.type/ref
  :db/cardinality :db.cardinality/one}
 {:db/ident       :access/tenant
  :db/valueType   :db.type/ref
  :db/cardinality :db.cardinality/one}]

The tenant who is being granted access is defined by the :access/tenant attribute. The entity the tenant has access to is referenced by the :access/entity attribute. This assumes there is an entity in the database referring to the tenant. If this is stored elsewhere, this could be a UUID or some other external reference. When the access entity exists for a tenant + entity, we assume the tenant has access to the entity, else the tenant does not have access.

Access is granted by transacting the access entity for each database entity the user can access. We will later use these attributes to check access at read time. For example, to grant a tenant access to a theoretical "blog" entity, we transact the following.

(d/transact tenant-conn {:tx-data [{:db/id        "blog-post"
                                    :post/content "im a little teapot"}
                                   {:access/entity "blog-post"
                                    :access/tenant [:tenant/id tenant-uuid]}]})

The access entity references the post entity via its temporary id. We assign access to the tenant with tenant-uuid via the :access/tenant attribute. This approach necessitates remembering to add the access entity with each transaction. Fortunately, forgetting to add the access entity fails securely. By forgetting the access entity, the tenant won’t be able to see the newly transacted data, which is not great, but it’s better than seeing another tenant’s data.

At query time, we check access by checking if the logic entities have an access entity defined for the tenant passed.

(d/q '[:find ?p
       :in $ ?tenant
       :where
       [?p :post/content]
       [?access :access/entity ?p]
       [?access :access/tenant ?tenant]]
  tenant-db
  [:tenant/id tenant-uuid])

That’s a bit verbose to write each time we want to check access. We can improve by pulling the access check out into a rule.

(def access-rules
  '[[(allowed? [?tenant ?e])
     [?access :access/entity ?e]
     [?access :access/tenant ?tenant]]])

(d/q '[:find ?p
       :in $ % ?tenant
       :where
       [?p :post/content]
       (allowed? ?tenant ?p)]
  tenant-db access-rules [:tenant/id tenant-uuid])

While this implementation provides a fully functional multi-tenant system that can scale to a large number of users, it has some limitations. This implementation places the responsibility of access checks solely in the developer’s lap. For some companies, this might be enough. Their developers can meticulously adhere to this convention. You think even for the cases where someone happens to forget to add access, it will certainly be caught in PR review. Perhaps you even introduce a linter to catch common cases. Even if you have exceptionally high confidence in you and your team’s ability to ensure access checks, I bet you’ll miss one. Take this particularly nasty example.

(d/q '[:find (pull ?p [*])
       :in $ % ?tenant
       :where
       [?p :post/content]
       (allowed? ?tenant ?p)]
  tenant-db access-rules [:tenant/id tenant-uuid])

We’ve changed the previous query to pull all attributes (via *) for ?p. If ?p has component ref attributes, those entities will be surfaced from running this query, even if the user has not been granted access to that entity. Oops.

For some teams, this is an okay trade-off. The cost of occasional data leaks is not high, and you value the ability to write queries unencumbered by access constraints. For others, this may not be acceptable.

If security is of the highest concern, you will need to introduce an API layer replacing Datomic’s. This layer will always take in the access context, in this case the tenant-uuid, and ensure the tenant has access to the resulting data. At a high level, a possible design for this could look like this.

(defn q
  ([tenant-uuid arg-map]
   ;; TODO: Implement d/q w/ access checks
   )
  ([tenant-uuid query & args]
   (q tenant-uuid {:query query :args args})))

(defn pull
  ([tenant-uuid db arg-map]
   (q tenant-uuid
     {:query '[:find (pull ?e pattern)
               :in $ pattern ?e]
      :args  [db (:selector arg-map) (:eid arg-map)]}))
  ([tenant-uuid db selector eid]
   (pull tenant-uuid db {:selector selector :eid eid})))

The implementation of q will do exactly what we’ve wrote above with the allowed? rules check, only programmatically. We first inspect the query for the set of variables used in :find. For each of these variables, we ensure that the allowed? rule is appended to the :where clauses. For example, we’d expect the below partial query to be transformed into the latter query.

;; Partial Query
[:find ?a ?b ?c]

;; Transformed Query
[:find ?a ?b ?c
 :in $ % ?tenant
 :where
 (allowed? ?tenant ?a)
 (allowed? ?tenant ?b)
 (allowed? ?tenant ?c)]

Stopping here doesn’t protect us against the pull case discussed previously. We’ll need to explicitly handle pull queries. When a pull is used in a :find element, we have two options: 1) We remove it, only querying for the eid, subsequently issuing the pull for each returned eid, checking access at each nested level. 2) Allow the query to go through as written and walk the resulting data, removing elements the tenant does not have access to.

While implementing 1 was exceedingly fun, it is not performant with the client API. The code works out to essentially reimplement the `d/pull`operation yourself. As such, you need quick access to lots of data. When running on an Ion, the implementation is fast. All the data is available as if in memory. However, we often work with a Cloud system locally. When doing so, each operation issues a network request, and the overhead of those requests adds up, making for poor performance.

Fortunately, the implementation of 2 is straightforward. We issue two queries: one to gather the set of eids the tenant has access to and another to run the user query. The latter of which still needs to be modified to have the allowed? rule added for each :find variable. A second modification is required as well. For each pull in the :find, we must add :db/id to the pattern at every level. The pulled `:db/id`s will be used for access checking the resulting pull data.

After running the user query, we walk the result data, perhaps using clojure.walk/postwalk, and, for each map we check if the the map’s :db/id value is in the set of eids our tenant has access to. If it is not, we remove the value from the result.

At a high level, the code flows something like this. The actual implementation is left as an exercise to the reader.

(defn q
  ([tenant-uuid arg-map]
   (let [q-arg-map (add-access tenant-uuid arg-map)
         accessible-eid-set (q-accessible-eids tenant-uuid (first (:args arg-map)))
         q-result (d/q q-arg-map)]
     (ensure-result-access q-result accessible-eid-set)))
  ([tenant-uuid query & args]
   (q tenant-uuid {:query query :args args})))

While we could stop with this, there another helpful mechanism to add in. For our domain, it is useful to express that if a tenant has access to a post, they should also have access to every revision in :post/revisions. While we could explicitly add an access entity for each revision entity, that may be an unnecessary since we know in our domain the tenant will always have access to every revision. To support a scenario where the tenant has implicit access to the entities under a particular attribute, we annotate the attribute schema as such, and add one more initial query to our q implementation.

;; New schema attribute
[{:db/ident       :access.schema/implictly-allowed?
  :db/valueType   :db.type/boolean
  :db/cardinality :db.cardinality/one}]

Add the :access.schema/implictly-allowed? attribute to the :post/revisions entity.

(d/transact tenant-conn {:tx-data [{:db/id                            :post/revisions
                                    :access.schema/implictly-allowed? true}]})

Now we can pass the set of implicitly allowed attributes to the ensure-result-access for consideration.

(defn q
  ([tenant-uuid arg-map]
   (let [q-arg-map (add-access tenant-uuid arg-map)
         accessible-eid-set (q-accessible-eids tenant-uuid (first (:args arg-map)))
         implicitly-allowed-attrs (q-implicitly-allowed-attrs (first (:args arg-map)))
         q-result (d/q q-arg-map)]
     (ensure-result-access q-result accessible-eid-set implicitly-allowed-attrs)))
  ([tenant-uuid query & args]
   (q tenant-uuid {:query query :args args})))

This approach will get you quite far. You’ll be able to scale to many tenants, much more than you’d have been able to with the by database design, and the query layer is minimally impactful on developer ergonomics.

With large enough scale, there comes a point where performance is problematic. This will likely manifest as thrashing in valcache, but the end result is the same — queries running slow. While this is a great for your application — you have lots of tenants — it requires additional infrastructure to handle the higher load. You’ll need to provision additional query groups to serve reads for a particular set of tenants. An application-level routing layer can be added to automatically direct reads, perhaps utilizing a persistent hashing strategy, for a tenant to the correct query group.

A caveat with this approach is you cannot easily comply with data deletion requirements (e.g., GDPR). We cannot follow the database deletion strategy since multiple tenants are storing in a single database. Datomic Cloud does not have data excision, so that won’t work either. We’re left with encrypting any personally identifiable tenant information with a tenant-specific key. If the tenant requests deletion of their data, we throw away the encryption key.

Datomic On-Prem

For sake of completeness, I should mention Datomic On-Prem. On-Prem offers a unique functionality — database filters. These filters let you subset the data in a database and operate with that subset using the regular Datomic API. For an example of how this could work, I suggest watching Lucas Cavalcanti’s portion of the talk Exploring four hidden superpowers of Datomic starting at 11:04.

On the looseness of words

Many words used here are vague: "small," "large," "lots," "slow," etc. While this may be frustrating, this was done intentionally. I cannot be prescriptive of how certain decisions will affect your system without knowledge of the shape and size of your data. When in doubt, test, test, test.

Written on 2023-04-16