Recently I read about a new Rust web framework called rwf. I was especially curious about its approach to handling database migrations, as many frameworks released in recent years tend to either overlook this or provide a mechanism that won't scale to larger projects. Reading through the documentation I learned that rwf falls in the second category: it generates two SQL files, one for migrating forward and one for migrating backwards, and evaluates the appropriate files based on the migration direction.
The problem here isn't necessarily the choice of using SQL (though it certainly doesn't help), but rather that the system is built in such a way that it becomes useless the moment your team and/or database starts growing in size.
I've been thinking about what a better solution to data migrations might look like for a while, so let's explore my current thoughts on the subject.
Since developers love to argue about terminology rather than focusing on the subject as a whole, we'll start off with defining a few terms so it's clear what we're talking about.
A database is a relational database in the context of this article, so something like PostgreSQL or SQLite. While technically much of what we'll discuss here also applies to document-oriented databases such as MongoDB, I'll be focusing on relational databases.
An application is a program that uses the database and is exposed to the users. I assume no further explanation is needed for this particular term.
A database is considered large when conventional ways of mutating it, both when changing its structure and content, requires a significant amount of time and a non-conventional approach to reduce the time necessary to make the changes.
The application data is the data and data structure the application needs, such as the data in a database or elsewhere (e.g. AWS S3).
A migration is a function that transitions the application data from state A to B.
Migrations move in a particular direction: up to move forward in time (= apply the changes), and down to move backwards in time (= revert the changes).
A migration process is the act of running one or more migrations.
With the terminology out of the way, let's define a few requirements our migration process must meet.
There may be more requirements depending on your needs, but these are the ones I'll consider as part of this article.
To help better understand what we'll be dealing with, let's look at a real-world example of a sufficiently complex piece of software: GitLab.
GitLab started as a GitHub clone with only a few developers working on it. For performing database migrations, it uses the migration framework provided by Ruby on Rails. As GitLab's popularity and the number of features it offered grew, so did the size of its database. When I joined GitLab in 2015, the size of the GitLab.com database was somewhere around 200-300 GiB. By the time I left in December 2021, it had grown to 1-2 TiB. We did manage to briefly shrink the database to a size of zero, but only because I removed the entire production database by accident.
Jokes aside, this growth in size meant that the traditional approach of running database migrations was no longer suitable. For example, renaming a column was no longer possible for large tables as it would take far too long to complete. Similarly, migrating data from format A to format B could easily take weeks to complete when performed using the traditional Rails approach.
Solving these problems required separate solutions. These are as follows:
While this setup allowed GitLab to migrate both small and large tables as well as data stored outside the database, it highlights several problems with the migration system provided by Rails:
There are also problems with the setup used by GitLab:
While the resulting situation isn't ideal, I believe it's the best we could come up with at the time given all the constraints we had to deal with.
If somebody were to create a framework or application from scratch, we can do better than this, but this requires that we first understand the problems existing solutions face.
The first problem is that existing migration systems make the assumption that the existence of a migration implies it's timeless, and that it will therefor always work. If all the migration does is creating a table or column then that is likely true, but for anything more complex this is no longer the case. Isolating a migration from the application it's a part of may help, but there will be plenty of cases where this just isn't practical due to the code duplication this requires. Even if the amount of code that has to be duplicated is small, it can add up when you have to do it many times for different migrations.
What we need is a way to capture some sort of snapshot of the application logic at the time the migration is written, then run the migration using that snapshot. This ensures we can always migrate up or down, as migrations are always run against a known state. This is similar to minimal version selection ensuring a dependency always uses the minimal version that satisfies its requirements, instead of the maximum version.
The second problem is that to build a migration system that scales, you must understand what that actually means, which in turn means you must've felt the pain of dealing with a system that doesn't scale. For whatever reason it seems that the people building new migration systems (or the frameworks they're a part of) don't have such experience or just don't care for it, resulting in setups only capable of catering towards the most basic of scenarios.
The people who do have the necessary experience in turn don't seem to be interested in building a better solution, perhaps because they've burned out on the subject. I certainly had no energy to think about this subject the first 1-2 years after leaving GitLab.
The third problem, and one that's related to the second problem, is that there aren't many projects that will grow large enough such that they need more sophisticated solutions to migrating their data. As a result, there's not enough of a push to come up with something better than the current status quo.
The fourth problem is that different projects have different requirements for applying their data migrations, and these requirements may result in different solutions. A mobile application using a small SQLite database is going to migrate its data in a way different from a SaaS application using a 10 TiB database, and building a solution that works for both may prove difficult.
Now that we've defined a set of requirements, discussed a real-world example to better understand what we're dealing with, and listed several problems faced by existing solutions, what would a better solution look like?
For applications deployed to controlled environments (i.e. not a mobile application deployed to phones you have no control over), I think I have a rough idea. What follows is only appliccable to deployments to controlled environments.
As I've hinted at before, migrations should be functions. Functions as in "a function written in a programming language", not functions as in "SQL snippets I just refer to as functions because it makes me sound smart". This means they're written in something like Ruby, Lua, Rust, or whatever language you prefer. I would like to say this seems obvious, but the fact that new frameworks and database toolkits (rwf, Diesel, and probably many more) tend to only support SQL files/expressions for migrations seems to suggest otherwise.
A function should be provided for both directions, such that one doesn't need to write a new migration to undo the changes of a previous migration. This allows for fast rollbacks in production environments if it's determined that a migration broke things. This does of course assume you can in fact revert the migration (i.e. data isn't mutated in a non-reversible manner), which unfortunately isn't always the case. Even if you revert by creating a new migration to undo the relevant changes, having a separate "down" function is still useful for testing purposes and development environments.
Migrations should be run against specific VCS revisions rather than whatever the latest revision is. For this to work you need to maintain a file of sorts that tracks the migrations to run along with the revisions to run them against. This also means there's a two-step process to creating migrations:
While this may seem annoying, it's trivial to automate the second step such that it shouldn't be a problem in practice.
To perform the migration process, the migration system determines the range of migrations to run based on the system's current state and the desired state, resulting in the migration range [M1, M2, …, MN]. For each migration in this range, the system checks out the corresponding revision and then runs the migration against the revision. Once the process is done, the system checks out the initial revision again.
By running the migration against a known revision of the code we ensure that it's timeless, even when it reuses application logic defined outside of the migration. This also means we remove the need for duplicating any application logic the migration may need to perform its work, making it easier to write, review and maintain the migration. Another nice benefit is that we're free to remove the migration once we no longer need it, rather than keeping it around for an unspecified amount of time, because as long as it's still tracked in the migration revision file we can still run it.
Migrations need to be split into pre-deployment and post-deployment migrations similar to GitLab's approach. This allows you to use the pre-deployment migrations for additions and other backwards compatible changes, while using post-deploymeng migrations for those that first require code changes. A simple example is removing renaming a column: a pre-deployment migration adds the new column and copies over its data (and maybe installs a trigger to keep the two in sync). The deployment updates the code to start using the new column, and the post-deployment migration (running after the code is deployed) removes the column.
Dealing with large data migrations is a more difficult problem to solve. This would require at least the following:
This does come with the (potentially unfortunate) requirement that you also provide (or depend on) a background processing system of some sort (i.e. Sidekiq). For a full-stack framework that might not be a problem, but for e.g. a standalone database toolkit this may not be desirable.
An important requirement here is that the deployment must not finish until the background jobs complete their work. This makes monitoring easier as the migration in charge of the background jobs can report its progress by e.g. writing to STDOUT, which is then collected as part of the deployment output/logs. This also makes it easier to reason about the system: when a deployment is done, so is all its work, rather than there being an unspecified number of jobs still running in the background.
A set of primitives should exist to make it easy to write tests for migrations, such that one can verify they in fact work as intended. For example, there should be a primitive that migrates the test database to the expected starting state and back, so you can test the migration against the expected initial state instead of the latest state.
This is probably the least exciting idea, but the fact this isn't widely done suggests it's not as obvious as I'd like it to be.
The above is just a rough idea of what I think that can be done to provide a better migration system. I think the idea of running migrations against known VCS revisions is especially interesting and worth exploring further. Perhaps I'll explore this in the future if I ever get around to building a web framework in Inko.