Monday, February 12, 2018

Changing the SQL schema (10)

The database so far is pretty simple: there are a couple of name fields, and a date of birth. Eventually, I'd like to add in a load of other fields.

One way to do this is to over-engineer an Inner Platform that allows some non-programmer admin to add in fields as they see fit. Another way is to hardcode the choice of fields into the codebase, and that is the way I'm going to proceed.That means to customise for a particular event, you need to fork the codebase and change stuff all the way through the code, which is icky, but this is really only for a couple of events for now.

As an example, I'd like to ask if a registrant can swim, and store the result as a PostgreSQL BOOLEAN column.

On the database side, there are a couple of ways to add this column: if you don't care about the data in your database, modify the appropriate CREATE statement in tables.sql, delete the database and create a new fresh database.

If you have any data, that's not so good - all the previously registered registrations will disappear.

Instead you can modify the schema of a database using PostgreSQL's ALTER command. For example:

postgres=# ALTER TABLE registration ADD COLUMN swim BOOLEAN;
ALTER TABLE
postgres=#

This doesn't lose existing registration, so is better.

But now for every deployment of the app, you need to remember to run the appropriate ALTER statements. (and yes there will be more than one deployment, even for such a small use case: there's a copy on my laptop for hacking on, and the copy people will use for real).

As a programmer, who cares? You throw the software over the wall to the lazy incompetant ops team, and leave it to them to screw up following your poorly written upgrade instructions. As an ops person, though, you'd much rather the programmers wrote their poorly written upgrade instructions in something that had actually been tested by the testing system before being passed on to you to discover the bugs.

Fear not! There's a pretty standard way of handling this, and there's a Haskell implementation called postgresql-simple-migration. Install that now with stack build postgresql-simple-migration.

What we do is write out each bit of SQL that needs to run in a series of files in a directory. When a new SQL statement needs running on all existing databases, put that in a new file in that directory. Let postgresql-simple-migration handle the tracking and executing of the commands when each installation of the application is upgraded.

To begin, set this up so that it will just create the same database schema as we've been using already: make a directory called migrations/ and rename the existing tables.sql to migrations/0000-create.sql. Then delete the existing database and let postgresql-simple-migrations take charge:

psql --user=postgres -c "DROP TABLE registration;"

stack exec migrate init user=postgres
Initializing schema

stack exec migrate migrate user=postgres migrations/
Execute:        0000-create.sql

If you run migrate again, it will report a different status for 0000-create:

stack exec migrate migrate user=postgres migrations/
Ok:     0000-create.sql

... which says that migrate didn't need to run that migration but that it is happy that it has been done already. So to get a nice mathsy term in, migrate is idempotent.

Now it's time to run that ALTER command to add the SWIM column.

Open your favourite SQL-editing text editor, or /bin/cat, and create a file called 0001-swim.sql containing the single line:

ALTER TABLE registration ADD COLUMN swim BOOLEAN;

... and run the migration command again:

stack exec migrate migrate user=postgres migrations/
Ok:     0000-create.sql
Execute:        0001-swim.sql

(Note the poor indentation. It's using \t tabs. We can have a debate about that. But we won't.)

migrate has seen that 0000-create has been run already, and that 0001-swim is new and so has executed it.

If this was a fresh database, both of the scripts would have been run, hopefully resulting in the same final schema. They're run in alphabetical order, so using a fixed-width sequence number gives an obvious way to put things in order until you run out of four digit integers. I've seen a date stamp commonly used too - I don't have a huge amount of opinion on the matter.

You need to be a bit careful here though. It is possible to add migrations out of order: for example, having added 0001-swim and had it deployed, I could add 0001-a which is *before* 0001-swim. Now, on a fresh database, they'll run in the order: 0000-create, 0001-a, 0001-swim. But in the database that has already been around a while, they'll have been applied in the order 0000-create, 0001-swim, 0001-a. This is the sort of thing that can happen with multiple developers working on different branches and merging their results without remembering to look out for this, although it's often the case that swapping the order doesn't matter much (for example, two column adds can happen in either order - the effect will be observable but shouldn't break properly written code).

Another degeneracy is when someone messes with existing migrations in version control to "fix" a problem that has happened. This can easily lead to database schemas in existence which can't be recreated by any one version of the software, an unfortunate circumstance.

And yet another downside is we now don't have a single set of CREATE statements to create the final database schema. Instead you have to run all the migrations and observe them in a live database.

Nevertheless, it is a pretty simple, and very common, solution to the problem of SQL schema upgrades.

It's also possible to make migrations happen at the beginning of reg-exe. That makes sense if there is one copy of reg-exe pointing at the database, and nothing else.

But did you know you can run *two* copies of reg-exe pointing at the same database? (or at least, I hope so - I haven't tried it). Magic scalability. In that case, it doesn't (always) make sense for an older version of the software to be talking to a database with a newer schema.

So although I'm not expecting to have super-scalability with this project, I still have a personal preference to manage the migrations separately from other processes which touch the database - even though it's one more deployment step to fuck up.

Here's the commit for this post - 599735f5. In the next post, I'll go over the changes to the Haskell code needed to support this new swim field.

ps. My notes for this post also say "some waffle about how the migrations are like a fold, done over database schema".

No comments:

Post a Comment