Database servers are usually long-lived, and important parts of the infrastructure that we build on. We rarely set them up from scratch, because we have to take such good care of them over time. I think this causes a lot of people to think that setting up a database server is some mysteriously difficult ordeal. To be clear, that’s actually true, if you need high availability and a solid recovery point objective. But there are a lot of use cases where that’s overkill, for example short-lived test environments, or CI/CD pipelines.
Edit: Hacker news thread here, in which several people tip about testcontainers, which is great for running tests, both locally and in CI/CD. pgtemp can automate exactly the steps I describe in this post and seems like a great tool.
The postgres data directory
A postgres installation is roughly split into two components:
- The binaries, which are usually installed somewhere in
PATH
and are shared across the entire server. - The data directory, which contains the state of the database cluster, including
all the databases, tables, indexes, configuration and so on. Most binaries in
the postgres installation accept
-D
or--pgdata
as an argument to specify where to keep the state, or where to start from.
Many times, we run one postgres instance per server, so that this is split is not
obvious to people. I also think that on ubuntu, the default installation is set up
so that there’s a postgres instance already set up somewhere under
/var/lib/postgresql
and it makes it sort of look like you can only have the one.
But it’s really easy to set up a new postgres instance on a server that has the postgres binaries. Here’s what you need to do:
# Create a new directory for the data
mkdir -p /tmp/throwaway-postgres
# Set up the data directory
initdb -D /tmp/throwaway-postgres
# Start postgres in there
pg_ctl -D /tmp/throwaway-postgres start
# Stop it again
pg_ctl -D /tmp/throwaway-postgres stop
If you need to configure anything, you can do that in the postgresql.conf
that is generated in the data directory, or you can pass additional arguments
to pg_ctl
when you start the server. If you prefer to keep postgres in the
foreground in the shell, you can use postgres -D /tmp/throwaway-postgres
instead
of pg_ctl start
. If you prefer to use pg_ctl
, but have to pass options, you
can use -o
to pass options to the postgres command, for example to run on a
different port, you’d say:
pg_ctl -D /tmp/throwaway-postgres -o "-p 5433" start
This makes it really easy to run many postgres instances on the same server, and you also don’t really need something like docker to run postgres in your local dev environment. You can just set up a new data directory for each project, and start and stop the server as you need it, or just run it in the foreground in the shell. If you do this, you probably want to use something like mise to make sure that all developers use the version of postgres that you plan on using in production.
I need my instances to have data
What we’ve learned about the -D
option makes it really easy to set up
many empty postgres instances. The next hurdle then, would generally be to
populate the instances with some data, so that our tests or applications can
actually use them. My preferred pattern for this is to have a single instance
that is stable and set up with the data that I need, then clone from that to
make new instances. Suppose you have a stable permanent test environment, for
example. Then, this is also really simple:
# Either set up PGPASSWORD or use a .pgpass file
export PGPASSWORD=postgres
pg_basebackup -D /tmp/throwaway-postgres \
--checkpoint=fast --progress \
--host the-stable-instance -U postgres
pg_ctl -D /tmp/throwaway-postgres -o "-p 5433" start
This is a really fast way to set up a temporary postgres instance, even if
there’s a bit of data in the-stable-instance
. The pg_basebackup
command
takes a binary snapshot of the -D
directory of the stable instance, and
replays any transactions that happened during the backup. This shouldn’t
take more than a couple of minutes for a small database of 15-30GB, and
you get an exact copy of the stable instance, with users, databases,
configuration, and all the data.
Note that pg_basebackup
isn’t guaranteed to work across different major
versions of postgres, or different CPU architectures, you’ll still need
pg_dump
for some situations. But if you want to be able to quickly set
up databases for testing, you should check if pg_basebackup
will work
for you, before reaching for pg_dump
.
Making dynamic instances in kubernetes
If you’re running in a kubernetes cluster, you can use the same pattern to
set up a new postgres instance in a pod very easily. You’ll need to create
a new pod with an initContainer
that runs pg_basebackup
to set up the
data directory, then start the postgres instance in the main container. We
do this at my current project to set up complete short-lived environments
for manual testing and demo purposes, and it works quite well. It takes
a few minutes to deploy a complete environment with a ~15GB database
instance. It’s also a useful trick to test database migrations on a
realistic dataset before deploying them to production.
This seems like complete overkill for my use case
I think this is overkill if you only have a single database in the
instance, and it doesn’t contain a lot of data. At that point, you can
still use this template idea, and maintain a single stable database
in a single database server, then use create database $uuid template stable_db
to set up a new database quickly. You can read more about that idea
in a previous blog post here.
Using it for fun and profit for eugene
For a while now, I’ve been working on the eugene trace
command over
at the eugene repository, and I’ve
been pondering how to make it easier for people to get started with it. In
the weekend, I remembered this pattern that I’ve used many times before,
and I thought maybe eugene trace
can just bring its own postgres
instance to the party. With the 0.5.0 release,
it does just that. If you have docker installed, you can run eugene trace
in
a directory that has flyway-style migrations like this:
docker run --rm -v $(pwd):/migrations \
ghcr.io/kaaveland/eugene:0.5.0 trace /migrations
There’s no setup required and the container is gone when the command finishes.