Linting postgres migration scripts

I have been working quite a bit on picking up dangerous migration patterns in migration scripts over at the eugene repository lately. A major feature I’ve added is syntax tree analysis, so that we can pick up some patterns without having to run the SQL scripts. This isn’t quite as precise as running the scripts, but it’s a lot faster and can catch quite a few common mistakes. So let’s take a look at how it works!

Linting SQL with pq_query.rs

One of the fun things about writing eugene in Rust, is that I get to use native libraries. This is a pretty useful capability to have, as it turns out that SQL is a really huge language that is painful to parse, and there is an excellent open source parser for it in Postgres itself.

pg_query.rs is a Rust wrapper around the Postgres parser, and it’s been really easy to start using it, by just running cargo add pg_query.

Note that this doesn’t mean it’s easy to use! Or rather, it’s not that hard to use, but even though someone else wrote the parser, the language itself is still huge and complex. But it’s done now, eugene lint can parse SQL and give some useful hints:

eugene lint -f json examples/alter_column_unique.sql

{
  "name": "examples/alter_column_unique.sql",
  "lints": [
    {
      "statement_number": 1,
      "sql": "alter table books add constraint unique_title unique(title)",
      "lints": [
        {
          "id": "E7",
          "name": "Creating a new unique constraint",
          "condition": "Found a new unique constraint and a new index",
          "effect": "This blocks all writes to the table while the index is being created and validated",
          "workaround": "`CREATE UNIQUE INDEX CONCURRENTLY`, then add the constraint using the index",
          "help": "New constraint unique_title creates implicit index on `public.books`, blocking writes until index is created and validated"
        },
        {
          "id": "E9",
          "name": "Taking dangerous lock without timeout",
          "condition": "A lock that would block many common operations was taken without a timeout",
          "effect": "This can block all other operations on the table indefinitely if any other transaction holds a conflicting lock while `idle in transaction` or `active`",
          "workaround": "Run `SET LOCAL lock_timeout = '2s';` before the statement and retry the migration if necessary",
          "help": "Statement takes lock on `public.books`, but does not set a lock timeout"
        }
      ]
    }
  ]
}

Splitting a SQL script into statements is super simple:

/// Separate SQL script into statements
pub fn sql_statements(sql: &str) -> Result<Vec<&str>> {
    Ok(pg_query::split_with_parser(sql)?
        .into_iter()
        .map(|s| s.trim())
        .collect())
}

Parsing a statement is also pretty simple, you just call pg_query::parse. The hard part starts once you have gotten the syntax tree.

Navigating `pg_query::NodeRef`

At the start, I tried to work with the enum pg_query::NodeRef. This uses a nice pattern I hadn’t seen before, here’s an excerpt:

#[derive(Debug, Copy, Clone)]
pub enum NodeRef<'a> {
    Alias(&'a protobuf::Alias),
    RangeVar(&'a protobuf::RangeVar),
    TableFunc(&'a protobuf::TableFunc),
    // ...
}

Each of the variants of NodeRef contains a reference to a protobuf struct (generated by prost) that contains the actual data.

This is much nicer to work with than putting the actual data in enum variants, because it means you can type a function to accept “the variant”, so it’s clear what kind of data it expects, and you avoid having to match on the enum in more than one place. This kind of makes it feel like a scala sealed trait with case classes in terms of capabilities, which is nice. The disadvantage is that there’s an extra layer of unpacking to do in matches, in order to get the data, you’d have to do something like this to bind the aliasname field of an Alias:

fn example(stmt: pg_query::NodeRef) {
    match stmt {
        NodeRef::Alias(pg_query::protobuf::Alias { aliasname, .. }) => {} // Do something with `aliasname`
        _ => {}
    }
}

NodeRef and its sibling Node turn up in a lot of places in the protobuf structs, and are a bit difficult to work with, with all the borrowing, referencing and this extra layer.

For this reason I decided to translate the part of the grammar I wanted to analyze into something that was easier to work with, which I named StatementSummary:

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum StatementSummary {
    Ignored,
    LockTimeout,
    CreateTable {
        // ...
    },
    CreateTableAs {
        // ...
    },
    CreateIndex {
        // ...
    },
    AlterTable {
        // ...
    },
}

pub fn describe(statement: &pg_query::NodeRef) -> anyhow::Result<StatementSummary> {
    match statement {
        pg_query::NodeRef::VariableSetStmt(child) => set_statement(child),
        pg_query::NodeRef::CreateStmt(child) => create_table(child),
        pg_query::NodeRef::CreateTableAsStmt(child) => create_table_as(child),
        pg_query::NodeRef::IndexStmt(child) => create_index(child),
        pg_query::NodeRef::AlterTableStmt(child) => alter_table(child),
        _ => Ok(StatementSummary::Ignored),
    }
}

Note that because of the way the protobuf structs are typed, we’re being forced to return Result instead of StatementSummary because something like NodeRef::CreateStmt contains Vec<Node> in its .table_elt attribute, instead of a more narrow type, so there is no guarantee in the types that a table element isn’t something like a CreateMaterializedViewStmt, even though we know that the parser would never do that to us.

Once the translation is done, we have something that’s pretty easy to lint against, but it is still context free. We can’t know if a CREATE INDEX statement is dangerous without knowing whether the table it’s created on is visible to other transactions or not. For that reason, we also need to keep track of some sort of context, or result, of the transaction so far.

I named this concept a TransactionState and it looks like this:

#[derive(Debug, Default, Eq, PartialEq)]
pub struct TransactionState {
    locktimeout: bool,
    created_objects: Vec<(String, String)>,
    altered_tables: Vec<(String, String)>,
    has_access_exclusive: bool,
}
#[derive(Copy, Clone)]
pub struct LintContext<'a> {
    pub(crate) ctx: &'a TransactionState,
    pub(crate) statement: &'a StatementSummary,
}

When bundled together with the StatementSummary, we can start to write some useful rules.

Writing rules

Initially, I wanted to aim for feature parity with the eugene trace command. This is pretty difficult, because the eugene trace command actually runs the SQL scripts, and can inspect their effects, whereas we’re going to be working with a static analysis only. But once I started writing rules, I realized that we may be able to implement most rules that eugene trace has, it’s just going to be less precise.

For example, here’s the check for create index without concurrently:

fn create_index_nonconcurrently(stmt: LintContext) -> Option<String> {
    match stmt.statement {
        StatementSummary::CreateIndex {
            schema,
            idxname,
            target,
            concurrently: false,
            ..
        } if stmt.is_visible(schema, target) => {
            let schema = if schema.is_empty() { "public" } else { schema };
            Some(format!("Statement takes `ShareLock` on `{schema}.{target}`, blocking \
             writes while creating index `{schema}.{idxname}`"))
        }
        _ => None,
    }
}

eugene trace gets to just check if the transaction holds ShareLock on any tables, or whether any new indexes have been created, but we’re checking whether there was a create index without concurrently. This would fail to match statements that implicitly create an index, such as adding a unique constraint.

We’re also “guessing” whether a statement takes AccessExclusiveLock for our warnings about running more statements after taking dangerous locks. But on the whole, this approach has worked out quite well, I would say. Here’s an example of some test cases that pass now:

#[test]
fn test_locktimeout_alter_table_with_timeout() {
    let report =
        anon_lint("set lock_timeout = '2s'; \
        create index books_title_idx on books(title);")
            .unwrap();
    assert!(!matched_lint_rule(&report, rules::LOCKTIMEOUT_WARNING.id()));
}
#[test]
fn test_create_index_concurrently_is_not_dangerous_lock() {
    let report =
        anon_lint("create index concurrently \
         books_title_idx on books(title);").unwrap();
    assert!(!matched_lint_rule(&report, rules::LOCKTIMEOUT_WARNING.id()));
}

#[test]
fn test_create_index_on_existing_table() {
    let report = anon_lint("create index books_title_idx \
           on books(title);").unwrap();
    assert!(matched_lint_rule(
        &report,
        rules::CREATE_INDEX_NONCONCURRENTLY.id()
    ));
}

I have already added specific lints for things like implicit index creation for instances that I am aware of, and I think the lints already pretty robust in total.

Results

There’s a lot of code, but I think it’s been worth it. For what it’s worth, a lot of the new code is pretty simple matching against the two different AST representations and I think writing the rules is easy enough. I know that there are false positives on the lints and that they won’t pick up all the things eugene trace can pick up, but I think they’re useful already. Aside from the table rewrite detection in eugene trace, we can pick up all the other dangerous patterns sometimes.

Since I know the lints can trigger false positives, I’ve had to make sure that the user can disable lints and to make sure that they can be enabled/disable on a per-statement basis. So you can either eugene lint --ignore E3 if you really disagree with the rule that says you shouldn’t add json columns, or you can add a comment to a single statement where there’s an exception to the rule, like this:

-- eugene: ignore E3
alter table books add column data json;

Since this feature seems useful, I’ve added it to eugene trace as well and made both commands fail if any lints are triggered (which you can opt out of with -a or --accept-failures).

All the examples in the examples folder now contain both a linted and a traced variant, to show the difference.

Linting SQL with pq_query.rs#

Navigating pg_query::NodeRef#

Writing rules#

Results#

Linting SQL with pq_query.rs

Navigating `pg_query::NodeRef`

Writing rules

Results