I have been working quite a bit on picking up dangerous migration patterns in migration scripts over at the eugene repository lately. A major feature I’ve added is syntax tree analysis, so that we can pick up some patterns without having to run the SQL scripts. This isn’t quite as precise as running the scripts, but it’s a lot faster and can catch quite a few common mistakes. So let’s take a look at how it works!
Linting SQL with pq_query.rs
One of the fun things about writing eugene
in Rust, is that I get to use
native libraries. This is a pretty useful capability to have, as it turns
out that SQL is a really huge language that is painful to parse, and there
is an excellent open source parser for it in Postgres itself.
pg_query.rs is a Rust wrapper around the Postgres parser, and it’s been
really easy to start using it, by just running cargo add pg_query
.
Note that this doesn’t mean it’s easy to use! Or rather, it’s not that hard
to use, but even though someone else wrote the parser, the language itself
is still huge and complex. But it’s done now, eugene lint
can parse SQL
and give some useful hints:
eugene lint -f json examples/alter_column_unique.sql
{
"name": "examples/alter_column_unique.sql",
"lints": [
{
"statement_number": 1,
"sql": "alter table books add constraint unique_title unique(title)",
"lints": [
{
"id": "E7",
"name": "Creating a new unique constraint",
"condition": "Found a new unique constraint and a new index",
"effect": "This blocks all writes to the table while the index is being created and validated",
"workaround": "`CREATE UNIQUE INDEX CONCURRENTLY`, then add the constraint using the index",
"help": "New constraint unique_title creates implicit index on `public.books`, blocking writes until index is created and validated"
},
{
"id": "E9",
"name": "Taking dangerous lock without timeout",
"condition": "A lock that would block many common operations was taken without a timeout",
"effect": "This can block all other operations on the table indefinitely if any other transaction holds a conflicting lock while `idle in transaction` or `active`",
"workaround": "Run `SET LOCAL lock_timeout = '2s';` before the statement and retry the migration if necessary",
"help": "Statement takes lock on `public.books`, but does not set a lock timeout"
}
]
}
]
}
Splitting a SQL script into statements is super simple:
/// Separate SQL script into statements
pub fn sql_statements(sql: &str) -> Result<Vec<&str>> {
Ok(pg_query::split_with_parser(sql)?
.into_iter()
.map(|s| s.trim())
.collect())
}
Parsing a statement is also pretty simple, you just call pg_query::parse
. The hard part
starts once you have gotten the syntax tree.
Navigating pg_query::NodeRef
At the start, I tried to work with the enum pg_query::NodeRef
. This uses a nice pattern I hadn’t seen before,
here’s an excerpt:
#[derive(Debug, Copy, Clone)]
pub enum NodeRef<'a> {
Alias(&'a protobuf::Alias),
RangeVar(&'a protobuf::RangeVar),
TableFunc(&'a protobuf::TableFunc),
// ...
}
Each of the variants of NodeRef
contains a reference to a protobuf struct
(generated by prost) that contains the actual data.
This is much nicer to work with than putting the actual data in enum variants, because it means
you can type a function to accept “the variant”, so it’s clear what kind of data it expects,
and you avoid having to match on the enum in more than one place. This kind of makes it feel like a
scala sealed trait
with case class
es in terms of capabilities, which is nice. The disadvantage
is that there’s an extra layer of unpacking to do in matches, in order to get the data, you’d have
to do something like this to bind the aliasname
field of an Alias
:
fn example(stmt: pg_query::NodeRef) {
match stmt {
NodeRef::Alias(pg_query::protobuf::Alias { aliasname, .. }) => {} // Do something with `aliasname`
_ => {}
}
}
NodeRef
and its sibling Node
turn up in a lot of places in the protobuf structs, and are
a bit difficult to work with, with all the borrowing, referencing and this extra layer.
For this reason I decided to translate the part of the grammar I wanted to analyze into
something that was easier to work with, which I named StatementSummary
:
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum StatementSummary {
Ignored,
LockTimeout,
CreateTable {
// ...
},
CreateTableAs {
// ...
},
CreateIndex {
// ...
},
AlterTable {
// ...
},
}
pub fn describe(statement: &pg_query::NodeRef) -> anyhow::Result<StatementSummary> {
match statement {
pg_query::NodeRef::VariableSetStmt(child) => set_statement(child),
pg_query::NodeRef::CreateStmt(child) => create_table(child),
pg_query::NodeRef::CreateTableAsStmt(child) => create_table_as(child),
pg_query::NodeRef::IndexStmt(child) => create_index(child),
pg_query::NodeRef::AlterTableStmt(child) => alter_table(child),
_ => Ok(StatementSummary::Ignored),
}
}
Note that because of the way the protobuf structs are typed, we’re being forced to return
Result
instead of StatementSummary
because something like NodeRef::CreateStmt
contains
Vec<Node>
in its .table_elt
attribute, instead of a more narrow type, so there is no
guarantee in the types that a table element isn’t something like a CreateMaterializedViewStmt
,
even though we know that the parser would never do that to us.
Once the translation is done, we have something that’s pretty easy to lint against, but it is
still context free. We can’t know if a CREATE INDEX
statement is dangerous without knowing
whether the table it’s created on is visible to other transactions or not. For that reason,
we also need to keep track of some sort of context, or result, of the transaction so far.
I named this concept a TransactionState
and it looks like this:
#[derive(Debug, Default, Eq, PartialEq)]
pub struct TransactionState {
locktimeout: bool,
created_objects: Vec<(String, String)>,
altered_tables: Vec<(String, String)>,
has_access_exclusive: bool,
}
#[derive(Copy, Clone)]
pub struct LintContext<'a> {
pub(crate) ctx: &'a TransactionState,
pub(crate) statement: &'a StatementSummary,
}
When bundled together with the StatementSummary
, we can start to write some useful rules.
Writing rules
Initially, I wanted to aim for feature parity with the eugene trace
command. This is
pretty difficult, because the eugene trace
command actually runs the SQL scripts, and
can inspect their effects, whereas we’re going to be working with a static analysis only.
But once I started writing rules, I realized that we may be able to implement most rules
that eugene trace
has, it’s just going to be less precise.
For example, here’s the check for create index
without concurrently
:
fn create_index_nonconcurrently(stmt: LintContext) -> Option<String> {
match stmt.statement {
StatementSummary::CreateIndex {
schema,
idxname,
target,
concurrently: false,
..
} if stmt.is_visible(schema, target) => {
let schema = if schema.is_empty() { "public" } else { schema };
Some(format!("Statement takes `ShareLock` on `{schema}.{target}`, blocking \
writes while creating index `{schema}.{idxname}`"))
}
_ => None,
}
}
eugene trace
gets to just check if the transaction holds ShareLock
on any tables, or whether
any new indexes have been created, but we’re checking whether there was a create index
without
concurrently
. This would fail to match statements that implicitly create an index, such as
adding a unique
constraint.
We’re also “guessing” whether a statement takes AccessExclusiveLock
for our warnings about
running more statements after taking dangerous locks. But on the whole, this approach has worked
out quite well, I would say. Here’s an example of some test cases that pass now:
#[test]
fn test_locktimeout_alter_table_with_timeout() {
let report =
anon_lint("set lock_timeout = '2s'; \
create index books_title_idx on books(title);")
.unwrap();
assert!(!matched_lint_rule(&report, rules::LOCKTIMEOUT_WARNING.id()));
}
#[test]
fn test_create_index_concurrently_is_not_dangerous_lock() {
let report =
anon_lint("create index concurrently \
books_title_idx on books(title);").unwrap();
assert!(!matched_lint_rule(&report, rules::LOCKTIMEOUT_WARNING.id()));
}
#[test]
fn test_create_index_on_existing_table() {
let report = anon_lint("create index books_title_idx \
on books(title);").unwrap();
assert!(matched_lint_rule(
&report,
rules::CREATE_INDEX_NONCONCURRENTLY.id()
));
}
I have already added specific lints for things like implicit index creation for instances that I am aware of, and I think the lints already pretty robust in total.
Results
There’s a lot of code, but I think it’s been worth it. For what it’s worth, a lot of the new code
is pretty simple matching against the two different AST representations and I think writing the rules
is easy enough. I know that there are false positives on the lints and that they won’t pick up all
the things eugene trace
can pick up, but I think they’re useful already. Aside from the table
rewrite detection in eugene trace
, we can pick up all the other dangerous patterns sometimes.
Since I know the lints can trigger false positives, I’ve had to make sure that the user can
disable lints and to make sure that they can be enabled/disable on a per-statement basis. So
you can either eugene lint --ignore E3
if you really disagree with the rule that says you
shouldn’t add json
columns, or you can add a comment to a single statement where there’s an
exception to the rule, like this:
-- eugene: ignore E3
alter table books add column data json;
Since this feature seems useful, I’ve added it to eugene trace
as well and made both commands
fail if any lints are triggered (which you can opt out of with -a
or --accept-failures
).
All the examples in the examples
folder now contain both a linted
and a traced
variant, to show the difference.