Parsing SQL queries provides superpowers for monitoring data health. This post describes how to get started on parsing SQL for data observability.
Query history of a data warehouse is a rich source of information to glean how data is used in your organization. Many aspects of data observability can be tracked by analyzing query history. For example, query history analysis can extract:
- Popular tables and columns
- Unused tables and columns
- Column-level lineage
- Freshness
These statistics also help to automate common data engineering tasks like:
- Backup and Disaster Recovery
- Triage Data Quality issues
- Track sensitive data and how they are used.
Challenges and Approaches Link to heading
SQL language is an ISO/IEC standard and the latest version is SQL2016. However, every database implements the standard differently, uses different function names for the same operation, and has extensions to access specific custom features. Therefore, there isn’t one SQL parser for dialects of all popular databases and data warehouses.
Regular expressions is a popular approach to extract information from SQL statements. However, regular expressions quickly become too complex to handle common features like WITH, sub-queries, windows clauses, aliases and quotes. sqlparse is a popular Python package that uses regular expressions to parse SQL.
An alternate approach is to implement the SQL grammar using parser generators like ANTLR. There are similar open source parser generators in other popular languages.
There are multiple projects that maintain parsers for popular open source databases like MySQL and Postgres. For other open source databases, the grammar can be extracted from the open-source project. For commercial databases, the only option is to reverse engineer the complete grammar. There are SQL parser/optimizer platforms like Apache Calcite that help to reduce the effort to implement the SQL dialect of your choice.
Open Source Parsers Link to heading
Some popular open source databases and data warehouses are:
MySQL/MariaDB Link to heading
- Pingcap parser is a MySQL parser in Go.
- SQL Parser in phpmyadmin is a validating SQL lexer and parser with a focus on MySQL dialect.
Postgres Link to heading
libpg_query extracts the parser (written in C) from the postgres project and packages it as a stand-alone library. This library is wrapped in other languages by other projects like:
- Python: pglast
- Ruby : pg_query
- Golang: pg_query_go
- JS: psql-parser in Node and pg-query-emscripten in the browser
- Rust: pg_query.rs
Multiple Engines Link to heading
- queryparser implements Apache Hive, Presto/Trino and Vertica dialects.
- zetasql implements BigQuery, Spanner, and Dataflow dialects.
Generic Parsers Link to heading
- Python: sqlparse
- Rust: sqlparser-rs
- Python: mo-sql-parsing
Platforms Link to heading
Parser/Optimizer platforms implement the common SQL language features and allow customization as first-class feature of the platform. Two popular open source projects are:
- Apache Calcite is a popular parser/optimizer that is used in popular databases and query engines like Apache Hive, BlazingSQL and many others.
- JSQLParser can parse multiple SQL dialects like MySQL, Postgres and Oracle. The grammar can be modified to support other SQL dialects.
Apache Calcite allows customizations at various points of the parsing process.
- Parser rules can be changed to support custom syntax.
- Conventions such as quotes vs double quotes, case sensitivity.
- Add optimizer rules.
Apache Calcite also provides visitors for traversing the SQL execution plan. A Visitor pattern is an algorithm to traverse a SQL plan.
Practical tips to Getting Started Link to heading
There are many abandoned open source SQL parsers. The first filter is to use a project that will be supported in the future. For popular databases such as Postgres and MySQL/MariaDB, there are parsers available in multiple programming languages.
What if there is no parser for your database?
Most teams do not create a parser from scratch. A popular option is to use the Postgres parser and then add custom SQL syntax. AWS Redshift, Vertica and DuckDB are examples. Use a Postgres SQL parser to parse the query history of these databases to parse the majority of the queries.
Many queries will fail to parse such as UNLOAD in AWS Redshift. If it is important to also parse the variants, consider modifying the projects to accept the custom grammar OR use a platform like Apache Calcite.
Conclusion Link to heading
There is a demand for SQL parsers to build reports on database or data warehouse usage. There are a number of good open-source projects. However, there is a steep learning curve to use these projects and in many cases a project may not fit your specific requirements.