FireSQL Parser

The FireSQL parser consists of two parts: the lexical scanner and the grammar rule module. Python parser generator Lark is used to provide the lexical scanner and grammar rule to parse the FireSQL statement. In the end, the parser execution generates the parse tree, aka. AST (Abstract Syntax Tree). The complexity of the FireSQL syntax requires an equally complex structure that efficiently stores the information needed for executing every possible FireSQL statement.

For example, the AST parse tree for the FireSQL statement

SELECT id, date, email
  FROM Bookings
  WHERE date = '2022-04-04T00:00:00'

![An Example SQL Parse Tree]({{ site.baseurl }}images/firesql-in-python/sql_parse_tree.jpg)

Figure. Illustration of the parse tree generated by lark

This is delightful to use lark due to its design philosophy, which clearly separate the grammar specification from processing. The processing is applied to the parse tree by the Visitor or Transformer components.

Visitor and Transformer

Visitors and Transformer provide a convenient interface to process the parse-trees that Lark returns. lark documentation defines,

Visitors - visit each node of the tree, and run the appropriate method on it according to the node’s data. They work bottom-up, starting with the leaves and ending at the root of the tree.
Transformers - work bottom-up (or depth-first), starting with visiting the leaves and working their way up until ending at the root of the tree.
- For each node visited, the transformer will call the appropriate method (callbacks), according to the node’s data, and use the returned value to replace the node, thereby creating a new tree structure.
- Transformers can be used to implement map & reduce patterns. Because nodes are reduced from leaf to root, at any point the callbacks may assume the children have already been transformed.

Using Visitor is simple at first, but you need to know exactly what you’re fetching, the children chain can be difficult to navigate depending on the grammar which produce the parsed tree.

We decided to use Transformer to transform the parse tree to the corresponding SQL component objects that can be easily consumed by the subsequent processing.

For instance, the former example parse tree is transformed into SQL components as,

SQL_Select(
  columns=[SQL_ColumnRef(table=None, column='id'),
           SQL_ColumnRef(table=None, column='date'),
           SQL_ColumnRef(table=None, column='email')],
  froms=[SQL_SelectFrom(part='Bookings', alias=None)],
  where=SQL_BinaryExpression(operator='==',
                             left=SQL_ColumnRef(table=None, column='date'),
                             right=SQL_ValueString(value='2022-04-04T00:00:00'))
)

With this transformed data structure, we can write the processor walking through the components and produce a execution plan to the corresponding Firestore queries.

Just Enough SQL for FireSQL

To get going, we don’t need the full SQL parser and transformer for the DML (Data Manipulation Language). We define ONLY the SELECT statement, just enough for Firestore collections query to serve our immediate needs.

FireSQL Grammar

A grammar is a formal description of a language that can be used to recognize its structure. The most used format to describe grammars is the Extended Backus-Naur Form (EBNF). A typical rules in a Backus-Naur grammar looks like this:

  where_clause ::= bool_expression
  bool_expression ::= bool_parentheses
                      | bool_expression "AND" bool_parentheses
                      | bool_expression "OR" bool_parentheses
  bool_parentheses ::= comparison_type
                       | "(" bool_expression "AND" comparison_type ")"
                       | "(" bool_expression "OR" comparison_type ")"
  ...
  CNAME ::= ("_"|"/"|LETTER) ("_"|"/"|LETTER|DIGIT)*
  ...

The where_clause is usually nonterminal, which means that it can be replaced by the group of elements on the right, bool_expression. The element bool_expression could contains other nonterminal symbols or terminal ones. Terminal symbols are simply the ones that do not appear as a <symbol> anywhere in the grammar and capitalized. A typical example of a terminal symbol is a string of characters, like “(”, “)”, “AND”, “OR”, “CNAME”.

Collection Path

The Firestore collection has a set of documents. Each document can be nested with more collections. Firestore identifies a collection by a path, looks like Companies/bennycorp/Users means Companies collection has a document bennycorp, which has Users collection.

If we want to query a nested collection, we can specify the collection name as a path. The paths can be long but we can use AS keyword to define their alias names.

For example, the subcollection Users and Bookings are specified with Companies/bennycorp document.

SELECT u.email, u.state, b.date, b.state
  FROM
    Companies/bennycorp/Users as u JOIN Companies/bennycorp/Bookings as b
    ON u.email = b.email
  WHERE 
      u.state = 'ACTIVE' AND
      b.date >= '2022-03-18T04:00:00'

Interesting Firestore Fact: collection path must have odd number of parts.

Document Field and Sub-field

Since Firestore document field can have nested sub-field, FireSQL statement column reference can reach the document sub-fields by quoted string, using the " to escape the field name with . in it. The quoted string can be used anywhere that a column reference is allowed.

For example, the Users document’s location field, which has a sub-field displayName. The sub-field can be reached by "location.displayName"

  SELECT email, "location.displayName"
  FROM Users
  WHERE "location.displayName" = 'Work From Home'

Document ID

Firestore has a unique “document ID” that associated with each document. The document ID is not part of the document fields that we need to provide special handling to access. FireSQL introduced a special field docid to let any statement to reference to the unique “document ID”.

For example, we can select where the document equals to a specific docid in the Users collection. Even though the document does not have docid field, we can also project the docid value in the output.

  SELECT docid, email
  FROM Users
  WHERE docid = '4LLlLw6tZicB40HrjhDJNmvaTYw1'

Due to Firestore admin API limitations, we can ONLY express = equal or IN operators with docid. For example, the following statement will find documents that in the specified array of docid.

  SELECT docid, email
  FROM Users
  WHERE docid IN ('4LLlLw6tZicB40HrjhDJNmvaTYw1', '74uWntZuVPeYcLVcoS0pFApGPdr2')

More interesting, if we want to project all the fields, including the docid. We can do the select statement like, docid and * are projected in the output.

  SELECT docid, *
  FROM Users
  WHERE "location.displayName" = 'Work From Home'

DateTime Type

Consistent description of date-time is a big topic that we made a practical design choice. We are using ISO-8601 to express the date-time as a string, while Firestore stores the date-time as Timestamp data type in UTC. For example,

writing “March 18, 2022, at time 4 Hours in UTC” date-time string, is “2022-03-18T04:00:00”.
writing “March 18, 2022, at time 0 Hours in Toronto Time EDT (-4 hours)” date-time string, is “2022-03-18T00:00:00-04”.

If in doubt, we are using the following to convert, match and render to the ISO-8601 string for date-time values.

DATETIME_ISO_FORMAT = "%Y-%m-%dT%H:%M:%S"
DATETIME_ISO_FORMAT_REGEX = r'^(-?(?:[1-9][0-9]*)?[0-9]{4})-(1[0-2]|0[1-9])-(3[01]|0[1-9]|[12][0-9])T(2[0-3]|[01][0-9]):([0-5][0-9]):([0-5][0-9])(\.[0-9]+)?(Z|[+-](?:2[0-3]|[01][0-9]):[0-5][0-9])?$'

Pattern Matching by LIKE

The SQL expression LIKE or NOT LIKE can be used for matching string data.

SELECT docid, email, state
  FROM
    Users
  WHERE
    state IN ('ACTIVE') AND
    email LIKE '%benny%'

After the Firebase query, the pattern matching is used as the filtering expression. The SQL processor supports pattern for:

prefix match pattern%
suffix match %pattern
infix match %pattern%

JSON Data

PyFireSQL provides JSON data supports, in particular, for the INSERT and UPDATE statements that must take complex data types. When the field value needs to take the complex data types, such as array or map (aka. Python dict), these complex data types must be encoded within a JSON enclosure. The JSON enclosure can interpret any valid JSON object; subsequently translates into the corresponding Firestore supported data types.

For example,

INSERT INTO Companies/bennycorp/Visits
  ( email, event )
  VALUES
    ( 'btscheung+test1@gmail.com', JSON(["event1","event2","event3"]) )

When the collection Visits has a field event which takes an array of event names, we assign event field by using the JSON enclosure to encode the array ["event1","event2","event3"] with a valid JSON string.

Since we are dealing with Firestore as a document structure without a schema, we can insert all the key pairs from a JSON map into the collection.

For example, the following insert statement - column specification uses * to indicate all fields. We are inserting a list of email, firstName, lastName, groups (as array), roles (as array), vaccination, access (as map).

INSERT INTO Companies/bennycorp/Users
  ( * )
  VALUES (
    JSON(
      {
        "email": "btscheung+twotwo@gmail.com",
        "name": "Benny TwoTwo",
        "groups": [],
        "roles": [
            "ADMIN"
        ],
        "vaccination": null,
        "access": {
          "hasAccess": true
        }
      }
    )
  )