FireSQL Parser
The FireSQL parser consists of two parts: the lexical scanner and the grammar rule module. Python parser generator Lark is used to provide the lexical scanner and grammar rule to parse the FireSQL statement. In the end, the parser execution generates the parse tree, aka. AST (Abstract Syntax Tree). The complexity of the FireSQL syntax requires an equally complex structure that efficiently stores the information needed for executing every possible FireSQL statement.
For example, the AST parse tree for the FireSQL statement
SELECT id, date, email
FROM Bookings
WHERE date = '2022-04-04T00:00:00'
![An Example SQL Parse Tree]({{ site.baseurl }}images/firesql-in-python/sql_parse_tree.jpg)
Figure. Illustration of the parse tree generated by lark
This is delightful to use lark
due to its design philosophy, which clearly separate the grammar specification from processing. The processing is applied to the parse tree by the Visitor or Transformer components.
Visitor and Transformer
Visitors and Transformer provide a convenient interface to process the parse-trees that Lark returns. lark
documentation defines,
Visitors - visit each node of the tree, and run the appropriate method on it according to the node’s data. They work bottom-up, starting with the leaves and ending at the root of the tree.
Transformers - work bottom-up (or depth-first), starting with visiting the leaves and working their way up until ending at the root of the tree.
For each node visited, the transformer will call the appropriate method (callbacks), according to the node’s
data
, and use the returned value to replace the node, thereby creating a new tree structure.Transformers can be used to implement map & reduce patterns. Because nodes are reduced from leaf to root, at any point the callbacks may assume the children have already been transformed.
Using Visitor is simple at first, but you need to know exactly what you’re fetching, the children chain can be difficult to navigate depending on the grammar which produce the parsed tree.
We decided to use Transformer to transform the parse tree to the corresponding SQL component objects that can be easily consumed by the subsequent processing.
For instance, the former example parse tree is transformed into SQL components as,
SQL_Select(
columns=[SQL_ColumnRef(table=None, column='id'),
SQL_ColumnRef(table=None, column='date'),
SQL_ColumnRef(table=None, column='email')],
froms=[SQL_SelectFrom(part='Bookings', alias=None)],
where=SQL_BinaryExpression(operator='==',
left=SQL_ColumnRef(table=None, column='date'),
right=SQL_ValueString(value='2022-04-04T00:00:00'))
)
With this transformed data structure, we can write the processor walking through the components and produce a execution plan to the corresponding Firestore queries.
Just Enough SQL for FireSQL
To get going, we don’t need the full SQL parser and transformer for the DML (Data Manipulation Language). We define ONLY the SELECT
statement, just enough for Firestore collections query to serve our immediate needs.
FireSQL Grammar
A grammar is a formal description of a language that can be used to recognize its structure. The most used format to describe grammars is the Extended Backus-Naur Form (EBNF). A typical rules in a Backus-Naur grammar looks like this:
where_clause ::= bool_expression
bool_expression ::= bool_parentheses
| bool_expression "AND" bool_parentheses
| bool_expression "OR" bool_parentheses
bool_parentheses ::= comparison_type
| "(" bool_expression "AND" comparison_type ")"
| "(" bool_expression "OR" comparison_type ")"
...
CNAME ::= ("_"|"/"|LETTER) ("_"|"/"|LETTER|DIGIT)*
...
The where_clause
is usually nonterminal, which means that it can be replaced by the group of elements on the right, bool_expression
. The element bool_expression
could contains other nonterminal symbols or terminal ones. Terminal symbols are simply the ones that do not appear as a <symbol>
anywhere in the grammar and capitalized. A typical example of a terminal symbol is a string of characters, like “(”, “)”, “AND”, “OR”, “CNAME”.
Collection Path
The Firestore collection has a set of documents. Each document can be nested with more collections. Firestore identifies a collection by a path, looks like Companies/bennycorp/Users
means Companies
collection has a document bennycorp
, which has Users
collection.
If we want to query a nested collection, we can specify the collection name as a path.
The paths can be long but we can use AS
keyword to define their alias names.
For example, the subcollection Users
and Bookings
are specified with Companies/bennycorp
document.
SELECT u.email, u.state, b.date, b.state
FROM
Companies/bennycorp/Users as u JOIN Companies/bennycorp/Bookings as b
ON u.email = b.email
WHERE
u.state = 'ACTIVE' AND
b.date >= '2022-03-18T04:00:00'
Interesting Firestore Fact: collection path must have odd number of parts.
Document Field and Sub-field
Since Firestore document field can have nested sub-field, FireSQL statement column reference can reach the document sub-fields by quoted string, using the "
to escape the field name with .
in it. The quoted string can be used anywhere that a column reference is allowed.
For example, the Users
document’s location
field, which has a sub-field displayName
. The sub-field can be reached by "location.displayName"
SELECT email, "location.displayName"
FROM Users
WHERE "location.displayName" = 'Work From Home'
Document ID
Firestore has a unique “document ID” that associated with each document. The document ID is not part of the document fields that we need to provide special handling to access. FireSQL introduced a special field docid
to let any statement to reference to the unique “document ID”.
For example, we can select where the document equals to a specific docid
in the Users
collection. Even though the document does not have docid
field, we can also project the docid
value in the output.
SELECT docid, email
FROM Users
WHERE docid = '4LLlLw6tZicB40HrjhDJNmvaTYw1'
Due to Firestore admin API limitations, we can ONLY express =
equal or IN
operators with docid
.
For example, the following statement will find documents that in the specified array of docid
.
SELECT docid, email
FROM Users
WHERE docid IN ('4LLlLw6tZicB40HrjhDJNmvaTYw1', '74uWntZuVPeYcLVcoS0pFApGPdr2')
More interesting, if we want to project all the fields, including the docid
. We can do the select statement like,
docid
and *
are projected in the output.
SELECT docid, *
FROM Users
WHERE "location.displayName" = 'Work From Home'
DateTime Type
Consistent description of date-time is a big topic that we made a practical design choice.
We are using ISO-8601 to express the date-time as a string,
while Firestore stores the date-time as Timestamp
data type in UTC.
For example,
writing “March 18, 2022, at time 4 Hours in UTC” date-time string, is “2022-03-18T04:00:00”.
writing “March 18, 2022, at time 0 Hours in Toronto Time EDT (-4 hours)” date-time string, is “2022-03-18T00:00:00-04”.
If in doubt, we are using the following to convert, match and render to the ISO-8601 string for date-time values.
DATETIME_ISO_FORMAT = "%Y-%m-%dT%H:%M:%S"
DATETIME_ISO_FORMAT_REGEX = r'^(-?(?:[1-9][0-9]*)?[0-9]{4})-(1[0-2]|0[1-9])-(3[01]|0[1-9]|[12][0-9])T(2[0-3]|[01][0-9]):([0-5][0-9]):([0-5][0-9])(\.[0-9]+)?(Z|[+-](?:2[0-3]|[01][0-9]):[0-5][0-9])?$'
Pattern Matching by LIKE
The SQL expression LIKE
or NOT LIKE
can be used for matching string data.
SELECT docid, email, state
FROM
Users
WHERE
state IN ('ACTIVE') AND
email LIKE '%benny%'
After the Firebase query, the pattern matching is used as the filtering expression. The SQL processor supports pattern for:
prefix match
pattern%
suffix match
%pattern
infix match
%pattern%
JSON Data
PyFireSQL provides JSON data supports, in particular, for the INSERT
and UPDATE
statements that must take complex data types.
When the field value needs to take the complex data types, such as array or map (aka. Python dict),
these complex data types must be encoded within a JSON enclosure. The JSON enclosure can interpret any valid JSON object;
subsequently translates into the corresponding Firestore supported data types.
For example,
INSERT INTO Companies/bennycorp/Visits
( email, event )
VALUES
( 'btscheung+test1@gmail.com', JSON(["event1","event2","event3"]) )
When the collection Visits
has a field event
which takes an array of event names,
we assign event
field by using the JSON
enclosure to encode the array ["event1","event2","event3"]
with a valid JSON string.
Since we are dealing with Firestore as a document structure without a schema, we can insert all the key pairs from a JSON map into the collection.
For example, the following insert statement - column specification uses *
to indicate all fields.
We are inserting a list of
email
, firstName
, lastName
, groups
(as array), roles
(as array), vaccination
, access
(as map).
INSERT INTO Companies/bennycorp/Users
( * )
VALUES (
JSON(
{
"email": "btscheung+twotwo@gmail.com",
"name": "Benny TwoTwo",
"groups": [],
"roles": [
"ADMIN"
],
"vaccination": null,
"access": {
"hasAccess": true
}
}
)
)