Roast is an optimized JSON format for Rego ASTs, as well as some common utilities for working with it.
Roast is used by Regal, where the JSON representation of Rego's AST is used input for static analysis performed by Rego itself to determine whether policies conform to Regal's linter rules.
- Fast to traverse and process in Rego
- Usable without having to deal with quirks and inconsistencies
- As easy to read as the original AST JSON format
While this module provides a way to encode an ast.Module
to an optimized JSON format, it does not provide a decoder.
In other words, there's no way to turn optimized AST JSON back into an ast.Module
(or other AST types). While this
would be possible to do, there's no real need for that given our current use-case for this format, which is to help work
with the AST efficiently in Rego. Roast should not be considered a general purpose format for serializing the Rego AST.
The following section outlines the differences between the original AST JSON format and the Roast format.
The perhaps most visually apparent change to the AST JSON format is how location
attributes are represented. These
attributes are everywhere in the AST, so optimizing these for fast traversal has a huge impact on both the size of
the format and the speed at which it can be processed.
In the original AST, a location
is represented as an object:
{
"file": "p.rego",
"row": 5,
"col": 1,
"text": "Y29sbGVjdGlvbg=="
}
And in the optimized format as a string:
"5:1:5:11"
The first two numbers are present in both formats, i.e. row
and col
. In the optimized format, the third and fourth
number is the end location, determined by the length of the text
attribute decoded. In this case Y29sbGVjdGlvbg==
decodes to collection
, which is 10 characters long. The end location is therefore 5:11
. The text can later be
retrieved when needed using the original source document as a lookup table of sorts.
While this may come with a small cost for when the location
is actually needed, it's a huge win for when it's not.
Having to split
the result and parse the row and column values when needed occurs some overhead, but only a small
percentage of location
attributes are commonly used in practice.
Note that the file
attribute is omitted entirely in the optimized format, as this would otherwise have to be repeated
for each location
value. This can easily be retrieved by other means.
Rego rules don't necessarily have a body, or at least not one that's printed. Examples of this include:
package policy
default rule := "value"
map["key"] := "value"
collection contains "value"
OPA represents such rules internally (that is, in the AST) as having a body with a single expression containing the
boolean value true
. This creates a uniform way to represent rules, so a rule like:
collection contains "value"
Would in the AST be identical to:
collection contains "value" if {
true
}
And in the OPA JSON AST format:
{
"body": [
{
"index": 0,
"location": {
"file": "p.rego",
"row": 5,
"col": 1,
"text": "Y29sbGVjdGlvbg=="
},
"terms": {
"location": {
"file": "p.rego",
"row": 5,
"col": 1,
"text": "Y29sbGVjdGlvbg=="
},
"type": "boolean",
"value": true
}
}
],
"head": {
"name": "collection",
"key": {
"location": {
"file": "p.rego",
"row": 5,
"col": 21,
"text": "InZhbHVlIg=="
},
"type": "string",
"value": "value"
},
"ref": [
{
"location": {
"file": "p.rego",
"row": 5,
"col": 1,
"text": "Y29sbGVjdGlvbg=="
},
"type": "var",
"value": "collection"
}
],
"location": {
"file": "p.rego",
"row": 5,
"col": 1,
"text": "Y29sbGVjdGlvbiBjb250YWlucyAidmFsdWUi"
}
},
"location": {
"file": "p.rego",
"row": 5,
"col": 1,
"text": "Y29sbGVjdGlvbg=="
}
}
Notice how there's 20 lines of JSON just to represent the body, even though there isn't really one!
The optimized Rego AST format discards generated bodies entirely, and the same rule would be represented as:
{
"head": {
"location": "5:1:5:11",
"name": "collection",
"ref": [
{
"location": "5:1:5:11",
"type": "var",
"value": "collection"
}
],
"key": {
"type": "string",
"value": "value",
"location": "5:21:5:27"
}
},
"location": "5:1:5:11"
}
Note that this applies equally to empty else
bodies, which are represented the same way in the original AST, and
omitted entirely in the optimized format.
Similarly, Roast discards location
attributes from attributes that don't have an actual location in the source code.
An example of this is the data
term of a package path, which is present only in the AST.
OPA already attaches annotations
to rules. With the Roast format attaching package
and subpackages
scoped
annotations
to the package
as well, there is no need to store annotations
at the module level, as that's
effectively just duplicating data. Having this removed can save a considerable amount of space in well-documented
policies, as they should be!
In the original AST, each expression in a body carries a numeric index
attribute. While this doesn't take much space,
it is largely redundant, as the same number can be inferred from the order of the expressions in the body array. It's
therefore been removed from the Roast format.
The name
attribute found in the OPA AST for rules
is unreliable, as it's not always present. The ref
attribute however always is. While this doesn't come with any real cost in terms of AST size or performance, consistency
is key.
A few inconsistencies exist in the original AST JSON format:
comments
attributes having aText
attribute rather than the expectedtext
comments
attributes having aLocation
attribute rather than the expectedlocation
Fixing these in the original format would be a breaking change. The Roast format corrects these inconsistencies, and
uses text
and location
consistently.
While the numbers may vary some, the Roast format is currently about 40-50% smaller in size than the original AST JSON
format, and can be processed (in Rego, using walk
and so on) about 1.25 times faster.
While it's not known how much of an impact it has on traversal speed / performance, a large chunk of the bytes in the
current optimized AST are base64-encoded text
values in location
strings. These could be replaced with the end
location of any given node. Rather than base64-encoding the the bytes of the text in serialization, we could instead
count the number of newlines in the text, and from that plus the number of bytes on the last line (if more than one)
determine the end location. It would then be assumed that the client has the means to translate that into the
equivalence of text
where necessary. Regal could for example easily do this from input.regal.file.lines
.