In Kast there is no builtin syntax. Except for syntax for adding new syntax.
The standard syntax is defined in std/syntax.ks
, but you can ignore it and define your own instead.
When parsing code into AST, Kast treats everything as an expression, and every syntax definition is extending what an expression can be.
For example, you could define a ternary if like in Python like so:
syntax ternary-if -> 20 = true_value "if" condition "else" false_value
This defines syntax for ternary-if
(which should be defined somewhere)
which is right associative (->
), has priority (precedence) of 20
,
and is defined as true_value "if" condition "else" false_value
where
true_value
, condition
and false_value
can be anything defined by
other syntax rules with higher priority (or be a simple single-token value),
and the if
and else
are "keywords".
Keywords can also be punctuation, so a ternary operator from C could be defined as:
syntax ternary-operator -> 20 = condition "?" then_case ":" else_case
The AST parsing algorithm takes a sequence of tokens and a collection of syntax definitions stored in such data structure and constructs an AST.
An example of parsed AST for a if b else c ? d : e
:
ternary-if {
true_value = a
condition = b
false_value = ternary-operator {
condition = c
then_case = d
else_case = e
}
}
Overview
The algorithm starts by trying to parse a smallest
complete expression starting at the beginning.
For example, it would parse just a
from a + b + c
,
or it would parse (a + b)
from (a + b) + c + d
.
After that, it tries to expand the first expression by making it a part of a more complex expression, finding the next smallest one.
For example a
from a + b + c
would be expanded to a + b
,
and (a + b)
from (a + b) + c + d
would be expanded to (a + b) + c
.
Then it tries to expand it again and again until it can make no more progress.
The first step is basically same as expanding from nothing, so what we do is this:
fn parse_expr() -> Option<Ast> {
let mut already_parsed = None;
while let Ok(expanded) = try_expand(already_parsed) {
already_parsed = Some(expanded);
}
already_parsed
}
So, a single expand operation is trying to parse either a simple (single-token) value, or a single syntax definition. In order to do so, we store all the syntax definitions in a helper data structure.
Storing syntax definitions
Syntax definitions are stored in a trie - a tree data structure, where each node represents partially parsed syntax definition.
An edge from parent to a child in this tree represents making progress, which is either following a keyword, or parsing an inner value. Some nodes are terminal, meaning that they represent fully parsed syntax definition.
For example, for storing ternary-if
definition like above
there would be a path from the root of the tree following edges:
- value
if
keyword- value
else
keyword- value
The final node after following this path is marked as terminal,
corresponding to the ternary-if
definition.
If, in addition to this syntax we also had a similar syntax definition but without the else case
(syntax just-if -> 20 = true_value "if" condition
),
and lets say a binary +
operator (defined as syntax plus <- 30 = a "+" b
),
then we would have a tree like this:
root {
-> value {
-> "+" {
-> value {
terminal "plus"
}
}
-> "if" {
-> value {
terminal "just-if"
-> "else" {
-> value {
terminal "ternary-if"
}
}
}
}
}
}
Actually, since we need to expand either starting with an already_parsed value or with nothing, we store two roots - one starting without a value and one with, instead of having a value edge from the root.
Parsing single definition
So, in order to parse a single syntax definition we need to traverse this tree to some terminal node.
We try to make as much progress as we can in a loop.
In every iteration, we first see if the token ahead is a keyword that we can follow with.
If not, and if there exists a value edge from the current node, then we
try to parse a value by recursively calling into parse_expr
.
If it succeeded to parse something, then we follow the value edge.
In other cases we stop and check that we arrived in a terminal node.
If not, then the code has a syntax error.
When parsing inner values it is also important to take
priorities and associativities of syntax definitions into consideration.
This means that both parse_expr
and expand
actually are also taking extra argument
which is the binding power (priority + associativity) of the outer expression, if there is one.
The only exception is if we are parsing an inner value inside parentheses
(like if we are parsing "(" expr ")"
syntax definition).
In such case, inner values are parsed just like the top level expression.
This also means that binding power is stored in the syntax trie in every node (other than root),
and when making progress we check that if we should actually proceed or not.
Here's an incomplete implementation:
fn expand(start_value: Option<Ast>, outer_binding_power: BindingPower) -> Result<Ast> {
// first see if its a simple value
if !peek_token().is_keyword() {
return Ok(Ast::Simple(pop_token()));
}
let mut node = match start_value {
Some(_) => root_with_value,
None => root_without_value,
};
let mut inner_values = Vec::from_iter(start_value);
loop {
if let Some(next_node) = node.next.get(Edge::Keyword(peek_token())) {
if !check_binding_power(outer_binding_power, next_node) {
break;
}
node = next_node;
} else if let Some(next_node) = node.next.get(Edge::Value).is_some() {
if !check_binding_power(outer_binding_power, next_node) {
break;
}
if let Some(value) = parse_expr(binding_power_for_inner_values) {
inner_values.push(value);
node = next_node;
} else {
break;
}
} else {
break;
}
}
if node == start_node {
// made no progress
return Err;
}
let definition = node.terminal.expect("arrived in non-terminal node");
Ok(Ast::Complex(definition, inner_values))
}
There are more missing details in this implementation.
For example, if there is a syntax definition which contains a keyword which is also used in the beginning of another definition, then we shouldn't start parsing a new one while in the process of parsing another.
For example, if we had these definitions:
syntax if <- 20 = "if" condition "(" body ")"
syntax call <- 40 = f arg
syntax parenthesised <- 100 = "(" expr ")"
In this case, when parsing condition of the if
we should treat the "(" as continuation of the if, not as start of the second argument of call
with condition
being the first.