Kast's AST parser

Overview

Parsing single definition

Kast's AST parser

In Kast there is no builtin syntax. Except for syntax for adding new syntax.

The standard syntax is defined in std/syntax.ks, but you can ignore it and define your own instead.

When parsing code into AST, Kast treats everything as an expression, and every syntax definition is extending what an expression can be.

For example, you could define a ternary if like in Python like so:

syntax ternary-if -> 20 = true_value "if" condition "else" false_value

This defines syntax for ternary-if (which should be defined somewhere) which is right associative (->), has priority (precedence) of 20, and is defined as true_value "if" condition "else" false_value where true_value, condition and false_value can be anything defined by other syntax rules with higher priority (or be a simple single-token value), and the if and else are "keywords".

Keywords can also be punctuation, so a ternary operator from C could be defined as:

syntax ternary-operator -> 20 = condition "?" then_case ":" else_case

The AST parsing algorithm takes a sequence of tokens and a collection of syntax definitions stored in such data structure and constructs an AST.

An example of parsed AST for a if b else c ? d : e:

ternary-if {
    true_value = a
    condition = b
    false_value = ternary-operator {
        condition = c
        then_case = d
        else_case = e
    }
}

Overview

The algorithm starts by trying to parse a smallest complete expression starting at the beginning. For example, it would parse just a from a + b + c, or it would parse (a + b) from (a + b) + c + d.

After that, it tries to expand the first expression by making it a part of a more complex expression, finding the next smallest one. For example a from a + b + c would be expanded to a + b, and (a + b) from (a + b) + c + d would be expanded to (a + b) + c.

Then it tries to expand it again and again until it can make no more progress.

The first step is basically same as expanding from nothing, so what we do is this:

fn parse_expr() -> Option<Ast> {
    let mut already_parsed = None;
    while let Ok(expanded) = try_expand(already_parsed) {
        already_parsed = Some(expanded);
    }
    already_parsed
}

So, a single expand operation is trying to parse either a simple (single-token) value, or a single syntax definition. In order to do so, we store all the syntax definitions in a helper data structure.

Storing syntax definitions

Syntax definitions are stored in a trie - a tree data structure, where each node represents partially parsed syntax definition.

An edge from parent to a child in this tree represents making progress, which is either following a keyword, or parsing an inner value. Some nodes are terminal, meaning that they represent fully parsed syntax definition.

For example, for storing ternary-if definition like above there would be a path from the root of the tree following edges:

value
if keyword
value
else keyword
value

The final node after following this path is marked as terminal, corresponding to the ternary-if definition.

If, in addition to this syntax we also had a similar syntax definition but without the else case (syntax just-if -> 20 = true_value "if" condition), and lets say a binary + operator (defined as syntax plus <- 30 = a "+" b), then we would have a tree like this:

root {
    -> value {
        -> "+" {
            -> value {
                terminal "plus"
            }
        }
        -> "if" {
            -> value {
                terminal "just-if"
                -> "else" {
                    -> value {
                        terminal "ternary-if"
                    }
                }
            }
        }
    }
}

Actually, since we need to expand either starting with an already_parsed value or with nothing, we store two roots - one starting without a value and one with, instead of having a value edge from the root.

Parsing single definition

So, in order to parse a single syntax definition we need to traverse this tree to some terminal node.

We try to make as much progress as we can in a loop. In every iteration, we first see if the token ahead is a keyword that we can follow with. If not, and if there exists a value edge from the current node, then we try to parse a value by recursively calling into parse_expr. If it succeeded to parse something, then we follow the value edge. In other cases we stop and check that we arrived in a terminal node. If not, then the code has a syntax error.

When parsing inner values it is also important to take priorities and associativities of syntax definitions into consideration. This means that both parse_expr and expand actually are also taking extra argument which is the binding power (priority + associativity) of the outer expression, if there is one. The only exception is if we are parsing an inner value inside parentheses (like if we are parsing "(" expr ")" syntax definition). In such case, inner values are parsed just like the top level expression. This also means that binding power is stored in the syntax trie in every node (other than root), and when making progress we check that if we should actually proceed or not.

Here's an incomplete implementation:

fn expand(start_value: Option<Ast>, outer_binding_power: BindingPower) -> Result<Ast> {
    // first see if its a simple value
    if !peek_token().is_keyword() {
        return Ok(Ast::Simple(pop_token()));
    }
    let mut node = match start_value {
        Some(_) => root_with_value,
        None => root_without_value,
    };
    let mut inner_values = Vec::from_iter(start_value);
    loop {
        if let Some(next_node) = node.next.get(Edge::Keyword(peek_token())) {
            if !check_binding_power(outer_binding_power, next_node) {
                break;
            }
            node = next_node;
        } else if let Some(next_node) = node.next.get(Edge::Value).is_some() {
            if !check_binding_power(outer_binding_power, next_node) {
                break;
            }
            if let Some(value) = parse_expr(binding_power_for_inner_values) {
                inner_values.push(value);
                node = next_node;
            } else {
                break;
            }
        } else {
            break;
        }
    }
    if node == start_node {
        // made no progress
        return Err;
    }
    let definition = node.terminal.expect("arrived in non-terminal node");
    Ok(Ast::Complex(definition, inner_values))
}

There are more missing details in this implementation.

For example, if there is a syntax definition which contains a keyword which is also used in the beginning of another definition, then we shouldn't start parsing a new one while in the process of parsing another.

For example, if we had these definitions:

syntax if <- 20 = "if" condition "(" body ")"
syntax call <- 40 = f arg
syntax parenthesised <- 100 = "(" expr ")"

In this case, when parsing condition of the if we should treat the "(" as continuation of the if, not as start of the second argument of call with condition being the first.