ConstFuse or Building Parser Combinators using C++ 17

An adventure on parsing and using/abusing C++ 17 features to create a parser combinator

Everything started with an idea to explore the new features introduced in the C++ 17 such as folding expressions and ease of use of some meta template programming. That being said i don’t advocate into using meta programming for everything but it has its uses. Certain things cant be done without having to resolve to some code duplication or a ugly unreadable mess.

After trying at least five different prototypes of the structure and execution of combinator parsers i settled on a pretty simple solution abstracting combinators and parsers on their own. First and foremost lets introduce the concept of combinator parsers. Its a pattern revolving around simple parsers ( functions that recognize input ) and combinators that act like higher order functions accepting parsers and returning a parser. For example lets say we have a parser called Number recognizing a single digit number and a combinator Repeat that repeats a parser one or more times. Then the parser Repeat(Number()) would be recognizing numbers comprised of multiple digits. Most parser combinator libraries are made in/using functional languages because of their nature and the ease of use and flexibility that they offer.

Link to the project

Alternatives

When it comes to parsing there are a lot of options you can explore and it all depends on what the requirements that you have. Two of the great examples would be ANTLR and of course Boost Spirit to give show something from the C++ world. The best example of how a parser combinator library looks like in the functional world is Parsec, a library written in Haskell Parsec.

On an adventure

The end goal of this adventure was to use interesting interactions with the C++ templating type system and create a simple single header library that’s easy to use and extend. Another main focus was making the grammars constructed with the combinators to be readable and maybe even fully constexpr. Every component of the library is basically a Functor or simply put, a class/struct that has overloaded the call operator ( operator() ), so it can be used as a function but still has some context.

struct SampleParser {
    using is_parser_type = std::true_type;
    using return_type = int;

    template<typename Iterator>
    bool operator()(Iterator& it, Iterator end, return_type* r) const{
    return true;
  }
};

This simplification keeps us from needing any sort of inheritance and we can also use/abuse constexpr so we get a simpler binary generated by the compiler. The call operator returns bool because we want to use it in a clever way with fold expressions. Using the traits is_parser_type and return_type we identify that this in fact is a parser and that can later be applied with Concept that are being introduced in C++ 20, but for now we can use the usual SFINAE tricks.

Lets explore a basic set of combinators Sequence, Repeat, Many, Optional, Any .

Sequence

Combining N parsers into a single one that recognized them in the order they are passed in the constructor of the combinator. At the core of it you see we are using a helper so we go over each parser and store its result in the respective position of the tuple.

template<typename ...Parsers>
struct Seq {

    using is_parser_type = std::true_type;
    using return_type = typename std::tuple<typename Parsers::return_type...>;
    using container_type = typename std::tuple<Parsers...>;


    container_type const parsers;

    explicit constexpr Seq(Parsers const& ...p) :parsers(p...) {};

    template<typename Iterator>
    bool operator()(Iterator& it, Iterator end, return_type* result) const{

        return_type tmp;
        Iterator backtrack = it;
        bool res = std::apply(all_applicator_res<Iterator, return_type, Parsers...>, parsers)(it, end, tmp);

        if (res) {
            *result = tmp;
        }
        else {
            it = backtrack;
        }
        return res;
    }
};

Repeat

Does the same as + in Regular Expressions for a single parser, matches once or more times what the parser passed matches

template<typename ...Parsers>
struct Seq {

    using is_parser_type = std::true_type;
    using return_type = typename std::tuple<typename Parsers::return_type...>;
    using container_type = typename std::tuple<Parsers...>;


    container_type const parsers;

    explicit constexpr Seq(Parsers const& ...p) :parsers(p...) {};

    template<typename Iterator>
    bool operator()(Iterator& it, Iterator end, return_type* result) const{

        return_type tmp;
        Iterator backtrack = it;
        bool res = std::apply(all_applicator_res<Iterator, return_type, Parsers...>, parsers)(it, end, tmp);

        if (res) {
            *result = tmp;
        }
        else {
            it = backtrack;
        }
        return res;
    }
};

Many

Similar to repeat its behavior is identical to * (Kleene Star ) in Regular Expressions, so it matches none or more times what the parser matches

template<typename Parser>
struct Many {
    using is_parser_type = std::true_type;
    using return_type = typename std::vector<typename Parser::return_type>;
    using temp_result_type = typename Parser::return_type;


    Parser const p;
    Many(Parser const& p_) :p(p_) {};
    //CAN REPEAT 0 Times
    template<typename Iterator>
    bool operator()(Iterator& it, Iterator end, return_type* result) const {

        Iterator backtrack = it;
        size_t cnt = 0;

        temp_result_type tmpres;

        while (p(it, end, &tmpres)) {
            result->push_back(tmpres);
            backtrack = it;
            ++cnt;
        }

        return true;
    }
};

Optional

Match a single time or none at all, usually people implement an Identity parser that would always succeed, and Optional is just the combination of the passed parser the Identity.

Any

Constructed from multiple parsers it matches at least one of them, this may be one of the more complex pieces of this project because in strictly typed languages its harder to have heterogeneous data stored. One issue that was more complex to solve for this combinator was the use of std::variant because having std::variant creates a different type than std::variant and that’s by standart. What i managed to create was unique_variant that takes template types and reduces those types to a unique list of types. For example unique_variant creates the type std::variant.

As you can see some of the parsers store the iterator state for backtracking in case it fails, this is just to simplify construction of complex parsers.

Usage and Construction

There are a couple of other interesting combinators that will get described later on,but first lets talk about how a parser is constructed. Because each of those combinators works as any other parser that means they behave just like higher order functions. Really simple example would be a having list of numbers, a mix of integers and doubles. If we had a parser called Integer() and another one called Double() making a parser that reads all the values in a string like this:

4 4.2 1 16 2.6 15.123 4 14

Becomes as easy as using this structure Many(Any(Integer(),Double())) and this statement has all of its complexities hidden inside the functors, so our main focus becomes recreating the grammar of what we want to parse. For simple tasks a statement like this might be more readable than regular expressions syntax, but of course it comes to personal preference and preformance. Most of the libraries opt into making a lot of operator overloads, but i felt that the resulting code looks like hot mess. This doesn’t mean i strayed completely from any operator overloads, but they are mainly to provide ease of use and readability. For example the usage of operator && and operator || they are for constructing a sequence or a choice of parsers.

When we want to sequentially use two parsers and they capture the same or similar type you chain them up like this

parser1 && parser2

When we want a choice of parsers and they capture the same or simillar type you chain them up like this

parser1 || parser2

This type of usage allows us to reduce the clutter of words that would happen if we used the conventional class calls. All overloaded operators are outside of classes ( global ) and return the appropriate Combinator so they act like a wrapper and are also chainable.

Execution of a parser

Each parser has a overloaded operator() taking a starting iterator, a end iterator and has an output parameter of the appropriate type, and returns a boolean value. We use that return value with the new fold syntax construct statements similar to this one:

(parsers(it, end, &result_item) && ...)

This unpacks all the parsers and executes them in order of calling ( something guaranteed by the C++17 standard ). This way if a parser in the chain fails, execution stops and we end up with an iterator at the first location it failed parsing. In the C++ language this is called short-circuiting and works on boolean values and implicitly convertible statements to boolean. This is not the only reason i chose this type of implementation, but because in earlier experiments i came to the conclusion that its harder to work with all the different return values and combining them.Using fold expressions also reduces our need to do recursive SFINAE template tricks.

Left recursion in grammars

In the world of parser combinators you could try and use conventional methods of handling left recursion, but that would entail making a construct so we dont end up with an endless loop. Luckily there is another combinator that we can use so we can avoid the recursive nature of a grammar. Most algorithms do some sort of pruning or rewriting to eliminate recursion. The combinator i mentioned earlier is called SepBy(). SepBy matches one parser separated by another usually called delimiter, using the example we made earlier with parsing numbers, if we have a comma separated list of values.

1,2,3,4,54,6,7,8,99

The task of parsing this would become SepBy(Integer(),Comma()) but the real power of this combinator shows up when we have to parse something like a simple mathematical expression.

1+5*6+8/2

This doesn’t mean you cant have nested rules inside the grammar, for example a rule referencing itself, thats why there is a Reference combinator that pushes a reference to a parser and also a generic Rule container that can hold any parser type. Those enable you to create almost anything you desire without rethinking/rewriting the grammar rules that are more intuitive to write/read.

Handling associativity and precedence

Associativity and precedence are one of the more complex parts that you have to create when handling a grammar. Most algorithms get precedence as a number and after creating the parse graph it prunes all the nodes that dont satisfy the precedence, but in a combinator parser something like that is harder to do. This is why i introduce Compound parsers, they are not really a full parser meaning you cant call them directly to recognize input. As an input they get one or more other parsers, that can also be compound but allow construction of nested parsing functors. We introduce 4 compound parsers, LeftBinOper,RightBinOper that handle respectively left and right associative operators and Prefix,Postfix that handle the same thing as their name, prefix and postfix operators. There can be abstracted more Compound parsers N-ary operators but i felt like that was out of the scope of the library at the moment. Combining compound parsers is as easy as making a nested function call but to increase readability we introduce an overload of operator >>=. This operator translates from p1 >>= p2 to just calling p1(p2). All we need and the end of a operator calls like this is just a non-compound parser to make it callable as a recognizer. Why does this increase readability ?

Well this way we model the precedence order based on our compound call order. Usually the precedence of operators in most languages goes like this

Left Associative → Right Associative → Prefix→ Postfix

This way when we resolve the functions it goes and parses from left to right:

Prefix → Postfix → Right Associative → Left Associative

Examples

Lets show some simplistic examples of how a parsing setup works and

Example Math Parser

Its easy to see that all the operators with same precedence are grouped with || operator on each level and then chained. Because of the flexibility of parameters ( mainly most of them can be made constexpr ) means we can increase redability even more.

constexpr auto single_sum = [](auto)->std::function<int(int, int)> { return [](int a, int b) { return a + b; }; };
    constexpr auto single_sub = [](auto)->std::function<int(int, int)> { return [](int a, int b) { return a - b; }; };
    constexpr auto single_mul = [](auto)->std::function<int(int, int)> { return [](int a, int b) { return a * b; }; };
    constexpr auto single_div = [](auto)->std::function<int(int, int)> { return [](int a, int b) { return a / b; }; };
    constexpr auto prefix_double = [](auto)->std::function<int(int)> { return [](int a) { return a * a; }; };
    constexpr auto prefix_neg = [](auto)->std::function<int(int)> { return [](int a) { return -a; }; };



    auto mathParser =
        LeftBinOper('+'_symb % single_sum || '-'_symb % single_sub) >>=
        LeftBinOper('*'_symb % single_mul || '/'_symb % single_div) >>=
        Prefix('-'_symb % prefix_neg || '!'_symb % prefix_double) >>=
        ParseNumber();



    std::string test_expression = "3*5*2+1*2+1+1+1+2+3+3";

    ctxStringIter start = test_expression.begin();
    ctxStringIter end = test_expression.end();

    decltype(mathParser)::return_type final_result;
    bool hasParsed = mathParser(start, end, &final_result);

    if (hasParsed) {
        std::cout << "Parsed Successfully" << std::endl;
        std::cout << "Result:= " << final_result << std::endl;
        std::cout << "(line: " << start.getCurrentLine() << " col: " << start.getCurrentCol() << std::endl;
    }

Example JSON Parser

I don’t claim that this is an efficient implementation nor any speed improvements over existing solutions, its just an exploration on parsing and features of the new standards in C++.

//Transformations    
    constexpr auto to_string = [](auto v) { return std::string(1, v); };
    constexpr auto vec_to_string = [](auto v) {
        return std::accumulate(v.begin(), v.end(), std::string{});
    };

    //Grammar
    constexpr auto ws = monadic::Many(symbs<'\n', '\t', '\r',' '>() % to_string);
    constexpr auto str_wrap = Wrapper('"'_symb, '"'_symb);
    constexpr auto obj_wrap = Wrapper('{'_symb, '}'_symb);
    constexpr auto array_wrap = Wrapper('['_symb, ']'_symb);

    //Handle Scientific notation
    constexpr auto json_int = AcceptString('0'_symb % to_string || "1..9"_range % to_string && monadic::Many(AcceptString("0..9"_range)));
    constexpr auto json_exp = symbs<'E', 'e'>() % to_string && Optional(symbs<'+', '-'>() % to_string) && json_int;
    constexpr auto float_part = AcceptString('.'_symb % to_string && monadic::Many(AcceptString("0..9"_range)));

    constexpr auto json_number = Optional('-'_symb % to_string) && json_int && Optional(float_part) && Optional(json_exp);

    constexpr auto json_string = str_wrap(monadic::Repeat(AcceptString("a..z"_range || "A..Z"_range)));

    constexpr auto json_obj = [&](const auto json_value) {
        return obj_wrap(SepBy(json_value, ','_symb) % vec_to_string || ws);
    };
    constexpr auto json_arr = [&](const auto json_value) {
        return array_wrap(SepBy(json_value, ','_symb) % vec_to_string || ws);
    };

    Rule<ctxStringIter, std::string> json_value =
        json_number ||
        json_string ||
        json_arr(Reference(&json_value)) ||
        json_obj(json_string && AcceptString(':'_symb) && AcceptString(Reference(&json_value))) ||
        ParseLit("true") ||
        ParseLit("false") ||
        ParseLit("null");


    std::string test_string = "[1,-3.45,[],{},5,6,\"asd\",{\"Test\":\"Arr\"},{\"Tester\":[1,2,34,3,5,6]},{\n}]";

    ctxStringIter start = test_string.begin();
    ctxStringIter end = test_string.end();

    decltype(json_value)::return_type var_test;
    bool hasParsed = json_value(start, end, &var_test);

    if (hasParsed) {
        std::cout << "Parsed Successfully" << std::endl;
        std::cout << "Result:= " << var_test << std::endl;
        std::cout << "(line: " << start.getCurrentLine() << " col: " << start.getCurrentCol() << std::endl;
    }