Parsing

In the last lesson we started with an abstract syntax tree and wrote some functions that did different interesting things to it. Now let's back up a step and figure out how to generate abstract syntax trees.

Abstract syntax trees are designed to be easy for computers to manipulate, they're not necessarily designed for human understanding. While you're composing your next opera I doubt you'll want to keep writing tag: 'seq' again and again.

Parsing is the general problem of turning raw text into structured data. In the case of parsing programming languages, parsing converts the text the programmer writes into an abstract syntax tree. Parsers turn this:

into this:

Some interesting things to notice in the example are that the parser was smart and knew the rule for multiply being done first before the addition. Also, the parser is itself a function. It takes an input that is a string and returns an abstract syntax tree.

Creating a parser

How do you create a parser? A parser is just a function, so one way is to just write a function that does what you want. For simple languages this is not too hard, but as you add features to the language this becomes a lot of tedious work.

We'll be using PEG.js to generate parsers from specifications for the language grammar. You give PEG.js a grammar for your language, and it creates a JavaScript function that parses that language. In this series of exercises we'll learn how to specify the grammar for a language in PEG.js.

There is also a very handy online PEG.js editor that you might want to play around with. It lets you interactively write grammars and test them on input quickly.

Our first PEG parser

Here is a little PEG parser.

When you give this grammar to PEG.js, it will create a parsing function parse for you. This function parse takes a string as input. If the input string is a single digit from 0 to 9, the parse function will return the digit. Otherwise the parse function will throw an exception indicating that the input could not be parsed. (In the tests for each exercise we convert the exception into a return value of undefined so that all the tests run.)

A grammar for PEG.js is a list of rules. Each rule has a name and then a body. The start name is special because parsing always starts there. The notation [0-9] indicates a single character from a range of possibilities, in this case digits from 0 through 9.

Interpreting the example above in English, I would say it means, "Start by expecting a digit. A digit is a single character from the set of characters from 0 through 9."

Now it's your turn.

Write a PEG grammar that parses a single uppercase character from A to Z.

1:1



Next