5.1.1 Context-Free Grammars
In order to understand much of the specification moving forward, we will need to understand some basic information on grammar used to describe programming languages. Indeed the spec itself devotes all of section 5 to the basic notation and grammar used in the rest of the documentation. Unfortunately for some, the writers assume familiarity with some aspects of language theory. In order to bridge the gap for those that are not familiar with some of the terminology, and as a refresher for those that are, we will discuss the concepts as they appear.
The first concept to understand is what a context-free grammar is, as it lays the foundation of much of the remaining specification.
Let’s start with an example of a context-free grammar so we have something to reference while we discuss the topic:
A : 0A1
A : B
B : @
Syntactically, each rule here appears as a line. The left-hand side is a single non-terminal (another word for a variable). The right-hand side is a string that consists of any number of terminals (constants) and non-terminals (variables). In our example, 0, 1, and @ are terminals, and A and B are non-terminals. The notation for the non-terminals (variables) is a capital letter or a word that is camel-cased. Terminals (constants) are written in lowercase, or as a number, or a symbol.
Formal grammars have a recursive nature and consist of a collection of substitution rules referred to as productions. Their recursive nature empowers them to be widely expressive as you will see in many real-world examples throughout the specification. To apply the grammar, you start with the top rule, and simply perform substitutions until there are no non-terminals on the right-hand side. The top-left non-terminal is known as the goal terminal because our goal is to define it with everything on its right. In our example, valid syntax that will conform to our grammar include 0@1 00@11 000@111 and so on.
Why the name ‘context-free’?
What makes something ‘context-free’ and why is it called that? Consider the following:
A : a //context-free
AB : a //not context-free
In the first rule you can replace A without worrying about anything else around it, i.e. context-free. However, with the second rule, you can only replace A if B is alongside it. This means that you need to worry about the context of A being next to B. That is why AB : a isn’t context-free! For a formula to be context-free it must have only one non-terminal on the left-hand side. More than one non-terminal and we no longer have a context-free formula.
Armed with all this information we can understand 5.1.1 of the spec.
A context-free grammar consists of a number of productions. Each production has an abstract symbol called a nonterminal as its left-hand side, and a sequence of zero or more nonterminal and terminal symbols as its right-hand side. For each grammar, the terminal symbols are drawn from a specified alphabet.
There we go! Exactly like we described albeit less cryptic now that we know what it’s talking about.
A chain production is a production that has exactly one nonterminal symbol on its right-hand side along with zero or more terminal symbols.
It’s called a chain because the rules or productions are linked together like a chain. A chain provides an optimal level of simplicity.
Starting from a sentence consisting of a single distinguished nonterminal, called the goal symbol, a given context-free grammar specifies a language, namely, the (perhaps infinite) set of possible sequences of terminal symbols that can result from repeatedly replacing any nonterminal in the sequence with a right-hand side of a production for which the nonterminal is the left-hand side.
Run-on sentence anyone? It’s not technically a run-on, but it is definitely a mouthful. If you read it chunk by chunk you’ll see that it re-iterates what we already know from what we just discussed.