ES6 Language Specification

The Lexical and RegExp Grammars

5.1.2  The Lexical and RegExp Grammars

What is the lexical grammar?

The lexical grammar is the most rudimentary part of a grammar’s syntax. It is better described in section 11 of the specs, however, we will briefly describe it here, and leave some of the more intricate details to when we discuss section 11.

We will first quote the spec as a point of reference and then describe it in simpler language. The first paragraph states:

A lexical grammar for ECMAScript is given in clause 11. This grammar has as its terminal symbols Unicode code points that conform to the rules for SourceCharacter defined in 10.1. It defines a set of productions, starting from the goal symbol InputElementDiv, InputElementTemplateTail, or InputElementRegExp, or InputElementRegExpOrTemplateTail, that describe how sequences of such code points are translated into a sequence of input elements.

To better understand what the lexical grammar is we need to cover some of the terminology. The characters that make up the lexical grammar are called SourceCharacters, which are defined as being any Unicode character/symbol. These characters in turn define the lexical productions labeled InputElementDiv, InputElementTemplateTail, InputElementRegExp, and InputElementRegExpOrTemplateTail. These 4 productions are composed of what are termed tokens which are words like while or for or characters like ( or +, as well as non-tokens such as line terminators, comments, and white space. We have 4 distinct productions because each of these have additional valid grammars such as regular expressions or additional punctuators. These 4 productions are used in different contexts within the spec. 

We can now better understand the remaining part of 5.1.2:

Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language.

Moreover, line terminators, although not considered to be tokens, also become part of the stream of input elements and guide the process of automatic semicolon insertion (11.9). Simple white space and single-line comments are discarded and do not appear in the stream of input elements for the syntactic grammar. A MultiLineComment (that is, a comment of the form /**/regardless of whether it spans more than one line) is likewise simply discarded if it contains no line terminator; but if a MultiLineComment contains one or more line terminators, then it is replaced by a single line terminator, which becomes part of the stream of input elements for the syntactic grammar.

A RegExp grammar for ECMAScript is given in 21.2.1. This grammar also has as its terminal symbols the code points as defined by SourceCharacter. It defines a set of productions, starting from the goal symbol Pattern, that describe how sequences of code points are translated into regular expression patterns.

Productions of the lexical and RegExp grammars are distinguished by having two colons “::” as separating punctuation. The lexical and RegExp grammars share some productions.

 

Josh Miller

Josh Miller Josh Miller

I’m a full-stack web developer who’s especially enthusiastic about the rapid developments in JavaScript. I’ve created this blog as a medium to share with others a journey of knowledge and discovery.