Common Lexical Specification
This Common Lexical Specification provides definitions of grammar rules being re-used in multiple language specifications on this website. This document consists of three sections: Section 1 defines how programs are encoded on a Byte level. Section 2 provides an introduction into grammars. Section 3 provides the full profile lexical grammar. Section 4 provides information on profiles.
1. Unicode
A program is a sequence of Unicode code points encoded into a sequence of Bytes using an Unicode encoding. In this version, only UTF-8 NOBOM with sequences of length 1 is supported. The Unicode encoding of a particular program must be determined by consumers of this specification.
2. Grammars
This section describes context-free grammars used in this specification to define the lexical and syntactical structure of a language.
2.1 Context-free grammars
A context-free grammar consists of a number of production. Each production has an abstract symbol called a non-terminal as its left-hand side, and a sequence of one or more non-terminal and terminal symbols as its right-hand side. For each grammar, the terminal symbols are drawn from a specified alphabet.
Starting from a sequence consisting of a single distinguished non-terminal, called the goal symbol, a given context-free grammar specifies a language, namely, the set of possible sequences of terminal smbols that can result from repeatedly replacing any non-terminal in the sequence with a right-hand side of a production for which the non-terminal is the left-hand side.
2.3 Lexical grammars
The lexical grammar uses the Unicode code points from the Unicode decoding phase as its terminal symbols.
The non-terminals of the lexical grammar start with the prefix Lexical..
It defines a set of productions, starting from the goal symbol
2.4 Syntactical grammars
The syntactical grammar for the Data Definition Language uses words of the lexical grammar as its terminal symbols.
The non-terminals of the syntactical grammar start with the prefix Syntactical..
It defines a set of productions, starting from the goal symbol
2.5 Grammar notation
Productions are written in fixed width fonts.
A production is defined by its left-hand side, followed by a colon /* and closed by */.
The following production denotes the non-terminal for a digit as used in the definitions of numerals:
Lexical.Digit: /* A single Unicode symbol from the code point range +U0030 to +U0039 */
A terminal is a sequence of Unicode symbols. A Unicode symbol is denoted by a shebang # followed by a hexadecimal number denoting its code point.
The following productions denote the non-terminal for a sign as used in the definitions of numerals:
/* #2b is also known as "PLUS SIGN" */
Lexical.PlusSign : #2b
/* #2d is also known as "MINUS SIGN" */
Lexical.MinusSign : #2d sign : plus_sign
Lexical.Sign : Lexical.PlusSign | Lexical.MinusSign
The syntax {x} on the right-hand side of a production denotes zero or more occurrences of x.
The following production defines a possibly empty sequence of digits as used in the definitions of numerals:
Lexical.ZeroOrMoreDigits : {Lexical.Digit}
The syntax [x] on the right-hand side of a production denotes zero or one occurrences of x.
The following productions denotes a possible definition of an integer numeral. It consists of an optional sign followed by a digit followed by zero or more digits (as defined in the previous example):
Lexical.Integer : [Lexical.Sign] Lexical.Digit Lexical.ZeroOrMoreDigits
The empty string is denoted by ε.
The following productions denotes a possibly empty list of integers (with integer as defined in the preceeding example).
Note that this list may include a trailing comma hence the {x} operator cannot be used here.
Syntactical.IntegerList : integer Syntactical.IntegerListRest
Syntactical.IntegerList : ε
Syntactical.IntegerListRest : Lexical.Comma Syntactical.Integer Syntactical.IntegerListRest
Syntactical.IntegerListRest : Lexical.Comma
Syntactical.IntegerListRest : ε
/* #2c is also known as "COMMA" */
Lexical.Comma : #2c
3 Full Profile Lexical Specification
The lexical grammar describes the translation of Unicode code points into words.
The goal non-terminal of the lexical grammar is the Lexical.Word symbol.
3.1 word
The word word is defined by
Lexical.Word : Lexical.Period
Lexical.Word : Lexical.Semicolon
Lexical.Word : Lexical.Boolean
Lexical.Word : Lexical.Number
Lexical.Word : Lexical.String
Lexical.Word : Lexical.Void
Lexical.Word : Lexical.Name
Lexical.Word : Lexical.LeftCurlyBracket
Lexical.Word : Lexical.RightCurlyBracket
Lexical.Word : Lexical.LeftSquareBracket
Lexical.Word : Lexical.RightSquareBracket
Lexical.Word : Lexical.Comma
Lexical.Word : Lexical.Colon
/*whitespace, newline, and comment are not considered the syntactical grammar*/
Lexical.Word : Lexical.Whitespace
Lexical.Word : Lexical.Newline
Lexical.Word : Lexical.Comment
3.2 whitespace
The word whitespace is defined by
/* #9 is also known as "CHARACTER TABULATION" */
Lexical.Whitespace : #9
/* #20 is also known as "SPACE" */
Lexial.Whitespace : #20
3.3 line terminator
The word Lexical.LineTerminator is defined by
/* #a is also known as "LINEFEED (LF)" */
/* #d is also known as "CARRIAGE RETURN (CR)" */
Lexical.LineTerminator : #a {#d}
Lexical.LineTerminator : #d {#a}
3.4 comments
The language using the Common Lexical Specification may use both single-line comments and multi-line comments.
A Lexical.Comment is either a single_line_comment or a Lexical.MultiLineComment.
Lexical.MultiLineComment is defined by
Lexical.Comment : Lexical.SingleLineComment
Lexical.Comment : Lexical.MultiLineComment
A Lexical.SingleLineComment starts with two solidi.
It extends to the end of the line.
Lexical.SingleLinecomment is defined by
/* #2f is also known as SOLIDUS */
Lexical.SingleLineComment :
#2f #2f
/* any sequence of characters except for line_terminator */
The Lexical.LineTerminator is not considered as part of the comment text.
A Lexical.MultiLineComment is opened by a solidus and an asterisk and closed by an asterisk and a solidus.
Lexical.MultiLineComment is defined by
/* #2f is also known as SOLIDUS */
/* #2a is also known as ASTERISK */
Lexical.MultiLineComment :
#2f #2a
/* any sequence of characters except except for #2a #2f */
#2a #2f
The #2f #2a and #2a #2f sequences are not considered as part of the comment text.
This implies:
#2f #2fhas no special meaning either comment.#2f #2aand#2a #2fhave no special meaning in single-line comments.- Multi-line comments do not nest.
3.5 parentheses
The words Lexical.LeftParenthesis and Lexical.RightParenthesis, respectively, are defined by
/* #28 is also known as "LEFT PARENTHESIS" */
Lexical.LeftParenthesis : #28
/* #29 is also known as "RIGHT PARENTHESIS" */
Lexical.RightParenthesis : #29
3.6 curly brackets
The words Lexical.LeftCurlyBracket and Lexical.RightCurlyBracket, respectively, are defined by
/* #7b is also known as "LEFT CURLY BRACKET" */
Lexical.LeftCurlyBracket : #7b
/* #7d is also known as "RIGHT CURLY BRACKET" */
Lexical.RightCurlyBracket : #7d
3.7 colon
The word Lexical.Colon is defined by
/* #3a is also known as "COLON" */
Lexical.Colon : #3a
3.8 square brackets
The words Lexical.LeftSquareBracket and Lexica.RightSquareBracket, respectively, are defined by
/* #5b is also known as "LEFT SQUARE BRACKET" */
Lexical.LeftSquareBracket : #5b
/* #5d is also known as "RIGHT SQUARE BRACKET" */
Lexical.RightSquareBracket : #5d
3.9 comma
The word Lexical.Comma is defined by
/* #2c is also known as "COMMA" */
Lexical.Comma : #2c
3.10 name
The word Lexical.Name is defined by
Lexical.Name : {Lexical.Underscore} Lexical.Alphabetic {Lexical.NameSuffixCharacter}
/* #41 is also known as "LATIN CAPITAL LETTER A" */
/* #5a is also known as "LATIN CAPITAL LETTER Z" */
/* #61 is also known as "LATIN SMALL LETTER A" */
/* #7a is also known as "LATIN SMALLER LETTER Z" */
Lexical.NameSuffixCharacter : /* The unicode characters from #41 to #5a and from #61 to #7a. */
/* #30 is also known as "DIGIT ZERO" */
/* #39 is also known as "DIGIT NINE" */
Lexical.NameSuffixCharacter : /* The unicode characters from #30 to #39. */
/* #5f is also known as "LOW LINE" */
Lexical.NameSuffixCharacter : #5f
3.10 number literal
The word Lexical.Number is defined by
Lexical.Number : Lexical.IntegerNumber
Lexical.Number : Lexical.RealNumber
Lexical.IntegerNumber : [Lexical.Sign] Lexical.Digit {Lexical.Digit}
Lexical.RealNumber : [Lexical.Sign] Lexical.Period Lexical.Digit {Lexical.Digit} [Lexical.Exponent]
Lexical.RealNumber : [Lexical.Sign] Lexical.Digit {Lexical.Digit} [Lexical.Period {Lexical.Digit}] [Lexical.Exponent]
Lexical.Exponent : Lexical.ExponentPrefix [Lexical.Sign] Lexical.Digit {Lexical.Digit}
/* #2b is also known as "PLUS SIGN" */
Lexical.Sign : #2b
/* #2d is also known as "MINUS SIGN" */
Lexical.Sign : #2d
/* #65 is also known as "LATIN SMALL LETTER E" */
Lexical.ExponentPrefix : #65
/* #45 is also known as "LATIN CAPITAL LETTER E" */
Lexical.ExponentPrefix : #45
3.11 string literal
The wordLexical.String is defined by
Lexical.String : Lexical.SingleQuotedString
Lexical.String : Lexical.DoubleQuotedString
Lexical.DoubleQuotedString : Lexical.DoubleQuote {Lexical.DoubleQuotedStringCharacter} Lexical.DoubleQuote
Lexical.DoubleQuotedStringCharacter : /* any character except for Lexical.Newline and Lexical.DoubleQuote and characters in [0,1F]*/
Lexical.DoubleQuotedStringCharacter : Lexical.EscapeSequence
Lexical.DoubleQuotedStringCharacter : #5c Lexical.DoubleQuote
/* #22 is also known as "QUOTATION MARK" */
Lexical.DoubleQuote : #22
Lexical.SingleQuotedString : Lexical.SingleQuote {Lexical.SingleQuotedStringCharacter} Lexical.SingleQuote
Lexical.SingleQuotedStringCharacter : /* any character except for Lexical.Newline and Lexical.SingleQuote and characters in [0,1F]*/
Lexical.SingleQuotedStringCharacter : Lexical.EscapeSequence
Lexical.SingleQuotedStringCharacter : #5c Lexical.SingleQuote
/* #27 is also known as "APOSTROPHE" */
Lexical.SingleQuote : #27
/* #5c is also known as "REVERSE SOLIDUS", #75 is also known as 'LATIN SMALL LETTER U*/
Lexical.EscapeSequence : #5c 'u' Lexical.HexadecimalDigit Lexical.HexadecimalDigit Lexical.HexadecimalDigit Lexical.HexadecimalDigit
/* #5c is also known as "REVERSE SOLIDUS" */
Lexical.EscapeSequence : #5c #5c
/* #64 is also known as "LATIN SMALL LETTER B" */
Lexical.EscapeSequence : #5c #64
/* #66 is also known as "LATIN SMALL LETTER F" */
Lexical.EscapeSequence : #5c #66
/* #6e is also known as "LATIN SMALL LETTER N" */
Lexical.EscapeSequence : #5c #6e
/* #72 is also known as "LATIN SMALL LETTER R" */
Lexical.EscapeSequence : #5c #72
/* #74 is also known as "LATIN SMALL LETTER T" */
Lexical.EscapeSequence : #5c #75
3.12 boolean literal
The word Lexical.Boolean is defined by
Lexical.Boolean : Lexical.True
Lexical.Boolean : Lexical.False
true : #74 #72 #75 #65
false : #66 #61 #6c #73 #65
Remark: The word Lexical.Boolean is a so called keyword.
It takes priority over the Lexical.Name.
3.13 void literal
The word Lexical.Void is defined by
Lexical.Void : #76 #6f # #69 #64
Remark: The word Lexical.Void is a so called keyword.
It takes priority over the Lexical.Name.
3.14 decimal digit
The word Lexical.DecimalDigit is defined by
Lexical.DecimalDigit : /* A single Unicode character from the code point range +U0030 to +U0039. */
3.15 hexadecimal digit
The word Lexical.HexadecimalDigit is defined by
Lexical.HexadecimalDigit : /* A single Unicode character from the code point range +U0030 to +U0039, +U0061 to +U007A, U+0041 to U+005A*/
3.16 alphanumeric
The word Lexical.Alphanumeric is reserved for future use.
3.17 period
The word Lexical.Period is defined by
/* #2e is also known as "FULL STOP" */
Lexical.Period : 2e