Home

Common Lexical Specification

This Common Lexical Specification provides definitions of grammar rules being re-used in multiple language specifications on this website. This document consists of three sections: Section 1 defines how programs are encoded on a Byte level. Section 2 provides an introduction into grammars. Section 3 provides the full profile lexical grammar. Section 4 provides information on profiles.

1. Unicode

A program is a sequence of Unicode code points encoded into a sequence of Bytes using an Unicode encoding. In this version, only UTF-8 NOBOM with sequences of length 1 is supported. The Unicode encoding of a particular program must be determined by consumers of this specification.

2. Grammars

This section describes context-free grammars used in this specification to define the lexical and syntactical structure of a language.

2.1 Context-free grammars

A context-free grammar consists of a number of production. Each production has an abstract symbol called a non-terminal as its left-hand side, and a sequence of one or more non-terminal and terminal symbols as its right-hand side. For each grammar, the terminal symbols are drawn from a specified alphabet.

Starting from a sequence consisting of a single distinguished non-terminal, called the goal symbol, a given context-free grammar specifies a language, namely, the set of possible sequences of terminal smbols that can result from repeatedly replacing any non-terminal in the sequence with a right-hand side of a production for which the non-terminal is the left-hand side.

2.3 Lexical grammars

The lexical grammar uses the Unicode code points from the Unicode decoding phase as its terminal symbols. The non-terminals of the lexical grammar start with the prefix Lexical.. It defines a set of productions, starting from the goal symbol Lexical.Word, that describe how sequences of code points are translated into a word.

2.4 Syntactical grammars

The syntactical grammar for the Data Definition Language uses words of the lexical grammar as its terminal symbols. The non-terminals of the syntactical grammar start with the prefix Syntactical.. It defines a set of productions, starting from the goal symbol sentence, that describe how sequences of words are translated into a sentence.

2.5 Grammar notation

Productions are written in fixed width fonts.

A production is defined by its left-hand side, followed by a colon , followed by its right-hand side definition. The left-hand side is the name of the non-terminal defined by the production. Multiple alternating definitions of a production may be defined. The right-hand side of a production consits of any sequence of terminals and non-terminals. In certain cases the right-hand side is replaced by a comment describing the right-hand side. This comment is opened by /* and closed by */.

Example

The following production denotes the non-terminal for a digit as used in the definitions of numerals:

Lexical.Digit: /* A single Unicode symbol from the code point range +U0030 to +U0039 */

A terminal is a sequence of Unicode symbols. A Unicode symbol is denoted by a shebang # followed by a hexadecimal number denoting its code point.

Example

The following productions denote the non-terminal for a sign as used in the definitions of numerals:

/* #2b is also known as "PLUS SIGN" */
Lexical.PlusSign : #2b
/* #2d is also known as "MINUS SIGN" */
Lexical.MinusSign : #2d sign : plus_sign
Lexical.Sign : Lexical.PlusSign | Lexical.MinusSign

The syntax {x} on the right-hand side of a production denotes zero or more occurrences of x.

Example

The following production defines a possibly empty sequence of digits as used in the definitions of numerals:

Lexical.ZeroOrMoreDigits : {Lexical.Digit}

The syntax [x] on the right-hand side of a production denotes zero or one occurrences of x.

Example

The following productions denotes a possible definition of an integer numeral. It consists of an optional sign followed by a digit followed by zero or more digits (as defined in the previous example):

Lexical.Integer : [Lexical.Sign] Lexical.Digit Lexical.ZeroOrMoreDigits

The empty string is denoted by ε.

Example

The following productions denotes a possibly empty list of integers (with integer as defined in the preceeding example). Note that this list may include a trailing comma hence the {x} operator cannot be used here.

Syntactical.IntegerList : integer Syntactical.IntegerListRest
Syntactical.IntegerList : ε

Syntactical.IntegerListRest : Lexical.Comma Syntactical.Integer Syntactical.IntegerListRest
Syntactical.IntegerListRest : Lexical.Comma
Syntactical.IntegerListRest : ε

/* #2c is also known as "COMMA" */
Lexical.Comma : #2c

3 Full Profile Lexical Specification

The lexical grammar describes the translation of Unicode code points into words. The goal non-terminal of the lexical grammar is the Lexical.Word symbol.

3.1 word

The word word is defined by

Lexical.Word : Lexical.Period
Lexical.Word : Lexical.Semicolon
Lexical.Word : Lexical.Boolean
Lexical.Word : Lexical.Number
Lexical.Word : Lexical.String
Lexical.Word : Lexical.Void
Lexical.Word : Lexical.Name
Lexical.Word : Lexical.LeftCurlyBracket
Lexical.Word : Lexical.RightCurlyBracket
Lexical.Word : Lexical.LeftSquareBracket
Lexical.Word : Lexical.RightSquareBracket
Lexical.Word : Lexical.Comma
Lexical.Word : Lexical.Colon
/*whitespace, newline, and comment are not considered the syntactical grammar*/ Lexical.Word : Lexical.Whitespace
Lexical.Word : Lexical.Newline
Lexical.Word : Lexical.Comment

3.2 whitespace

The word whitespace is defined by

/* #9 is also known as "CHARACTER TABULATION" */
Lexical.Whitespace : #9
/* #20 is also known as "SPACE" */
Lexial.Whitespace : #20

3.3 line terminator

The word Lexical.LineTerminator is defined by

/* #a is also known as "LINEFEED (LF)" */
/* #d is also known as "CARRIAGE RETURN (CR)" */
Lexical.LineTerminator : #a {#d}
Lexical.LineTerminator : #d {#a}

3.4 comments

The language using the Common Lexical Specification may use both single-line comments and multi-line comments. A Lexical.Comment is either a single_line_comment or a Lexical.MultiLineComment. Lexical.MultiLineComment is defined by

Lexical.Comment : Lexical.SingleLineComment Lexical.Comment : Lexical.MultiLineComment

A Lexical.SingleLineComment starts with two solidi. It extends to the end of the line. Lexical.SingleLinecomment is defined by

/* #2f is also known as SOLIDUS */ Lexical.SingleLineComment : #2f #2f /* any sequence of characters except for line_terminator */

The Lexical.LineTerminator is not considered as part of the comment text.

A Lexical.MultiLineComment is opened by a solidus and an asterisk and closed by an asterisk and a solidus. Lexical.MultiLineComment is defined by

/* #2f is also known as SOLIDUS */
/* #2a is also known as ASTERISK */
Lexical.MultiLineComment :
#2f #2a
/* any sequence of characters except except for #2a #2f */
#2a #2f

The #2f #2a and #2a #2f sequences are not considered as part of the comment text.

This implies:

3.5 parentheses

The words Lexical.LeftParenthesis and Lexical.RightParenthesis, respectively, are defined by

/* #28 is also known as "LEFT PARENTHESIS" */
Lexical.LeftParenthesis : #28
/* #29 is also known as "RIGHT PARENTHESIS" */
Lexical.RightParenthesis : #29

3.6 curly brackets

The words Lexical.LeftCurlyBracket and Lexical.RightCurlyBracket, respectively, are defined by

/* #7b is also known as "LEFT CURLY BRACKET" */
Lexical.LeftCurlyBracket : #7b
/* #7d is also known as "RIGHT CURLY BRACKET" */
Lexical.RightCurlyBracket : #7d

3.7 colon

The word Lexical.Colon is defined by

/* #3a is also known as "COLON" */
Lexical.Colon : #3a

3.8 square brackets

The words Lexical.LeftSquareBracket and Lexica.RightSquareBracket, respectively, are defined by

/* #5b is also known as "LEFT SQUARE BRACKET" */
Lexical.LeftSquareBracket : #5b
/* #5d is also known as "RIGHT SQUARE BRACKET" */
Lexical.RightSquareBracket : #5d

3.9 comma

The word Lexical.Comma is defined by

/* #2c is also known as "COMMA" */
Lexical.Comma : #2c

3.10 name

The word Lexical.Name is defined by

Lexical.Name : {Lexical.Underscore} Lexical.Alphabetic {Lexical.NameSuffixCharacter}

/* #41 is also known as "LATIN CAPITAL LETTER A" */
/* #5a is also known as "LATIN CAPITAL LETTER Z" */
/* #61 is also known as "LATIN SMALL LETTER A" */
/* #7a is also known as "LATIN SMALLER LETTER Z" */
Lexical.NameSuffixCharacter : /* The unicode characters from #41 to #5a and from #61 to #7a. */

/* #30 is also known as "DIGIT ZERO" */
/* #39 is also known as "DIGIT NINE" */
Lexical.NameSuffixCharacter : /* The unicode characters from #30 to #39. */

/* #5f is also known as "LOW LINE" */
Lexical.NameSuffixCharacter : #5f

3.10 number literal

The word Lexical.Number is defined by

Lexical.Number : Lexical.IntegerNumber
Lexical.Number : Lexical.RealNumber
Lexical.IntegerNumber : [Lexical.Sign] Lexical.Digit {Lexical.Digit}
Lexical.RealNumber : [Lexical.Sign] Lexical.Period Lexical.Digit {Lexical.Digit} [Lexical.Exponent]
Lexical.RealNumber : [Lexical.Sign] Lexical.Digit {Lexical.Digit} [Lexical.Period {Lexical.Digit}] [Lexical.Exponent]
Lexical.Exponent : Lexical.ExponentPrefix [Lexical.Sign] Lexical.Digit {Lexical.Digit}

/* #2b is also known as "PLUS SIGN" */
Lexical.Sign : #2b
/* #2d is also known as "MINUS SIGN" */
Lexical.Sign : #2d
/* #65 is also known as "LATIN SMALL LETTER E" */
Lexical.ExponentPrefix : #65
/* #45 is also known as "LATIN CAPITAL LETTER E" */
Lexical.ExponentPrefix : #45

3.11 string literal

The word Lexical.String is defined by

Lexical.String : Lexical.SingleQuotedString
Lexical.String : Lexical.DoubleQuotedString

Lexical.DoubleQuotedString : Lexical.DoubleQuote {Lexical.DoubleQuotedStringCharacter} Lexical.DoubleQuote
Lexical.DoubleQuotedStringCharacter : /* any character except for Lexical.Newline and Lexical.DoubleQuote and characters in [0,1F]*/
Lexical.DoubleQuotedStringCharacter : Lexical.EscapeSequence
Lexical.DoubleQuotedStringCharacter : #5c Lexical.DoubleQuote
/* #22 is also known as "QUOTATION MARK" */
Lexical.DoubleQuote : #22

Lexical.SingleQuotedString : Lexical.SingleQuote {Lexical.SingleQuotedStringCharacter} Lexical.SingleQuote
Lexical.SingleQuotedStringCharacter : /* any character except for Lexical.Newline and Lexical.SingleQuote and characters in [0,1F]*/
Lexical.SingleQuotedStringCharacter : Lexical.EscapeSequence
Lexical.SingleQuotedStringCharacter : #5c Lexical.SingleQuote
/* #27 is also known as "APOSTROPHE" */
Lexical.SingleQuote : #27

/* #5c is also known as "REVERSE SOLIDUS", #75 is also known as 'LATIN SMALL LETTER U*/
Lexical.EscapeSequence : #5c 'u' Lexical.HexadecimalDigit Lexical.HexadecimalDigit Lexical.HexadecimalDigit Lexical.HexadecimalDigit
/* #5c is also known as "REVERSE SOLIDUS" */
Lexical.EscapeSequence : #5c #5c
/* #64 is also known as "LATIN SMALL LETTER B" */
Lexical.EscapeSequence : #5c #64
/* #66 is also known as "LATIN SMALL LETTER F" */
Lexical.EscapeSequence : #5c #66
/* #6e is also known as "LATIN SMALL LETTER N" */
Lexical.EscapeSequence : #5c #6e
/* #72 is also known as "LATIN SMALL LETTER R" */
Lexical.EscapeSequence : #5c #72
/* #74 is also known as "LATIN SMALL LETTER T" */
Lexical.EscapeSequence : #5c #75

3.12 boolean literal

The word Lexical.Boolean is defined by

Lexical.Boolean : Lexical.True
Lexical.Boolean : Lexical.False
true : #74 #72 #75 #65
false : #66 #61 #6c #73 #65

Remark: The word Lexical.Boolean is a so called keyword. It takes priority over the Lexical.Name.

3.13 void literal

The word Lexical.Void is defined by

Lexical.Void : #76 #6f # #69 #64

Remark: The word Lexical.Void is a so called keyword. It takes priority over the Lexical.Name.

3.14 decimal digit

The word Lexical.DecimalDigit is defined by

Lexical.DecimalDigit : /* A single Unicode character from the code point range +U0030 to +U0039. */

3.15 hexadecimal digit

The word Lexical.HexadecimalDigit is defined by

Lexical.HexadecimalDigit : /* A single Unicode character from the code point range +U0030 to +U0039, +U0061 to +U007A, U+0041 to U+005A*/

3.16 alphanumeric

The word Lexical.Alphanumeric is reserved for future use.

3.17 period

The word Lexical.Period is defined by

/* #2e is also known as "FULL STOP" */
Lexical.Period : 2e