Common Lexical Specifications
This Common Lexical Specification provides definitions of grammar rules being re-used in multiple language specifications on this website. This document consists of three sections: Section 1 defines how programs are encoded on a Byte level. Section 2 provides an introduction into grammars. Section 3 provides the full profile lexical grammar. Section 4 provides information on profiles.
1. Unicode
A program is a sequence of Unicode code points encoded into a sequence of Bytes using an Unicode encoding. In this version, only UTF-8 NOBOM with sequences of length 1 is supported. The Unicode encoding of a particular program must be determined by consumers of this specification.
2. Grammars
This section describes context-free grammars used in this specification to define the lexical and syntactical structure of a language.
2.1 Context-free grammars
A context-free grammar consists of a number of production. Each production has an abstract symbol called a non-terminal as its left-hand side, and a sequence of one or more non-terminal and terminal symbols as its right-hand side. For each grammar, the terminal symbols are drawn from a specified alphabet.
Starting from a sequence consisting of a single distinguished non-terminal, called the goal symbol, a given context-free grammar specifies a language, namely, the set of possible sequences of terminal smbols that can result from repeatedly replacing any non-terminal in the sequence with a right-hand side of a production for which the non-terminal is the left-hand side.
2.3 Lexical grammars
The lexical grammar uses the Unicode code points from the Unicode decoding phase as its terminal symbols.
It defines a set of productions, starting from the goal symbol
2.4 Syntactical grammars
The syntactical grammar for the Data Definition Language uses words of the lexical grammar as its terminal symbols.
It defines a set of productions, starting from the goal symbol
2.5 Grammar notation
Productions are written in fixed width
fonts.
A production is defined by its left-hand side, followed by a colon /*
and closed by */
.
The following production denotes the non-terminal for a digit as used in the definitions of numerals:
digit: /* A single Unicode symbol from the code point range +U0030 to +U0039 */
A terminal is a sequence of Unicode symbols. A Unicode symbol is denoted by a shebang #
followed by a hexadecimal number denoting its code point.
The following productions denote the non-terminal for a sign as used in the definitions of numerals:
/* #2b is also known as "PLUS SIGN" */
plus_sign : #2b
/* #2d is also known as "MINUS SIGN" */
minus_sign : #2d sign : plus_sign
sign : minus_sign
The syntax {x}
on the right-hand side of a production denotes zero or more occurrences of x
.
The following production defines a possibly empty sequence of digits as used in the definitions of numerals:
zero-or-more-digits : {digit}
The syntax [x]
on the right-hand side of a production denotes zero or one occurrences of x
.
The following productions denotes a possible definition of an integer numeral. It consists of an optional sign followed by a digit followed by zero-or-more-digits as defined in the preceeding examples):
integer : [sign] digit zero-or-more-digits
The empty string is denoted by ε
.
The following productions denotes a possibly empty list of integers (with integer as defined in the preceeding example). Note that this list may include a trailing comma.
integer-list : integer integer-list-rest
integer-list : ε
integer-list-rest : comma integer integer-list-rest
integer-list-rest : comma
integer-list-rest : ε
/* #2c is also known as "COMMA" */
comma : #2c
3. Full Profile Lexical Specification
The lexical grammar describes the translation of Unicode code points into words.
The goal non-terminal of the lexical grammar is the word
symbol.
3.1. word
The word word
is defined by
word : delimiters
word : boolean
word : number
word : string
word : void
word : name
word : left_curly_bracket
word : right_curly_bracket
word : left_square_bracket
word : right_square_bracket
word : comma
word : colon
/*whitespace, newline, and comment are not considered the syntactical grammar*/
word : whitespace
word : newline
word : comment
3.2. whitespace
The word whitespace
is defined by
/* #9 is also known as "CHARACTER TABULATION" */
whitespace : #9
/* #20 is also known as "SPACE" */
whitespace : #20
3,3, line terminator
The word line_terminator
is defined by
/* #a is also known as "LINEFEED (LF)" */
/* #d is also known as "CARRIAGE RETURN (CR)" */
line_terminator : #a {#d}
line_terminator : #d {#a}
3,4, comments
The language using the Common Lexical Specification may use both single-line comments and multi-line comments.
A comment
is either a single_line_comment
or a multi_line_comment
.
multi_line_comment
is defined by
comment : single_line_comment
comment : multi_line_comment
A single_line_comment
starts with two solidi.
It extends to the end of the line.
single_line_comment
is defined by:
/* #2f is also known as SOLIDUS */
single_line_comment :
#2f #2f
/* any sequence of characters except for line_terminator */
The line_terminator is not considered as part of the comment text.
A multi_line_comment
is opened by a solidus and an asterisk and closed by an asterisk and a solidus.
multi_line_comment
is defined by
/* #2f is also known as SOLIDUS */
/* #2a is also known as ASTERISK */
multi_line_comment :
#2f #2a
/* any sequence of characters except except for #2a #2f */
#2a #2f
The #2f #2a
and #2a #2f
sequences are not considered as part of the comment text.
This implies:
#2f #2f
has no special meaning either comment.#2f #2a
and#2a #2f
have no special meaning in single-line comments.- Multi-line comments do not nest.
3.5. parentheses
The words left_parenthesis
and right_parenthesis
, respectively, are defined by
/* #28 is also known as "LEFT PARENTHESIS" */
left_parenthesis : #28
/* #29 is also known as "RIGHT PARENTHESIS" */
right_parenthesis : #29
3.6. curly brackets
The words left_curly_bracket
and right_curly_bracket
, respectively, are defined by
/* #7b is also known as "LEFT CURLY BRACKET" */
left_curly_bracket : #7b
/* #7d is also known as "RIGHT CURLY BRACKET" */
right_curly_bracket : #7d
3.7. colon
The word colon
is defined by
/* #3a is also known as "COLON" */
colon : #3a
3.8. square brackets
The words left_square_bracket
and right_square_bracket
, respectively, are defined by
/* #5b is also known as "LEFT SQUARE BRACKET" */
left_square_bracket : #5b
/* #5d is also known as "RIGHT SQUARE BRACKET" */
right_square_bracket : #5d
3.9. comma
The word comma
is defined by
/* #2c is also known as "COMMA" */
comma : #2c
name
The word name
is defined by
name : {underscore} alphabetic {name_suffix_character}
/* #41 is also known as "LATIN CAPITAL LETTER A" */
/* #5a is also known as "LATIN CAPITAL LETTER Z" */
/* #61 is also known as "LATIN SMALL LETTER A" */
/* #7a is also known as "LATIN SMALLER LETTER Z" */
name_suffix_character : /* The unicode characters from #41 to #5a and from #61 to #7a. */
/* #30 is also known as "DIGIT ZERO" */
/* #39 is also known as "DIGIT NINE" */
name_suffix_character : /* The unicode characters from #30 to #39. */
/* #5f is also known as "LOW LINE" */
name_suffix_character : #5f
3.10. number literal
The word number
is defined by
number : integer_number
number : real_number
integer_number : [sign] digit {digit}
real_number : [sign] period digit {digit} [exponent]
real_number : [sign] digit {digit} [period {digit}] [exponent]
exponent : exponent_prefix [sign] digit {digit}
/* #2b is also known as "PLUS SIGN" */
sign : #2b
/* #2d is also known as "MINUS SIGN" */
sign : #2d
/* #2e is also known as "FULL STOP" */
period : 2e
/* #65 is also known as "LATIN SMALL LETTER E" */
exponent_prefix : #65
/* #45 is also known as "LATIN CAPITAL LETTER E" */
exponent_prefix : #45
3.11. string literal
The word string
is defined by
string : single_quoted_string
stirng : double_quoted_string
double_quoted_string : double_quote {double_quoted_string_character} double_quote
double_quoted_string_character : /* any character except for newline and double_quote */
double_quoted_string_character : escape_sequence
double_quoted_string_character : #5c double_quote
/* #22 is also known as "QUOTATION MARK" */
double_quote : #22
single_quoted_string : single_quote {single_quoted_string_character} single_quote
single_quoted_string_character : /* any character except for newline and single quote */
single_quoted_string_character : escape_sequence
single_quoted_string_character : #5c single_quote
/* #27 is also known as "APOSTROPHE" */
single_quote : #27
/* #5c is also known as "REVERSE SOLIDUS" */
escape_sequence : #5c #5c
/* #6e is also known as "LATIN SMALL LETTER N" */
escape_sequence : #5c #6e
/* #72 is also known as "LATIN SMALL LETTER R" */
escape_sequence : #5c #72
3.12. boolean literal
The word boolean
is defined by
boolean : true
boolean : false
true : #74 #72 #75 #65
false : #66 #61 #6c #73 #65
Remark: The word boolean
is a so called keyword.
It takes priority over the name
.
3.13. void literal
The word void
is defined by
void : #76 #6f # #69 #64
Remark: The word void
is a so called keyword.
It takes priority over the name
.
3.14. digit
The word digit
is defined by
digit : /* A single Unicode character from the code point range +U0030 to +U0039. */
3.15. alphanumeric
The word alphanumeric
is reserved for future use.