Home

Common Lexical Specifications

This Common Lexical Specification provides definitions of grammar rules being re-used in multiple language specifications on this website. This document consists of three sections: Section 1 defines how programs are encoded on a Byte level. Section 2 provides an introduction into grammars. Section 3 provides the full profile lexical grammar. Section 4 provides information on profiles.

1. Unicode

A program is a sequence of Unicode code points encoded into a sequence of Bytes using an Unicode encoding. In this version, only UTF-8 NOBOM with sequences of length 1 is supported. The Unicode encoding of a particular program must be determined by consumers of this specification.

2. Grammars

This section describes context-free grammars used in this specification to define the lexical and syntactical structure of a language.

2.1 Context-free grammars

A context-free grammar consists of a number of production. Each production has an abstract symbol called a non-terminal as its left-hand side, and a sequence of one or more non-terminal and terminal symbols as its right-hand side. For each grammar, the terminal symbols are drawn from a specified alphabet.

Starting from a sequence consisting of a single distinguished non-terminal, called the goal symbol, a given context-free grammar specifies a language, namely, the set of possible sequences of terminal smbols that can result from repeatedly replacing any non-terminal in the sequence with a right-hand side of a production for which the non-terminal is the left-hand side.

2.3 Lexical grammars

The lexical grammar uses the Unicode code points from the Unicode decoding phase as its terminal symbols. It defines a set of productions, starting from the goal symbol word, that describe how sequences of code points are translated into a word.

2.4 Syntactical grammars

The syntactical grammar for the Data Definition Language uses words of the lexical grammar as its terminal symbols. It defines a set of productions, starting from the goal symbol sentence, that describe how sequences of words are translated into a sentence.

2.5 Grammar notation

Productions are written in fixed width fonts.

A production is defined by its left-hand side, followed by a colon , followed by its right-hand side definition. The left-hand side is the name of the non-terminal defined by the production. Multiple alternating definitions of a production may be defined. The right-hand side of a production consits of any sequence of terminals and non-terminals. In certain cases the right-hand side is replaced by a comment describing the right-hand side. This comment is opened by /* and closed by */.

Example

The following production denotes the non-terminal for a digit as used in the definitions of numerals:

digit: /* A single Unicode symbol from the code point range +U0030 to +U0039 */

A terminal is a sequence of Unicode symbols. A Unicode symbol is denoted by a shebang # followed by a hexadecimal number denoting its code point.

Example

The following productions denote the non-terminal for a sign as used in the definitions of numerals:

/* #2b is also known as "PLUS SIGN" */
plus_sign : #2b
/* #2d is also known as "MINUS SIGN" */
minus_sign : #2d sign : plus_sign
sign : minus_sign

The syntax {x} on the right-hand side of a production denotes zero or more occurrences of x.

Example

The following production defines a possibly empty sequence of digits as used in the definitions of numerals:

zero-or-more-digits : {digit}

The syntax [x] on the right-hand side of a production denotes zero or one occurrences of x.

Example

The following productions denotes a possible definition of an integer numeral. It consists of an optional sign followed by a digit followed by zero-or-more-digits as defined in the preceeding examples):

integer : [sign] digit zero-or-more-digits

The empty string is denoted by ε.

Example

The following productions denotes a possibly empty list of integers (with integer as defined in the preceeding example). Note that this list may include a trailing comma.

integer-list : integer integer-list-rest
integer-list : ε

integer-list-rest : comma integer integer-list-rest
integer-list-rest : comma
integer-list-rest : ε

/* #2c is also known as "COMMA" */
comma : #2c

3. Full Profile Lexical Specification

The lexical grammar describes the translation of Unicode code points into words. The goal non-terminal of the lexical grammar is the word symbol.

3.1. word

The word word is defined by

word : delimiters
word : boolean
word : number
word : string
word : void
word : name
word : left_curly_bracket
word : right_curly_bracket
word : left_square_bracket
word : right_square_bracket
word : comma
word : colon
/*whitespace, newline, and comment are not considered the syntactical grammar*/ word : whitespace
word : newline
word : comment

3.2. whitespace

The word whitespace is defined by

/* #9 is also known as "CHARACTER TABULATION" */
whitespace : #9
/* #20 is also known as "SPACE" */
whitespace : #20

3,3, line terminator

The word line_terminator is defined by

/* #a is also known as "LINEFEED (LF)" */
/* #d is also known as "CARRIAGE RETURN (CR)" */
line_terminator : #a {#d}
line_terminator : #d {#a}

3,4, comments

The language using the Common Lexical Specification may use both single-line comments and multi-line comments. A comment is either a single_line_comment or a multi_line_comment. multi_line_comment is defined by

comment : single_line_comment comment : multi_line_comment

A single_line_comment starts with two solidi. It extends to the end of the line. single_line_comment is defined by:

/* #2f is also known as SOLIDUS */ single_line_comment : #2f #2f /* any sequence of characters except for line_terminator */

The line_terminator is not considered as part of the comment text.

A multi_line_comment is opened by a solidus and an asterisk and closed by an asterisk and a solidus. multi_line_comment is defined by

/* #2f is also known as SOLIDUS */
/* #2a is also known as ASTERISK */
multi_line_comment :
#2f #2a
/* any sequence of characters except except for #2a #2f */
#2a #2f

The #2f #2a and #2a #2f sequences are not considered as part of the comment text.

This implies:

3.5. parentheses

The words left_parenthesis and right_parenthesis, respectively, are defined by

/* #28 is also known as "LEFT PARENTHESIS" */
left_parenthesis : #28
/* #29 is also known as "RIGHT PARENTHESIS" */
right_parenthesis : #29

3.6. curly brackets

The words left_curly_bracket and right_curly_bracket, respectively, are defined by

/* #7b is also known as "LEFT CURLY BRACKET" */
left_curly_bracket : #7b
/* #7d is also known as "RIGHT CURLY BRACKET" */
right_curly_bracket : #7d

3.7. colon

The word colon is defined by

/* #3a is also known as "COLON" */
colon : #3a

3.8. square brackets

The words left_square_bracket and right_square_bracket, respectively, are defined by

/* #5b is also known as "LEFT SQUARE BRACKET" */
left_square_bracket : #5b
/* #5d is also known as "RIGHT SQUARE BRACKET" */
right_square_bracket : #5d

3.9. comma

The word comma is defined by

/* #2c is also known as "COMMA" */
comma : #2c

name

The word name is defined by

name : {underscore} alphabetic {name_suffix_character}

/* #41 is also known as "LATIN CAPITAL LETTER A" */
/* #5a is also known as "LATIN CAPITAL LETTER Z" */
/* #61 is also known as "LATIN SMALL LETTER A" */
/* #7a is also known as "LATIN SMALLER LETTER Z" */
name_suffix_character : /* The unicode characters from #41 to #5a and from #61 to #7a. */

/* #30 is also known as "DIGIT ZERO" */
/* #39 is also known as "DIGIT NINE" */
name_suffix_character : /* The unicode characters from #30 to #39. */

/* #5f is also known as "LOW LINE" */
name_suffix_character : #5f

3.10. number literal

The word number is defined by

number : integer_number
number : real_number
integer_number : [sign] digit {digit}
real_number : [sign] period digit {digit} [exponent]
real_number : [sign] digit {digit} [period {digit}] [exponent]
exponent : exponent_prefix [sign] digit {digit}

/* #2b is also known as "PLUS SIGN" */
sign : #2b
/* #2d is also known as "MINUS SIGN" */
sign : #2d
/* #2e is also known as "FULL STOP" */
period : 2e
/* #65 is also known as "LATIN SMALL LETTER E" */
exponent_prefix : #65
/* #45 is also known as "LATIN CAPITAL LETTER E" */
exponent_prefix : #45

3.11. string literal

The word string is defined by

string : single_quoted_string
stirng : double_quoted_string

double_quoted_string : double_quote {double_quoted_string_character} double_quote
double_quoted_string_character : /* any character except for newline and double_quote */
double_quoted_string_character : escape_sequence
double_quoted_string_character : #5c double_quote
/* #22 is also known as "QUOTATION MARK" */
double_quote : #22

single_quoted_string : single_quote {single_quoted_string_character} single_quote
single_quoted_string_character : /* any character except for newline and single quote */
single_quoted_string_character : escape_sequence
single_quoted_string_character : #5c single_quote
/* #27 is also known as "APOSTROPHE" */
single_quote : #27

/* #5c is also known as "REVERSE SOLIDUS" */
escape_sequence : #5c #5c
/* #6e is also known as "LATIN SMALL LETTER N" */
escape_sequence : #5c #6e
/* #72 is also known as "LATIN SMALL LETTER R" */
escape_sequence : #5c #72

3.12. boolean literal

The word boolean is defined by

boolean : true
boolean : false
true : #74 #72 #75 #65
false : #66 #61 #6c #73 #65

Remark: The word boolean is a so called keyword. It takes priority over the name.

3.13. void literal

The word void is defined by

void : #76 #6f # #69 #64

Remark: The word void is a so called keyword. It takes priority over the name.

3.14. digit

The word digit is defined by

digit : /* A single Unicode character from the code point range +U0030 to +U0039. */

3.15. alphanumeric

The word alphanumeric is reserved for future use.