Common Lexical Translations
The Common Lexical Translations specification a lexical translation, that is, it defines translations rules for how an input sequence of Unicode code points is translated into an output sequence of transformed and classified sequences of Unicode code points called words. The rules for provided by the Common Lexical Translations specification are reused by other specifications within Michael Heilmann’s Arcadia.
1. Introduction
The Common Lexical Translations specification a lexical translation, that is, it defines translations rules for how an input sequence of Unicode code points is translated into an output sequence of transformed and classified sequences of Unicode code points called words. The rules for provided by the Common Lexical Translations specification are reused by other specifications within Michael Heilmann’s Arcadia.
The rules layed out here are described in terms of the concepts and notations of the Context-Free Grammars specification (see https://michaelheilmann.com/specifications/context-free-grammars for more information).
A grammar incorporating these rules must have the set of all Unicode code points as its input alphabet. Furthermore, the grammar must ensure that ambiguities are resolved.
2 Standard Profile
The Full Profile contains all possible rules. Other profiles may be added in future versions. Language designers are encouraged to create their own profiles.
2.1 word
The word word is defined by
Lexical.Word : Lexical.Period
Lexical.Word : Lexical.Semicolon
Lexical.Word : Lexical.Boolean
Lexical.Word : Lexical.Number
Lexical.Word : Lexical.String
Lexical.Word : Lexical.Void
Lexical.Word : Lexical.Name
Lexical.Word : Lexical.LeftCurlyBracket
Lexical.Word : Lexical.RightCurlyBracket
Lexical.Word : Lexical.LeftSquareBracket
Lexical.Word : Lexical.RightSquareBracket
Lexical.Word : Lexical.Comma
Lexical.Word : Lexical.Colon
Lexical.Word : Lexical.Whitespace
Lexical.Word : Lexical.Newline
Lexical.Word : Lexical.Comment
2.2 whitespace
The word whitespace is defined by
/* #9 is also known as "CHARACTER TABULATION" */
Lexical.Whitespace : #9
/* #20 is also known as "SPACE" */
Lexial.Whitespace : #20
2.3 line terminator
The word Lexical.LineTerminator is defined by
/* #a is also known as "LINEFEED (LF)" */
/* #d is also known as "CARRIAGE RETURN (CR)" */
Lexical.LineTerminator : #a {#d}
Lexical.LineTerminator : #d {#a}
2.4 comments
The language using the Common Lexical Specification may use both single-line comments and multi-line comments.
A Lexical.Comment is either a single_line_comment or a Lexical.MultiLineComment.
Lexical.MultiLineComment is defined by
Lexical.Comment : Lexical.SingleLineComment
Lexical.Comment : Lexical.MultiLineComment
A Lexical.SingleLineComment starts with two solidi.
It extends to the end of the line.
Lexical.SingleLinecomment is defined by
/* #2f is also known as SOLIDUS */
Lexical.SingleLineComment :
#2f #2f
/* any sequence of characters except for line_terminator */
The Lexical.LineTerminator is not considered as part of the comment text.
A Lexical.MultiLineComment is opened by a solidus and an asterisk and closed by an asterisk and a solidus.
Lexical.MultiLineComment is defined by
/* #2f is also known as SOLIDUS */
/* #2a is also known as ASTERISK */
Lexical.MultiLineComment :
#2f #2a
/* any sequence of characters except except for #2a #2f */
#2a #2f
The #2f #2a and #2a #2f sequences are not considered as part of the comment text.
This implies:
#2f #2fhas no special meaning either comment.#2f #2aand#2a #2fhave no special meaning in single-line comments.- Multi-line comments do not nest.
2.5 parentheses
The words Lexical.LeftParenthesis and Lexical.RightParenthesis, respectively, are defined by
/* #28 is also known as "LEFT PARENTHESIS" */
Lexical.LeftParenthesis : #28
/* #29 is also known as "RIGHT PARENTHESIS" */
Lexical.RightParenthesis : #29
2.6 curly brackets
The words Lexical.LeftCurlyBracket and Lexical.RightCurlyBracket, respectively, are defined by
/* #7b is also known as "LEFT CURLY BRACKET" */
Lexical.LeftCurlyBracket : #7b
/* #7d is also known as "RIGHT CURLY BRACKET" */
Lexical.RightCurlyBracket : #7d
2.7 colon
The word Lexical.Colon is defined by
/* #3a is also known as "COLON" */
Lexical.Colon : #3a
2.8 square brackets
The words Lexical.LeftSquareBracket and Lexica.RightSquareBracket, respectively, are defined by
/* #5b is also known as "LEFT SQUARE BRACKET" */
Lexical.LeftSquareBracket : #5b
/* #5d is also known as "RIGHT SQUARE BRACKET" */
Lexical.RightSquareBracket : #5d
2.9 comma
The word Lexical.Comma is defined by
/* #2c is also known as "COMMA" */
Lexical.Comma : #2c
2.10 name
The word Lexical.Name is defined by
Lexical.Name : {Lexical.Underscore} Lexical.Alphabetic {Lexical.NameSuffixCharacter}
/* #41 is also known as "LATIN CAPITAL LETTER A" */
/* #5a is also known as "LATIN CAPITAL LETTER Z" */
/* #61 is also known as "LATIN SMALL LETTER A" */
/* #7a is also known as "LATIN SMALLER LETTER Z" */
Lexical.NameSuffixCharacter : /* The unicode characters from #41 to #5a and from #61 to #7a. */
/* #30 is also known as "DIGIT ZERO" */
/* #39 is also known as "DIGIT NINE" */
Lexical.NameSuffixCharacter : /* The unicode characters from #30 to #39. */
/* #5f is also known as "LOW LINE" */
Lexical.NameSuffixCharacter : #5f
2.10 number literal
The word Lexical.Number is defined by
Lexical.Number : Lexical.IntegerNumber
Lexical.Number : Lexical.RealNumber
Lexical.IntegerNumber : [Lexical.Sign] Lexical.Digit {Lexical.Digit}
Lexical.RealNumber : [Lexical.Sign] Lexical.Period Lexical.Digit {Lexical.Digit} [Lexical.Exponent]
Lexical.RealNumber : [Lexical.Sign] Lexical.Digit {Lexical.Digit} [Lexical.Period {Lexical.Digit}] [Lexical.Exponent]
Lexical.Exponent : Lexical.ExponentPrefix [Lexical.Sign] Lexical.Digit {Lexical.Digit}
/* #2b is also known as "PLUS SIGN" */
Lexical.Sign : #2b
/* #2d is also known as "MINUS SIGN" */
Lexical.Sign : #2d
/* #65 is also known as "LATIN SMALL LETTER E" */
Lexical.ExponentPrefix : #65
/* #45 is also known as "LATIN CAPITAL LETTER E" */
Lexical.ExponentPrefix : #45
2.11 string literalThe word Lexical.String is defined by
Lexical.String : Lexical.SingleQuotedString
Lexical.String : Lexical.DoubleQuotedString
Lexical.DoubleQuotedString : Lexical.DoubleQuote {Lexical.DoubleQuotedStringCharacter} Lexical.DoubleQuote
Lexical.DoubleQuotedStringCharacter : /* any character except for Lexical.Newline and Lexical.DoubleQuote and characters in [0,1F]*/
Lexical.DoubleQuotedStringCharacter : Lexical.EscapeSequence
Lexical.DoubleQuotedStringCharacter : #5c Lexical.DoubleQuote
/* #22 is also known as "QUOTATION MARK" */
Lexical.DoubleQuote : #22
Lexical.SingleQuotedString : Lexical.SingleQuote {Lexical.SingleQuotedStringCharacter} Lexical.SingleQuote
Lexical.SingleQuotedStringCharacter : /* any character except for Lexical.Newline and Lexical.SingleQuote and characters in [0,1F]*/
Lexical.SingleQuotedStringCharacter : Lexical.EscapeSequence
Lexical.SingleQuotedStringCharacter : #5c Lexical.SingleQuote
/* #27 is also known as "APOSTROPHE" */
Lexical.SingleQuote : #27
/* #5c is also known as "REVERSE SOLIDUS", #75 is also known as 'LATIN SMALL LETTER U*/
Lexical.EscapeSequence : #5c 'u' Lexical.HexadecimalDigit Lexical.HexadecimalDigit Lexical.HexadecimalDigit Lexical.HexadecimalDigit
/* #5c is also known as "REVERSE SOLIDUS" */
Lexical.EscapeSequence : #5c #5c
/* #64 is also known as "LATIN SMALL LETTER B" */
Lexical.EscapeSequence : #5c #64
/* #66 is also known as "LATIN SMALL LETTER F" */
Lexical.EscapeSequence : #5c #66
/* #6e is also known as "LATIN SMALL LETTER N" */
Lexical.EscapeSequence : #5c #6e
/* #72 is also known as "LATIN SMALL LETTER R" */
Lexical.EscapeSequence : #5c #72
/* #74 is also known as "LATIN SMALL LETTER T" */
Lexical.EscapeSequence : #5c #75
Lexical.String : Lexical.SingleQuotedString
Lexical.String : Lexical.DoubleQuotedString
Lexical.DoubleQuotedString : Lexical.DoubleQuote {Lexical.DoubleQuotedStringCharacter} Lexical.DoubleQuote
Lexical.DoubleQuotedStringCharacter : /* any character except for Lexical.Newline and Lexical.DoubleQuote and characters in [0,1F]*/
Lexical.DoubleQuotedStringCharacter : Lexical.EscapeSequence
Lexical.DoubleQuotedStringCharacter : #5c Lexical.DoubleQuote
/* #22 is also known as "QUOTATION MARK" */
Lexical.DoubleQuote : #22
Lexical.SingleQuotedString : Lexical.SingleQuote {Lexical.SingleQuotedStringCharacter} Lexical.SingleQuote
Lexical.SingleQuotedStringCharacter : /* any character except for Lexical.Newline and Lexical.SingleQuote and characters in [0,1F]*/
Lexical.SingleQuotedStringCharacter : Lexical.EscapeSequence
Lexical.SingleQuotedStringCharacter : #5c Lexical.SingleQuote
/* #27 is also known as "APOSTROPHE" */
Lexical.SingleQuote : #27
/* #5c is also known as "REVERSE SOLIDUS", #75 is also known as 'LATIN SMALL LETTER U*/
Lexical.EscapeSequence : #5c 'u' Lexical.HexadecimalDigit Lexical.HexadecimalDigit Lexical.HexadecimalDigit Lexical.HexadecimalDigit
/* #5c is also known as "REVERSE SOLIDUS" */
Lexical.EscapeSequence : #5c #5c
/* #64 is also known as "LATIN SMALL LETTER B" */
Lexical.EscapeSequence : #5c #64
/* #66 is also known as "LATIN SMALL LETTER F" */
Lexical.EscapeSequence : #5c #66
/* #6e is also known as "LATIN SMALL LETTER N" */
Lexical.EscapeSequence : #5c #6e
/* #72 is also known as "LATIN SMALL LETTER R" */
Lexical.EscapeSequence : #5c #72
/* #74 is also known as "LATIN SMALL LETTER T" */
Lexical.EscapeSequence : #5c #75
In the lexical translation, several transformations are performed upon the word:
Lexical.DoubleQuote {Lexical.DoubleQuotedStringCharacter} Lexical.DoubleQuotehas the leading double quote and the trailing double quote removed.Lexical.SingleQuote {Lexical.SingleQuotedStringCharacter} Lexical.SingleQuotehas the leading single quote and the trailing single quote removed.#5c #5c(double "REVERSE SOLIDUS") are replaced by#5c("REVERSE SOLIDUS")#5c #64("REVERSE SOLIDUS" followed by "LATIN SMALL LETTER B") are replaced by#7("BELL")>/li>#5c #66("REVERSE SOLIDUS" followed by "LATIN SMALL LETTER F") are replaced by#c("FORM FEED")#5c #6e("REVERSE SOLIDUS" followed by "LATIN SMALL LETTER N") are replaced by#a("LINE FEED")>/li>#5c #72("REVERSE SOLIDUS" followed by "LATIN SMALL LETTER R") are replaced by#d("CARRIAGE RETURN")#5c #72("REVERSE SOLIDUS" followed by "LATIN SMALL LETTER T") are replaced by#9("TAB")#5c 'u' Lexical.HexadecimalDigit Lexical.HexadecimalDigit Lexical.HexadecimalDigit Lexical.HexadecimalDigitis replaced by the corresponding Unicode code point. That Unicode code point is computed as follows: Let \(a\), \(b\), \(c\), and \(d\) denote the first, second, third, and fourth hexadecimal digit from left to right following'u'. Then the Unicode code point is given by \[ c := (a \cdot 16^3) + (b*16^2) + (c*16^1) + (d*16^0) \] If \(c\) is outside of the set of Unicode code points, then the lexical translation shall fail.
2.12 boolean literal
The word Lexical.Boolean is defined by
Lexical.Boolean : Lexical.True
Lexical.Boolean : Lexical.False
true : #74 #72 #75 #65
false : #66 #61 #6c #73 #65
Remark: The word Lexical.Boolean is a so called keyword.
It takes priority over the Lexical.Name.
2.13 void literal
The word Lexical.Void is defined by
Lexical.Void : #76 #6f # #69 #64
Remark: The word Lexical.Void is a so called keyword.
It takes priority over the Lexical.Name.
2.14 decimal digit
The word Lexical.DecimalDigit is defined by
Lexical.DecimalDigit : /* A single Unicode character from the code point range +U0030 to +U0039. */
2.15 hexadecimal digit
The word Lexical.HexadecimalDigit is defined by
Lexical.HexadecimalDigit : /* A single Unicode character from the code point range +U0030 to +U0039, +U0061 to +U007A, U+0041 to U+005A*/
2.16 alphanumeric
The word Lexical.Alphanumeric is reserved for future use.
2.17 period
The word Lexical.Period is defined by
/* #2e is also known as "FULL STOP" */
Lexical.Period : 2e