An Mg program consists of one or more source files, known formally as compilation units. A compilation unit file is an ordered sequence of Unicode characters. Compilation units typically have a one-to-one correspondence with files in a file system, but this correspondence is not required. For maximal portability, it is recommended that files in a file system be encoded with the UTF-8 encoding.
Conceptually speaking, a program is compiled using four steps:
Lexical analysis, which translates a stream of Unicode input characters into a stream of tokens. Lexical analysis evaluates and executes pre-processing directives.
Syntactic analysis, which translates the stream of tokens into an abstract syntax tree.
Semantic analysis, which resolves all symbols in the abstract syntax tree, type checks the structure, and generates a semantic graph.
Code generation, which generates instructions from the semantic graph for some target runtime, producing an image.
Further tools may link images and load them into a runtime.
This specification presents the syntax of the Mg programming language using two grammars. The lexical grammar defines how Unicode characters are combined to form line terminators, white space, comments, tokens, and pre-processing directives. The syntactic grammar defines how the tokens resulting from the lexical grammar are combined to form Mg programs.
The lexical and syntactic grammars are presented using grammar productions. Each grammar production defines a non-terminal symbol and the possible expansions of that non-terminal symbol into sequences of non-terminal or terminal symbols. In grammar productions, non-terminal symbols are shown in italic type, and terminal
symbols are shown in a fixed-width font.
The first line of a grammar production is the name of the non-terminal symbol being defined, followed by a colon. Each successive indented line contains a possible expansion of the non-terminal given as a sequence of non-terminal or terminal symbols. For example, the production:
IdentifierVerbatim: [ IdentifierVerbatimCharacters ]
defines an IdentifierVerbatim to consist of the token “[
”, followed by IdentifierVerbatimCharacters, followed by the token “]
”.
When there is more than one possible expansion of a non-terminal symbol, the alternatives are listed on separate lines. For example, the production:
DecimalDigits: DecimalDigit DecimalDigits DecimalDigit
defines DecimalDigits to either consist of a DecimalDigit or consist of DecimalDigits followed by a DecimalDigit. In other words, the definition is recursive and specifies that a decimal-digits list consists of one or more decimal digits.
A subscripted suffix “opt” is used to indicate an optional symbol. The production:
DecimalLiteral: IntegerLiteral . DecimalDigit DecimalDigitsopt
is shorthand for:
DecimalLiteral: IntegerLiteral . DecimalDigit IntegerLiteral . DecimalDigit DecimalDigits
and defines a DecimalLiteral to consist of an IntegerLiteral followed by a '.'
a DecimalDigit and by optional DecimalDigits.
Alternatives are normally listed on separate lines, though in cases where there are many alternatives, the phrase “one of” may precede a list of expansions given on a single line. This is simply shorthand for listing each of the alternatives on a separate line. For example, the production:
Sign: one of + -
is shorthand for:
Sign:
+
-
Conversely, exclusions are designated with the phrase “none of”. For example, the production
TextSimple: none of " NewLineCharacter
permits all characters except ‘“
’, ‘’, and new line characters.
The lexical grammar of Mg is presented in Section 8.3. The terminal symbols of the lexical grammar are the characters of the Unicode character set, and the lexical grammar specifies how characters are combined to form tokens, white space, and comments (Section 8.3.2).
Every source file in an Mg program must conform to the Input production of the lexical grammar.
The syntactic grammar of Mg is presented in the chapters that follow this chapter. The terminal symbols of the syntactic grammar are the tokens defined by the lexical grammar, and the syntactic grammar specifies how tokens are combined to form Mg programs.
Every source file in an Mg program must conform to the CompilationUnit production of the syntactic grammar.
The Input production defines the lexical structure of an Mg source file. Each source file in an Mg program must conform to this lexical grammar production.
Input: InputSectionoptInputSection: InputSectionPart InputSection InputSectionPartInputSectionPart: InputElementsopt NewLineInputElements: InputElement InputElements InputElementInputElement:
Whitespace Comment Token
Four basic elements make up the lexical structure of an Mg source file: line terminators, white space, comments, and tokens. Of these basic elements, only tokens are significant in the syntactic grammar of an Mg program.
The lexical processing of an Mg source file consists of reducing the file into a sequence of tokens, which becomes the input to the syntactic analysis. Line terminators, white space, and comments can serve to separate tokens, but otherwise these lexical elements have no impact on the syntactic structure of an Mg program.
When several lexical grammar productions match a sequence of characters in a source file, the lexical processing always forms the longest possible lexical element. For example, the character sequence //
is processed as the beginning of a single-line comment because that lexical element is longer than a single /
token.
Line terminators divide the characters of an Mg source file into lines.
NewLine: NewLineCharacter U+000D U+000A NewLineCharacter: U+000A // Line Feed U+000D // Carriage Return U+0085 // Next Line U+2028 // Line Separator U+2029 // Paragraph Separator
For compatibility with source code editing tools that add end-of-file markers, and to enable a source file to be viewed as a sequence of properly terminated lines, the following transformations are applied, in order, to every compilation unit:
If the last character of the source file is a Control-Z character (U+001A
), this character is deleted.
A carriage-return character (U+000D
) is added to the end of the source file if that source file is non-empty and if the last character of the source file is not a carriage return (U+000D
), a line feed (U+000A
), a line separator (U+2028
), or a paragraph separator (U+2029
).
Two forms of comments are supported: single-line comments and delimited comments. Single-line comments start with the characters //
and extend to the end of the source line. Delimited comments start with the characters /*
and end with the characters */
. Delimited comments may span multiple lines.
Comment: CommentDelimited CommentLine CommentDelimited: /* CommentDelimitedContentsopt */ CommentDelimitedContent: * none of / CommentDelimitedContents: CommentDelimitedContent CommentDelimitedContents CommentDelimitedContent CommentLine: // CommentLineContentsopt CommentLineContent: none of NewLineCharacter CommentLineContents: CommentLineContent CommentLineContents CommentLineContent
Comments do not nest. The character sequences /*
and */
have no special meaning within a //
comment, and the character sequences //
and /*
have no special meaning within a delimited comment.
Comments are not processed within text literals.
The example
// This defines a // Logical literal // syntax LogicalLiteral = "true" | "false" ;
shows three single-line comments.
The example
/* This defines a Logical literal */ syntax LogicalLiteral = "true" | "false" ;
includes one delimited comment.
Whitespace is defined as any character with Unicode class Zs (which includes the space character) as well as the horizontal tab character, the vertical tab character, and the form feed character.
Whitespace: WhitespaceCharacters WhitespaceCharacter: U+0009 // Horizontal Tab U+000B // Vertical Tab U+000C // Form Feed U+0020 // Space NewLineCharacter WhitespaceCharacters: WhitespaceCharacter WhitespaceCharacters WhitespaceCharacter
There are several kinds of tokens: identifiers, keywords, literals, operators, and punctuators. White space and comments are not tokens, though they act as separators for tokens.
Token: | |
Identifier | |
Keyword | |
Literal | |
OperatorOrPunctuator |
A regular identifier begins with a letter or underscore and then any sequence of letter, underscore, dollar sign, or digit. An escaped identifier is enclosed in square brackets. It contains any sequence of Text
literal characters.
Identifier: IdentifierBegin IdentifierCharactersopt IdentifierVerbatim IdentifierBegin: - Letter IdentifierCharacter: IdentifierBegin $ DecimalDigit IdentifierCharacters: IdentifierCharacter IdentifierCharacters IdentifierCharacter IdentifierVerbatim: [ IdentifierVerbatimCharacters ] IdentifierVerbatimCharacter: none of ] IdentifierVerbatimEscape IdentifierVerbatimCharacters: IdentifierVerbatimCharacter IdentifierVerbatimCharacters IdentifierVerbatimCharacter IdentifierVerbatimEscape: \ ] Letter: a..z A..Z DecimalDigit: 0..9 DecimalDigits: DecimalDigit DecimalDigits DecimalDigit
A keyword is an identifier-like sequence of characters that is reserved, and cannot be used as an identifier except when escaped with square brackets []
.
Keyword: oneof:
any empty error export false final id import interleave language labelof
left module null precedence right syntax token true valuesof
The following keywords are reserved for future use:
checkpoint identifier nest override new virtual partial
A literal is a source code representation of a value.
Literal: | |
DecimalLiteral | |
IntegerLiteral | |
LogicalLiteral | |
NullLiteral | |
TextLiteral |
Literals may be ascribed with a type to override the default type ascription.
Decimal literals are used to write real-number values.
DecimalLiteral: | |
DecimalDigits . DecimalDigits |
Examples of decimal literal follow:
0.0 12.3 999999999999999.999999999999999
Integer literals are used to write integral values.
IntegerLiteral: | |
|
Examples of integer literal follow:
0 123 999999999999999999999999999999 -42
Logical literals are used to write logical values.
LogicalLiteral: one of true false
Examples of logical literal:
true false
Mg supports two forms of Text
literals: regular text literals and verbatim text literals. In certain contexts, text literals must be of length one (single characters). However, Mg does not distinguish syntactically between strings and characters.
A regular text literal consists of zero or more characters enclosed in single or double quotes, as in "hello"
or ‘hello’
, and may include both simple escape sequences (such as
for the tab character), and hexadecimal and Unicode escape sequences.
A verbatim Text
literal consists of a “commercial at” character (@
) followed by a single- or double-quote character ('
or "
), zero or more characters, and a closing quote character that matches the opening one. A simple example is @"hello"
. In a verbatim text literal, the characters between the delimiters are interpreted exactly as they occur in the compilation unit, the only exception being a SingleQuoteEscapeSequence or a DoubleQuoteEscapeSequence, depending on the opening quote. In particular, simple escape sequences, and hexadecimal and Unicode escape sequences are not processed in verbatim text literals. A verbatim text literal may span multiple lines.
A simple escape sequence represents a Unicode character encoding, as described in the following table.
Escape Sequence | Character Name | Unicode Encoding |
---|---|---|
| Single quote |
|
| Double quote |
|
| Backslash |
|