An M program consists of one or more source files, known formally as compilation units. A compilation unit file is an ordered sequence of Unicode characters. Compilation units typically have a one-to-one correspondence with files in a file system, but this correspondence is not required. For maximal portability, it is recommended that files in a file system be encoded with the UTF-8 encoding.
Conceptually speaking, a program is compiled using four steps:
Lexical analysis, which translates a stream of Unicode input characters into a stream of tokens. Lexical analysis evaluates and executes pre-processing directives.
Syntactic analysis, which translates the stream of tokens into an abstract syntax tree.
Semantic analysis, which resolves all symbols in the abstract syntax tree, type checks the structure, and generates a semantic graph.
Code generation, which generates an image from the semantic graph. An image is a list of executable instructions for some target runtime, for example, SQL Server.
Further tools may link images and load them into a runtime.
This specification presents the syntax of the M programming language using two grammars. The lexical grammar defines how Unicode characters are combined to form line terminators, white space, comments, tokens, and pre-processing directives. The syntactic grammar defines how the tokens resulting from the lexical grammar are combined to form M programs.
The lexical and syntactic grammars are presented using grammar productions. Each grammar production defines a non-terminal symbol and the possible expansions of that non-terminal symbol into sequences of non-terminal or terminal symbols. In grammar productions, non-terminal symbols are shown in italic type, and terminal
symbols are shown in a fixed-width font.
The first line of a grammar production is the name of the non-terminal symbol being defined, followed by a colon. Each successive indented line contains a possible expansion of the non-terminal given as a sequence of non-terminal or terminal symbols. For example, the production:
IdentifierVerbatim: [ IdentifierVerbatimCharacters ]
defines an IdentifierVerbatim to consist of the token “[
”, followed by IdentifierVerbatimCharacters, followed by the token “]
”.
When there is more than one possible expansion of a non-terminal symbol, the alternatives are listed on separate lines. For example, the production:
DecimalDigits: DecimalDigit DecimalDigits DecimalDigit
defines DecimalDigits to either consist of a DecimalDigit or consist of DecimalDigits followed by a DecimalDigit. In other words, the definition is recursive and specifies that decimal digits list consists of one or more decimal digits.
A subscripted suffix “opt” is used to indicate an optional symbol. The production:
DecimalLiteral: IntegerLiteral . DecimalDigit DecimalDigitsopt
is shorthand for:
DecimalLiteral: IntegerLiteral . DecimalDigit IntegerLiteral . DecimalDigit DecimalDigits
and defines an DecimalLiteral to consist of an IntegerLiteral followed by a “.” a DecimalDigit and by optional DecimalDigits.
Alternatives are normally listed on separate lines, though in cases where there are many alternatives, the phrase “one of” may precede a list of expansions given on a single line. This is simply shorthand for listing each of the alternatives on a separate line. For example, the production:
Sign: one of + -
is shorthand for:
Sign:
+
-
Conversely, exclusions are designated with the phrase “none of.” For example, the production
TextSimple: none of " NewLineCharacter
permits all characters except ‘"
’, ‘’, and new line characters.
The lexical grammar of M is presented in 2.3. The terminal symbols of the lexical grammar are the characters of the Unicode character set, and the lexical grammar specifies how characters are combined to form tokens, white space, and comments (Section 2.3.2).
Every source file in an M program must conform to the Input production of the lexical grammar.
The syntactic grammar of M is presented in the chapters that follow this chapter. The terminal symbols of the syntactic grammar are the tokens defined by the lexical grammar, and the syntactic grammar specifies how tokens are combined to form M programs.
Every source file in an M program must conform to the CompilationUnit production of the syntactic grammar.
The Input production defines the lexical structure of an M source file. Each source file in an M program must conform to this lexical grammar production.
Input: InputSectionoptInputSection: InputSectionPart InputSection InputSectionPartInputSectionPart: InputElementsopt NewLineInputElements: InputElement InputElements InputElementInputElement: Whitespace Comment Token
Four basic elements make up the lexical structure of an M source file: line terminators, white space, comments, and tokens. Of these basic elements, only tokens are significant in the syntactic grammar of an M program.
The lexical processing of an M source file consists of reducing the file into a sequence of tokens that becomes the input to the syntactic analysis. Line terminators, white space, and comments can serve to separate tokens, but otherwise these lexical elements have no impact on the syntactic structure of an M program.
When several lexical grammar productions match a sequence of characters in a source file, the lexical processing always forms the longest possible lexical element. For example, the character sequence //
is processed as the beginning of a single-line comment because that lexical element is longer than a single /
token.
Line terminators divide the characters of an M source file into lines.
NewLine: NewLineCharacter U+000D U+000A NewLineCharacter: U+000A // Line Feed U+000D // Carriage Return U+0085 // Next Line U+2028 // Line Separator U+2029 // Paragraph Separator
For compatibility with source code editing tools that add end-of-file markers, and to enable a source file to be viewed as a sequence of properly terminated lines, the following transformations are applied, in order, to every compilation unit:
If the last character of the source file is a Control-Z character (U+001A
), this character is deleted.
A carriage-return character (U+000D
) is added to the end of the source file if that source file is nonempty and if the last character of the source file is not a carriage return (U+000D
), a line feed (U+000A
), a line separator (U+2028
), or a paragraph separator (U+2029
).
Two forms of comments are supported: single-line comments and delimited comments. Single-line comments start with the characters //
and extend to the end of the source line. Delimited comments start with the characters /*
and end with the characters */
. Delimited comments may span multiple lines.
Comment: CommentDelimited CommentLine CommentDelimited: /* CommentDelimitedContentsopt */ CommentDelimitedContent: * none of / CommentDelimitedContents: CommentDelimitedContent CommentDelimitedContents CommentDelimitedContent CommentLine: // CommentLineContentsopt CommentLineContent: none of NewLineCharacter CommentLineContents: CommentLineContent CommentLineContents CommentLineContent
Comments do not nest. The character sequences /*
and */
have no special meaning within a //
comment, and the character sequences //
and /*
have no special meaning within a delimited comment.
Comments are not processed within Text
literals.
The example
// This defines a // Person entity // type Person = { Name : Text; Age : Number; }
shows three single-line comments.
The example
/* This defines a Person entity */ type Person = { Name : Text; Age : Number; }
includes one delimited comment.
Whitespace is defined as any character with Unicode class Zs (which includes the space character) as well as the horizontal tab character, the vertical tab character, and the form feed character.
Whitespace: WhitespaceCharacters WhitespaceCharacter: U+0009 // Horizontal Tab U+000B // Vertical Tab U+000C // Form Feed U+0020 // Space NewLineCharacter WhitespaceCharacters: WhitespaceCharacter WhitespaceCharacters WhitespaceCharacter
There are several kinds of tokens: identifiers, keywords, literals, operators, and punctuators. White space and comments are not tokens, though they act as separators for tokens.
Token: | |
Identifier | |
Keyword | |
Literal | |
OperatorOrPunctuator |
A regular identifier begins with a letter or underscore and then any sequence of letter, underscore, dollar sign, or digit. An escaped identifier is enclosed in square brackets. It contains any sequence of Text
literal characters.
Identifier: IdentifierBegin IdentifierCharactersopt IdentifierVerbatim IdentifierBegin: _ Letter IdentifierCharacter: IdentifierBegin $ DecimalDigit IdentifierCharacters: IdentifierCharacter IdentifierCharacters IdentifierCharacter IdentifierVerbatim: [ IdentifierVerbatimCharacters ] IdentifierVerbatimCharacter: none of ] IdentifierVerbatimEscape IdentifierVerbatimCharacters: IdentifierVerbatimCharacter IdentifierVerbatimCharacters IdentifierVerbatimCharacter IdentifierVerbatimEscape: \ ] Letter: a..z A..Z DecimalDigit: 0..9 DecimalDigits: DecimalDigit DecimalDigits DecimalDigit
A keyword is an identifier-like sequence of characters that is reserved and cannot be used as an identifier except when escaped with square brackets []
.
Keyword:
accumulate
by
equals
export
from
group
identity
import
in
into
item
join
let
module
null
select
this
type
unique
value
where
A literal is a source code representation of a value.
Literal: | |
DecimalLiteral | |
IntegerLiteral | |
ScientificLiteral | |
DateTimeLiteral | |
TimeLiteral | |
CharacterLiteral | |
TextLiteral | |
BinaryLiteral | |
GuidLiteral | |
LogicalLiteral | |
NullLiteral |
Literals may be ascribed with a type to override the default type ascription.
Decimal literals are used to write fixed-point or exact number values.
DecimalLiteral: IntegerLiteral . DecimalDigit DecimalDigitsopt
Decimal literals default to the smallest standard library type that can contain the value. Examples of decimal literal follow:
99.999 0.1 1.0
Integer literals are used to write integral values.
IntegerLiteral: | |
DecimalDigits |
Integer literals default to the smallest precision type that can contain the value, starting with Integer32
.
Examples of integer literal follow:
0 123 999999999999999999999999999999
Scientific literals are used to write values floating-point or inexact numbers.
ScientificLiteral: DecimalLiteral e Signopt DecimalDigit DecimalDigitsopt DecimalLiteral E Signopt DecimalDigit DecimalDigitsopt Sign: one of + -
Scientific literals default to the smallest precision type that can contain the value, starting with Double
.
Examples of scientific literal follow:
.31416e+1 9.9999e-1 0.0E0
Date literals are used to write a date independent of a specific time of day.
DateLiteral: Signopt DateYear - DateMonth - DateDay
The tokens of a DateLiteral must not have white space.
DateDay: one of 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 DateMonth: one of 01 02 03 04 05 06 07 08 09 10 11 12 DateYear: DecimalDigit DecimalDigit DecimalDigit DecimalDigit
The type of a DateLiteral is Date
.
0001-01-01
is the representation of January1st, 1 AD.
There is no year 0
, therefore ‘0000
’ is not a valid Date Time.
-0001
is the representation of January1st, 1 BC.
Examples of date literal follow:
0001-01-01 2008-08-14 -1184-03-01
DateTime literals are used to write a time of day on a specific date independent of time zone.
DateTimeLiteral: DateLiteral T TimeLiteral
The type of a DateTime literal is DateTime
.
Example of date time literal follow:
2008-08-14T13:13:00 0001-01-01T00:00:00 2005-05-19T20:05:00
TimeLiteral: TimeHourMinute : TimeSecond TimeHourMinute: TimeHour : TimeMinute TimeHour: one of 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 TimeMinute: 0 DecimalDigit 1 DecimalDigit 2 DecimalDigit 3 DecimalDigit 4 DecimalDigit 5 DecimalDigit TimeSecond: 0 DecimalDigit TimeSecondDecimalPartopt 1 DecimalDigit TimeSecondDecimalPartopt 2 DecimalDigit TimeSecondDecimalPartopt 3 DecimalDigit TimeSecondDecimalPartopt 4 DecimalDigit TimeSecondDecimalPartopt 5 DecimalDigit TimeSecondDecimalPartopt 60 TimeSecondDecimalPartopt TimeSecondDecimalPart: . DecimalDigits
Examples of time literal follow:
11:30:00 01:01:01.111 13:13:00
A character literal represents a single character, for example ‘a’
.
CharacterLiteral: ' Character ' Character: CharacterSimple CharacterEscapeHex CharacterEscapeSimple CharacterEscapeUnicode Characters: Character Characters Character CharacterEscapeHex: CharacterEscapeHexPrefix HexDigit CharacterEscapeHexPrefix HexDigit HexDigit CharacterEscapeHexPrefix HexDigit HexDigit HexDigit CharacterEscapeHexPrefix HexDigit HexDigit HexDigit HexDigit CharacterEscapeHexPrefix: one of x X CharacterEscapeSimple: CharacterEscapeSimpleCharacter CharacterEscapeSimpleCharacter: one of ' " 0 a b f n r t v CharacterEscapeUnicode: u HexDigit HexDigit HexDigit HexDigit U HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit CharacterSimple: none of U+0027 // Single Quote U+005C // Backslash NewLineCharacter
A hexadecimal escape sequence represents a single Unicode character, with the value formed by the hexadecimal number following the prefix.
If the value represented by a character literal is greater than U+FFFF
, a compile-time error occurs.
A Unicode character escape sequence in a character literal must be in the range U+0000
to U+FFFF
.
A simple escape sequence represents a Unicode character encoding, as described in the following table.
Escape Sequence | Character Name | Unicode Encoding |
---|---|---|
| Single quote |
|
| Double quote |
|
| Backslash |
|