Parsing Basics¶
Tokens¶
Currently, a Wig program must be contained in a single source file. The bytes of this file are interpreted as a UTF-8 string.
The first step in processing a source file is to separate it into tokens.
These are the building blocks for the grammar below.
In the grammar, the tokens with fixed text are written enclosed in single quotes.
For example, the token ( is written '(' in the grammar.
There is one exception; for clarity, ' is written "'" in the grammar.
To form the token starting at a given position in the source,
Wig usually uses the longest possible token.
process is the identifier process,
not the keyword proc followed by the identifier ess.
Comments and string literals are exceptions;
for these, Wig uses the shortest possible token.
Identifers and keywords¶
Identifiers contain letters, digits, and underscores. The first character may not be a digit.
The following keywords meet the criteria for identifiers, but are considered distinct tokens, not identifiers:
proc
TODO
Add more.
Special tokens¶
The following short sequences are considered to be tokens:
()'\n(the newline character)
TODO
Add more. Including two and three character sequences.
Whitespace¶
Whitespace consists of characters defined by Unicode to be whitespace, except the newline character. Whitespace is ignored except as it serves to separate tokens and indicate indentation.
Indentation¶
The structure of a Wig program is usually indicated by indentation.
In determining indentation, lines containing only whitespace and comments are ignored.
Consecutive indented lines must start with exactly the same whitespace, but one may have more whitespace than the other. When a line with longer whitespace follows a line with shorter whitespace, an indent token is generated; it follows the newline.
When the following line has shorter whitespace, the appropriate number of dedent tokens are generated. These dedents will precede the newline before the following line.
Indentation whitespace should contain only spaces and tabs.
At the end of the file, dedent tokens will be generated if needed to match any outstanding unmatched dedent tokens.
For example, this code:
if ok
if done
return
work()
generates this sequence of tokens:
'if'
'ok'
'\n'
indent
'if'
'done'
'\n'
indent
'return'
dedent
dedent
'\n'
'work'
'('
')'
Comments¶
Comments begin with the character
#. There are two kinds.In a line comment, the opening
#is followed by any character other than an opening brace bracket[. This type of comment extends to the end of its line.In a block comment, the opening
#is followed immediately by one or more opening brace brackets[. The comment extends until the next#preceded by at least as many closing brace brackets].If a block comment contains a newline character, then it may not be followed by actual code on the line of the closing
]#: