Subject: How to syntax-colour Inform
Date: Wed, 17 Dec 1997 18:46:34 +0000 (GMT)
From: Graham Nelson <
[email protected]>
Newsgroups: rec.arts.int-fiction
[This is going to be a new section in the Inform Technical Manual,
which seems as good a place to keep it as any, but in the mean time
it's been requested several times on the newsgroup, hence this
posting. Comments welcome -- GN.]
How to syntax-colour Inform source code
---------------------------------------
"Syntax colouring" is an automatic process which some text editors apply
to the text being edited: the characters are displayed just as they are,
but with artificial colours added according to what the text editor thinks
they mean. The editor is in the position of someone going through a book
colouring all the verbs in red and all the nouns in green: it can only do
so if it understands how to tell a verb or a noun from other words.
Many good text editors have been programmed to syntax colour for languages
such as C, and a few will allow users to reprogram them to other languages.
One such is the popular Acorn RISC OS text editor "Zap", for which the
author has written an extension mode called "ZapInform". ZapInform
contributes colouring rules for the Inform language and as over a dozen
people have now asked me how it works, while the original is written
in ARM assembly (a language rather less widely spoken than Middle Egyptian)
it seems worth documenting the main algorithm.
(ZapInform does a number of other useful things, including pasting in
template objects and rooms when commanded from a mouse-accessed menu:
for instance, you can create a simple game with two or three mouse
clicks and a few object names typed in to a dialogue box, then click
to save and compile the result. See the ZapInform manual for details.)
(a) State values
ZapInform associates a 32-bit number called the "state" with every
character position.
The "state" is as follows. 11 of the upper 16 bits hold flags, the
rest being unused:
32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17
comment
single-quoted text
double-quoted text
statement
after marker
highlight flag
highlight all flag
colour backtrack
after-restart-flag
wait-direct (waiting for a directive)
dont-know-flag
These flags make up the "outer state" while the lower 16 bits holds
a number pompously called the "inner state":
0 after WS (WS = white space or start of line or comma)
1 after WS then "-"
2 after WS then "-" and ">" [terminal]
3 after WS then "*" [terminal]
0xFF after junk
0x100*N + S after WS then an Inform identifier N+1 characters long
itself in state S:
101 w 202 wi 303 wit 404 with
111 h 212 ha 313 has
121 c 222 cl 323 cla 424 clas 525 class
same + 0x8000 when complete [terminal]
In practice it would be madness to try to actually store the state
of every character position in memory (it would occupy four times as
much space as the file itself). Instead, ZapInform caches just one
state value, the one most recently calculated, and uses a process
called "scanning" to determine new states. That is, given that we
know the state at character X and want to know the state at character
Y, we can find out by scanning each character between X and Y,
altering the state according to each one.
It might possible save some time to cache more state values than
this (say, the state values at the start of every screen-visible
line of text, or some such) but the complexity of doing this doesn't
seem worthwhile on my implementation. Scanning is a quick process
because the Zap text editor stores the entire file in almost contiguous
memory, easy to run through, and the state value can be kept in a
single CPU register while this is done.
(b) Scanning text
Let us number the characters in a file 1, 2, 3, ...
The state before character 1 is always 0x02000000: that is, inner
state zero and outer state with only the waiting-for-directive flag set.
(One can think of this as the state of an imaginary "character 0".)
The state at character N+1 is then a function of the state at
character N and what character is actually there. Thus,
State(0) = 0x02000000
and for all N >= 0,
State(N+1) = Scanning_function(State(N), Character(N+1))
And here is what the scanning function does:
1. Is the comment bit set?
Is the character a new-line?
If so, clear the comment bit.
Stop.
2. Is the double-quote bit set?
Is the character a double-quote?
If so, clear the double-quote bit.
Stop.
3. Is the single-quote bit set?
Is the character a single-quote?
If so, clear the single-quote bit.
Stop.
4. Is the character a single quote?
If so, set the single-quote bit and stop.
5. Is the character a double quote?
If so, set the double-quote bit and stop.
6. Is the character an exclamation mark?
If so, set the comment bit and stop.
7. Is the statement bit set?
If so:
Is the character "]"?
If so:
Clear the statement bit.
Stop.
If the after-restart bit is clear, stop.
Run the inner finite state machine.
If it results in a keyword terminal (that is, a terminal
which has inner state 0x100 or above):
Set colour-backtrack (and record the backtrack colour
as "function" colour).
Clear after-restart.
Stop.
If not:
Is the character "["?
If so:
Set the statement bit.
If the after-marker bit is set, set after-restart.
Stop.
Run the inner finite state machine.
If it results in a terminal:
Is the inner state 2 [after "->"] or 3 [after "*"]?
If so:
Set after-marker.
Set colour-backtrack (and record the backtrack
colour as "directive" colour).
Zero the inner state.
[If not, the terminal must be from a keyword.]
Is the inner state 0x404 [after "with"]?
If so:
Set colour-backtrack (and record the backtrack
colour as "directive" colour).
Set after-marker.
Set highlight.
Clear highlight-all.
Is the inner state 0x313 ["has"] or 0x525 ["class"]?
If so:
Set colour-backtrack (and record the backtrack
colour as "directive" colour).
Set after-marker.
Clear highlight.
Set highlight-all.
If the inner state isn't one of these: [so that recent
text has formed some alphanumeric token which might or
might not be a reserved word of some kind]
If waiting for directive is set:
Set colour-backtrack (and record the backtrack
colour as "directive" colour)
If not, but highlight-all is set:
Set colour-backtrack (and record the backtrack
colour as "property" colour)
If not, but highlight is set:
Clear highlight.
Set colour-backtrack (and record the backtrack
colour as "property" colour).
Is the character ";"?
If so:
Set wait-direct.
Clear after-marker.
Clear after-restart.
Clear highlight.
Clear highlight-all.
Is the character ","?
If so:
Set after-marker.
Set highlight.
Stop.
The "inner finite state machine" adjusts only the inner state, and
always preserves the outer state. It not only changes an old inner
state to a new inner state, but sometimes returns a "terminal" flag
to signal that something interesting has been found.
State Condition Go to state Return terminal-flag?
0 if "-" 1
if "*" 3 yes
if space, "#",
newline 0
if "_" 0x100
if "w" 0x101
if "h" 0x111
if "c" 0x121
other letters 0x100
otherwise 0xFF
1 if ">" 2 yes
otherwise 0xFF
2 always 0
3 always 0
0xFF if space,
newline 0
otherwise 0xFF
all 0x100+ states:
if not alphanumeric, add
0x8000 to the state yes
then for the following states:
0x101 if "i" 0x202
otherwise 0x200
0x202 if "t" 0x303
otherwise 0x300
0x303 if "h" 0x404
otherwise 0x400
0x111 if "a" 0x212
otherwise 0x200
0x212 if "s" 0x313
otherwise 0x300
0x121 if "l" 0x222
otherwise 0x200
0x222 if "a" 0x323
otherwise 0x300
0x323 if "s" 0x424
otherwise 0x400
0x424 if "s" 0x525
otherwise 0x500
but for all other 0x100+ states:
if alphanumeric, add
0x100 to the state
0x8000+ always 0
(Note that if your text editor stores tabs as characters in their own
right (usually 0x09) rather than rows of spaces, tab should be included
with space and newline in the above.)
Briefly, the finite state machine can be left running until it returns
a terminal, which means it has found "->", "*" or a completed Inform
identifier: and it detects "with", "has" and "class" as special keywords
amongst these identifiers.
(c) Initial colouring
ZapInform colours one line of visible text at a time. For instance, it
might be faced with this:
Object -> bottle "~Heinz~ bottle"
And it outputs an array of colours for each character position in the
line, which the text editor can then use in actually displaying the text.
It works out the state before the first character of the line (the "O"),
then scans through the line. For each character, it determines the
initial colour as a function of the state at that character:
If single-quote or double-quote is set, then quoted text colour.
If comment is set, then comment colour.
If statement is set:
Use code colour
unless the character is "[" or "]", in which case use
function colour,
or is a single or double quote, in which case use quoted text
colour.
If not:
Use foreground colour
unless the character is "," or ";" or "*" or ">", in which
case use directive colour,
or the character is "[" or "]", in which case use
function colour,
or is a single or double quote, in which case use quoted text
colour.
However, the scanning algorithm sometimes signals that a block of
text must be "backtracked" through and recoloured. For instance,
this happens if the white space after the sequence "c", "l", "a",
"s" and "s" is detected when in a context where the keyword "class"
is legal. The scanning algorithm does this by setting the "colour
backtrack" bit in the outer state. Note that the number of characters
we need to recolour backwards from the current position has been
recorded in bits 9 to 16 of the inner state (which has been counting
up lengths of identifiers), while the scanning algorithm has also
recorded the colour to be used. For instance, in
Object -> bottle "~Heinz~ bottle"
^ ^ ^
backtracks of size 6, 2 and 6 are called for at the three marked
spaces. Note that a backtrack never crosses a new-line.
ZapInform uses the following chart of colours:
name default actual colour
foreground navy blue
quoted text grey
comment light green
directive black
property red
function red
code navy blue
codealpha dark green
assembly gold
escape character red
but note that at this stage, we've only used the following:
function colour [ and ] as function brackets, plus function names
comment colour comments
directive colour initial directive keywords, plus "*",
"->", "with", "has" and "class" when used
in a directive context
quoted text colour singly- or doubly-quoted text
foreground colour code in directives
code colour code in statements
property colour property, attribute and class names when
used within "with", "has" and "class"
For instance,
Object -> bottle "~Heinz~ bottle"
would give us the array
DDDDDDDDDDFFFFFFFQQQQQQQQQQQQQQQQ
(F being foreground colour; it doesn't really matter what colour
values the spaces have).
(d) Colour refinement
The next operation is "colour refinement", which includes a number
of things.
Firstly, any characters with colour Q (quoted-text) which have special
meanings are given "escape-character colour" instead. This applies
to "~", "^", "\" and "@" followed by (possibly) another "@" and a
number of digits.
Next we look for identifiers. An identifier for these purposes includes
a number, for it is just a sequence of:
"_" or "$" or "#" or "0" to "9" or "a" to "z" or "A" to "Z".
The initial colouring of an identifier tells us its context. We're
only interested in those in foreground colour (these must be used
in the body of a directive) or code colour (used in statements).
If an identifier is in code colour, then:
If it follows an "@", recolour the "@" and the identifier in
assembly-language colour.
Otherwise, unless it is one of the following:
"box" "break" "child" "children" "continue" "default"
"do" "elder" "eldest" "else" "false" "font" "for" "give"
"has" "hasnt" "if" "in" "indirect" "inversion" "jump"
"metaclass" "move" "new_line" "nothing" "notin" "objectloop"
"ofclass" "or" "parent" "print" "print_ret" "provides" "quit"
"random" "read" "remove" "restore" "return" "rfalse" "rtrue"
"save" "sibling" "spaces" "string" "style" "switch" "to"
"true" "until" "while" "younger" "youngest"
we recolour the identifier to "codealpha colour".
On the other hand, if an identifier is in foreground colour, then we
check it to see if it's one of the following interesting keywords:
"first" "last" "meta" "only" "private" "replace" "reverse"
"string" "table"
If it is, we recolour it in directive colour.
Thus, after colour refinement we arrive at the final colour scheme:
function colour [ and ] as function brackets, plus function names
comment colour comments
quoted text colour singly- or doubly-quoted text
directive colour initial directive keywords, plus "*",
"->", "with", "has" and "class" when used
in a directive context, plus any of the
reserved directive keywords listed above
property colour property, attribute and class names when
used within "with", "has" and "class"
foreground colour everything else in directives
code colour operators, numerals, brackets and statement
keywords such as "if" or "else" occurring
inside routines
codealpha colour variable and constant names occurring inside
routines
assembly colour @ plus assembly language opcodes
escape char colour special or escape characters in quoted text
(e) An example
Consider the following example stretch of code (which is not meant to
be functional or interesting, just colourful):
! Here's the bottle:
Object -> bottle "bottle marked ~DRINK ME~"
with name "bottle" "jar" "flask",
initial "There is an empty bottle here.",
before
[; LetGo: ! For dealing with water
if (noun in bottle)
"You're holding that already (in the bottle).";
],
has container;
[ ReadableSpell i j k;
if (scope_stage==1)
{ if (action_to_be==##Examine) rfalse;
rtrue;
}
@set_cursor 1 1;
];
Extend "examine" first
* scope=ReadableSpell -> Examine;
Here are the initial colourings:
! Here's the bottle:
CCCCCCCCCCCCCCCCCCCC
Object -> bottle "bottle marked ~DRINK ME~"
DDDDDDDDDDFFFFFFFQQQQQQQQQQQQQQQQQQQQQQQQQQ
with name "bottle" "jar" "flask",
FFDDDDDPPPPPQQQQQQQQFQQQQQFQQQQQQQD
initial "There is an empty bottle here.",
FFFFFFFPPPPPPPPQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQD
before
FFFFFFFPPPPPP
[; LetGo: ! For dealing with water
FFFFFFFfSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSCCCCCCCCCCCCCCCCCCCCCCCC
if (noun in bottle)
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
"You're holding that already (in the bottle).";
SSSSSSSSSSSSSSSSSQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQS
],
SSSSSSSfD
has container;
FFDDDDDPPPPPPPPPD
[ ReadableSpell i j k;
fffffffffffffffSSSSSSS
if (scope_stage==1)
SSSSSSSSSSSSSSSSSSSSS
{ if (action_to_be==##Examine) rfalse;
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
rtrue;
SSSSSSSSSSSS
}
SSS
@set_cursor 1 1;
SSSSSSSSSSSSSSSSSS
];
fD
Extend "examine" first
DDDDDDDQQQQQQQQQFFFFFF
* scope=ReadableSpell -> Examine;
FFFFFFFFFFFFFFFFDDFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDDDFFFFFFFD
(Here F=foreground, D=directive, f=function, S=code (S for
"statement"), C=comment, P=property, Q=quoted text.) And here is
the refinement:
! Here's the bottle:
CCCCCCCCCCCCCCCCCCCC
Object -> bottle "bottle marked ~DRINK ME~"
DDDDDDDDDDFFFFFFFQQQQQQQQQQQQQQQEQQQQQQQQEQ
with name "bottle" "jar" "flask",
FFDDDDDPPPPPQQQQQQQQFQQQQQFQQQQQQQD
initial "There is an empty bottle here.",
FFFFFFFPPPPPPPPQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQD
before
FFFFFFFPPPPPP
[; LetGo: ! For dealing with water
FFFFFFFfSSIIIIISSSSSSSSSSSSSSSSSSSSSSSCCCCCCCCCCCCCCCCCCCCCCCC
if (noun in bottle)
SSSSSSSSSSSSSSSSSIIIISSSSIIIIIIS
"You're holding that already (in the bottle).";
SSSSSSSSSSSSSSSSSQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQS
],
SSSSSSSfD
has container;
FFDDDDDPPPPPPPPPD
[ ReadableSpell i j k;
fffffffffffffffSSSSSSS
if (scope_stage==1)
SSSSSSIIIIIIIIIIISSIS
{ if (action_to_be==##Examine) rfalse;
SSSSSSSSSSIIIIIIIIIIIISSIIIIIIIIISSSSSSSSS
rtrue;
SSSSSSSSSSSS
}
SSS
@set_cursor 1 1;
SSAAAAAAAAAAASISIS
];
fD
Extend "examine" first
DDDDDDDQQQQQQQQQFDDDDD
* scope=ReadableSpell -> Examine;
FFFFFFFFFFFFFFFFDDFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDDDFFFFFFFD
(where E = escape characters, A = assembly and I = "codealpha", that
is, identifiers cited in statement code).
--
Graham Nelson |
[email protected] | Oxford, United Kingdom