Subject: How to syntax-colour Inform
Date: Wed, 17 Dec 1997 18:46:34 +0000 (GMT)
From: Graham Nelson <[email protected]>
Newsgroups: rec.arts.int-fiction

[This is going to be a new section in the Inform Technical Manual,
which seems as good a place to keep it as any, but in the mean time
it's been requested several times on the newsgroup, hence this
posting.  Comments welcome -- GN.]

How to syntax-colour Inform source code
---------------------------------------

"Syntax colouring" is an automatic process which some text editors apply
to the text being edited: the characters are displayed just as they are,
but with artificial colours added according to what the text editor thinks
they mean.  The editor is in the position of someone going through a book
colouring all the verbs in red and all the nouns in green: it can only do
so if it understands how to tell a verb or a noun from other words.
Many good text editors have been programmed to syntax colour for languages
such as C, and a few will allow users to reprogram them to other languages.

One such is the popular Acorn RISC OS text editor "Zap", for which the
author has written an extension mode called "ZapInform".  ZapInform
contributes colouring rules for the Inform language and as over a dozen
people have now asked me how it works, while the original is written
in ARM assembly (a language rather less widely spoken than Middle Egyptian)
it seems worth documenting the main algorithm.

(ZapInform does a number of other useful things, including pasting in
template objects and rooms when commanded from a mouse-accessed menu:
for instance, you can create a simple game with two or three mouse
clicks and a few object names typed in to a dialogue box, then click
to save and compile the result.  See the ZapInform manual for details.)

(a)  State values

ZapInform associates a 32-bit number called the "state" with every
character position.

The "state" is as follows.  11 of the upper 16 bits hold flags, the
rest being unused:

  32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17
                                               comment
                                            single-quoted text
                                         double-quoted text
                                      statement
                                   after marker
                                highlight flag
                             highlight all flag
                          colour backtrack
                       after-restart-flag
                    wait-direct (waiting for a directive)
                 dont-know-flag

These flags make up the "outer state" while the lower 16 bits holds
a number pompously called the "inner state":

            0    after WS (WS = white space or start of line or comma)
            1    after WS then "-"
            2    after WS then "-" and ">" [terminal]
            3    after WS then "*" [terminal]

         0xFF    after junk
  0x100*N + S    after WS then an Inform identifier N+1 characters long
                 itself in state S:
                 101 w    202 wi   303 wit   404 with
                 111 h    212 ha   313 has
                 121 c    222 cl   323 cla   424 clas   525 class
same + 0x8000    when complete [terminal]

In practice it would be madness to try to actually store the state
of every character position in memory (it would occupy four times as
much space as the file itself).  Instead, ZapInform caches just one
state value, the one most recently calculated, and uses a process
called "scanning" to determine new states.  That is, given that we
know the state at character X and want to know the state at character
Y, we can find out by scanning each character between X and Y,
altering the state according to each one.

It might possible save some time to cache more state values than
this (say, the state values at the start of every screen-visible
line of text, or some such) but the complexity of doing this doesn't
seem worthwhile on my implementation.  Scanning is a quick process
because the Zap text editor stores the entire file in almost contiguous
memory, easy to run through, and the state value can be kept in a
single CPU register while this is done.

(b)  Scanning text

Let us number the characters in a file 1, 2, 3, ...

The state before character 1 is always 0x02000000: that is, inner
state zero and outer state with only the waiting-for-directive flag set.
(One can think of this as the state of an imaginary "character 0".)
The state at character N+1 is then a function of the state at
character N and what character is actually there.  Thus,

      State(0) = 0x02000000

and for all N >= 0,

      State(N+1) = Scanning_function(State(N), Character(N+1))

And here is what the scanning function does:

   1.  Is the comment bit set?
          Is the character a new-line?
             If so, clear the comment bit.
             Stop.

   2.  Is the double-quote bit set?
          Is the character a double-quote?
             If so, clear the double-quote bit.
             Stop.

   3.  Is the single-quote bit set?
          Is the character a single-quote?
             If so, clear the single-quote bit.
             Stop.

   4.  Is the character a single quote?
          If so, set the single-quote bit and stop.

   5.  Is the character a double quote?
          If so, set the double-quote bit and stop.

   6.  Is the character an exclamation mark?
          If so, set the comment bit and stop.

   7.  Is the statement bit set?
          If so:
             Is the character "]"?
                If so:
                   Clear the statement bit.
                   Stop.

             If the after-restart bit is clear, stop.

             Run the inner finite state machine.

             If it results in a keyword terminal (that is, a terminal
             which has inner state 0x100 or above):
                Set colour-backtrack (and record the backtrack colour
                as "function" colour).
                Clear after-restart.

             Stop.

          If not:
             Is the character "["?
                If so:
                   Set the statement bit.
                   If the after-marker bit is set, set after-restart.
                   Stop.

             Run the inner finite state machine.

             If it results in a terminal:
                Is the inner state 2 [after "->"] or 3 [after "*"]?
                   If so:
                      Set after-marker.
                      Set colour-backtrack (and record the backtrack
                      colour as "directive" colour).
                      Zero the inner state.
                [If not, the terminal must be from a keyword.]
                Is the inner state 0x404 [after "with"]?
                   If so:
                      Set colour-backtrack (and record the backtrack
                      colour as "directive" colour).
                      Set after-marker.
                      Set highlight.
                      Clear highlight-all.
                Is the inner state 0x313 ["has"] or 0x525 ["class"]?
                   If so:
                      Set colour-backtrack (and record the backtrack
                      colour as "directive" colour).
                      Set after-marker.
                      Clear highlight.
                      Set highlight-all.
                If the inner state isn't one of these: [so that recent
                text has formed some alphanumeric token which might or
                might not be a reserved word of some kind]
                   If waiting for directive is set:
                         Set colour-backtrack (and record the backtrack
                         colour as "directive" colour)
                   If not, but highlight-all is set:
                         Set colour-backtrack (and record the backtrack
                         colour as "property" colour)
                   If not, but highlight is set:
                         Clear highlight.
                         Set colour-backtrack (and record the backtrack
                         colour as "property" colour).

                Is the character ";"?
                   If so:
                      Set wait-direct.
                      Clear after-marker.
                      Clear after-restart.
                      Clear highlight.
                      Clear highlight-all.
                Is the character ","?
                   If so:
                      Set after-marker.
                      Set highlight.

             Stop.

The "inner finite state machine" adjusts only the inner state, and
always preserves the outer state.  It not only changes an old inner
state to a new inner state, but sometimes returns a "terminal" flag
to signal that something interesting has been found.

         State      Condition      Go to state     Return terminal-flag?
         0          if "-"         1
                    if "*"         3               yes
                    if space, "#",
                       newline     0
                    if "_"         0x100
                    if "w"         0x101
                    if "h"         0x111
                    if "c"         0x121
                    other letters  0x100
                    otherwise      0xFF
         1          if ">"         2               yes
                    otherwise      0xFF
         2          always         0
         3          always         0
         0xFF       if space,
                       newline     0
                    otherwise      0xFF

         all 0x100+ states:
                    if not alphanumeric, add
                       0x8000 to the state         yes
         then for the following states:
         0x101      if "i"         0x202
                    otherwise      0x200
         0x202      if "t"         0x303
                    otherwise      0x300
         0x303      if "h"         0x404
                    otherwise      0x400
         0x111      if "a"         0x212
                    otherwise      0x200
         0x212      if "s"         0x313
                    otherwise      0x300
         0x121      if "l"         0x222
                    otherwise      0x200
         0x222      if "a"         0x323
                    otherwise      0x300
         0x323      if "s"         0x424
                    otherwise      0x400
         0x424      if "s"         0x525
                    otherwise      0x500
         but for all other 0x100+ states:
                    if alphanumeric, add
                       0x100 to the state

         0x8000+    always         0

(Note that if your text editor stores tabs as characters in their own
right (usually 0x09) rather than rows of spaces, tab should be included
with space and newline in the above.)

Briefly, the finite state machine can be left running until it returns
a terminal, which means it has found "->", "*" or a completed Inform
identifier: and it detects "with", "has" and "class" as special keywords
amongst these identifiers.

(c)  Initial colouring

ZapInform colours one line of visible text at a time.  For instance, it
might be faced with this:

    Object -> bottle "~Heinz~ bottle"

And it outputs an array of colours for each character position in the
line, which the text editor can then use in actually displaying the text.

It works out the state before the first character of the line (the "O"),
then scans through the line.  For each character, it determines the
initial colour as a function of the state at that character:

 If single-quote or double-quote is set, then quoted text colour.
 If comment is set, then comment colour.
 If statement is set:
    Use code colour
       unless the character is "[" or "]", in which case use
          function colour,
       or is a single or double quote, in which case use quoted text
          colour.
 If not:
    Use foreground colour
       unless the character is "," or ";" or "*" or ">", in which
          case use directive colour,
       or the character is "[" or "]", in which case use
          function colour,
       or is a single or double quote, in which case use quoted text
          colour.

However, the scanning algorithm sometimes signals that a block of
text must be "backtracked" through and recoloured.  For instance,
this happens if the white space after the sequence "c", "l", "a",
"s" and "s" is detected when in a context where the keyword "class"
is legal.  The scanning algorithm does this by setting the "colour
backtrack" bit in the outer state.  Note that the number of characters
we need to recolour backwards from the current position has been
recorded in bits 9 to 16 of the inner state (which has been counting
up lengths of identifiers), while the scanning algorithm has also
recorded the colour to be used.  For instance, in

    Object -> bottle "~Heinz~ bottle"
          ^  ^      ^

backtracks of size 6, 2 and 6 are called for at the three marked
spaces.  Note that a backtrack never crosses a new-line.

ZapInform uses the following chart of colours:

   name                   default actual colour

   foreground             navy blue
   quoted text            grey
   comment                light green
   directive              black
   property               red
   function               red
   code                   navy blue
   codealpha              dark green
   assembly               gold
   escape character       red

but note that at this stage, we've only used the following:

   function colour        [ and ] as function brackets, plus function names
   comment colour         comments
   directive colour       initial directive keywords, plus "*",
                          "->", "with", "has" and "class" when used
                          in a directive context
   quoted text colour     singly- or doubly-quoted text
   foreground colour      code in directives
   code colour            code in statements
   property colour        property, attribute and class names when
                          used within "with", "has" and "class"

For instance,

    Object -> bottle "~Heinz~ bottle"

would give us the array

    DDDDDDDDDDFFFFFFFQQQQQQQQQQQQQQQQ

(F being foreground colour; it doesn't really matter what colour
values the spaces have).

(d)  Colour refinement

The next operation is "colour refinement", which includes a number
of things.

Firstly, any characters with colour Q (quoted-text) which have special
meanings are given "escape-character colour" instead.  This applies
to "~", "^", "\" and "@" followed by (possibly) another "@" and a
number of digits.

Next we look for identifiers.  An identifier for these purposes includes
a number, for it is just a sequence of:

    "_" or "$" or "#" or "0" to "9" or "a" to "z" or "A" to "Z".

The initial colouring of an identifier tells us its context.  We're
only interested in those in foreground colour (these must be used
in the body of a directive) or code colour (used in statements).

If an identifier is in code colour, then:

   If it follows an "@", recolour the "@" and the identifier in
      assembly-language colour.
   Otherwise, unless it is one of the following:

     "box"  "break"  "child"  "children"  "continue"  "default"
     "do"  "elder"  "eldest"  "else"  "false"  "font"  "for"  "give"
     "has"  "hasnt"  "if"  "in"  "indirect"  "inversion"  "jump"
     "metaclass"  "move"  "new_line"  "nothing"  "notin"  "objectloop"
     "ofclass"  "or"  "parent"  "print"  "print_ret"  "provides"  "quit"
     "random"  "read"  "remove"  "restore"  "return"  "rfalse"  "rtrue"
     "save"  "sibling"  "spaces"  "string"  "style"  "switch"  "to"
     "true"  "until"  "while"  "younger"  "youngest"

   we recolour the identifier to "codealpha colour".

On the other hand, if an identifier is in foreground colour, then we
check it to see if it's one of the following interesting keywords:

     "first"  "last"  "meta"  "only"  "private"  "replace"  "reverse"
     "string"  "table"

If it is, we recolour it in directive colour.

Thus, after colour refinement we arrive at the final colour scheme:

   function colour        [ and ] as function brackets, plus function names
   comment colour         comments
   quoted text colour     singly- or doubly-quoted text
   directive colour       initial directive keywords, plus "*",
                             "->", "with", "has" and "class" when used
                             in a directive context, plus any of the
                             reserved directive keywords listed above
   property colour        property, attribute and class names when
                             used within "with", "has" and "class"
   foreground colour      everything else in directives
   code colour            operators, numerals, brackets and statement
                             keywords such as "if" or "else" occurring
                             inside routines
   codealpha colour       variable and constant names occurring inside
                             routines
   assembly colour        @ plus assembly language opcodes
   escape char colour     special or escape characters in quoted text

(e)  An example

Consider the following example stretch of code (which is not meant to
be functional or interesting, just colourful):

  ! Here's the bottle:

  Object -> bottle "bottle marked ~DRINK ME~"
    with name "bottle" "jar" "flask",
         initial "There is an empty bottle here.",
         before
         [; LetGo:                      ! For dealing with water
               if (noun in bottle)
                   "You're holding that already (in the bottle).";
         ],
    has  container;

  [ ReadableSpell i j k;
    if (scope_stage==1)
    {   if (action_to_be==##Examine) rfalse;
        rtrue;
    }
    @set_cursor 1 1;
  ];

  Extend "examine" first
                  * scope=ReadableSpell            -> Examine;

Here are the initial colourings:

  ! Here's the bottle:
  CCCCCCCCCCCCCCCCCCCC

  Object -> bottle "bottle marked ~DRINK ME~"
  DDDDDDDDDDFFFFFFFQQQQQQQQQQQQQQQQQQQQQQQQQQ
    with name "bottle" "jar" "flask",
  FFDDDDDPPPPPQQQQQQQQFQQQQQFQQQQQQQD
         initial "There is an empty bottle here.",
  FFFFFFFPPPPPPPPQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQD
         before
  FFFFFFFPPPPPP
         [; LetGo:                      ! For dealing with water
  FFFFFFFfSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSCCCCCCCCCCCCCCCCCCCCCCCC
               if (noun in bottle)
  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
                   "You're holding that already (in the bottle).";
  SSSSSSSSSSSSSSSSSQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQS
         ],
  SSSSSSSfD
    has  container;
  FFDDDDDPPPPPPPPPD

  [ ReadableSpell i j k;
  fffffffffffffffSSSSSSS
    if (scope_stage==1)
  SSSSSSSSSSSSSSSSSSSSS
    {   if (action_to_be==##Examine) rfalse;
  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
        rtrue;
  SSSSSSSSSSSS
    }
  SSS
    @set_cursor 1 1;
  SSSSSSSSSSSSSSSSSS
  ];
  fD

  Extend "examine" first
  DDDDDDDQQQQQQQQQFFFFFF
                  * scope=ReadableSpell            -> Examine;
  FFFFFFFFFFFFFFFFDDFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDDDFFFFFFFD

(Here F=foreground, D=directive, f=function, S=code (S for
"statement"), C=comment, P=property, Q=quoted text.)  And here is
the refinement:

  ! Here's the bottle:
  CCCCCCCCCCCCCCCCCCCC

  Object -> bottle "bottle marked ~DRINK ME~"
  DDDDDDDDDDFFFFFFFQQQQQQQQQQQQQQQEQQQQQQQQEQ
    with name "bottle" "jar" "flask",
  FFDDDDDPPPPPQQQQQQQQFQQQQQFQQQQQQQD
         initial "There is an empty bottle here.",
  FFFFFFFPPPPPPPPQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQD
         before
  FFFFFFFPPPPPP
         [; LetGo:                      ! For dealing with water
  FFFFFFFfSSIIIIISSSSSSSSSSSSSSSSSSSSSSSCCCCCCCCCCCCCCCCCCCCCCCC
               if (noun in bottle)
  SSSSSSSSSSSSSSSSSIIIISSSSIIIIIIS
                   "You're holding that already (in the bottle).";
  SSSSSSSSSSSSSSSSSQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQS
         ],
  SSSSSSSfD
    has  container;
  FFDDDDDPPPPPPPPPD

  [ ReadableSpell i j k;
  fffffffffffffffSSSSSSS
    if (scope_stage==1)
  SSSSSSIIIIIIIIIIISSIS
    {   if (action_to_be==##Examine) rfalse;
  SSSSSSSSSSIIIIIIIIIIIISSIIIIIIIIISSSSSSSSS
        rtrue;
  SSSSSSSSSSSS
    }
  SSS
    @set_cursor 1 1;
  SSAAAAAAAAAAASISIS
  ];
  fD

  Extend "examine" first
  DDDDDDDQQQQQQQQQFDDDDD
                  * scope=ReadableSpell            -> Examine;
  FFFFFFFFFFFFFFFFDDFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDDDFFFFFFFD

(where E = escape characters, A = assembly and I = "codealpha", that
is, identifiers cited in statement code).

--
Graham Nelson | [email protected] | Oxford, United Kingdom