cxpos is an implementation of this paper:
=========================================
Copyright 2004, D.Sofyan & Adrian Hafizh,
private property of PT SOFTINDO, Jakarta.
All rights reserved.
=========================================
Title: An analysis of Bound-Range in indexed-character-based string matching scan
Version: 1.0.0.1-beta
Authors: D.Sofyan & Adrian Hafizh
The most important thing when doing a substring scan is to know
WHEN the character sequences in the substring is no longer
matched, as early as possible, and if possible, to detect WHAT
sequences does not worth to scan and could be skipped, as much as
possible, to minimalize useless, discarded matching tries.
Based on this postulate, we would analyze the new algorithm of
character based scanning to quickly locate a position of pattern
or smaller substring against the larger one, much more efficiently
compared to that of conventional scan.
On this paper we propose an improved, sub-optimalization of
character-based pattern match technique.
.
..
...
First, we have to build an INDEX table of substring to be matched
against. This table should be initialized in such a way that the
ordinal position of particular character in the INDEX table
contains the index/position of that character in the substring,
or in other words, every characters in the substring, have their
index/position in the substring recorded in their respective
ordinal position in the INDEX table. For any other characters that
are not in the substrings, their ordinal positions in the INDEX
table should be filled with a value greater than the last position
of the substring (an OUT-RANGE index-value). From now on, for
simplicity shake, the ordinal position of particular character's
value in the INDEX table will be called as only the INDEX-VALUE.
Conveniently, we use 0-based position for the index/position.
When a particular character appears more than once in the
substring, then the ordinal position of that character in the
INDEX table must contain the highest value of the index/position
of that particular character in the substring (the last one). On
further descriptions we will call this kind of characters as TWINS.
After filling the INDEX table with the proper values, we have to know
the index-values range by getting the LOWEST and the HIGHEST
index-value of the characters sequence in the substring (also in
the INDEX table); the HIGHEST index-value is already indicated by
the last position of the substring, whereas the LOWEST index-value
could be gotten by scanning either the substring or the INDEX table,
From now on, they would simply be called them as the LOWEST and the
HIGHEST respectively.
It is also worth to know if there are not any duplicated characters
(TWINS) in the substring. The search algorithm will be much stright
and simpler, for example, any out-of-range character catched during
the skim will result on to rejection of the whole skimmed-substring,
because the LOWEST would have never been reached anyway upto until
the last fetch.
// If they are not any duplicated characters in the substring (TWINS):
// Once the INDEX table has been built, we no longer need the
// substring anymore, all of the informations we need are already
// in the INDEX table.
In this document, we analyze the comprehensive state in general,
which also sorrounding/including this unique character sequences,
should the performance is considered to be much critical, we might
have to do a different branch for this case.
The other advantages of using the INDEX table is that no-case
comparison will no longer need different or additional instructions,
as long as the comparison is based on their index-values, not between
the characters by themselves, because neither lower or uppercase is
treated specially, only the index/position of particular character
in the substring that makes different. .
On further searching, we should avoid doing comparison of substring
with string to be matched against by their character per characters
per-se, instead, we mainly using this INDEX table to do the
comparison. The characters which contained/existed in the substring
will have its index-value equal with its position in the substring,
any character whose index-values higher than the length of substring
(OUT-RANGE) does not need any further processing, or in the other
means, only the characters in the substring will be taken into account.
Second, when doing a skim, we do it in the reversed order, to
take advantages of the length of the substring to be matched
against to be accounted in. This reversed way will also lead to
non-measurable advantages which gain effectiveness when scanning
a recognizable pattern (a word) against sequences of another
recognizable pattern (such a document text), because one word has
a relatively low possibility to match another pattern consisted
by to half words (in the same length), thus the non-match will
detected earlier.
On this algorithm, this reversed way is also aimed at the computation
on how much characters sequences could be skipped at comparison's fail
when skimming.
Finally, mainly by using the index-values we will seek the string to
be matched against, try to locate the possible matched substring, and
if possible, skip any sequence of characters that will never match.
The search begin with comparing the fetched character from the
string with the first character of the substring (we called further
the position of the string at this point as ORIGIN), only if they
are equal then we begin the substring skim cycle, otherwise we
simply skip to the next character.
// If they are not any duplicated characters (TWINS) in the substring:
// The index-value of 0 has the special position here, because it is
// should be used as initial substring skim cycle. The index-value of 0
// means that the character is the first character in the substring,
// which we will seek it first on the string, if the character fetched
// in the string has the index-value other than 0, we simply skip it
// to fetch the next character, only if its index-value is 0 then we
// begin the substring skim cycle.
The substring SKIM CYCLE is the process of comparing the CHUNK of
the string (portion of the string with the length of the substring,
--will be called as only LENGTH on further descriptions) against
the substring, on a indexed-character-based comparison. When the
comparison fails, whe should check WHEN is that happens (denoted
by the position COUNTER) and WHAT is the INDEX-VALUE (of the
character fetched from the chunk, at the offset pointed by the
current counter position). Keep the terminologies in mind, for
those will be carried on further descriptions.
On any conditions, the process will keep going on while the
index-value is equal with the counter, until end-of-loop, that is
the counter reaches its lowest value, which also means that the
substring has found. Only when the index-value is different from
the counter then we have to suspend the process, and do several
investigation to determine the exact state.
On any conditions, if the index-value is higher than the HIGHEST
then the cycle will be ended and the original fetch position could
be skipped to that point. This conditions indicates that the
character in question is out of range, the chunk could be safely
cut to the position after this character.
The skimming cycle started from the HIGHEST value to the LOWEST
value (counted-down), this counter should be compared against the
index-value of current fetched character (original fetched position
with the offset of counter). There are no physical accesses to the
substring at this step, the comparisons are merely of index-values
against the counter.
// If they are not any duplicated characters (TWINS) in the substring:
// We never do physical comparison of the characters at all
// We only need a single comparison stage, any index-value
// different from the counter will lead to suspended skim
// If the index-value is lower than the counter we should shift
// forward the original fetch position by (counter - index-value) away
// if the index-value higher than the counter, then we could
// discard the whole string and skip forward the original fetch
// position to the length of the substring
On comparison fails, and the index-value is between the LOWEST and
the HIGHEST, then we have to do the next step, that is manually
comparing the character from chunk with the substring, both at the
same counter position, if they are equal then we continue the
suspended first-step. For the comparison to be carried on further
processing, then is not the characters itself that we had the
comparison against to, but rather indirectly by their index-values.
Note that if both of the characters are equal (by comparing their
index-values), there is an indication of TWINS, notice also that
their values will have to be higher than the counter.
If the comparison fails at this stage, then one way or another,
the skim have to be ended. But beforewise, we might do another
check to determine how much save we could get by skipping useless
further tries.
At this stage, we virtually try to match the character of
substring[?] where ? is the index-value of the chunk[COUNTER].
Any index-values between the LOWEST and COUNTER are considered
to be OUT-RANGE. Note that the values below the LOWEST will never
exist, so what we have to consider are only the COUNTER value, and
(supposed to be) the appearance of the LOWEST, and the values
between them, if any. The convergency of those lower and
upper-bound makes the RANGE-bound in effect as described later.
The RANGE-bound is a virtual range between the LOWEST and the
(counted-down) HIGHEST, consisted by all of acceptable index-
values. At first, it is composed equally with the substring. The
RANGE-bound length is also in correlation with the LENGTH of the
substring (or the chunk), it supposed to be equal with the
counter itself, in fact it is, if there are not any TWINS. Due to
this TWINS problem, the length of RANGE-bound is practically less
than LENGTH, because of the length of several characters is counted
as only 1. What we do know is only the upper-bound of the RANGE-bound,
indicated by the counter (and the expected lower-bound by the LOWEST).
while skimming, the RANGE-bound length might be virtually decreased
towards the lower bound according to the counter, yet it physically
decreases only if the character fetched IS NOT a TWIN. Wen the fetched
character IS a TWIN, the RANGE-bound decrements virtually mapped to
a corresponding TWIN value, this map is always an index-value higher
than the counter NEVER below them, (because of the increasing rules
of the index-values when they're been built), in consequence of that,
the lower-bound, would be the last index-value reached.
The expected to be found: the LOWEST, is the index-value of the
least and last (the first-reversed) particular character class that
expected to be found. She is the lower-bound of the range, also
implies as the assurance of the RANGE-bound length limit, though
we don't know exactly WHERE she is, neither WHAT of her value, we
just expect (or more precisely: assumed) that she might have been
there at one position later, with one particular value. In opposite
to the upper-bound, that is the counter, we do know exactly
where and what the value is.
In conclusion, If the lower-bound has never been reached, then we
ought to be sure that the length of RANGE-bound is keep intact with
the length of the substring. In contrast, once the LOWEST has been
found (reached), then we might not be confident about the integrity
of (the length of) the RANGE-bound anymore, in fact, the RANGE-bound
is no longer exists.
(It's paradoxial, isn't? We are sure about something based on
what we aren't, yet we aren't when conversely we are).
Implementation:
Since any prior index-values higher than the counter have been
previously mapped to their respective characters, the index-value
may not be higher than the counter (as noted above, the index-
value of TWINS, of course would be higher than the counter, but
consequently, both of the index-values will be once ever equal
anyway), if they did, then we could safely truncate the original
fetch position, right after that out-of-range character. Thus
basically, any OUT-RANGE index-value will result in rejection of
some portion of the chunk, or precisely, truncation of the chunk
up to the position of the OUT-RANGE index-value.
(Please bear in mind that when the LOWEST has been reached, then
the RANGE-bound is not in effect anymore).
If an unexpected index-value appears while the LOWEST has not been
reached yet, then the chunk could be discarded, we do know that
any unexpected index-value (any index-values not in the RANGE)
will violate the RANGE-bound. Since we have a confidence that the
length of RANGE-bound is keep intact with the length of substring
while the LOWEST has not been reached, then any appearance of
OUT-RANGE index-values would violate the length-bound. And since
the LOWEST whose characters implies as the firstly characters at
the sequence are has not been appears yet, we might discard all
the characters sequence that have been passed the prior test from
beginning of the skim cycle, whereas the remaining, untested
characters sequence, might also discarded as well because its
must be shorter than LENGTH as intended by the length of the
substring. They are all bring us the conclusion that the whole
chunk could be safely discarded.
addition:
If the LOWEST has NEVER been reached then the whole chunk could
also be discarded, the fetch position could be skipped by the
length of the substring.
If the LOWEST has EVER been reached then we should put the chunk's
character at her ESTIMATED proper place, in other words, shift
accordingly the original fetch position by length indicated by
the difference between the counter and the index-value. (As would
be described later, the shift would be increased by +1). Note that
we could not truncate the chunk because there could be several
characters in low order counter which might be matched the index-
value of TWINS.
Note: Further checks would be taken place ONLY for inequality.
if the LOWEST has not been reached yet, it means that the acceptable
index-value must be fit between LOWEST and counter (an exception
of this is for TWINS which should have been passed the earlier test).
if the LOWEST has been reached, it means that the counter position
(value) is below the lowest posibble index-value, its impossible
in this position that any lower index-values (compared with the
counter value) will be found, is not worth to test, it must be
higher than the counter value, and also, the index-value must be
equal with the LOWEST (since any TWINS should have been catched
in the prior comparison test, and so did with the out-range
characters, --in fact this state would have never been reached in
the last case), and finally it must be the SECOND LOWEST's
character found (since the first-one, would have been already
passed the previous test).
Nevertheless, at this state we could not discard the chunk
neither truncate it, since it is possible for all characters
(any index-values) betwen the LOWEST and HIGHEST might appear,
(although in this state they must be failed the comparison test).
Those will bring us a conclusion to shift the original fetch
position accordingly so the character in question is virtually
placed BEFORE the LOWEST position (recall that the character is
the SECOND LOWEST's character found), the shift distance it is
equal with the distance between counter and the LOWEST plus one.
It fairly possible that some distance exists between this character
position with the first LOWEST character position found, but we do
not have any information about how much of that, the best that we
could do is to shift the original fetch position as stated above.
(P = P + LOWEST - COUNTER +1).
SUMMARY:
There's always a trade between efficiency and completeness.
On comparison fails, we could simply fetch the next character
instead of doing any further test and processing which may or
may not give any advantages for searching.
Below are the summary of the steps and tests that could be taken,
from a simple to a comprehensive ones.
1. If the comparison fails (the index-value is different from the
counter) we could simply step one forward the original fetch
position.
2. If the comparison fails, we could do a test to check whether the
index-value is OUT-RANGE (higher than the HIGHEST) or not, and if it
did, then we could discard the chunk and skip forward that out-range
character), else we step the original fetch position one forward.
(higher than HIGHEST -> discard)
3. If the comparison fails, the index-value higher than the counter,
then we should get the LOWEST and then test whether she has been
reached or not by comparing her with the counter, if she was greater
than the counter, it means that she has been reached, and vice versa.
If she has NOT been reached yet, then we could discard the chunk,
else we step the original fetch position one forward.
(index-value > counter > LOWEST -> discard)
4. If the comparison fails, we might also do another test, by first
getting the index-value2 of the character in the substring at the
counter position, if only both of the index-values are equal then
the skim continued, else, if the chunk's index-value is out of
RANGE-bound (higher than the counter), then we could discard the
chunk and skip the original fetch to position by the LENGTH away,
else we step the original fetch position one forward.
5. If the comparison fails, the index-value2 we've got from the
substring's character at position of the counter is different
from the chunk's index-value, then we might do a final test, that
is checking whether the LOWEST has beeen reached by comparing the
counter with the LOWEST. If the values are different (and it will
--as described above, the LOWEST's value MUST be higher than the
counter, and also it will be EQUAL with the chunk's index-value),
then we should positioned the chunk's character at her ESTIMATED
proper place, ie. shift the ORIGIN by the difference between the
counter and the (assumed) index-value plus 1 additional offset.
Note that the last step is nothing more but an extension of the
previous one, there is no additional state, neither additional
burden to the program except for computing the skippable distance.
Also in practice, we should not wasted any effort to get the LOWEST
value, for she only exists in concept.
Ch = SubStr[1]
P = address(S)
tail = P + length(S) - length(SubStr) -1
@Loop:
inc(P)
if P > tail then goto @NOTFOUND
if S[P] <> Ch goto @Loop
Counter = length(SubStr)
@SkimLoop:
dec(Counter)
if Counter = 0 then goto @FOUND
I = index-value of S[P]
if I == Counter then goto @SkimLoop
(I could be any value excep Counter)
METHOD1:
if S[P + Counter] = SubStr[Counter] then goto @SkimLoop
else inc(P, Counter +1) (equal with @Shift1 below)
(
Here we assumed that the LOWEST has (always) been reached,
then the LOWEST MUST be equal with Counter.
It is no harm to do a thing like that, since if the LOWEST has
been reached we could either DISCARD or SHIFT the chunk.
I > Counter -> DISCARD
I < Counter -> SHIFT
The beauty of this simplification is that the Counter
value will always match the distance that could be skipped.
On the other hand, with this treatment we might only
shifting the chunk, when it could be discarded at all.
)
METHOD2:
if I > Counter then begin
V = Index-value of SubStr[Counter]
if I == V then goto @SkimLoop
if Counter > LOWEST then goto @Discard (P = P + HIGEHST +1)
else begin
if I > HIGHEST then @Truncate (P = P + Counter)
else @Shift (by delta value: P = P + Counter - I)
end
end
else (I must be < Counter)
goto @Shift1 (P = P + Counter +1)