 |
 |
 |
 |
MORE ABOUT SYNTAX OF REGULAR EXPRESSIONS
If you are not already familiar with the topic of regular expression
start with the introduction
to regular expressions. This page is intended as a terse but
comprehensive summary of the Xerox Regular Expression language.
On this page you will find:
SYMBOLS :

Except for epsilon and other, a single symbol
represents a string that consists of 16-bit Unicode characters. The
string is terminated by a pair of null characters.
| a |
Single symbol a. An ordinary letter without double quotes
around it is interpreted as a single character C string of the Latin-1
script. It is expanded to a string of 16-bit characters by adding an
initial zero byte to each character, including the terminating null
character. |
| "a" |
Single symbol a.
Double quotes have no special meaning for symbols composed of ordinary letters. |
| "\n" |
Single symbol consisting of a newline character.
Inside double quotes, \n, \t, \v, \b, \r, \f, and \a are interpreted
according to Unix conventions. \n = newline, \t = tab, \b = backspace,
\r = carriage return, \f = formfeed, \a = alert.
| | "\142" |
Single symbol b.
Inside double quotes, \o, \oo, \ooo are interpreted as octal
character codes.
| | "\x63" |
Single symbol c.
Inside double quotes, \xx is interpreted as a hexadecimal characater code.
| | "\u0064" |
Single symbol d.
Inside double quotes, \uxxxx is interpreted as a two-byte Unicode
character (U+xxxx in official Unicode notation).
| | %+ab%- |
Multi-character symbol +ab-.
The % means that the following symbol will be taken literally.
Otherwise, + and - would be interpreted as
regular-expression operators.
|
| "+ab-" |
Multi-character symbol +ab-. Symbols enclosed in double
quotes can contain characters that otherwise are used as operators
(& | - + * etc.). As shown in the examples above, inside double
quotes any character can be written as an octal or hexadecimal number,
or as a two-byte Unicode character encoded as two hexadecimal numbers.
Standard Unix conventions for newline, tab, etc. are also legal.
|
| 0 or "" |
Epsilon symbol, the empty string. |
| ? |
Any symbol. If interpreted as a pair, ? denotes the
identity relation. Note that in a
displayed network ? only represents unknown symbols.
In a regular expression, ? in addition covers all symbols that
explicitly occur in the regular expression.
|
SYMBOLS PAIRS :

A symbol pair is made of two single symbols: the
upper and the lower symbol. In writing and displaying
symbol pairs, we separate the upper symbol from the lower symbol with a colon.
With the exception of ?:? (any mapped
to any), we treat identity pairs as equivalent to the
corresponding single symbol in the lower language.
| a:b |
Symbol pair with a in the upper language and b
in the lower language. |
| "\n":0 |
Symbol pair consisting of the newline in the upper language and
empty string (= epsilon) in the lower language. |
| a:a |
Symbol pair a:a. Identity pairs can also be represented
by a single symbol. The regular expression compiler regards a:a
and a as equivalent. |
| ?:? |
The relation that consists of any symbol mapped to any symbol.
In contrast, ? designates the identity relation when
interpreted as symbol pair.
|
BRACKETS AND PARENTHESES :

| [A] |
Same language as A. These brackets control the order in which
operations are performed. |
| (A) |
Optionality
(union of the language A with the empty language).
|
OPERATIONS ON REGULAR LANGUAGES :

Some regular expressions operators require that the component expressions
denote a regular language. The result denotes a regular language. They are
not applicable to relations.
| ~A |
Set of all strings that are not in the language A.
(~a contains "aa" and "aaa" but not "a".)
|
| \A |
Set of all single-symbol strings that are not in the
language A.
(\a contains "b" and "c" but not "a" and no
strings with more than one symbol like "aa" or "aaab".)
|
| A & B |
Intersection of the languages A and B.
Set of strings that are in A and in B.
|
| A - B |
A minus B. Set of strings that are in A and not in B.
|
OPERATIONS ON REGULAR LANGUAGES AND RELATIONS :

For some operators, the component expressions may denote either languages
and the relations. If one of the component expressions denotes a relation,
the result also designates a relation.
| $A |
Contains.
$A is language of strings or pairs of strings that each
contain at least one instance of A somewhere.
$A is equivalent to [?* A ?*].
|
| $.A |
Contains one.
Set of all strings containing exactly one occurrence of a string from
the language A.
|
| $?A |
Contains at most one.
Set of all strings containing at most one occurrence of a string from
the language A.
|
| A+ |
Kleene plus.
Concatenation of one or more strings from the language A.
|
| A* |
Kleene star.
Concatenation of zero or more strings from the language A.
|
| A/B |
A ignoring B.
Set of all strings that would be in A if all substrings
of B (regarded as noise) were removed.
(a/x contains "a", "xa", "xaxx", etc.)
|
| A ./. B |
A ignoring internally B.
Definition: A./.B = [A/B] - [B ?* | ?* B]
(e.g. [a a a]/x contains "aaa", "aaxa", "axxxaxa",
etc.) |
| A B |
Concatenation of the languages A and B.
(a b is the concatenation of the two single-symbol strings
"a" and "b". However, ab without a blank inbetween
is a string of the single multi-character symbol "ab".)
|
| A | B |
Union of the languages A and B.
Set of strings that are in A or in B or in both languages.
|
| R .P. Q |
Upper-side priority union.
Union of all string pairs from the relation R with all those
string pairs from Q that have an upper string that is not in R.
Alias: R .|u. Q
(Definition: R .|u. Q = R | [Q .-u. R] ). |
| R .p. Q |
Lower-side priority union.
Union of all string pairs from the relation R with all those
string pairs from Q that have a lower string that is not in R.
Alias: R .|l. Q
(Definition: R .|l. Q = R | [Q .-l. R] ). |
| R .-u. Q |
Upper-side minus.
Set of all string pairs from the relation R minus those that have
an upper string that is also in Q.
(Definition: R .-u. Q = [R.u - Q.u] .o. R). |
| R .-l. Q |
Lower-side minus.
Set of all string pairs from the relation R minus those that have
a lower string that is also in Q.
(Definition: R .-l. Q = R .o. [R.l - Q.l]). | | A < B |
A before B.
Set of strings where no substring from B can precede any substring from A.
|
| A > B |
A after B.
Set of strings where no substring from B can follow any substring from A.
|
| A <> B |
Shuffle of A and B.
(e.g. [a b <> x y ] contains the strings axyb,
axby, abxy, xayb, xaby and
xyab.)
|
| A^n |
n concatenations of A.
(e.g. [a | b b]^3 contains the strings aaa,
aabb, abba, abbbb, bbaa, etc.)
|
| A^{n,k} |
n to k concatenations of A.
(e.g. a^{3,5} contains exactly the strings aaa,
aaaa, aaaaa.)
|
| A^>n |
more than n concatenations of A.
(e.g. a^>3 contains the strings aaaa,
aaaaa, aaaaaa, etc.)
|
| A^<n |
less than n concatenations of A.
(e.g. a^<4 contains exactly the strings aaa,
aa, a, and the empty string.)
|
| A.r |
Reverses the regular language A.
(e.g. [a b c].r equals c b a)
|
| `[[A], s, sl] |
Substitution of all symbols s (being a single symbol or
a constituent of a symbol pair) in the language or relation A
by all symbols of the list sl.
(e.g. `[[a b:0 b:x c], b, p q] gives
[a [p:0 | q:0] [p:x | q:x] c].)
|
OPERATORS FOR DESCRIBING REGULAR RELATIONS :

| A .x. B |
Crossproduct of the regular languages A and B.
It denotes a regular relation that maps every string in A
(upper side) to all strings in B (lower side),
and vice versa.
|
| A .o. B |
Composition of two regular relations A and B.
The upper language of A gets directly related to the lower language of B.
The intersection of the lower language of A and the upper language of B
works like a filter without being present in the result of the composition.
(e.g. [a:i | b:j] .o. [i:x | k:y] gives the result [a:x]
because of a:i and i:x.
Everything else is filtered out by the intersection [i|j] & [i|k]
allowing only i on the (disappearing) intermediate level.)
|
| R.u |
Denotes the upper language of the regular relation R.
Alias: R.1
(e.g. [A .x. B].1 equals A)
|
| R.l |
Denotes the lower language of the regular relation R.
Alias: R.2
(e.g. [A .x. B].2 equals B)
|
| R.i |
Inverts the regular relation R.
(e.g. [A .x. B].i equals [B .x. A])
|
OPERATORS DESCRIBING RESTRICTION, REPLACEMENT AND MARKUP :

| A => L1 _ R1 , L2 _ R2 , ...... , Ln _ Rn
|
| |
Restriction.
Strings of language A are only allowed between a string of language L1
to the left and a string of language R1 to the right or between a
string of L2 and a string of R2 or ......
or between a string of Ln and a string of Rn.
|
| A -> B || L _ R
|
| A -> B // L _ R
|
| A -> B \\ L _ R
|
| A -> B \/ L _ R
|
| |
Simple replacement.
Every string of language A (upper side) is replaced by all
strings of language B (lower side; multiple results possible)
if and only if it occurs between a string of language L to the left
and a string of language R to the right. The four separators,
||, //, \\, \/, indicate
whether the left and right contexts, L and R, are
checked on the upper or on the lower side of the replacement.
- ||
- L on upper side, R on upper side
- //
- L on lower side, R on upper side
- \\
- L on upper side, R on lower side
- \/
- L on lower side, R on lower side
See Karttunen 1995 for
discussion and examples.
|
A11 -> B11 , ... , A1n -> B1n || L11 _ R11 , ... , L1m _ R1m ,,
A21 -> B21 , ... , A2p -> B2p || L21 _ R21 , ... , L2q _ R2q ,,
...... ,,
Ak1 -> Bk1 , ... , Akr -> Bkr || Lk1 _ Rk1 , ... , Lks _ Rks
|
| |
Parallel replacement
(obligatory, input-oriented, upper-to-lower).
Every string of language A11 (upper side) is replaced by all
strings of language B11 (lower side; multiple results possible)
and ......
and every string of A1n by all strings of B1n
if and only if it occurs between a string of L11 to the left
and a string of R11 to the right or ...... or between a string
of L1m and a string of R1m.
At the same time, every string of A21 is replaced by all
strings of language B21, etc., if it occurs between a string of L21
and a string of R21, etc.
All these replacements take place simultaneously without
interfering with each other.
Note the double comma at the end of the first three lines above
separating the groups of related replacements. A single comma
is used as the separator inside the replace and context lists.
All replacements are obligatory ("->" instead of "(->)", so that
they always take place), input-oriented ("||" instead of "//" e.g.,
so that the left and right contexts must exist on the input side)
and upper-to-lower ("->" instead of "<-" e.g., so that
the original strings are in the upper language and the replaced
strings in the lower language).
For an explicit description of all variants of parallel replacement
see the
technical report MLTT-021 (Dec. 1995).
|
|
A11 (->) B11 , ... , A1n (->) B1n || L11 _ R11 , ... , L1m _ R1m ,,
etc.
|
| |
Parallel replacement
(optional, input-oriented, upper-to-lower).
All these replacements take place simultaneously without
interfering with each other.
They are all optional ("(->)" instead of "->", so that
they do not need but may take place, where all possibilities
of local (non-)application generate results).
All replacements are input-oriented and upper-to-lower.
|
|
A11 -> B11 , ... , A1n -> B1n // L11 _ R11 , ... , L1m _ R1m ,,
etc.
|
| |
Parallel replacement
(obligatory, rightward, upper-to-lower).
All replacements are obligatory, rightward ("//" instead of "||" e.g.,
so that the right context must exist on the input side and the
left context on the output side which gives the impression that
the replacements propagate from left to right)
and upper-to-lower.
|
|
A11 -> B11 , ... , A1n -> B1n \\ L11 _ R11 , ... , L1m _ R1m ,,
etc.
|
| |
Parallel replacement
(obligatory, leftward, upper-to-lower).
All replacements are obligatory, leftward ("\\" instead of "||" e.g.,
so that the left context must exist on the input side and the
right context on the output side which gives the impression that
the replacements propagate from right to left)
and upper-to-lower.
|
|
A11 -> B11 , ... , A1n -> B1n \/ L11 _ R11 , ... , L1m _ R1m ,,
etc.
|
| |
Parallel replacement
(obligatory, output-oriented, upper-to-lower).
All replacements are obligatory, output-oriented
("\/" instead of "||" e.g.,
so that both the left and the rigth context must exist on the output side)
and upper-to-lower.
|
|
A11 <- B11 , ... , A1n <- B1n || L11 _ R11 , ... , L1m _ R1m ,,
etc.
|
| |
Parallel replacement
(obligatory, input-oriented, lower-to-upper).
All replacements are obligatory, input-oriented
and lower-to-upper ("<-" instead of "->" e.g., so that
the original strings are in the lower language and the replaced
string in the upper language).
|
|
A11 <-> B11 , ... , A1n <-> B1n ||
L11 _ R11 , ... , L1m _ R1m ,,
etc.
|
| |
Parallel replacement
(obligatory, input-oriented, bi-directional).
All replacements are all obligatory, input-oriented
and bi-directional ("<->" instead of "->" e.g., so that
they take place from the upper to the lower side and vice versa).
|
|
[. A11 .] -> B11 , ... , A1n -> B1n ||
L11 _ R11 , ... , L1m _ R1m ,,
etc.
|
| |
Parallel replacement
(obligatory, input-oriented, upper-to-lower).
All strings of language A11 are considered as concatenations of
one-symbol substrings strictly alernating with single
empty strings, i.e. empty strings occur everywhere exactly one time
(marked by brackets [. and .]).
All strings of language A1n are considered as concatenations of
one-symbol substrings alernating with any number
of empty strings, i.e. empty strings occur everywhere any number
of times (including zero times).
NOTE:
All previous features of parallel replacements like
obligatory or optional, input-oriented,
rightward etc.,
bi-directional etc., can be combined in any manner,
however all replacements that take place in parallel
must have the same features.
For an explicit description see the
technical report MLTT-021 (Dec. 1995).
|
A @-> B A @> B
A ->@ B A >@ B |
Directed replacement.
Instances of the language A on the upper side of the
relation are replaced selectively. The selection factors each
upper language string uniquely into A and non-A
substrings. The factoring is based on
a left-to-right (or right-to-left) and longest match (or shortest
match) regimen. For the left-to-right, longest match version,
@->, the starting locations are selected from
left-to-right. If there are overlapping instances of A
starting at the same location, only the longest one is replaced
by all strings of B.
With @> it is the shortest match from the left, with ->@ the
longest match from the right and with >@ the shortest match from
the right.
|
A @-> L ... R
A @> L ... R
A ->@ L ... R
A >@ L ... R |
Directed markup.
Instances of of the language A on the upper side of the
relation are selected for markup under the directionality and
longest match regimen described above. Each selected A
string is marked by inserting all strings of R to its left
and all strings of S to its right. The selected A
strings themselves remain unchanged, along with the non-A
segments.
|
OPERATORS FOR READING NETWORKS FROM FILES :

| @"filename" |
Read a network from a binary file. |
| @bin"filename" |
Read a network from a binary file. |
| @re"filename" |
Read a network from a regular expression file. |
| @txt"filename" |
Read a network from a text file. |
| @stxt"filename" |
Read a network from a spaced text file. |
| @pl"filename" |
Read a network from a prolog file. |
If there are errors in the called file (concerning syntax,
file type, file handling, empty file, etc.)
a null_fsm (i.e. ~[?*]) is generated,
and an error message is printed.
If a binary or regex or prolog file contains more than one network,
only the first one is used and a warning is printed.
ORDER OF PRECEDENCE :

The following list states the order of precedence of all above operators.
Operators of same precedence are executed from left to right,
except the prefix operators ( ~ \ $ $? $. ) that are
executed from right to left.
To enforce another order use angle brackets [ ].
The list begins with the operators of highest precedence,
i.e. with the most tightly binding ones.
Operators of same precedence are on the same line.
:
~ \ $ $? $.
+ * ^ .1 .2 .i .r
/
concatenation
> <
| & -
=> -> (->) @-> etc.
.x. .o.
For more detailed information please look at our
pointers to finite-state literature. We welcome
your comments and suggestions.
Back to Finite State Technology
homepage
|
 |