Home page Site map Contact
  

 

MORE ABOUT SYNTAX OF REGULAR EXPRESSIONS

If you are not already familiar with the topic of regular expression start with the introduction to regular expressions. This page is intended as a terse but comprehensive summary of the Xerox Regular Expression language.

On this page you will find:

 

SYMBOLS :

Except for epsilon and other, a single symbol represents a string that consists of 16-bit Unicode characters. The string is terminated by a pair of null characters.

a Single symbol a. An ordinary letter without double quotes around it is interpreted as a single character C string of the Latin-1 script. It is expanded to a string of 16-bit characters by adding an initial zero byte to each character, including the terminating null character.
"a" Single symbol a. Double quotes have no special meaning for symbols composed of ordinary letters.
"\n" Single symbol consisting of a newline character. Inside double quotes, \n, \t, \v, \b, \r, \f, and \a are interpreted according to Unix conventions. \n = newline, \t = tab, \b = backspace, \r = carriage return, \f = formfeed, \a = alert.
"\142" Single symbol b. Inside double quotes, \o, \oo, \ooo are interpreted as octal character codes.
"\x63" Single symbol c. Inside double quotes, \xx is interpreted as a hexadecimal characater code.
"\u0064" Single symbol d. Inside double quotes, \uxxxx is interpreted as a two-byte Unicode character (U+xxxx in official Unicode notation).
%+ab%- Multi-character symbol +ab-. The % means that the following symbol will be taken literally. Otherwise, + and - would be interpreted as regular-expression operators.
"+ab-" Multi-character symbol +ab-. Symbols enclosed in double quotes can contain characters that otherwise are used as operators (& | - + * etc.). As shown in the examples above, inside double quotes any character can be written as an octal or hexadecimal number, or as a two-byte Unicode character encoded as two hexadecimal numbers. Standard Unix conventions for newline, tab, etc. are also legal.
0 or "" Epsilon symbol, the empty string.
? Any symbol. If interpreted as a pair, ? denotes the identity relation. Note that in a displayed network ? only represents unknown symbols. In a regular expression, ? in addition covers all symbols that explicitly occur in the regular expression.

 

SYMBOLS PAIRS :

A symbol pair is made of two single symbols: the upper and the lower symbol. In writing and displaying symbol pairs, we separate the upper symbol from the lower symbol with a colon. With the exception of ?:? (any mapped to any), we treat identity pairs as equivalent to the corresponding single symbol in the lower language.

a:b Symbol pair with a in the upper language and b in the lower language.
"\n":0 Symbol pair consisting of the newline in the upper language and empty string (= epsilon) in the lower language.
a:a Symbol pair a:a. Identity pairs can also be represented by a single symbol. The regular expression compiler regards a:a and a as equivalent.
?:? The relation that consists of any symbol mapped to any symbol. In contrast, ? designates the identity relation when interpreted as symbol pair.

 

BRACKETS AND PARENTHESES :

[A] Same language as A. These brackets control the order in which operations are performed.
(A) Optionality (union of the language A with the empty language).

 

OPERATIONS ON REGULAR LANGUAGES :

Some regular expressions operators require that the component expressions denote a regular language. The result denotes a regular language. They are not applicable to relations.

~A Set of all strings that are not in the language A. (~a contains "aa" and "aaa" but not "a".)
\A Set of all single-symbol strings that are not in the language A. (\a contains "b" and "c" but not "a" and no strings with more than one symbol like "aa" or "aaab".)
A & B Intersection of the languages A and B. Set of strings that are in A and in B.
A - B A minus B. Set of strings that are in A and not in B.

 

OPERATIONS ON REGULAR LANGUAGES AND RELATIONS :

For some operators, the component expressions may denote either languages and the relations. If one of the component expressions denotes a relation, the result also designates a relation.

$A Contains. $A is language of strings or pairs of strings that each contain at least one instance of A somewhere. $A is equivalent to [?* A ?*].
$.A Contains one. Set of all strings containing exactly one occurrence of a string from the language A.
$?A Contains at most one. Set of all strings containing at most one occurrence of a string from the language A.
A+ Kleene plus. Concatenation of one or more strings from the language A.
A* Kleene star. Concatenation of zero or more strings from the language A.
A/B A ignoring B. Set of all strings that would be in A if all substrings of B (regarded as noise) were removed. (a/x contains "a", "xa", "xaxx", etc.)
A ./. B A ignoring internally B.  Definition:  A./.B = [A/B] - [B ?* | ?* B] (e.g. [a a a]/x contains "aaa", "aaxa", "axxxaxa", etc.)
A B Concatenation of the languages A and B. (a b is the concatenation of the two single-symbol strings "a" and "b". However, ab without a blank inbetween is a string of the single multi-character symbol "ab".)
A | B Union of the languages A and B. Set of strings that are in A or in B or in both languages.
R .P. Q Upper-side priority union. Union of all string pairs from the relation R with all those string pairs from Q that have an upper string that is not in R. Alias: R .|u. Q  (Definition: R .|u. Q = R | [Q .-u. R] ).
R .p. Q Lower-side priority union. Union of all string pairs from the relation R with all those string pairs from Q that have a lower string that is not in R. Alias: R .|l. Q  (Definition: R .|l. Q = R | [Q .-l. R] ).
R .-u. Q Upper-side minus. Set of all string pairs from the relation R minus those that have an upper string that is also in Q.  (Definition: R .-u. Q = [R.u - Q.u] .o. R).
R .-l. Q Lower-side minus. Set of all string pairs from the relation R minus those that have a lower string that is also in Q.  (Definition: R .-l. Q = R .o. [R.l - Q.l]).
A < B A before B. Set of strings where no substring from B can precede any substring from A.
A > B A after B. Set of strings where no substring from B can follow any substring from A.
A <> B Shuffle of A and B. (e.g. [a b <> x y ] contains the strings axyb, axby, abxy, xayb, xaby and xyab.)
A^n n concatenations of A. (e.g. [a | b b]^3 contains the strings aaa, aabb, abba, abbbb, bbaa, etc.)
A^{n,k} n to k concatenations of A. (e.g. a^{3,5} contains exactly the strings aaa, aaaa, aaaaa.)
A^>n more than n concatenations of A. (e.g. a^>3 contains the strings aaaa, aaaaa, aaaaaa, etc.)
A^<n less than n concatenations of A. (e.g. a^<4 contains exactly the strings aaa, aa, a, and the empty string.)
A.r Reverses the regular language A. (e.g. [a b c].r equals c b a)
`[[A], s, sl] Substitution of all symbols s (being a single symbol or a constituent of a symbol pair) in the language or relation A by all symbols of the list sl. (e.g. `[[a b:0 b:x c], b, p q] gives [a [p:0 | q:0] [p:x | q:x] c].)

 

OPERATORS FOR DESCRIBING REGULAR RELATIONS :

A .x. B Crossproduct of the regular languages A and B. It denotes a regular relation that maps every string in A (upper side) to all strings in B (lower side), and vice versa.
A .o. B Composition of two regular relations A and B. The upper language of A gets directly related to the lower language of B. The intersection of the lower language of A and the upper language of B works like a filter without being present in the result of the composition. (e.g. [a:i | b:j] .o. [i:x | k:y] gives the result [a:x] because of a:i and i:x. Everything else is filtered out by the intersection [i|j] & [i|k] allowing only i on the (disappearing) intermediate level.)
R.u Denotes the upper language of the regular relation R. Alias: R.1 (e.g. [A .x. B].1 equals A)
R.l Denotes the lower language of the regular relation R. Alias: R.2 (e.g. [A .x. B].2 equals B)
R.i Inverts the regular relation R. (e.g. [A .x. B].i equals [B .x. A])

 

OPERATORS DESCRIBING RESTRICTION, REPLACEMENT AND MARKUP :

A => L1 _ R1 , L2 _ R2 , ...... , Ln _ Rn
Restriction. Strings of language A are only allowed between a string of language L1 to the left and a string of language R1 to the right or between a string of L2 and a string of R2 or ...... or between a string of Ln and a string of Rn.
A -> B || L _ R
A -> B // L _ R
A -> B \\ L _ R
A -> B \/ L _ R
Simple replacement. Every string of language A (upper side) is replaced by all strings of language B (lower side; multiple results possible) if and only if it occurs between a string of language L to the left and a string of language R to the right. The four separators, ||, //, \\, \/, indicate whether the left and right contexts, L and R, are checked on the upper or on the lower side of the replacement.
||
L on upper side, R on upper side
//
L on lower side, R on upper side
\\
L on upper side, R on lower side
\/
L on lower side, R on lower side
See Karttunen 1995 for discussion and examples.
A11 -> B11 , ... , A1n -> B1n || L11 _ R11 , ... , L1m _ R1m ,,
A21 -> B21 , ... , A2p -> B2p || L21 _ R21 , ... , L2q _ R2q ,,
...... ,,
Ak1 -> Bk1 , ... , Akr -> Bkr || Lk1 _ Rk1 , ... , Lks _ Rks
Parallel replacement (obligatory, input-oriented, upper-to-lower). Every string of language A11 (upper side) is replaced by all strings of language B11 (lower side; multiple results possible) and ...... and every string of A1n by all strings of B1n if and only if it occurs between a string of L11 to the left and a string of R11 to the right or ...... or between a string of L1m and a string of R1m.
At the same time, every string of A21 is replaced by all strings of language B21, etc., if it occurs between a string of L21 and a string of R21, etc.
All these replacements take place simultaneously without interfering with each other.
Note the double comma at the end of the first three lines above separating the groups of related replacements. A single comma is used as the separator inside the replace and context lists.
All replacements are obligatory ("->" instead of "(->)", so that they always take place), input-oriented ("||" instead of "//" e.g., so that the left and right contexts must exist on the input side) and upper-to-lower ("->" instead of "<-" e.g., so that the original strings are in the upper language and the replaced strings in the lower language).
For an explicit description of all variants of parallel replacement see the technical report MLTT-021 (Dec. 1995).
A11 (->) B11 , ... , A1n (->) B1n || L11 _ R11 , ... , L1m _ R1m ,, etc.
Parallel replacement (optional, input-oriented, upper-to-lower). All these replacements take place simultaneously without interfering with each other.
They are all optional ("(->)" instead of "->", so that they do not need but may take place, where all possibilities of local (non-)application generate results). All replacements are input-oriented and upper-to-lower.
A11 -> B11 , ... , A1n -> B1n // L11 _ R11 , ... , L1m _ R1m ,, etc.
Parallel replacement (obligatory, rightward, upper-to-lower). All replacements are obligatory, rightward ("//" instead of "||" e.g., so that the right context must exist on the input side and the left context on the output side which gives the impression that the replacements propagate from left to right) and upper-to-lower.
A11 -> B11 , ... , A1n -> B1n \\ L11 _ R11 , ... , L1m _ R1m ,, etc.
Parallel replacement (obligatory, leftward, upper-to-lower). All replacements are obligatory, leftward ("\\" instead of "||" e.g., so that the left context must exist on the input side and the right context on the output side which gives the impression that the replacements propagate from right to left) and upper-to-lower.
A11 -> B11 , ... , A1n -> B1n \/ L11 _ R11 , ... , L1m _ R1m ,, etc.
Parallel replacement (obligatory, output-oriented, upper-to-lower). All replacements are obligatory, output-oriented ("\/" instead of "||" e.g., so that both the left and the rigth context must exist on the output side) and upper-to-lower.
A11 <- B11 , ... , A1n <- B1n || L11 _ R11 , ... , L1m _ R1m ,, etc.
Parallel replacement (obligatory, input-oriented, lower-to-upper). All replacements are obligatory, input-oriented and lower-to-upper ("<-" instead of "->" e.g., so that the original strings are in the lower language and the replaced string in the upper language).
A11 <-> B11 , ... , A1n <-> B1n || L11 _ R11 , ... , L1m _ R1m ,, etc.
Parallel replacement (obligatory, input-oriented, bi-directional). All replacements are all obligatory, input-oriented and bi-directional ("<->" instead of "->" e.g., so that they take place from the upper to the lower side and vice versa).
[. A11 .] -> B11 , ... , A1n -> B1n || L11 _ R11 , ... , L1m _ R1m ,, etc.
Parallel replacement (obligatory, input-oriented, upper-to-lower). All strings of language A11 are considered as concatenations of one-symbol substrings strictly alernating with single empty strings, i.e. empty strings occur everywhere exactly one time (marked by brackets [. and .]).
All strings of language A1n are considered as concatenations of one-symbol substrings alernating with any number of empty strings, i.e. empty strings occur everywhere any number of times (including zero times).
NOTE: All previous features of parallel replacements like obligatory or optional, input-oriented, rightward etc., bi-directional etc., can be combined in any manner, however all replacements that take place in parallel must have the same features.
For an explicit description see the technical report MLTT-021 (Dec. 1995).
A @-> B
A @> B
A ->@ B
A >@ B
Directed replacement. Instances of the language A on the upper side of the relation are replaced selectively. The selection factors each upper language string uniquely into A and non-A substrings. The factoring is based on a left-to-right (or right-to-left) and longest match (or shortest match) regimen. For the left-to-right, longest match version, @->, the starting locations are selected from left-to-right. If there are overlapping instances of A starting at the same location, only the longest one is replaced by all strings of B. With @> it is the shortest match from the left, with ->@ the longest match from the right and with >@ the shortest match from the right.
A @-> L ... R
A @> L ... R
A ->@ L ... R
A >@ L ... R
Directed markup. Instances of of the language A on the upper side of the relation are selected for markup under the directionality and longest match regimen described above. Each selected A string is marked by inserting all strings of R to its left and all strings of S to its right. The selected A strings themselves remain unchanged, along with the non-A segments.

 

OPERATORS FOR READING NETWORKS FROM FILES :

@"filename" Read a network from a binary file.
@bin"filename" Read a network from a binary file.
@re"filename" Read a network from a regular expression file.
@txt"filename" Read a network from a text file.
@stxt"filename" Read a network from a spaced text file.
@pl"filename" Read a network from a prolog file.

If there are errors in the called file (concerning syntax, file type, file handling, empty file, etc.) a null_fsm (i.e. ~[?*]) is generated, and an error message is printed.
If a binary or regex or prolog file contains more than one network, only the first one is used and a warning is printed.

 

ORDER OF PRECEDENCE :

The following list states the order of precedence of all above operators. Operators of same precedence are executed from left to right, except the prefix operators ( ~ \ $ $? $. ) that are executed from right to left.
To enforce another order use angle brackets [ ].
The list begins with the operators of highest precedence, i.e. with the most tightly binding ones. Operators of same precedence are on the same line.

:
~ \ $ $? $.
+ * ^ .1 .2 .i .r
/
concatenation
> <
| & -
=> -> (->) @-> etc.
.x. .o.

 

For more detailed information please look at our pointers to finite-state literature. We welcome your comments and suggestions.

Back to Finite State Technology homepage