![]() |
|
|
|
|
![]() |
|
|
|
|
|
|
|
|
|
|
SYNTAX AND SEMANTICS OF REGULAR EXPRESSIONS If you already know what regular expressions are or you only want to find out some details about our notation or about certain specific operators, you should skip the introduction and go directly to the relevant section. If you are a novice reader who is not completely familiar with the concept of a regular expression, start with the introduction.
This document describes a subset of the available regular expression operators. See More about Regular Expressions for a complete reference
The language of regular expressions is a formal language, similar to formulas of Boolean logic. It has a simple syntax but the expressions can be arbitrarily complex. Like formulas of Boolean logic, regular expressions denote sets. We assume that the reader is familiar with the basic concepts of set theory (membership, union, intersection, ordered pair, etc.).
The simplest kind of regular expression consists of a single symbol. For example, expression
denotes the set that contains the string "a" and nothing else. To distinguish between regular expressions and the sets they denote, we adopt a typographical convention: the expressions appear in fixed typewriter font; strings are enclosed in double quotes. A special kind of string, the empty string, is expressed with an empty pair of double quotes: "". We need to distinguish two kinds of sets. A set of strings is called a language. The expression above, a, denotes the language consisting of just the string "a". As in set theory, a set consisting of pairs of strings is called a relation. The terms regular language and regular relation refer to sets that can be described by a regular expression. The simplest kind of regular expression that denotes a relation is a symbol pair, such as
It denotes the set consisting of the ordered pair <"a", "b">. A regular relation may always be viewed as a mapping between two regular languages. In this case, the relation is simply the crossproduct of the languages denoted by the expressions a and b. In order to distinguish the two languages that are involved in a regular relation, we call the first one the upper and the second one the lower language of the relation. Correspondingly, in the pair a:b, the first symbol, a, is called the upper symbol and the second one, b, the lower symbol. The two components of a symbol pair are separated by a colon, :, without any whitespace before or after.
Complex regular expressions can be built up from simpler ones by means of regular expression operators. Whitespace is used as the concatenation operator. For example,
the concatenation of a and b, denotes the language that contains just the string "ab". In general, the concatenation of two regular languages consists of strings that extend each string of the first language with all the strings of the second language. The union operator, |, constructs a regular language that includes all the strings of the component languages. For example,
denotes the language that contains the strings "a" and "b", but not "ab". A regular expression can be enclosed in square brackets without changing its interpretation. For example, [a | b] and [a] | [b] describe the same language as a | b. Brackets are sometimes necessary to ensure that the expression has the intended meaning. Note that
represents the language of "aba" and "abb", whereas the language described by
contains the strings "aba" and "b". By convention, concatenation takes precedence over union if the ambiguity is not resolved by bracketing. (More about that topic in the section on Operator precedence.) The minus operator, -, subtracts all the strings of one language from another language. Thus
contains just the string "a". This complex expression is equivalent to the simple expression [a], that is, they denote the same language.
Empty string, Null, and Universal languages An empty pair of brackets, [], is interpreted as denoting the empty-string language. The only string in this language is "", the empty string. Concatenating something with the empty-string language does not give rise to any new strings. Thus [a []] denotes the same language as [a]. On the other hand, the empty string makes a difference for the union operator. The language [a | []] contains two strings, "a" and "". Enclosing a regular expression in round parentheses (as opposed to square brackets) represents a union with the empty-string language. Consequently, (a | b) is the same language as [a | b | []]. Less formally, a regular expression like (b) represents an optional occurrence of "b", i.e. either "b" or the empty string. Subtracting the empty-string language (or any other language) from itself, [[] - []], yields a language that contains no strings at all. We call this the null language. Because the null language has no strings, subtracting or adding it to another language obviously has no effect. Concatenation is different. If one of the component expressions in concatenation is the null language, so is the result. Another expression for the null language is mentioned in the discussion of the term complement operator. The opposite of the null language is the universal language. The universal language contains all strings of any length including the empty string. To see examples of expressions that denote the universal language, see the discussion of the complement and Kleene star operators.
To make the notation less cumbersome, we systematically ignore the distinction between a language and the identity relation that maps every string of the language to itself. For example, the expression [a | b] can be used to denote either the language consisting of "a" and "b" or the relation that maps "a" to "a" and "b" to "b". For the same reason, symbol pairs with identical upper and lower symbols may be collapsed into a simple symbol. Thus we can write a:a simply as a. We also do not always carefully distinguish between a regular expression and the language it denotes. Occasionally we speak sloppily of "the language [a | b]" in the sense of "the language denoted by the regular expression [a | b]." Compilation of regular expressions A regular expression can be converted into a finite-state network that encodes the language or the relation denoted by the expression. In general, each symbol or symbol pair in the expression is represented in the network by an arc with the corresponding label. For example, the regular expression [a] compiles into a network that contains a single arc labeled a. On the other hand, the relationship between complex regular expressions and the corresponding networks is typically much less transparent. For instance, there is no arc labeled with a in the network that encodes the language [\a], the union of all single symbol strings other than "a". For more information on this topic, please read the article on finite-state networks.
Epsilon and Any Two regular expression symbols have a special interpretation:
A regular expression consisting of just the epsilon symbol, 0, denotes the same empty-string language as []. The epsilon symbol is also used as the upper or the lower symbol of a symbol pair. The expression 0:a maps "" to "a", and vice versa. The inverse relation is denoted by a:0. In order to understand the special interpretation of the any symbol in regular expressions, it is useful to take note of the fact that the expression [?] represents the same language as [? | a]. The reason is simple. Because ? stands for any symbol whatsoever, the language denoted by [?] contains all single symbol strings, including "a". Consequently, the union [? | a] does not result in the addition of any new strings. Similarly, [a ?], the concatenation of [a] and [?], is equivalent to [a [? | a]]. In the network corresponding to a complex regular expression, an instance of ? in the expression generally does not correspond to a single arc. The compiler interprets ? by constructing one arc labeled with ? and a duplicate arc for every other symbol in the expression. Consequently, in the resulting network the ? arcs only have the role of representing unknown symbols. The difference between ? as a regular expression symbol (any) and ? as an arc label in a network (unknown) is an unfortunate source of confusion for novice users. See the discussion of ? as the unknown symbol in the article on finite-state networks. In a later section we introduce one more special symbol, the boundary marker .#., to refer to the beginning or the end of a string in certain expressions. The escape character, %, eliminates any special meaning of the following character. Thus |, 0, ?, etc. can be used as ordinary symbols by prefixing them with %. For example, %0 represents the string "0" rather than the epsilon symbol; %| is the vertical bar itself, as opposed to the the union operator |. The ordinary percent sign may be expressed as %%. The special meaning of any character may also be turned off by enclosing the symbol in double quotes: "0" and "?" are interpreted as ordinary letters. On the other hand, certain other strings receive a special interpretation within a doubly quoted string. For example, "\n" is interpreted as the newline character, "\t" as a tab, etc. following C conventions. The backslash is also used as a prefix in octal and hexadecimal numbers: "\101" and "\x41" are alternate ways to specify the symbol A. The print name of a symbol can contain more than one character. For example, %+V represents the string "+V". Beware of multicharacter symbols that coincide with a string concatenated from shorter symbols. For example, the expression [a a | aa] denotes a language that contains the string "aa" twice. One is a single symbol and the other is the concatenation of two "a"s. If such ambiguities arise in practice, our system resolves them left-to-right in favor of the longest symbol. Sometimes it is easier to write characters of words together, especially when there is many. For notational convenience, the
Because we intentionally ignore the distinction between a language and its identity relation, a pair of identical symbols may be collapsed into a single symbol. Instead of a:a we may simply write a.
In addition to concatenation, union, and minus, mentioned in the Introduction, there are many other regular expression operators. The list given here is representative but not exhaustive (see More about Regular Expressions). Some of the operators below can be combined with all expressions regardless of whether they denote languages or relations. Concatenation and union are of this kind. Other operators presuppose that the component expression is of a particular type. For example, complementation, intersection, and minus can be used only with expressions that denote a simple language because regular relations are not closed with respect to these operations. Composition presupposes that the component expressions denote relations. For each of the operators described below, we specify what kind of expressions it combines with.
We describe here some more complex regular expressions including additional variants of the replacement expressions discussed above.In all of these cases, the L and R components are optional. The absence of the L or R part in these expressions is interpreted as denoting empty-string language.
In the restriction, =>, and conditional replacement, ->, @->, expressions we can use a special boundary marker, .#., to refer to the beginning or to the end of a string. In the left context, the boundary marker signals the beginning of the string; in the right context it means the end of the string. For example, the restriction expression [a => _ .#.] denotes the language where "a" appears only at the end of a string. The boundary marker is not allowed as a constituent in a symbol pair; otherwise it can be used in a regular expression like any other symbol. For example, in the replacement expression [a -> b || [.#. | a] _ ] the union in the left context indicates that "a" is to be replaced by "b" at the beginning of a string and after another "a". Complex regular expressions can often be simplified by taking advantage of a convention that ranks the operators in a precedence hierarchy. For example, [\a] [b]* may be expressed more concisely as \a b* because the unary operators, term complement and Kleene star, by convention 'bind more tightly' to their component expressions than concatenation. Similarly, the brackets in a | [b c] are unnecessary because concatenation takes precedence over union; a | b c denotes the same language. In turn, union has precedence over replacement, allowing us to simplify a -> [b | c] to a -> b | c. The ranking of all the operators discussed in the previous sections is summarized in the table below. (High to Low.)
Unbracketed expressions with operators that have the same rank are disambiguated from left to right. Consequently, a | b & b* is interpreted as [a | b] & b* and not as a | [b & b*]. For the sake of clarity, we advise using brackets in such cases even if the ambiguity would be resolved in the intended way by the left-to-right ordering.
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||