![]() |
|
|
|
|
![]() |
|
|
|
|
|
|
|
|
|
|
EXAMPLES OF NETWORKS AND REGULAR EXPRESSIONS This page contains examples to illustrate the compilation of finite-state networks from regular expressions. You can copy and paste the formulas directly into the main window of the Finite-State Compiler page. If you can suggest interesting or amusing examples to add to this page, we would like hear from you!
Elision and Contraction Noun Phrase Parser Syllabification Plus/Minus One Soft-drink Machine Better Soft-drink Machine Soundex Roman/Arabic Translator Roman Plus/Minus One Einstein's Puzzle Einstein II A lexical transducer is a lemmatizer that relates inflected surface forms to the corresponding lexical forms. The lexical form consists of the canonical citation form of the word and some number of morphological tags that describe the inflection. The expression in Figure 1.1 compiles into a transducer that encodes a small subset of a lexical transducer for English. The size of the network is 16 states, 22 arcs.
Figure 1.1: A lexical transducer Lexical transducers are typically derived by composing a network of lexical forms with rule transducers that encode the regular morphological alternations of the language (Karttunen 1994). For the sake of illustrating the concept, the expression in Figure 1.1 simply maps each lexical form to the corresponding surface form with the help of the crossproduct (.x.) operator and constructs the union of the pairs. The lexical forms constitute the upper language of the relation, the surface forms are on the lower side. Consequently, applying the transducer 'downwards' to a lexical form generates the corresponding surface form; 'upwards' application to a surface form produces an analysis. Note that some surface forms correspond to more than one lexical form. Figure 1.2 illustrates the application of the transducer in both directions.
Figure 1.2: Generation and Analysis The tags used in this example come from the well-known Brown corpus: +VBD means the past tense of a verb, +JJ is adjective, etc. Note that +VBD and +JJ are single symbols.
A tourist arriving in Paris is often bewildered by the apparent inconsistency in the naming of the numerous local railroad stations. Why is it "Gare du Nord" but "Gare de Lyon"? Why "Gare de l'Est" but "Gare d'Austerlitz"? ("Gare" means 'station'.) The Parisians explain that this variation reflects two productive morphological alternations in French:
Knowing this does not help unless you also know that 1 takes precedence over 2 in cases where both are applicable. That is the reason why there is no "du" in "Gare de l'Est". One also must know that there is no "le" in "Gare de Lyon" and "Gare d'Austerlitz" since "Lyon" and "Austerlitz" are proper names. Both types of alternation can be modeled easily as conditional replacements (Karttunen 1995). We need to formulate the replacements in such a way that they describe the relation correctly both in isolation ("de le") and in the context of a longer string "Gare de le Est". Consequently, we must be careful about the beginning and end of strings in stating the conditions. With suitable definitions for 'Vowel' and 'H', the elision rule could be formalized simply as in Figure 2.1: Here the percent sign causes the following space to be to interpreted as a symbol rather than as whitespace. This expression relates "e " to "'" (a string consisting of a single quote) when it is preceded by a word-initial "c", "d", "l", or "s" and followed by a vowel or "h". The expression [% |.#.] denotes the beginning of a word; that is, a preceding blank space or a string boundary. Consequently, "de ", "le ", etc. do not alternate when they are part of some longer word. The actual statement in Figure 2.3 is a little more complicated because the vowels are spelled out explicitly as small and capital letters. The contraction rule can be expressed as the replacement in Figure 2.2.
Figure 2.2: Contraction of "de le" to "du" (rule 2) As before, the context [% |.#.] indicates that "de" and "le" must be whole words, preceded and followed by a space or a string boundary. Both rule 1 and rule 2 are applicable in cases such as "Gare de le Est". Because the correct result is "Gare de l'Est" and not "Gare du Est", rule 1 has precedence. In other words, the transducer for rule 1 is applied first and the transducer for rule 2 is applied to the output from rule 1.
Figure 2.3: Sequential rule application Another way to get the right effect is to derive a single transducer by composition, as shown in Figure 2.4. By placing rule 1 above rule 2 in the composition, we ensure that the application of the resulting single transducer has exactly the same effect as the sequential application of the original transducers in Figure 2.3.
Figure 2.4: Composition of Elision and Contraction The size of the network is 19 states, 192 arcs. Adding the missing accented vowels would increase the number arcs but not the number of states. The transducer maps the lexicalized forms "Gare de le Est" and "Gare de le Nord" unambiguously to "Gare de l'Est" and "Gare du Nord", But this does not completely solve the problem for the foreign tourist. See Figure 2.5.
Figure 2.5: Unambiguous generation vs. ambiguous analysis The relation is not unambiguous in the other direction. The upward application of the transducer in Figure 2.5 to "Gare de l'Est" gives two results: "Gare de le Est" and "Gare de l'Est". That is, rule 1 in itself does not guarantee that "l'Est" can only arise in French as the result of elision. Similarly, "Gare du Nord" yields both "Gare de le Nord" and "Gare du Nord" because rule 2 does not say that every instance of "du" is a contraction of "de le". In fact there is another "du" in French. And yet another puzzle for the foreigner: why the "de" in "Gare d'Austerlitz" (cf. "Gare de Lyon" vs. "Gare Montparnasse") when there no trains from Gare d'Austerlitz to Austerlitz? The problem of giving Elision priority over Contraction could also be solved in another way. With a slight reformulation of the contraction rule into Contraction2, we can get the correct result regardless of the order of application. We just need to refine the context of the "de le" to "du" rule in such a way that it has no effect in the environment where the "le " elides to "l'". That is, we restrict Contraction2 so that "de le" does not contract to "du" in front of a word that begins with a vowel or h. Figure 2.6 shows the alternative solution to the problem. Just to demonstrate that the reformulation of the contraction rule makes the ordering irrelevant, we now compose the transducers in the opposite order.
Figure 2.6: Composition of Contraction2 and Elision The difference between two formulations of the contraction rule is that in Figure 2.6, the right context, by virtue of the expression [% ~[[A|E|I|O|U|H|a|e|i|o|u|h] ?*] .#.], excludes any sequence where the space after "le" is followed by a string beginning with a vowel or "h". The reader may be interested in verifying that the expressions in Figure 2.4 and 2.6 compile into exactly the same 19 state, 192 arc transducer. Because the order of application of Elision and Contraction2 makes no difference, the two transductions could also be done parallel. To try that alternative, replace the composition operator .o. in Figure 2.6 by a double comma. That is, the parallel replacement should have the form A -> B || X _ Y ,, C -> D || Z _ W . This example illustrates that fact that the interaction of two phonological or orthographical alternations can in general be resolved with or without explicit ordering, although one or the other of the alternative solutions may seem more natural or simpler in a particular case. As phonologists know, much ink has been spilled over this issue. Karttunen 1991 discusses similar problems from the perspective of two-level rules (Koskenniemi 1983).
Finite-state networks have been used since the late 1950s for lightweight syntactic analysis. To illustrate the basic idea, we represent text as strings of part-of-speech tags, using "d" for determiner, "a" for adjective, "n" for noun, and "v" verb, etc. In other words, in this example the regular expression symbols represent whole words rather than single letters in a text. Let us assume that a noun phrase consists of an optional determiner, any number of adjectives, and one or more nouns. This simple definition can be expressed as a regular expression: [(d) a* n+]. This expression denotes an infinite set of strings, such as "n" (cats), "aan" (discriminating aristocratic cats), "nn" (cat food), "dn" (many cats), "dann" (that expensive cat food) etc. A simple noun phrase parser can be thought of as a transducer that inserts markers, say, a pair of braces { }, around noun phrases in a text. The task is not as trivial as it seems at first glance. Consider the expression
Figure 3.1: A simple NP parser The resulting 4-state transducer, shown in Figure 3.2 inserts curly brackets around any expression matching the [(d) a* n+] pattern. Note that ? here matches symbols, such as "v", that are not in the alphabet (sigma) of the network.
Figure 3.2: [(d) a* n+ -> %{ ... %}] Applied to the input "danvn" (many small cats like milk) this transducer yields three alternative bracketings:
Figure 3.3: Three analyses of "danvn" The reason for the ambiguity is that the transducer may loop in the start state over the initial "d" and "da" strings before inserting the first left bracket. This correct, of course, because "n", "an", and "dan" are all instances of the [(d) a* n+] pattern. For certain applications it may be desirable to produce a unique parse, marking the maximal expansion of each NP: "{dan}v{n}". Using the left-to-right, longest-match replace operator @-> instead of the simple replace operator -> in Figure 3.1 yields the desired result.
Figure 3.4: Left-to-right, longest-match NP parser The corresponding network consists of 8 states and 28 arcs. The diagram is shown in Figure 3.5.
Figure 3.5: [(d) a* n+ @-> %{ ... %}] As illustrated in Figure 3.6, the string "danvn" is now unambiguously transduced to "{dan}v{n}", ignoring the embedded smaller noun phrases. The corresponding path in the transducer is <0 0:{ 7 d 3 a 3 n 4 0:} 5 ? 0 0:{ 7 n 4 0:} 5>.
Figure 3.6: Left-to-right, longest-match markup With longer input strings, the difference between the two parsers becomes even more striking. Whereas the left-to-right, longest match parser in Figure 3.5 maps the input "dannvaan" unambiguously to "{dann}v{aan}, the simple NP parser in Figure 3.2 produces 18 alternative analyses for the same input. The brackets inserted by the noun phrase parse can in turn be used to define larger syntactic units, such as prepositional phrases and verb phrases. A composition of two or more such transducers creates a transducer that produces nested bracketings.
A syllable is a unit of speech that consists of a vocalic center, called nucleus, surrounded by a consonantal onset and a consonantal coda, one or both of which may be empty. In languages such as Finnish, where the orthography is closely related to the pronounciation, the general structure of a syllable in written texts can be described by the regular expression [C* V+ C*]. In other words, the onset and the coda consist of zero or more consonants, C, the nucleus of one or more vowels, V. We ignore here the specific constraints on the three components. For instance, if the onset or the coda consist of several consonants, the sonority level tends to diminish in proportion to the distance from the nucleus: "tri" and "irt" are possible syllables but "rti" and "itr" are less acceptable. The nucleus may consist of a single vowel, "a", a double long vowel, "aa", or a diphthong, such as "ai". On the other hand, in quotations of spoken Finnish, one may see words that do not conform to these principles. For instance, the emphatic lengthening of a syllable may be represented by a burst of letters: "jooo" ('yeees'), "perrrkele" ('shiiit'). The general template [C* V+ C*] allows for such exceptions. In the following, we will use a single hyphen to mark the syllable boundaries. Without exception, a single consonant between two vowels is grouped with the following vowel as the onset: V - C V. Any additional intervocalic consonants belong to the coda of the preceding syllable: V C C - C V, as in "myrs-ky" ('storm'), "kilt-ti" ('nice'). Word-internal syllables of native Finnish words have a single initial consonant; the first syllable of the word may begin with a cluster of consonants. For example, the word "strukturalismi" syllabifies into "struk-tu-ra-lis-mi". A simple syllable parser based on these principles can be derived with the expression in Figure 4.1.
Figure 4.1: Schematic syllable parser This formula denotes a relation that inserts a hyphen in front of a consonant vowel sequence after a maximal expansion of the [C* V+ C*] pattern. This expression contains the same left-to-right, longest-match operator as the simple NP parser in the preceding example but here we need an additional constraint on what must follow the inserted marker. The corresponding transducer is shown in Figure 4.2.
Figure 4.2: [[C* V+ C*] @-> ... "-" || _ [C V]] Figure 4.3 illustrates the application of the parser on a skeletal CV structure.
Figure 4.3: Application of the schematic syllable parser. To use the syllable parser on real text, we of course need to replace C by the set of consonant symbols and V by vowels. The most convenient way to accomplish that is to use a substitution expression, that is, a regular expression of the form `[[RE], SYMBOL, LIST-OF-SYMBOLS]. This denotes the language or a relation obtained by replacing the every instance of SYMBOL in RE by the union of the symbols on the list. Figure 4.4 shows the expression from which the final transducer can be compiled.
Figure 4.4: Syllable parser The transducer derived from the expression in 4.4 has the same five states as the network in Figure 4.3 but many more arcs. Figure 4.5 illustrates the application of the transducer to a test word.
Figure 4.5: Application of the syllable parser The hyphenation of words in Finnish texts generally follows the natural syllabification pattern. However, hyphenation rules also take into account other information, in particular, the compound boundaries. For example, in words like "autonajaja" ('car driver'), the first hyphenation point is at the compound boundary, "auton-ajaja", regardless of the actual pronounciation. In many cases correct hyphenation is difficult to achieve automatically because of ambiguities such as "auto-nosto" ('car lift') vs. "auton-osto" ('car purchase'), which cannot be resolved without real understanding of the text.
In this example we construct a simple transducer that performs addition and subtraction by 1 on binary numbers. A downward application of the transducer corresponds to subtraction, upward application is addition. For example, "1100" (twelve) should get mapped to "1101" (thirteen) upwards, to "1011" (eleven) downwards. As the first step towards constructing a description of the relation, let us consider the regular expression in Figure 5.1. It denotes a parallel replacement of 1 by 0 and 0 by 1 when the string to the right is empty or consists of nothing but zeros up to the end, that is, to the end of the string, .#., or to something that is neither zero nor one, [\[%0|1]].
Figure 5.1: Binary adder/subtracter for 1 (first try). This is almost correct. For example, it maps "1100" to "1011" downwards and to "1101" upwards. However, the transducer does not map correctly at the left edge of the number where a new 1 must sometimes be added and a leading 0 can be deleted. It does not include pairs like "10" and "1", "100" and "11", etc., as it should. The problem can be corrected by composing the expression in Figure 5.1 with two auxiliary relations. To make the expressions more perspicuous, let us use END to abbreviate the complex expression [\[%0|1] | .#.]. The first relation, [1 <- [. .] || END _ [%0+ END]], adds an initial 1 to a string of one or more zeros in the upwards direction. (Using [. .] instead of [ ] to represent the empty string has the effect here that only a single 1 is inserted.) The second auxiliary relation, [%0 -> [] || END _ [1+ END]], eliminates the leading zero in strings such as "01" in the downwards direction. The two new relations are composed around the original expression. The result is shown in Figure 5.2 in complete detail.
Figure 5.2: Binary adder/subtracter for 1 (final). Although the expression looks complicated, the size of the network is only 5 states, 12 arcs, Figure 5.3. To clearly distinguish between the digit zero, and the special use of zero as the epsilon symbol, we exceptionally represent the single 1-to-epsilon pair in this diagram as 1:. All instances of 0 in Figure 5.3 represent zero as a digit.
Figure 5.3: Plus/minus 1 transducer. The reader is invited to find a simpler regular expression that compiles into the same transducer. (It is harder than it looks.) As Figure 5.4 demonstrates, the transducer applies correctly to strings that contain binary numbers in the middle of other text. The input equation asserts that 12-7 equals 8-3.
Figure 5.4: Addition and subtraction on binary numbers An interesting feature of this transducer is that we can derive from it an adder/subtracter for any natural number by composition. The composition of the transducer in Figure 5.3 with itself yields a transducer that adds and subtracts 2. The composition of the plus/minus 2 transducer with itself adds and subtracts 4, and so on. Although the transducer in Figure 5.3 is unambiguous in either direction, it is not efficient on large numbers. For example, consider subtraction from "10000000001" (1025 in decimal). The initial "1" on the left must be realized in one of three ways depending on what follows on the right: as "0", as the empty string, or as "1". The first alternative, mapping to "0", is correct only if the initial "1" is also the last digit of the number. This is not the case here. The remaining choice depends on whether there is another "1" somewhere further to the right. Both alternatives have to be explored. In the case at hand, there is another "1" at the very end, so the first "1" stays, yielding "10000000000". The alternative path that eliminates the initial "1" and maps all the immediately following "0"s to "1"s dies out at the last moment. Since the ambiguity in general cannot be resolved within a fixed amount of lookahead, the transducer in Figure 5.3 is yet another example (see Finite-State Networks) of a transducer that is unambiguous in a given direction, but neither sequential or sequentiable. Because the mapping on the left depends on what follows on the right, and not vice versa, it would be more efficient to proceed backwards, starting from the right end of the string. With the help of the reverse operator, .r, we can easily turn the transducer in Figure 5.3 around for right-to-left application. The regular expression for the reverse transducer is of the form [--- the expression in figure 5.2 ---].r . Figure 5.5 shows the resulting right-to-left binary adder/subtracter. (To clearly distinguish between zero and epsilon, we again represent the single 1-to-epsilon label as 1:. All "0"s in Figure 5.5 are digits.)
Figure 5.5: Right-to-left plus/minus 1 transducer. The right-to-left version of the transducer is nearly sequential in both directions as it has only one local ambiguity in state 1. We can now run our original example in Figure 5.4 backwards:
Figure 5.6: Right-to-left addition and subtraction See also the Roman Plus/Minus One example. Let us imagine a machine that dispenses soft drinks for the price of 65 cents per can in US currency. The machine accepts nickels, dimes, and quarters. If you put in the right amount of money in any combination of coins, a can of soft drink drops into a bin; otherwise nothing happens. We can think of the sequences of coins that add up to 65 cents as a language consisting of the letters "n" (nickel, worth 5 cents), "d" (dime, worth 10 cents) and "q" (quarter, worth 25 cents). For example, "qqdn", "ddddddn", and "nnnnnnnnnnnnn" are in this language because the total value of the string is exactly 65 cents. Any string whose value is less or more than 65 cents is excluded, for example, "qdq" (too little) and "qqq" (too much). The task in this example is to construct a regular expression that compiles into a finite-state automaton that implements the behavior of the soft drink machine. In order to avoid unnecessary technical complexities, let us simulate the dropping of the can by just an appropriate sound, say "PLONK". Thus we can think of the soft drink machine as a transducer that converts an appropriate sequence of coins into a PLONK. (Unlike the vending machines we are familiar with, our transducer could also be applied in the opposite direction as a buying machine, converting PLONKs to money.) We can solve this problem in a few simple steps. The first task is to associate the coin symbols, n, d, and q, with appropriate values. We do this by constructing for each coin a transducer that maps it to a string of "c"s (cents) whose length represents the value. The simplest way to do this is to use the construction A^n, which denotes the concatenation of A with itself n times. For example, c^5 denotes the language consisting of the string "ccccc". Thus the expression [n .x. c^5] expresses the fact that a nickel is worth 5 cents. The regular expression in Figure 6.1 denotes a mapping from all possible sequences of the three symbols to the corresponding value.
Figure 6.1: A relation between coin sequences and their values This relation contains infinitely many pairs. For example, the pair consisting of "nd" and "ccccccccccccccc" expresses the fact that a nickle and a dime are worth 15 cents. The next task is to relate the dropping of the soft-drink can to the appropriate amount of money. The expression [c^65 .x. PLONK] does just that. What remains is the problem of allowing the price to be paid by any sequence of coins that has the appropriate value. The solution is given in Figure 6.2.
Figure 6.2: The 65 cent soft drink machine The result of the composition in Figure 6.2 is a transducer whose upper language consists of all coin sequences whose value is exactly 65 cents. This effect comes about because the upper language of the relation denoted by [c^65 .x. PLONK], acts like a filter in the composition. It consists of a single string of 65 "c"s. Consequently, all coin sequences that map into some other value are eliminated. The lower language of the relation denoted by the expression in Figure 6.2 consists of "PLONK". (Note that PLONK is a single symbol.) Because the composition yields a relation between the outermost languages, the result does not contain any "c"s. The size of the network compiled from the regular expression in Figure 6.2 is 14 states, 34 arcs. The upper language consists of 634 strings. The number is larger than one might think because it includes, for each combination of coins, all the possible permutations. That is, "qqdn", "qqnd", "dqqn", etc. are counted separately. If we are interested in knowing just the number of combinations of nickels, dimes, and quarters that add up to 65 cents, ignoring the order of insertion, we can compute that easily by forcing the coins into a canonical order. The solution is shown in Figure 6.3. The effect of the topmost expression [q* d* n*] is to force the coins to be inserted in a strict order: first the quarters, if any; then the dimes, if any; and only then the nickels.
Figure 6.3: A strict soft drink machine We can enumerate the allowed ways of paying by doing a lookup on PLONK in the resulting transducer. See Figure 6.4.
Figure 6.4: Ordered coin combinations This strict machine allows only fourteen sequences of coins.
The clients of the soft drink machine just described cannot be very satisfied. The machine behaves in an outrageous manner. If the client fails to put in exactly the right combination of coins for a soft drink, the machine just keeps the money without delivering anything. Good business practice dictates that the machine should reimburse the client if the sum is too small and return the extra change along with the soft drink if the client has overpaid. We will show how these shortcomings can be corrected. To make things simpler, we will also fix a less important defect. Instead of selling just a single soft drink can, the improved transducer will convert any amount money into the appropriate number of PLONKs and return the extra change if any. In the case where the inserted money is less than the price of one can, the entire sum will be returned. In other words, we wish to modify the lower language of Figure 6.2 to make it a subset of [PLONK* q* d* n*], that is, any number of PLONKs followed by the extra change. To ensure that the extra change is paid out only once, we need to make sure that quarters get paid before dimes and dimes before nickels. Because all of the components need to be associated with the appropriate value in cents, the last line of Figure 6.2 is replaced by [[c^65 .x. PLONK]* [c^25 .x. q]* [c^10 .x. d]* [c^5 .x. n]*]. The last refinement is to ensure that as much money as possible is converted into soft drinks and to remove any ambiguity in how the extra change is to be reimbursed. For example, if the client is entitled to get back 60 cents, we give him two quarters and a dime instead of six dimes or some other combination of coins that could be paid. If there is any ambiguity about how reimbursements are made, the transducer would produce multiple outputs, with the danger that the client would get paid multiple times the same amount in different combinations of coins. To achieve these goals, we add yet another layer of constraints on the bottom of the composition. The expression [~$[q q q | [q q | d] d [d | n] | n n]] has the effect of disallowing the reimbursement of any sum that could be converted to a PLONK, such as "qqq" and "qqdn". It also prohibits sequences of dimes and nickels that add up to a quarter or more, such as "ddd" and "ddn". Finally, it prohibits a sequence of two nickels thus ensuring that ten cents is returned only as a dime. The new solution to the soft drink machine problem is given in Figure 7.1.
Figure 7.1: A better 65 cent soft drink machine The transducer compiled from the expression in Figure 7.1 contains 35 states and 102 arcs. Because there is no limit on how many soft drinks can be bought at the same time, the number of pairs of strings in the relation encoded by the transducer is infinite. For example, it includes the pair consisting of "nnnnnqqqqddddnnn" and "PLONKPLONKqq". Here the sum of the coins is 180 cents, thus the client gets two soft drinks, worth $1.30, and 50 cents back paid as two quarters. Note that the improved soft drink machine can also be used to reduce the amount of small change. If a sequence of coins is not enough to buy a drink, the entire amount is returned using as few coins as possible. For example, in exchange for 60 cents in whatever combination, the client will get back two quarters and a dime. To make the machine completely foolproof, we need one final improvement. Some clients may insert unwanted items into the machine (subway tokens, foreign coins, etc.). These objects should not be accepted; they should passed right back to the client. This goal can be achieved easily by wrapping the entire expression in Figure 7.1 inside an ignore operator, /. To highlight the addition, we omit the common part: [ [ --- the expression in Figure 7.1 --- ]/[\[q | d | n]]]. Here the second operand, [\[q | d | n]], denotes the set that contains every single-symbol string other that "q", "d", or "n". The effect of the ignore operation is to create a transducer that is in other respects equivalent to the original one except that it allows the other symbols to occur freely, without affecting the relation specified in Figure 7.1. Any unknown symbol is simply mapped to itself, as shown in Figure 7.2, and ignored.
Figure 7.2: Ignoring foreign objects Here "quadriphonique" is transduced to "PLONKuariphoiue". The two quarters, the dime and the nickel in the input string are transduced to a PLONK; everything else is passed through unchanged. The size of the final transducer is 35 states, 172 arcs. The number of states is the same as in the network compiled from Figure 7.1. The increase in the number of arcs comes from having two new looping arcs at each state. One is for the unknown symbol; the other is for PLONK, that is, a can of something dropped into the machine by the client himself.
Related names such as "Johnson", "Johnsen", "Johansson", "Johanson", "Johansen", "Jensen", etc. are often confused with one another because they are close in spelling and pronounciation. In 1918 Margaret K. Odell and Robert C. Russel developed a system of mapping such sets of similar surnames into a unique common representation. They called it the "Soundex" method. Odell and Russel obtained two US patents (long since expired) for the idea. Donald Knuth describes the Soundex method in his book on the Art of Computer Programming (Vol 3. pp. 391-392) as follows:
Figure 8.1: Soundex mapping The expression in Figure 8.1 denotes the composition of two relations. The first one, a set of parallel longest-match replacements, maps sequences of adjacent consonants into the corresponding code and deletes the vowels. Thus it does the work of steps 1, 2 and 3 in the original Soundex algorithm. Because the first letter of the name is capitalized, it remains unaffected by these replacements. The second relation in Figure 8.1 corresponds to the step 4 in the original method. It consists of a union of four relations. The first one preserves the first four symbols of a string and eliminates the rest. The remaining three add zeros to the end of short strings to increase the length to four. The Soundex transducer compiled from the expression in Figure 8.1 consists of 26 states and 805 arcs. Figure 8.2 illustrates its effect.
Figure 8.2: Same code, different names Because there is no limit on the length of a name or the number of ignored vowels and consonants, every Soundex code represents an infinite set of strings. Consequently, applying the Soundex transducer in the upward direction produces mostly gibberish. However, if we have a lexicon of names, composing the lexicon with the Soundex relation creates a restricted name-to-code transducer that yields useful results in either direction. Consider the expression in Figure 8.3.
Figure 8.3: Name-to-code relation The transducer compiled from the expression in Figure 8.3 maps the strings in the name lexicon into exactly the same Soundex codes as the general Soundex transducer. It is more limited of course because the upper language of the relation is a finite list. On the other hand, for that very reason, applying it in the upward direction yields useful results, as shown in Figure 8.4.
Figure 8.4: From name to code, from code to name In order to know what names "Johnson" could be confused with, one can first apply the restricted name-to-code transducer downwards to get the corresponding Soundex code J525. By applying the transducer upwards to J525 we obtain all the strings in the name lexicon that are associated with that particular code. There is a way to get that information in just one step. By composing the relation in Figure 8.3 with its own inverse, we can create a transducer that maps every name in the lexicon directly to all the other names that have the same Soundex code. The actual Soundex codes are eliminated in the composition. The resulting transducer simply maps every name to a set of names, possibly just to itself, without any indication of what property it is that the names have in common.
Let us define the language of valid Roman numerals up to MMMCMXCIX (= 3999) and then construct a transducer that maps each Roman numeral to the corresponding Arabic numeral, and vice versa. The method we will use is a combination of techniques introduced in the Soft drink and Soundex examples above. Roman numerals are written with an alphabet of seven letters:
Figure 9.1 Alphabet of Roman numerals. The system is additive: II = 2, VII = 7, and so on. However only certain orders are allowed. The general rule is that letters with a larger value precede letters with a smaller value. Thus MLX is a valid numeral (= 1060) but XML is not. Three of the seven letters, I, X, and C, can occur in front of certain larger numerals and in that case they are interpreted subtractively. For example IV is 4 (= 5 - 1), IX is 9, XL is 40. However the I, X, and C prefixes are mutually exclusive. I may occur only in front of V and X; X can appear in front of L and C, and C may be prefixed only to D and M. The expression in Figure 9.2 encodes the syntactic conventions.
Figure 9.2 Language of Roman numerals. The expression in Figure 9.2 is a concatenation of four subexpressions. The first one, M^<4 ("less than four Ms"), yields MMM, MM, M, and the empty string. Each of the following four lines creates ten strings including the empty string. For example, the expression [I [X | V] | (V) I^<4] yields the strings IX, VIII, VII, VI, V, IV, III, II, I, and the empty string. The concatenation of the four parts yields all the valid Roman numerals from I to MMMCMXCIX. We consider the empty string as the Roman counterpart of 0. Later on we show how to construct a Roman/Arabic transducer directly but we take first a more interesting approach in which unary numerals serves as an "interlingua". That is, we will develop a mapping from Roman to unary numerals, then specify a mapping from Arabic to unary numerals, and finally link each Roman numeral to the corresponding Arabic one by way of their common unary representation. For example, the mapping from IV to 4 is obtained by transducing IV to 1111 and 1111 to 4 and by composing the two relations. Figure 9.3 illustrates the general idea.
Figure 9.3 Relating IV to 4 by 1111 The regular expression in Figure 9.4 specifies the mapping from Roman to unary numerals.
Figure 9.4 Roman to unary numerals The expression in Figure 9.4 consists of two subexpressions that are composed with one another. The first component is a seven part parallel replace expression that maps each of the letters to the appropriate number of 1s. Because I, X, and C have a subtractive interpretation in front of a letter that has a larger value, we need to restrict the context. For example, the third line in Figure 9.4 maps X to 1111111111 just in case it is at the end of a string or followed by something other than L or C. In contrast, V, L, D, and M are additive in any context. The second part of the expression in Figure 9.4, maps the remaining instances of I, X, and C to the empty string taking along the appropriate number of 1s on their right side. In other words, the mapping from IV to 1111 is specified in two steps as illustrated in Figure 9.5
Figure 9.5 IV to 1111 The composition of the two relations in Figure 9.5 of course maps IV to 1111 directly eliminating the intermediate representation. The syntax of Arabic numerals is simpler than the syntax of Roman numerals. The only complication is that we do not allow leading zeros because it would introduce ambiguity. The expression in Figure 9.6 defines the language of Arabic numerals from 0 up to 3999.
Figure 9.6 Language of Arabic numerals. The mapping Arabic numerals to unary is complicated by the fact that the value of every digit depends on its distance from the right edge. The expression in Figure 9.7 defines the relation between Arabic and unary numerals.
Figure 9.7 Arabic to unary numerals. The expression in Figure 9.7 is a set of five replacements that are done in parallel so as not to interfere with each other's contexts. For example, 3 is replaced by three, thirty, or three hundred 1s depending on whether it is zero, one, or two symbols away from the end of the string. Zero is replaced by an epsilon in all positions. Having defined the two languages, Roman and Arabic numerals, and the mapping from both to the same unary notation, we can derive the desired direct relation between Roman and Arabic numerals in the manner shown in Figure 9.8. Because the expression is too large to be reproduced here, we use the names Roman, RomanToUnary, Arabic, and Arabic for the formulas in Figures 9.2, 9.4, 9.6, and 9.7. Note that we need to invert the Arabic/Unary relation in order to get the unary strings to the upper side where they will match the unary strings on the lower side of the Roman/Arabic relation.
Figure 9.8 Indirect mapping from Roman to Arabic numerals The resulting network for the Roman/Arabic relation is quite compact. The size of the transducer is 30 states and 103 arcs. Figure 9.9 shows it in action.
Figure 9.9 Roman/Arabic translation The Roman/Arabic relation can of course also be defined quite directly. The expression in Figure 9.10 encodes simultaneously the syntax of Roman numerals as well as their relation to Arabic numerals. It denotes exactly the same relation as the expression in Figure 9.8.
Figure 9.10 Direct mapping from Roman to Arabic numerals The expression in Figure 9.10 is a composition of two relations. The first part defines the basic Roman/Arabic mapping; the second part eliminates leading zeros on the Arabic side. The Roman/Arabic part is a concatenation of four relations that define for a given position in a four digit Arabic numeral the Roman way of writing 9, 8, 7, ... etc. For example, Arabic 3 corresponds to Roman MMM, CCC, XXX, or III depending on where it occurs. Arabic 0 corresponds to the empty string everywhere on the Roman side. Note that in Figure 9.10 we use // instead of || in the replace expression for zeros. // indicates that the left context for the replacement is specified on the lower side and the right context on the upper side. Consequently all the leading zeros in "0007" are deleted.
In the Plus/Minus One example above, we constructed a transducer that maps any binary number to its successor or predecessor depending on the direction of application. Here we do the same for Roman numerals. This task is straight-forward since we have already defined the language of Roman numerals and the mapping to the unary representation. The unary representation makes the task easy because it is simple to define the successor/predecessor relation. To find the predecessor of a Roman number we can compute the corresponding unary number, subtract one from it, and transduce the result back to the Roman system using the inverse mapping from unary to Roman numerals. Figure 10.2 illustrates the main idea.
Figure 10.2 From V to IV The regular expression in Figure 10.3 is the composition of the formulas in Figures 9.2 and 9.4. It yields a transducer that maps each Roman numeral to its unary counterpart. For example, it relates V to 11111. The inverse of that transducer maps 1111 to IV.
Figure 10.3 Roman to unary translation All that we need now is a transducer that encodes the predecessor relation for binary numbers. The expression in Figure 10.4 yields such a network. It eliminates the rightmost 1 from a unary numeral.
Figure 10.4 Unary subtration In Figure 10.5 the name MinusOne stands for the expression in Figure 10.4 and RomanToUnary for the expression in Figure 10.3 and RomanToUnary.i for the inverse relation.
Figure 10.5 Roman to unary translation The intermediate unary layer is eliminated in the composition. The resulting transducer maps Roman numerals directly to Roman numerals, as illustrated in Figure 10.6. The size of the network is is 44 states and 148 arcs.
Figure 10.6 Roman subtraction/addition A similar construction would of course also work for binary and Arabic numerals within any finite range. It would not work for the unbounded case we considered in the Plus/Minus One example.
Variations of this riddle appear on the net from time to time. It is sometimes attributed to Albert Einstein and it is claimed that 98% of the people are incapable of solving it. Some commentators suggest that Einstein created such puzzles not to test out intelligence but to get rid of all the students who wanted him as an advisor. It is not likely that there is any truth to these stories. Whereever it comes from, this is a nice riddle. Let us assume that there are five houses of different colors next to each other on the same road. In each house lives a man of a different nationality. Every man has his favorite drink, his favorite brand of cigarettes, and keeps pets of a particular kind.
The question to be answered is: Who keeps fish? This problem is al little too elaborate for our simple web demo compiler. Although the solution could be expressed as a single regular expression without any auxiliary notions, it would be very large. It is simpler to define first a few terms that can be used as components in the final solution. In the following, we presuppose that we have access to the xfst application described at http://www.xrce.xerox.com/competencies/content-analysis/fssoft/. We need five basic variables: Color of the house, Nationality of the owner, and his favorite Drink, Cigarette, and Pet. We define each variable as the language of the possible values of the variable.
The next concept to define is that of a House. Let us construe it as a concatenation of the five terms defined above:
With five variables each taking one of five possible values, this gives quite a number of possible households, 5x5x5x5x5 = 3125, to be exact. A road with five houses next to each other, House^5, provides an astronomical number of possible combinations of colors, nationalities, drinks, cigarettes and pets. To solve Einstein's puzzle, we represent each of the fifteen constraints as a regular language and intersect these languages with the initial set of all possibilities. If all goes well at the end we will know who keeps fish. For example, we can interpret The Englishman lives in the red house. as $[red Englishman]. This constraint is trivial to encode because in our representation of a house, the color and the nationality are adjacent. The second costraint, The Swede keeps dogs could be represented as $[Swede Drink Cigarette dogs] but we will choose a less verbose formulation, $[Swede ~$Pet dogs], that does not explicitly list the variables that separate Nationality and Pet. The fifteen constraints are shown below.
All that we need to do to solve the problem is to impose the constraints on the row of five houses by intersection. The solution below is almost correct.
The result is a network with five paths. In four of the solutions nobody keeps fish and the German keeps the same kind of pet as someone else. We need one final constraint, presupposed by the question "Who keeps fish?":
With C16 added, only one solution remains. To make it easier to see, we compose the solution with the transducer that adds some prose around the pieces along the one remaining path:
We can now see the solution.
In short, it's the German who keeps fish. If you would like to try another puzzle of this kind, here is a variant of "Einstein's puzzle" from www.riddleaday.com:
There are five cars parked next to each other in the parking garage. Each car is a different color. Each of the five car owners drink a different beverage, have a different hobby (or participate in a different sport) and own a different breed of dog. No two owners prefer the same beverage, have the same hobby (or sport) or own the same breed of dog. Hints:
The question is: Who owns the Bichon Frise (it's a dog)?
For more detailed information please look at our pointers to finite-state literature. We welcome your comments and suggestions.
|
|