![]() |
|
|
|
|
![]() |
|
|
|
|
|
|
|
|
|
|
FINITE-STATE LEXICON COMPILER
Technical Report. ISTL-NLTT2993-04-02. April 1993. Xerox Palo Alto Research Center. Palo Alto, California. 1993. Copyright (c) 1993 Xerox Corporation. All rights reserved. Finite-State Lexicon Compiler Lauri Karttunen Abstract : The Finite-State Lexicon Compiler is an authoring tool for creating lexicons and lexical transducers. It is designed to be used in conjunction with transducers produced with the Xerox Two-level Rule Compiler. Table of Contents
The Finite-State Lexicon Compiler ("lexc") is an authoring tool for creating lexicons and lexical transducers. A lexical transducer is a specialized finite-state automaton that maps lexical forms and morphological specifications to corresponding inflected forms, and vice versa. For example, a lexical transducer for English might relate the word dining to dine+PresPart and swam to swim+Past. The input to the lexicon compiler is a text file in a format similar to lexicons accepted by Kimmo Koskenniemi's two-l program [1] and Evan Antworth's PC-KIMMO [2] but with several novel features. The most notable innovation is that the entries in the source lexicon may specify a relation between two forms. Idiosyncratic mappings, such as the i~a alternation in swam may be encoded directly in the lexicon whereas regular alternation, such as the loss of the stem-final e in dining may be handled by general rules. The output from the compiler is a simple finite-state automaton or a transducer. The result of the compilation may be composed with a rule transducer or transducers that effect morphological alternations or in other ways modify the initial result. The lexicon compiler is designed to be used with transducers produced with the Xerox two-level rule compiler [3]. The compiler was written in C at Xerox PARC in 1992-93 by Lauri Karttunen and Todd Yampol. It is based on Karttunen's 1990 Common Lisp implementation of a similar utility. The compiler currently works on the Macintosh and UNIX operating systems. The interface to the compiler is the same in both versions except that the Macintosh version allows the user to interact with the file system using mouse clicks and dialogue boxes. Section 2 of this document (Making simple automata) gives a brief tutorial on how the compiler can be used to make finite-state lexicons. Section 3 (Making lexical transducers) introduces more sophisticated techniques. Section 4 (Checking the result) illustrates ways to explore the compiled lexicon. Section 5 (Scripting) is about setting up files that make it easier to update the lexicon. Section 6 (Regular expressions) lists the regular expression operators that can be used to construct lexical entries. On-line help messages are listed in APPENDIX 1. APPENDIX 2 is a detailed example of cascaded composition. When you start the compiler (by double-clicking the icon on a Macintosh or by typing 'lexc' in UNIX), it prints a welcoming banner as shown in (1) and waits for the user to type a command terminated with a carriage return on the lexc> prompt line. The commands are grouped under three general headings Input/Output, Operations, and Display. Each group is further divided into subsections of related commands. The user may review the list at any time by answering the prompt with a question mark. The 'help' command, followed by a name of a particular command, gives advice on what the commands do, 'help all' prints out all help messages. (1)
***************************************
* Two-Level Lexicon Compiler 1.2 *
* created by *
* Lauri Karttunen and Todd Yampol*
* Copyright (c) 1992 by the Xerox Corporation.*
* All Rights Reserved.*
***************************************
Input/Output -------------------------------------------------------------- Source: compile-source, read-source, result-to-source, save-source. Rules: read-rules. Result: read-result, save-result, source-to-result. Operations ---------------------------------------------------------------- Composition: compose-result, extract-surface-forms. Checking: check-all, lookdown, lookup, random. Switches: duplicates, failures, singles, time. Scripts: begin-script, end-script, run-script. Display ------------------------------------------------------------------- Misc: banner, labels, status, storage. Help: help, ?. Type 'quit' to exit. lexc> The three main commands are 'compile-source', 'read-rules', and 'compose-result'. The term source net is used for the automaton (either a simple lexicon or a transducer) compiled from the input lexicon. The command 'compile-source' creates the source net. The term rule nets refers to a set of rule transducers created by the two-level rule compiler (or by other means) and saved in binary form. The command 'read-rules' reads the rule nets from a file. The command 'compose-result' composes the source net with the rule net or nets and produces the result net. We will start with these commands and explain the rest as we go along. To start off with a very simple example, we construct a lexicon that contains the forms of two lexemes, dine and line. Let us assume that line is both a noun and a verb, dine only a verb. The simplest way to define the lexicon is just to list all the forms as shown in (2) under the heading LEXICON Root. In general, there may be any number of sublexicons, but one of them must have the name Root. The other sublexicons, if any, are named according to the needs and tastes of the lexicon writer. Entries are separated by semicolons. Each entry consists of a form, possibly null, and a continuation class. The continuation class must be the name of another sublexicon or #, a terminator symbol. The word END at the end of the file is optional. The exclamation point, !, marks a comment; the rest of the line is ignored. Compilation of this lexicon creates a finite-state machine (fsm) that accepts all only the listed word forms. (2)
To get a more interesting example, let us define another version of the lexicon that distinguishes stems from suffixes. The new version is listed in (3). Some entries have no content other than the continuation class, some have a sublexicon name as the continuation class. (3)
Example (4) shows what happens when this file is processed by the compiler. Note that the name of the text file containing the lexicon specification is ex1-lex.txt. (4)
The resulting lexicon contains the same six words we saw in (2). The form lines actually gets constructed twice, once as a verb, once as a noun. After minimization, only one of them remains. The compiler first processes each sublexicon separately, keeping track of continuation pointers, and then joins the structures to a single network which is determinized and minimized. The form dines, for example, is created by joining its components in the manner shown in (5). The first component is a null string. (5)
The command 'save-source' saves the compiled source net as a binary file. In the case at hand, the source net is the simple six-state automaton shown in Figure 1. Figure 1
To move on to a more challenging example, let us construct a lexicon that contains all valid dates of the form "<month> <date>, <year>", such as March 14, 1993. To keep things simple, we adopt the restriction that dates must conform to this format exactly; that is, a space between the month and the day, a comma and a space between the day and the year. Let us also stipulate that <year> must be a number between 0 and 9999. Every date in the lexicon should of course be valid; March 31, 1993 is included but April 31, 1993 is not because there are only 30 days in April. Because whitespace characters (blank, tab, carriage return, linefeed) are generally ignored by the compiler, we must use a percent sign, %, as an escape character in front of a blank that should not count as whitespace. For reasons that will soon become evident, the numeral 0 is also special. If it is used as an ordinary digit in a word, we must prefix it with %. To construct the date June 10, 2000, for example, we need to specify it as June% 1%0,% 2%0%0%0. Zero is special only in the word, not in the continuation class of an entry. (6)
The complete listing for the date lexicon is given in (6). It is a little more complicated than one might expect because it enforces the constraint that the day numeral must not be larger than the number of days in the month. (February is a special case; we will come to that shortly.) Another reason for the complexity is that we decided to avoid leading zeros. For example, January 1, 100 is included but January 1, 0100 is not considered a valid date. A more compact encoding for the same lexicon is presented in (52), Section 6. Example (7) illustrates the structure of the lexicon by showing how the date March 14, 1993 is assembled by linking the components in thirteen sublexicons. (7)
Sublexicon Substring The listing in (8) traces the compilation of the date lexicon. (8)
As it reads the source file, the compiler prints the name of each sublexicon and the number of entries in it. When the compilation is finished, the command 'random' is a convenient way to examine the contents of the lexicon. It randomly prints 15 words from the lexicon. Any gross errors are likely to show up quickly. Note that there are altogether 3.66 million dates in our 64-state lexicon. A few too many for ten thousand years--we have yet to eliminate unwanted leap days. (9)
The note Using SOURCE means that the examples are generated from the source net. In the case at hand, there is no result net so far (recall that a "result net" is the result of composing a source net with a rule net or nets). The randomization algorithm is useful but not optimal; short words are much more likely to show up than long ones. The command 'lookup' enables the user to query whether a particular word (a date in this case) is in the lexicon. If the word is found, it is echoed; a line of stars indicates that the form was not found in the lexicon. (10)
If the argument to a command is not included on the same line with the command, the program prompts for it, as in the last example. Although the lexicon in (6) is correct (except for leap days), it is more cumbersome than it needs to be. For example, the last sublexicon (0To9Years) says, for each of the ten digits, that the string may terminate with that digit. (11)
It would be simpler to say--just once--that any of the ten digits counts as an acceptable termination. The lexicon compiler also accepts the latter type of specification. The form part of an entry may be not just a simple string but a regular expression describing a set (possibly infinite) of forms that share a common continuation class. In the case at hand, we may combine the ten entries in (11) to the single entry in (12). (12)
The angle brackets indicate that the expression contained within them is to be interpreted as a regular expression and not as an ordinary string; the vertical bar is a symbol for disjunction. Inside a regular expression, all symbols must be surrounded by whitespace or delimiters. (13) illustrates an alternative form of the Root lexicon in which 31 and 30 day months are enumerated by regular expressions. (13)
More about regular expressions in Section 6. 2.4 Composition with a rule transducer Let us now consider the problem of how to eliminate from our lexicon the dates for nonexistent leap days, such as February 29, 1993. February has 29 days only in years that are divisible by four, except that centuries are leap years only if the number is still divisible by four when the last two zeros are left out. For example, 1900 was not a leap year but in the year 2000 there will be 29 days in February. We could solve this problem by constructing a special chain of sublexicons for February alone but it would be complicated. An equivalent but notationally easier solution is to eliminate the unwanted leap days by composing the source lexicon with a finite-state transducer (fst) that expresses the rule stated above. This provides us with a simple example of how the lexicon compiler interacts with transducers produced with the two-level rule compiler [3]. A two-level grammar for prohibiting the occurrence of February 29 with an invalid year is presented in (14). (14)
Please consult [3] for a detailed explanation of the rule formalism used here. Ritchie et al. [4] and Sproat [5] are good general introductions to two-level morphology. For the sake of this example, it is sufficient to know that the relatively straightforward encoding of the leap day constraint can easily be compiled to the transducer shown in (15) in the form of a transition table. The rows in (15) represent numbered states, the columns are labeled with a symbol that represents an "equivalence class," a set of symbols that are equivalent within the fst. The equivalence classes are listed at the bottom of the table; the first symbol in each class is used in the table to represent the class. This transducer is trivial in the sense that it maps every valid date to itself without changing anything. The implicit boundary symbol, #, at the end and at the beginning of a date is mapped to the epsilon symbol (printed as 0). Invalid dates that contain February 29 in the wrong type of year disappear in the transduction. The top half of the table tracks the string February 29, to state 14, the bottom half tracks the years up to the implicit final word boundary #:0 (here zero represents the epsilon symbol). The gaps in the last column indicate states where February 29 is followed by the wrong kind of year or no year at all. Because these states do not allow the date to end there, the unwanted leap dates are eliminated. (15)
Let us now see how the transducer is applied to the lexicon. We assume that the "No February 29" rule has already been compiled by the two-level rule compiler and is stored in the binary file ex3-rule.fst. In (16), we first compile the source lexicon as before, then load the rule transducer with the command 'read-rules'. (16)
Alternatively, we could start by reading the rules. The order does not matter in this case, although it may be relevant if the alphabet of the lexicon contains multicharacter symbols. We will come back to this later. An argument to the 'read-rules' command should be a name of a Xerox format [6] binary file containing one or more transducers. Any file of this type is acceptable regardless of whether it was produced with the two-level rule compiler or by other means. The command 'compose-result' creates the result net by applying the rule transducer (or transducers) to the source net. (17)
The result of the composition looks promising: the result net has 17 states more than the source net, but the word count has decreased by 7575 dates. The 'status' command in (18) gives a quick overview of the situation. (18)
We can now use 'lookup' to verify that the composition had the intended effect. Because we now have a source and a result net, we can choose which one to use. If a result net has been computed, it is the default target for lookup. As we see by the string of asterisks returned, February 29, 1993 has disappeared from the lexicon. (19)
As intended, the year 2000 is correctly included as a leap year whereas the year 1900 has been eliminated as it should be. (20)
Because everything appears to be in order, we save the result with the 'save-result' command. The command 'save-source' would save the source net. More about saving in Section 7. (21)
3.1 A simple lexical transducer A lexical transducer, first discussed in Karttunen, Kaplan, and Zaenen [7], is a lemmatizer that maps inflected surface forms to their canonical dictionary representations and vice versa. In simple cases, such transducers can be compiled directly from a source lexicon. For example, we can modify the minilexicon in (3) to make it a morphological analyzer/generator for the forms of dine and line. The new lexicon is shown in (22). (22)
In addition to the citation form of the word itself, the entries in (22) contain morphological information: part of speech (+N, +V) and type of inflection (+Pl, +Sg, +Base, +Sg3, +Past, +PastPart). The main technical difference between the old and the new version is that some of the entries now contain a pair of forms: a lexical form and a surface form, separated by a colon. The compiler interprets such a pair as a regular relation, a sequence of pairwise correspondences between the symbols. For example, the entry +Pl:s pairs the lexical form +Pl with the surface form s. In other words, in forms constructed with this entry, plural is realized as s. The declaration Multichar_Symbols at the beginning of the file tells the compiler that +Pl and the other morphological tags are to be parsed as a single symbol rather than as a sequence of symbols. Using multicharacter symbols conveniently makes the morphological alphabet disjoint from ordinary letters. In pairs such as +N:0, +Sg:0, the numeral 0 stands for epsilon, the empty string, whereas the real zero is designated by %0. (23) shows the composition of the surface form lines in (22), given the interpretation of zero as the empty string. The single form entries, e.g. line, are viewed as equivalent to identity mappings (line:line). (23)
The command 'compile-source' converts the lexicon in (22) to a transducer, which can be used bidirectionally for analysis ('lookup') as well as generation ('lookdown'): (24)
The 'lookup' command matches its argument against the lower side of the lexical transducer (ignoring epsilons) and produces the corresponding lexical string or strings if the match succeeds; 'lookdown' does the same operation in the opposite direction. (25)
Because entries in the lexicon specification may consist of a pair of forms, it is simple to deal with words that have irregular surface forms in their paradigm. For example, strong verbs like swim have irregular past and past participle forms; in other respects they are regular. To add swim to our minilexicon, we need three new entries in LEXICON Verbs for the verb itself and three new sublexicons. The new sublexicons, shared by irregular forms of other strong verbs, encode the fact that in these verbs the past tense and past participle categories are expressed as a stem alternation and not by a suffix. These additions are listed in (26). (26)
When the lexicon is compiled into a transducer, the pairs swim:swam and swim:swum are combined into a sequence of symbol pairs in which every character of the lexical form is paired with the corresponding symbol of the surface form. If the symbols are identical, we represent the result as a single symbol (s instead of s:s), otherwise as a pair (i:a and i:u). If one of the forms is shorter than the other, epsilons are automatically added to the end to make the lengths equal. The past tense of swim is represented in the transducer as the path of symbols and symbol pairs shown in (27). (27)The commands 'lookup swam' and 'lookdown swim+V+Past' both locate this same path in the transducer by matching the argument against one or the other side of the transducer and printing out the symbols of the opposite side, or the same symbols if there is no difference. Because of epsilons and because one input symbol may correspond to more than one symbol on the opposite side, these operations in a transducer are in general not deterministic. On the other hand, lexical transducers have the advantage that they are bidirectional: analysis and generation make use of the same data structure and the same search algorithm. The simple left-to-right conversion of the lexical and surface forms to a sequence of symbol pairs can lead to nonoptimal results. If the past tense of fight is encoded as fight:fought, the symbols line up in the manner shown in (28). (28)
Because by default the compiler adds an epsilon to the end of the shorter string, the resulting path, <f i:o g:u h:g t:h 0:t>, is very irregular. We can avoid such ill-matched sequences by specifying in the lexical entry directly where epsilons should go. The optimal solution in this case is fi0ght:fought, which yields the path <f i:o 0:u g h t>. Here, as elsewhere, the 0 that represents an epsilon is not preceded by a percent sign. 3.2 Composing with rule transducers Let us consider augmenting our minilexicon (22) with the present participle forms dining, lining, and swimming. We need a new sublexicon for present participle and a way of linking it to the entries for dine, line, and swim. The deletion of the stem final e in dining and lining and the gemination of m in swimming could be handled in the same way as the stem alternation in swim and swum: (30)
But this would be unsatisfactory because the alternations in (30) are not idiosyncratic to these particular words; they apply productively to all words of the same type. Our solution is to write general e-deletion and gemination rules that apply to the canonical lexical form in the environment of a vowel-initial suffix. When the source lexicon is composed with the rule transducers, the result transducer will correctly represent these alternations. The two rules are shown in (31). See [3] for details of the syntax for specifying two-level rules. (31)
After compilation, the present participle form of swim has the structure shown in (33). (33)
The lower side of the source transducer contains "intermediate" forms. The surface form or forms will be determined by the application of the two-level rules in (31). In order to control the environment in which these rules apply, we explicitly mark the boundary between the stem and the present participle ending in the lexicon. The boundary marker, ^, corresponds to an epsilon in the lexical form (see the last entry in LEXICON VCommon). Because the only surface realization of ^ allowed in the rule alphabet is 0 (epsilon), the boundary marker disappears from the result when the source net is composed with the rule transducers. (34) illustrates the effect of the composition. (34)
In the composition the intermediate form is eliminated. The result net maps swim+V+PresPart directly to swimming: (35)
Because we need the e-deletion rule anyway for the present participle forms of dine and line, we changed the past and past participle suffix from d to ed. This enables us to handle all forms of regular verbs ending in a consonant, such as stop. (36) traces the construction of the final result with the compiler. We assume here that the rules in (31) have already been compiled and stored in the binary file ex5-rules.fst. (36)
Let us get an overview of the situation with the 'status' command:
As we see in (36) there are two transducers in the rule file, compiled from the two rules in (31) with the two-level rule compiler [3]. In the previous composition example in Section 2.4, we only had a single rule transducer. The result here is what we get by composing the source net with the intersection of the two rule transducers. To verify that we have the expected result, we can again use 'lookup' and 'lookdown', first on the result net: (37)
If we choose the source net as the target, we can still see the intermediate forms. The source net is left unchanged by the compose operation. (38)
Section 4 discusses other ways of examining the result of the composition. For languages with a more complex morphology than English, it may be impractical to encode all the morphological alternations in a single rule system. One such case is the plural formation of compound nouns in French, discussed in Karttunen, Kaplan and Zaenen [7]. It can best be described by two separate rule systems. The lexicon is composed with the first rule system and the resulting transducer is composed with the second set of rules. Appendix 2 discusses the case of French nouns in more detail. In a cascaded composition, the result net of the first composition becomes the source net for the second step. The command 'result-to-source' makes that switch. When a new set of rules has been read in, the composition can be continued. This process is similar to the traditional derivation with an ordered set of rewrite rules. The difference of course is that the rules are applied to the lexicon as a whole rather than to individual words. At every point in the cascade, the result net maps the lexical forms to the output of the last rule or set of rules in the composition. We have already illustrated three commands that display words in the source and the result net: 'random', 'lookup', and 'lookdown'. The 'check-all' command systematically compares the source net to the result net and displays the effect of the compose operation. It looks at every word in the source net and tries to find all words in the result net that have been derived from it by composition. The results of the search can be written to a file or displayed on the screen as in (39). (39)
The left column shows the lexical (upper) side of all the words in the source net, the right column displays the corresponding surface (lower) forms in the result net. For example, we see that the composition has produced swimming from swim+V+PresPart. The output from 'check-all' is controlled by three variables that can be toggled on and off. If the variable 'singles' (default OFF) is ON, all lexical forms that have only one surface realization are printed. If 'duplicates' (default ON) is ON, lexical forms with more than one realization are printed. If 'failures' (default ON) is ON, entries in the source net that have no surface realization are printed with a row of asterisks in place of the missing output. In a typical case, most words in a source lexicon have one and only one surface realization. It may be unnecessary and time consuming to print them all if the lexicon is large. By turning the 'singles' flag off, the lexicon writer sees only the entries in the source lexicon that have several surface realizations or no realizations at all. This is useful information when the rule set is still in development. Badly constrained alternations may result in invalid multiple outputs, and conflicting rules may completely block some valid forms. If all the flags are turned off, 'check-all' produces no output other than the statistics at the end. A common source of error is a mismatch between the lexicon and the rule alphabets. If some words in the source net contain symbols that are not in the rule alphabet, they disappear in the composition. The 'compose-result' command checks the alphabets and prints a warning if some symbol in the lower side of the source net does not appear on the upper side of the rule transducers. The command 'labels' allows the user to display the pair alphabet on the screen: (40)
Any symbol in the source lexicon that consists of more than one letter must be declared in a Multichar_Symbols declaration (cf. (22) and (32)), otherwise it will be misparsed as a symbol sequence. The lexicon compiler converts strings to symbols from left to right always preferring the longest match. It is important to keep this in mind in designing the alphabet. It might be safe to declare qu as a single symbol in English because it is nearly always pronounced [k]. On the other hand, declaring th a multicharacter symbol would clearly be a bad decision because words like hothouse would be misparsed as h-o-th-o-u-s-e. If the lexicon consists of ordinary alphabetic symbols and multicharacter tags as our examples do, it is best to use some nonalphabetic character such as + as the first letter of all multicharacter symbols. In this way the regular alphabet and the special alphabet are disjoint and every string is guaranteed to parse correctly. The commands 'begin-script', 'end-script', and 'run-script' speed up the development of a lexicon by allowing the user to record a sequence of commands to a file and execute them all at once from inside the lexicon compiler. 'begin-script' starts the recording, 'end-script' closes the script file: (41)
A script file, such as (42) may also be produced with a text editor and executed with a single command as in (43). (42)
The simple script facility is useful on the Macintosh. In UNIX, it is possible to automate the development even better with the help of the 'make' command which is invoked from the UNIX command line. Let us illustrate this by constructing a Makefile for the result net of the current example. The "target" here is a situation in which the file containing the result net is up-to-date. Let us assume that the net was saved as ex5-result.fst, the source net as ex5-source.fst, and the rules as ex5-rules.fst. The source net was compiled from ex5-lex.txt, the rules from ex5-rules.txt. If the rule file gets modified, ex5-rules.fst needs to be recomputed; if ex5-lex.txt changes, ex5-source.fst should be recompiled; if either ex5-source.fst or ex5-rules.fst has changed, ex5-result.fst has to be recomposed. To get 'make' to keep track of the situation and update the files as needed, the dependencies and the update actions have to be specified in a file called Makefile. (44) is an example of a Makefile that keeps ex5-rules.fst up-to-date by using 'source' to run a specific update script when one of the dependent files has changed. (44)
The UNIX command 'make ex5' checks the dependencies and carries out any required update actions in the appropriate order. The scripts for the updates are in three separate files, update-result.txt (45), update-source.txt (46), and update-rules.txt (47). (45)
The update-result script in (45) invokes the lexicon compiler, reads in the source and the rule nets, composes them, turns on the singles flag, produces the check-all file ex5-check.log, saves the result net as ex5-result.fst and quits. The update-source script in (46) is simpler. (46)
Updating the rules requires the two-level rule compiler. The update-rules script in (47) invokes the rule compiler (twolc) and runs the appropriate commands. For details, see [3]. (47)
In Section 2.3 we saw that entries in a source lexicon may consist of a regular expression and a continuation class. In this section we discuss regular expressions in more detail. The lexicon compiler follows the same syntactic conventions as the two-level rule compiler (see Section 3.2 in [3]) but it has a few additional operators. The common operators are listed in (48) in a descending order of precedence. A few comments are added to explain the meaning of some nonstandard operators. (48)
The lexicon compiler requires that regular expressions be enclosed in a pair of angle brackets, as shown in (12) and (13), to distinguish them from simple lexical forms and form pairs of type string1:string2. All operators, special characters, grouping elements, angle brackets, punctuation and whitespace characters must be prefixed with the escape character (%) if they are used as ordinary symbols. In addition to (48), the lexicon compiler recognizes the following regular expression operators:
(49)
The iteration operator has the same precedence as * and +. (50)
(51)
Using the insert and iteration constructs we can abbreviate our date lexicon--cf. (6), (12), (13) in Section 2--to the form shown in (52). We assume here that the files 0to9.fsm and 1to9.fsm have the contents suggested by the name, a net with the digits 0 to 9 and 1 to 9, respectively. (52)
7.1 Saving The two output commands ('save-source' and 'save-result') save the net in the binary format described in [6]. This same format is used by the rule compiler to store transducers. If the user is only interested in the surface forms of the result net, the surface language of the transducer can be extracted with the command 'extract-surface-forms'. This operation converts the result net to a simple automaton. 7.2 Storage The command 'storage' gives statistics of memory allocation: (53)
7.3 Time The command 'time' toggles an ON/OFF switch (default OFF). If the switch is ON, the execution time is printed for all commands. Thanks to Beth Bryson, Ronald M. Kaplan, Hyuk-Chul Kwon, Christian Rohrer and Annie Zaenen for comments and suggestions. Special thanks to Kenneth R. Beesley for editorial assistance. References[1] Koskenniemi, Kimmo. Two-level Morphology: A General Computational Model for Word-Form Recognition and Production. Department of General Linguistics. University of Helsinki. 1983. [2] Antworth, Evan L. PC-KIMMO: a two-level processor for morphological analysis. Occasional Publications in Academic Computing No. 16, Summer Institute of Linguistics, Dallas, Texas. 1990. [3] Karttunen, Lauri and Kenneth R Beesley. Two-Level Rule Compiler Technical Report. ISTL-922. Xerox Palo Alto Research Center. Palo Alto, California. 1992. [4] Ritchie, Graeme D., Graham J. Russell, Alan W. Black, and Stephen G. Pulman. Computational Morphology: Practical Mechanisms for the English Lexicon. MIT Press. 1992. [5] Sproat, Richard. Morphology and Computation. MIT Press. Cambridge, Massachusetts. 1992. [6] Karttunen, Lauri. Binary Encoding Format for Finite-State Networks. Technical Report SSL-9027. Xerox Palo Alto Research Center. Palo Alto. March 1990. [7] Karttunen, Lauri, Ronald M. Kaplan and Annie Zaenen. "Two-Level Morphology with Composition". In the Proceedings of the Fifteenth International Conference on Computational Linguistics. COLING-92. Nantes, 2328 August 1992. Vol. I 141-48. ICCL. 1992. On-line help messages
Cascaded composition: French compound plurals This appendix presents a simple example of cascaded composition, based on the 1992 Coling paper by Karttunen, Kaplan, and Zaenen [7]. In cascaded composition, two or more rule transducers are composed with the lexicon in a sequence. Introducing a cascade does not enrich the descriptive power of two-level morphology. Any regular relation than can be described by two rules or rule systems applying in a sequence could also be described by a single two-level grammar. But in the case of French compound plurals a single two-level description is clearly more complicated for the human rule writer than two rule systems that apply in a sequence. Let us first construct a lexicon and a rule system that describes the inflection of simple nouns and adjectives in French. Depending on the type of the stem, nouns have three types of plural endings: -s (chef:chefs), -x (cheval:chevaux), or zero (nez:nez, prorata:prorata). Feminine adjectives in general add an -e to the end of the stem, which can have an effect on how the plural is realized (social:sociaux Masc, sociale:sociales Fem). Some words have alternate forms in the plural, for example, travail:travaux~travails. The sample lexicon in (1) contains a few words that illustrate these phenomena. The lexical form of cheval, for example, is encoded as cheval+N+Masc+Sg. (1)
Words that end in a sibilant (nez) are typically invariant in
the plural; words like prorata 'proportional share' that exceptionally
take no plural ending are marked in the lexicon with the diacritic 0:^.
Note that there are two entries for travail. One of them gets
a diacritic, 0:^s inserted between the stem and the gender
marker. We use this diacritic to control the shape of the plural (travaux or travails).
Because of the presence of the diacritic pairs, the compilation of (1)
creates a transducer. The source net maps all other lexical forms to
themselves but prorata+N+Masc+Pl is mapped to prorata^+N+Masc
+Pl and travail+N+Masc+Pl to both travail+N+Masc+Pl and travail^s+N+Masc+Pl. A simple rule system that describes the realization of these lexical
forms is presented in (2).
(2)
We compile the source lexicon in the usual manner and compose it with
the rule transducers. The output from check-all verifies that the rules
have the intended effect.
The form travail+N+Masc+Pl gets two surface realizations
because there are two intermediate forms travail+N+Masc+Pl and travail+^s+N+Masc+Pl.
The presence of ^s exempts the latter form from the alternations
that regularly take place in XContext. The diacritic itself
disappears in the composition. A similar case is illustrated in Section 3.2
in examples (34) and (35). In compound nouns and adjectives several patterns are possible. Sometimes
only the first part of the compound is marked for the plural (hôtels-dieu),
sometimes the last part (mini-jupes), sometimes both parts (portes-fenêtres).
Some compounds are invariant in the plural (porte-plume). It is not
possible to predict from the form alone (and not always from the meaning either)
which class a given compound belongs to. We assume here that it has to be marked
in the lexical form. The rules in (2) only describe the pattern where the plural is realized at
the end of the second word. It is possible in principle to modify these rules
so that they also correctly account for other types of compounds, but this
is not a simple matter. It would fail to capture the obvious fact that the
plurals portes and fenêtres in portes-fenêtres in
themselves are regular; the only thing that is special about the word is that
plurality is expressed in both parts of the compound.
We avoid these complications by creating a second rule system that determines
whether the plurality is expressed by the first word, by the second word or
both. Because the new rules are triggered by compound diacritics, they have
no effect on simple nouns. The rules are listed in (4).
We use the diacritic ^n to mark compounds like hôtel-dieu that
mark the plural on the first noun. ^nn is used for noun-noun
compounds like porte-fenêtre that have double plurals. ^an is
a marks words like social-démocrate where the first part
of the compound is inflected as an adjective for gender and number. Rules
1-4 insert the appropriate category, gender, and number at the end of
the first word in the compound. Rules 5-7 delete the morphological codes
at the end of the second noun in words like hôtel-dieu.
This new set of rules is ordered before the rules in (2). The correct
result can be produced in either of two ways. (i) We intersect the rules
in each set to a single transducer, compose the results, and then compose
the lexicon with a single rule transducer, as described in [6]. Alternatively,
(ii) we compose the lexicon with the rules in (4) and compose the result
with the rules in (2). Here we take the second option. But let us first
expand our noun lexicon in (1) with the new entries and sublexicons shown
in (5). (5)
In (6) we compile the new source lexicon and compose it with the rules in (4). (6)
The next step is to move the result net to the source position. After reading in the second set of rules, we compose again for the final result. (7)
The check-all listing in (8) verifies that the result is as we expect. (8)
The minor differences in word counts--42 vs. 41 in (7), 41 vs. 40 in (8)--are caused by the diacritic ^s in the intermediate forms of travail. The sequence of diagrams (9)-(11) illustrates the effect of the two compositions on the plural form of cheval-vapeur 'horse-power'. (9) is the mapping in the source lexicon. (9)
The first composition with the rules in (4) copies the morphological tags to the end of the first part of the compound. The resulting structure is shown in (10). Because cheval-vapeur is a compound that marks plurality only on the first word, the tags following vapeur are removed. These changes are triggered by the diacritic ^n, which itself disappears. (10)
The second composition with the rules in (2) realizes the plural marker at the end of cheval in the regular way and the final structure emerges in (11). (11)
As we mentioned above, exactly the same result could be produced by another route of computations. The intersections of the two rule sets, (4) and (2), could be merged to a single rule automaton by composition. The composition of the lexicon with the resulting single rule transducer yields the same result as the cascading composition illustrated here.
|
|