Home page Site map Contact
  

 

FINITE-STATE LEXICON COMPILER

 

LANGUAGE-SPECIFIC TOOLS (tokenization, analysis, disambiguation) :

Technical Report. ISTL-NLTT2993-04-02. April 1993. Xerox Palo Alto Research Center. Palo Alto, California. 1993. Copyright (c) 1993 Xerox Corporation. All rights reserved.

Finite-State Lexicon Compiler

Lauri Karttunen
Xerox Palo Alto Research Center

Abstract :

The Finite-State Lexicon Compiler is an authoring tool for creating lexicons and lexical transducers. It is designed to be used in conjunction with transducers produced with the Xerox Two-level Rule Compiler.

Table of Contents

  1. Introduction
  2. Making simple automata
    1. A few words
    2. A lexicon of dates
    3. Using regular expressions
    4. Composition with a rule transducer
  3. Making lexical transducers
    1. Simple lexical transducer
    2. Composing with rule transducers
    3. Cascade of compositions
  4. Checking the result
  5. Scripting
    1. Simple script
    2. Using make
  6. Regular expressions
  7. Miscellaneous
  8. Acknowledgements, References
  9. Appendix 1: Help, Summary of commands
  10. Appendix 2: Sample description

1. What is it?

The Finite-State Lexicon Compiler ("lexc") is an authoring tool for creating lexicons and lexical transducers. A lexical transducer is a specialized finite-state automaton that maps lexical forms and morphological specifications to corresponding inflected forms, and vice versa. For example, a lexical transducer for English might relate the word dining to dine+PresPart and swam to swim+Past.

The input to the lexicon compiler is a text file in a format similar to lexicons accepted by Kimmo Koskenniemi's two-l program [1] and Evan Antworth's PC-KIMMO [2] but with several novel features. The most notable innovation is that the entries in the source lexicon may specify a relation between two forms. Idiosyncratic mappings, such as the i~a alternation in swam may be encoded directly in the lexicon whereas regular alternation, such as the loss of the stem-final e in dining may be handled by general rules.

The output from the compiler is a simple finite-state automaton or a transducer. The result of the compilation may be composed with a rule transducer or transducers that effect morphological alternations or in other ways modify the initial result. The lexicon compiler is designed to be used with transducers produced with the Xerox two-level rule compiler [3].

The compiler was written in C at Xerox PARC in 1992-93 by Lauri Karttunen and Todd Yampol. It is based on Karttunen's 1990 Common Lisp implementation of a similar utility. The compiler currently works on the Macintosh and UNIX operating systems. The interface to the compiler is the same in both versions except that the Macintosh version allows the user to interact with the file system using mouse clicks and dialogue boxes.

Section 2 of this document (Making simple automata) gives a brief tutorial on how the compiler can be used to make finite-state lexicons. Section 3 (Making lexical transducers) introduces more sophisticated techniques. Section 4 (Checking the result) illustrates ways to explore the compiled lexicon. Section 5 (Scripting) is about setting up files that make it easier to update the lexicon. Section 6 (Regular expressions) lists the regular expression operators that can be used to construct lexical entries. On-line help messages are listed in APPENDIX 1. APPENDIX 2 is a detailed example of cascaded composition.

2. Making simple automata

When you start the compiler (by double-clicking the icon on a Macintosh or by typing 'lexc' in UNIX), it prints a welcoming banner as shown in (1) and waits for the user to type a command terminated with a carriage return on the lexc> prompt line. The commands are grouped under three general headings Input/Output, Operations, and Display. Each group is further divided into subsections of related commands. The user may review the list at any time by answering the prompt with a question mark. The 'help' command, followed by a name of a particular command, gives advice on what the commands do, 'help all' prints out all help messages.

(1)



               ***************************************	
               *        Two-Level Lexicon Compiler  1.2       *	
               *                created by                    *	
               *         Lauri Karttunen and Todd Yampol*	
               *   Copyright (c) 1992 by the Xerox Corporation.*	
               *             All Rights Reserved.*	
               ***************************************
        


Input/Output
--------------------------------------------------------------
Source: compile-source, read-source, result-to-source, save-source.
Rules: read-rules.
Result: read-result, save-result, source-to-result.
Operations
----------------------------------------------------------------
Composition: compose-result, extract-surface-forms.
Checking: check-all, lookdown, lookup, random.
Switches: duplicates, failures, singles, time.
Scripts: begin-script, end-script, run-script.
Display
-------------------------------------------------------------------
Misc: banner, labels, status, storage.
Help: help, ?.
Type 'quit' to exit.

lexc>

The three main commands are 'compile-source', 'read-rules', and 'compose-result'. The term source net is used for the automaton (either a simple lexicon or a transducer) compiled from the input lexicon. The command 'compile-source' creates the source net. The term rule nets refers to a set of rule transducers created by the two-level rule compiler (or by other means) and saved in binary form. The command 'read-rules' reads the rule nets from a file. The command 'compose-result' composes the source net with the rule net or nets and produces the result net. We will start with these commands and explain the rest as we go along.

2.1 A few words

To start off with a very simple example, we construct a lexicon that contains the forms of two lexemes, dine and line. Let us assume that line is both a noun and a verb, dine only a verb. The simplest way to define the lexicon is just to list all the forms as shown in (2) under the heading LEXICON Root. In general, there may be any number of sublexicons, but one of them must have the name Root. The other sublexicons, if any, are named according to the needs and tastes of the lexicon writer.

Entries are separated by semicolons. Each entry consists of a form, possibly null, and a continuation class. The continuation class must be the name of another sublexicon or #, a terminator symbol. The word END at the end of the file is optional. The exclamation point, !, marks a comment; the rest of the line is ignored. Compilation of this lexicon creates a finite-state machine (fsm) that accepts all only the listed word forms.

(2)


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex0-lex.txt

LEXICON Root
dine #; dines #; dined #;
line #; lines #; lined #;

END

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

To get a more interesting example, let us define another version of the lexicon that distinguishes stems from suffixes. The new version is listed in (3). Some entries have no content other than the continuation class, some have a sublexicon name as the continuation class.

(3)


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex1-lex.txt

LEXICON Root
Noun;
Verb;

LEXICON Noun
line NounSuffix;

LEXICON Verb
dine VerbSuffix;
line VerbSuffix;

LEXICON NounSuffix
s #;
#;

LEXICON VerbSuffix
s #;
d #;
#;

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


Example (4) shows what happens when this file is processed by the compiler. Note that the name of the text file containing the lexicon specification is ex1-lex.txt.

(4)


lexc> compile-source ex1-lex.txt
Opening 'ex1-lex.txt'...
Root...2, Noun...1, Verb...2, NounSuffix...2, VerbSuffix...3
Building lexicon...Minimizing...Done!
SOURCE: 6 states, 7 arcs, 6 words


The resulting lexicon contains the same six words we saw in (2). The form lines actually gets constructed twice, once as a verb, once as a noun. After minimization, only one of them remains. The compiler first processes each sublexicon separately, keeping track of continuation pointers, and then joins the structures to a single network which is determinized and minimized. The form dines, for example, is created by joining its components in the manner shown in (5). The first component is a null string.

(5)

Sublexicon: Root Verb VerbSuffix
Substring:   dine s

The command 'save-source' saves the compiled source net as a binary file. In the case at hand, the source net is the simple six-state automaton shown in Figure 1.

Figure 1

2.2 A lexicon of dates

To move on to a more challenging example, let us construct a lexicon that contains all valid dates of the form "<month> <date>, <year>", such as March 14, 1993. To keep things simple, we adopt the restriction that dates must conform to this format exactly; that is, a space between the month and the day, a comma and a space between the day and the year. Let us also stipulate that <year> must be a number between 0 and 9999. Every date in the lexicon should of course be valid; March 31, 1993 is included but April 31, 1993 is not because there are only 30 days in April.

Because whitespace characters (blank, tab, carriage return, linefeed) are generally ignored by the compiler, we must use a percent sign, %, as an escape character in front of a blank that should not count as whitespace. For reasons that will soon become evident, the numeral 0 is also special. If it is used as an ordinary digit in a word, we must prefix it with %. To construct the date June 10, 2000, for example, we need to specify it as June% 1%0,% 2%0%0%0. Zero is special only in the word, not in the continuation class of an entry.

(6)


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex2-lex.txt

LEXICON Root
January 31days; February 29days; March 31days; April 30days;
May 31days; June 30days; July 31days; August 31days;
September 30days; October 31days; November 30days; December 31days;

LEXICON 31days
% 31 Years; 30days;

LEXICON 30days
% 3%0 Years; 29days;

LEXICON 29days
%1To9Days; % 1 0To9Days; % 2 0To9Days;

LEXICON 0To9Days
%0 Years; 1To9Days;

LEXICON 1To9Days
1 Years; 2 Years; 3 Years; 4 Years; 5 Years;
6 Years; 7 Years; 8 Years; 9 Years;

LEXICON Years
,% ThousandsOfYears; ,% HundredsOfYears; ,% TensOfYears;
,% 0To9Years;

LEXICON ThousandsOfYears
1 0To999Years; 2 0To999Years; 3 0To999Years; 4 0To999Years;
5 0To999Years; 6 0To999Years; 7 0To999Years; 8 0To999Years;
9 0To999Years;

LEXICON 0To999Years
%0 0To99Years; HundredsOfYears;

LEXICON HundredsOfYears
1 0To99Years; 2 0To99Years; 3 0To99Years; 4 0To99Years;
5 0To99Years; 6 0To99Years; 7 0To99Years; 8 0To99Years;
9 0To99Years;

LEXICON 0To99Years
%0 0To9Years; TensOfYears;

LEXICON TensOfYears
1 0To9Years; 2 0To9Years; 3 0To9Years; 4 0To9Years; 5 0To9Years;
6 0To9Years; 7 0To9Years; 8 0To9Years; 9 0To9Years;

LEXICON 0To9Years
%0 #; 1 #; 2 #; 3 #; 4 #; 5 #; 6 #; 7 #; 8 #; 9 #;

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


The complete listing for the date lexicon is given in (6). It is a little more complicated than one might expect because it enforces the constraint that the day numeral must not be larger than the number of days in the month. (February is a special case; we will come to that shortly.) Another reason for the complexity is that we decided to avoid leading zeros. For example, January 1, 100 is included but January 1, 0100 is not considered a valid date. A more compact encoding for the same lexicon is presented in (52), Section 6.

Example (7) illustrates the structure of the lexicon by showing how the date March 14, 1993 is assembled by linking the components in thirteen sublexicons.

(7)

Sublexicon Substring

Root March
31days
30days
29days 1
0To9Days
1To9Days 4
Years ,
ThousandsOfYears 1
0To999Years
HundredsOfYears 9
0To99Years
TensOfYears 9
0To9Years 3

The listing in (8) traces the compilation of the date lexicon.

(8)


lexc> compile-source ex2-lex.txt
Opening 'ex2-lex.txt'...
Root...12, 31days...2, 30days...2, 29days...3, 0To9Days...2,
1To9Days...9, Years...4, ThousandsOfYears...9,
0To999Years...2,
HundredsOfYears...9, 0To99Years...2, TensOfYears...9,
0To9Years...10
Building lexicon...Minimizing...Done!
SOURCE: 64 states, 147 arcs, 3660000 words.

lexc> save-source ex2-lex.fsm
Opening 'ex2-lex.fsm'...
Done.

As it reads the source file, the compiler prints the name of each sublexicon and the number of entries in it. When the compilation is finished, the command 'random' is a convenient way to examine the contents of the lexicon. It randomly prints 15 words from the lexicon. Any gross errors are likely to show up quickly. Note that there are altogether 3.66 million dates in our 64-state lexicon. A few too many for ten thousand years--we have yet to eliminate unwanted leap days.

(9)


lexc> random
NOTE: Using SOURCE.
November 18, 2
September 10, 3
June 4, 2985
October 4, 8
December 9, 5253
March 8, 6
February 4, 7
October 9, 7103
November 9, 4805
October 31, 0
August 6, 28
September 4, 63
September 9, 1
September 8, 8767
December 3, 160

The note Using SOURCE means that the examples are generated from the source net. In the case at hand, there is no result net so far (recall that a "result net" is the result of composing a source net with a rule net or nets). The randomization algorithm is useful but not optimal; short words are much more likely to show up than long ones.

The command 'lookup' enables the user to query whether a particular word (a date in this case) is in the lexicon. If the word is found, it is echoed; a line of stars indicates that the form was not found in the lexicon.

(10)


lexc> lookup March 14, 1993
NOTE: Using SOURCE.
March 14, 1993
lexc> lookup February 30, 1993
NOTE: Using SOURCE.
****************
lexc> lookup
NOTE: Using SOURCE.
Word: February 29, 1993
February 29, 1993

If the argument to a command is not included on the same line with the command, the program prompts for it, as in the last example.

2.3 Using regular expressions

Although the lexicon in (6) is correct (except for leap days), it is more cumbersome than it needs to be. For example, the last sublexicon (0To9Years) says, for each of the ten digits, that the string may terminate with that digit.

(11)


LEXICON 0To9Years
%0 #; 1 #; 2 #; 3 #; 4 #; 5 #; 6 #; 7 #; 8 #; 9 #;

It would be simpler to say--just once--that any of the ten digits counts as an acceptable termination. The lexicon compiler also accepts the latter type of specification. The form part of an entry may be not just a simple string but a regular expression describing a set (possibly infinite) of forms that share a common continuation class. In the case at hand, we may combine the ten entries in (11) to the single entry in (12).

(12)


LEXICON 0To9Years
<%0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9> # ;

The angle brackets indicate that the expression contained within them is to be interpreted as a regular expression and not as an ordinary string; the vertical bar is a symbol for disjunction.

Inside a regular expression, all symbols must be surrounded by whitespace or delimiters. (13) illustrates an alternative form of the Root lexicon in which 31 and 30 day months are enumerated by regular expressions.

(13)


LEXICON Root
<J a n u a r y | M a r c h | M a y | J u l y |
A u g u s t | [O c t o | D e c e m] b e r > 31days;
<A p r i l | J u n e | [S e p t | N o v ] e m b e r> 30days;
February 29days;

More about regular expressions in Section 6.

2.4 Composition with a rule transducer

Let us now consider the problem of how to eliminate from our lexicon the dates for nonexistent leap days, such as February 29, 1993. February has 29 days only in years that are divisible by four, except that centuries are leap years only if the number is still divisible by four when the last two zeros are left out. For example, 1900 was not a leap year but in the year 2000 there will be 29 days in February.

We could solve this problem by constructing a special chain of sublexicons for February alone but it would be complicated. An equivalent but notationally easier solution is to eliminate the unwanted leap days by composing the source lexicon with a finite-state transducer (fst) that expresses the rule stated above. This provides us with a simple example of how the lexicon compiler interacts with transducers produced with the two-level rule compiler [3].

A two-level grammar for prohibiting the occurrence of February 29 with an invalid year is presented in (14).

(14)


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex3-rule.txt : Restricting "February 29" to leap years

Alphabet
A D F J M N O S a b c e g h i l m n o p r s t u v y % %,
%0 1 2 3 4 5 6 7 8 9 #:0;

Sets
Digit = %0 1 2 3 4 5 6 7 8 9 ;
Even = %0 2 4 6 8 ;
Odd = 1 3 5 7 9 ;

Definitions
BadOne = \[%0 | 4 | 8] ; ! Any symbol other than 0, 4, or 8.
BadTwo = Even BadOne | Odd \[2|6] ;
NotDivisibleByFour = BadOne | Digit* BadTwo ;

Rules

"No February 29" ! The string "February 29, " must not be followed
! by a year not divisible by four or
! by a century that is not divisible by four

F e b r u a r y % 2 9 %, % /<= _ NotDivisibleByFour (%0 %0) #: ;

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Please consult [3] for a detailed explanation of the rule formalism used here. Ritchie et al. [4] and Sproat [5] are good general introductions to two-level morphology. For the sake of this example, it is sufficient to know that the relatively straightforward encoding of the leap day constraint can easily be compiled to the transducer shown in (15) in the form of a transition table.

The rows in (15) represent numbered states, the columns are labeled with a symbol that represents an "equivalence class," a set of symbols that are equivalent within the fst. The equivalence classes are listed at the bottom of the table; the first symbol in each class is used in the table to represent the class. This transducer is trivial in the sense that it maps every valid date to itself without changing anything. The implicit boundary symbol, #, at the end and at the beginning of a date is mapped to the epsilon symbol (printed as 0). Invalid dates that contain February 29 in the wrong type of year disappear in the transduction.

The top half of the table tracks the string February 29, to state 14, the bottom half tracks the years up to the implicit final word boundary #:0 (here zero represents the epsilon symbol). The gaps in the last column indicate states where February 29 is followed by the wrong kind of year or no year at all. Because these states do not allow the date to end there, the unwanted leap dates are eliminated.

(15)


"No February 29"
A F a b e r u y , 0 1 2 4 6 9 #:0
1: 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2: 1 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1
3: 1 2 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1
4: 1 2 1 1 1 5 1 1 1 1 1 1 1 1 1 1 1
5: 1 2 1 1 1 1 6 1 1 1 1 1 1 1 1 1 1
6: 1 2 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1
7: 1 2 1 1 1 8 1 1 1 1 1 1 1 1 1 1 1
8: 1 2 1 1 1 1 1 9 1 1 1 1 1 1 1 1 1
9: 1 2 1 1 1 1 1 1 10 1 1 1 1 1 1 1 1
10: 1 2 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1
11: 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 12 1
12: 1 2 1 1 1 1 1 1 1 13 1 1 1 1 1 1 1
13: 1 2 1 1 1 1 1 1 14 1 1 1 1 1 1 1 1
14: 17 20 17 17 17 17 17 17 17 17 14 15 16 14 16 15 17
15: 17 20 17 17 17 17 17 17 17 17 23 15 14 16 14 15
16: 17 20 17 17 17 17 17 17 17 17 21 15 16 14 16 15
17: 1 2 1 1 1 1 1 1 1 1 18 1 1 1 1 1
18: 1 2 1 1 1 1 1 1 1 1 19 1 1 1 1 1 1
19: 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
20: 1 2 1 1 3 1 1 1 1 1 18 1 1 1 1 1
21: 17 20 17 17 17 17 17 17 17 17 22 15 16 14 16 15 17
22: 17 20 17 17 17 17 17 17 17 17 14 15 16 14 16 15
23: 17 20 17 17 17 17 17 17 17 17 24 15 16 14 16 15
24: 17 20 17 17 17 17 17 17 17 17 22 15 16 14 16 15
Equivalence classes:
((A D J M N O S c g h i l m n o p s t v) (F) (a) (b) (e) (r)
(u) (y) ( ) (,) (0) (1 3 5 7) (2) (4 8) (6) (9) (#:0))

Let us now see how the transducer is applied to the lexicon. We assume that the "No February 29" rule has already been compiled by the two-level rule compiler and is stored in the binary file ex3-rule.fst. In (16), we first compile the source lexicon as before, then load the rule transducer with the command 'read-rules'.

(16)


lexc> compile-source ex2-lex.txt
Opening 'ex2-lex.txt'...
Root...12, 31days...2, 30days...2, 29days...3, 0To9Days...2,
1To9Days...9, Years...4, ThousandsOfYears...9,
0To999Years...2,
HundredsOfYears...9, 0To99Years...2, TensOfYears...9,
0To9Years...10
Building lexicon...Minimizing...Done!
SOURCE: 64 states, 147 arcs, 3660000 words.

lexc> read-rules ex3-rule.fst
Opening 'ex3-rule.fst'...
1 net read.

Alternatively, we could start by reading the rules. The order does not matter in this case, although it may be relevant if the alphabet of the lexicon contains multicharacter symbols. We will come back to this later.

An argument to the 'read-rules' command should be a name of a Xerox format [6] binary file containing one or more transducers. Any file of this type is acceptable regardless of whether it was produced with the two-level rule compiler or by other means.

The command 'compose-result' creates the result net by applying the rule transducer (or transducers) to the source net.

(17)


lexc> compose-result
No epenthesis.
Initial and final word boundaries added.
..............Done.
90 states, 320 arcs, 3652425 words.
Minimizing...Done.
81 states, 262 arcs, 3652425 words.

The result of the composition looks promising: the result net has 17 states more than the source net, but the word count has decreased by 7575 dates. The 'status' command in (18) gives a quick overview of the situation.

(18)


lexc> status
SOURCE: ex2-lex.txt = 64 states, 147 arcs, 3660000 words.
RULES: ex3-rule.fst = 1 transducer.
RESULT: ex2-lex.txt+ex3-rule.fst = 81 states, 262 arcs, 3652425 words.
Switches: singles=OFF, duplicates=ON, failures=ON, time=OFF, verbose=OFF

We can now use 'lookup' to verify that the composition had the intended effect. Because we now have a source and a result net, we can choose which one to use. If a result net has been computed, it is the default target for lookup. As we see by the string of asterisks returned, February 29, 1993 has disappeared from the lexicon.

(19)


lexc> lookup February 29, 1993
Use (s)ource or (r)esult? [r]:
NOTE: Using RESULT.
****************

As intended, the year 2000 is correctly included as a leap year whereas the year 1900 has been eliminated as it should be.

(20)


lexc> lookup February 29, 1900
Use (s)ource or (r)esult? [r]:
NOTE: Using RESULT.
****************

lexc> lookup February 29, 2000
Use (s)ource or (r)esult? [r]:
NOTE: Using RESULT.
February 29, 2000

Because everything appears to be in order, we save the result with the 'save-result' command. The command 'save-source' would save the source net. More about saving in Section 7.

(21)


lexc> save-result ex3-lex.fsm Opening 'ex3-lex.fsm'... Done.

3. Making lexical transducers

3.1 A simple lexical transducer

A lexical transducer, first discussed in Karttunen, Kaplan, and Zaenen [7], is a lemmatizer that maps inflected surface forms to their canonical dictionary representations and vice versa. In simple cases, such transducers can be compiled directly from a source lexicon. For example, we can modify the minilexicon in (3) to make it a morphological analyzer/generator for the forms of dine and line. The new lexicon is shown in (22).

(22)


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex4-lex.txt

Multichar_Symbols
+N +Sg +Pl +V +Base +Past +PastPart +Sg3

LEXICON Root
Nouns; Verbs;

LEXICON Nouns
line N;

LEXICON Verbs
dine V; line V;

LEXICON N
+N:0 NounInfl;

LEXICON NounInfl
+Sg:0 #; +Pl:s #;

LEXICON V
+V:0 RegVerbConj;

LEXICON RegVerbConj
VCommon; RegPast; RegPastPart;

LEXICON VCommon
+Base:0 #; +Sg3:s #;

LEXICON RegPast
+Past:d #;

LEXICON RegPastPart
+PastPart:d #;

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

In addition to the citation form of the word itself, the entries in (22) contain morphological information: part of speech (+N, +V) and type of inflection (+Pl, +Sg, +Base, +Sg3, +Past, +PastPart). The main technical difference between the old and the new version is that some of the entries now contain a pair of forms: a lexical form and a surface form, separated by a colon. The compiler interprets such a pair as a regular relation, a sequence of pairwise correspondences between the symbols. For example, the entry +Pl:s pairs the lexical form +Pl with the surface form s. In other words, in forms constructed with this entry, plural is realized as s.

The declaration Multichar_Symbols at the beginning of the file tells the compiler that +Pl and the other morphological tags are to be parsed as a single symbol rather than as a sequence of symbols. Using multicharacter symbols conveniently makes the morphological alphabet disjoint from ordinary letters. In pairs such as +N:0, +Sg:0, the numeral 0 stands for epsilon, the empty string, whereas the real zero is designated by %0.

(23) shows the composition of the surface form lines in (22), given the interpretation of zero as the empty string. The single form entries, e.g. line, are viewed as equivalent to identity mappings (line:line).

(23)


Sublexicon: Root Nouns N NounInfl Lexical form: line +N +Sg Surface form: line s

The command 'compile-source' converts the lexicon in (22) to a transducer, which can be used bidirectionally for analysis ('lookup') as well as generation ('lookdown'):

(24)


lexc> lookup lines
NOTE: Using SOURCE.
line+N+Pl

lexc> lookup dined
NOTE: Using SOURCE.
dine+V+PastPart
dine+V+Past

The 'lookup' command matches its argument against the lower side of the lexical transducer (ignoring epsilons) and produces the corresponding lexical string or strings if the match succeeds; 'lookdown' does the same operation in the opposite direction.

(25)


lexc> lookdown dine+V+Sg3
NOTE: Using SOURCE.
dines

lexc> lookdown line+N+Pl
NOTE: Using SOURCE.
lines

Because entries in the lexicon specification may consist of a pair of forms, it is simple to deal with words that have irregular surface forms in their paradigm. For example, strong verbs like swim have irregular past and past participle forms; in other respects they are regular. To add swim to our minilexicon, we need three new entries in LEXICON Verbs for the verb itself and three new sublexicons. The new sublexicons, shared by irregular forms of other strong verbs, encode the fact that in these verbs the past tense and past participle categories are expressed as a stem alternation and not by a suffix. These additions are listed in (26).

(26)


swim VIrreg;
swim:swam VIrregPast;
swim:swum VIrregPastPart;

LEXICON Virreg
+V:0 VCommon; ! to get "swims"

LEXICON VIrregPast
+V+Past:0 #; ! or <+V:0 +Past:0>

LEXICON VIrregPastPart
+V+PastPart:0 #;

When the lexicon is compiled into a transducer, the pairs swim:swam and swim:swum are combined into a sequence of symbol pairs in which every character of the lexical form is paired with the corresponding symbol of the surface form. If the symbols are identical, we represent the result as a single symbol (s instead of s:s), otherwise as a pair (i:a and i:u). If one of the forms is shorter than the other, epsilons are automatically added to the end to make the lengths equal. The past tense of swim is represented in the transducer as the path of symbols and symbol pairs shown in (27).

(27)

The commands 'lookup swam' and 'lookdown swim+V+Past' both locate this same path in the transducer by matching the argument against one or the other side of the transducer and printing out the symbols of the opposite side, or the same symbols if there is no difference. Because of epsilons and because one input symbol may correspond to more than one symbol on the opposite side, these operations in a transducer are in general not deterministic. On the other hand, lexical transducers have the advantage that they are bidirectional: analysis and generation make use of the same data structure and the same search algorithm.

The simple left-to-right conversion of the lexical and surface forms to a sequence of symbol pairs can lead to nonoptimal results. If the past tense of fight is encoded as fight:fought, the symbols line up in the manner shown in (28).

(28)


Lexical string: f i g h t
Surface string: f o u g h t

Because by default the compiler adds an epsilon to the end of the shorter string, the resulting path, <f i:o g:u h:g t:h 0:t>, is very irregular. We can avoid such ill-matched sequences by specifying in the lexical entry directly where epsilons should go. The optimal solution in this case is fi0ght:fought, which yields the path <f i:o 0:u g h t>. Here, as elsewhere, the 0 that represents an epsilon is not preceded by a percent sign.

3.2 Composing with rule transducers

Let us consider augmenting our minilexicon (22) with the present participle forms dining, lining, and swimming. We need a new sublexicon for present participle and a way of linking it to the entries for dine, line, and swim. The deletion of the stem final e in dining and lining and the gemination of m in swimming could be handled in the same way as the stem alternation in swim and swum:

(30)


dine:din
line:lin
swim:swimm

But this would be unsatisfactory because the alternations in (30) are not idiosyncratic to these particular words; they apply productively to all words of the same type.

Our solution is to write general e-deletion and gemination rules that apply to the canonical lexical form in the environment of a vowel-initial suffix. When the source lexicon is composed with the rule transducers, the result transducer will correctly represent these alternations. The two rules are shown in (31). See [3] for details of the syntax for specifying two-level rules.

(31)


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex5-rules.txt

Alphabet
a b c d e f g h i j k l m n o p q r s t u v w x y z %^:0
#:0;

Sets
Cons = b c d f g h j k l m n p q r s t v w x y z ;
Vowel = a e i o u ;

Rules
"E-deletion" ! A stem final e is deleted between a
! consonant and a suffix initial vowel
! Note: ^ marks the suffix boundary.
e:0 <=> Cons _ %^: Vowel;

"Gemination" ! The stem final consonant in a CVC
structure
! is doubled before a suffix initial vowel

0:Cx <=> Cons Vowel Cx %^: _ Vowel ; where Cx in (b d g m n p r s t z);

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

(32) is our minilexicon in its final form.

(32)


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex5-lex.txt

Multichar_Symbols
+N +Sg +Pl +V +Base +Past +PastPart +Sg3 +PresPart

LEXICON Root
Nouns; Verbs;

LEXICON Nouns
line N;

LEXICON Verbs
dine V; line V; stop V;
swim VIrreg; swim:swam VIrregPast; swim:swum
VIrregPastPart;

LEXICON N
+N:0 NounInfl;

LEXICON NounInfl
+Sg:0 #; +Pl:s #;

LEXICON V
+V:0 RegVerbConj;

LEXICON RegVerbConj
VCommon; RegPast; RegPastPart;

LEXICON VCommon
+Base:0 #; +Sg3:s #; 0+PresPart:^ing #;

LEXICON RegPast
0+Past:^ed #;

LEXICON RegPastPart
0+PastPart:^ed #;

LEXICON VIrreg
+V:0 VCommon;

LEXICON VIrregPast
+V+Past:0 #;

LEXICON VIrregPastPart
+V+PastPart:0 #;

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

After compilation, the present participle form of swim has the structure shown in (33).

(33)


Lexical Form: s w i m +V +PresPart
Intermediate Form: s w i m ^ i n g

The lower side of the source transducer contains "intermediate" forms. The surface form or forms will be determined by the application of the two-level rules in (31). In order to control the environment in which these rules apply, we explicitly mark the boundary between the stem and the present participle ending in the lexicon. The boundary marker, ^, corresponds to an epsilon in the lexical form (see the last entry in LEXICON VCommon). Because the only surface realization of ^ allowed in the rule alphabet is 0 (epsilon), the boundary marker disappears from the result when the source net is composed with the rule transducers.

(34) illustrates the effect of the composition.

(34)


Lexical Form: s w i m +V +PresPart
Intermediate Form: s w i m ^ i n g
Surface Form: s w i m m i n g

In the composition the intermediate form is eliminated. The result net maps swim+V+PresPart directly to swimming:

(35)


Lexical Form: s w i m +V +PresPart
Surface Form: s w i m m i n g

Because we need the e-deletion rule anyway for the present participle forms of dine and line, we changed the past and past participle suffix from d to ed. This enables us to handle all forms of regular verbs ending in a consonant, such as stop.

(36) traces the construction of the final result with the compiler. We assume here that the rules in (31) have already been compiled and stored in the binary file ex5-rules.fst.

(36)


lexc> read-rules ex5-rules.fst
Opening 'ex5-rules.fst'...
2 nets read.

lexc> compile-source ex5-lex.txt
Opening 'ex5-lex.txt'...
Root...2, Nouns...1, Verbs...6, N...1, NounInfl...2, V...1,
RegVerbConj...3, VCommon...3, RegPast...1, RegPastPart...1,
VIrreg...1, VIrregPast...1, VIrregPastPart...1
Building lexicon...Minimizing...Done!
SOURCE: 30 states, 42 arcs, 22 words.

lexc> compose-result
Epenthetical forms are possible.
Initial and final word boundaries added.
.....Done.
45 states, 59 arcs, 22 words.
Minimizing...Done.
33 states, 48 arcs, 22 words

Let us get an overview of the situation with the 'status' command:


lexc> status
SOURCE: ex5-lex.txt = 30 states, 42 arcs, 22 words.
RULES: ex5-rules.fst = 2 transducers.
RESULT: ex5-lex.txt+ex5-rules.fst = 33 states, 48 arcs, 22 words.
Switches: singles=OFF, duplicates=ON, failures=ON, time=OFF, verbose=OFF

As we see in (36) there are two transducers in the rule file, compiled from the two rules in (31) with the two-level rule compiler [3]. In the previous composition example in Section 2.4, we only had a single rule transducer. The result here is what we get by composing the source net with the intersection of the two rule transducers.

To verify that we have the expected result, we can again use 'lookup' and 'lookdown', first on the result net:

(37)


lexc> lookup swimming
Use (s)ource or (r)esult? [r]:
NOTE: Using RESULT.
swim+V+PresPart

lexc> lookup dining
Use (s)ource or (r)esult? [r]:
NOTE: Using RESULT.
dine+V+PresPart

lexc> lookdown line+V+PresPart
Use (s)ource or (r)esult? [r]:
NOTE: Using RESULT.
lining

If we choose the source net as the target, we can still see the intermediate forms. The source net is left unchanged by the compose operation.

(38)


lexc> lookdown dine+V+PresPart
Use (s)ource or (r)esult? [r]: s
NOTE: Using SOURCE.
dine^ing

Section 4 discusses other ways of examining the result of the composition.

3.3 Cascade of compositions

For languages with a more complex morphology than English, it may be impractical to encode all the morphological alternations in a single rule system. One such case is the plural formation of compound nouns in French, discussed in Karttunen, Kaplan and Zaenen [7]. It can best be described by two separate rule systems. The lexicon is composed with the first rule system and the resulting transducer is composed with the second set of rules. Appendix 2 discusses the case of French nouns in more detail.

In a cascaded composition, the result net of the first composition becomes the source net for the second step. The command 'result-to-source' makes that switch. When a new set of rules has been read in, the composition can be continued. This process is similar to the traditional derivation with an ordered set of rewrite rules. The difference of course is that the rules are applied to the lexicon as a whole rather than to individual words. At every point in the cascade, the result net maps the lexical forms to the output of the last rule or set of rules in the composition.

4. Checking the result

We have already illustrated three commands that display words in the source and the result net: 'random', 'lookup', and 'lookdown'. The 'check-all' command systematically compares the source net to the result net and displays the effect of the compose operation. It looks at every word in the source net and tries to find all words in the result net that have been derived from it by composition. The results of the search can be written to a file or displayed on the screen as in (39).

(39)

 


lexc> check-all
FLAGS: singles=ON, duplicates=ON, failures=ON.
Output file (- = stdout, ; = dialog box) [cancel]: -
==================================
Mon Mar 22 17:11:52 1993
SOURCE: ex5-lex.txt
RESULT: ex5-lex.txt+ex5-rules.fst
FLAGS: singles=ON, duplicates=ON, failures=ON.
==================================
dine+V+Base dine
dine+V+Sg3 dines
dine+V+PastPart dined
dine+V+Past dined
dine+V+PresPart dining
stop+V+Base stop
stop+V+Sg3 stops
stop+V+PastPart stopped
stop+V+Past stopped
stop+V+PresPart stopping
swim+V+Base swim
swim+V+Sg3 swims
swim+V+PresPart swimming
swim+V+Past swam
swim+V+PastPart swum
line+V+Base line
line+V+Sg3 lines
line+V+PastPart lined
line+V+Past lined
line+V+PresPart lining
line+N+Sg line
line+N+Pl lines

Source word count: 22
Single: 22, Multiple: 0, Failure: 0
Output word count: 22

The left column shows the lexical (upper) side of all the words in the source net, the right column displays the corresponding surface (lower) forms in the result net. For example, we see that the composition has produced swimming from swim+V+PresPart.

The output from 'check-all' is controlled by three variables that can be toggled on and off. If the variable 'singles' (default OFF) is ON, all lexical forms that have only one surface realization are printed. If 'duplicates' (default ON) is ON, lexical forms with more than one realization are printed. If 'failures' (default ON) is ON, entries in the source net that have no surface realization are printed with a row of asterisks in place of the missing output. In a typical case, most words in a source lexicon have one and only one surface realization. It may be unnecessary and time consuming to print them all if the lexicon is large. By turning the 'singles' flag off, the lexicon writer sees only the entries in the source lexicon that have several surface realizations or no realizations at all. This is useful information when the rule set is still in development. Badly constrained alternations may result in invalid multiple outputs, and conflicting rules may completely block some valid forms. If all the flags are turned off, 'check-all' produces no output other than the statistics at the end.

A common source of error is a mismatch between the lexicon and the rule alphabets. If some words in the source net contain symbols that are not in the rule alphabet, they disappear in the composition. The 'compose-result' command checks the alphabets and prints a warning if some symbol in the lower side of the source net does not appear on the upper side of the rule transducers. The command 'labels' allows the user to display the pair alphabet on the screen:

(40)


lexc> labels
1 = SOURCE, 2 = RULES, 3 = RESULT.
Choose 1, 2, or 3 [Cancel]: 1
d e i l m n o p s t w i:a i:u 0:d 0:g 0:n 0:^ +Base:0 +N:0 +Past:e
+Past:0 +PastPart:e +PastPart:0 +Pl:s +PresPart:i +Sg:0 +Sg3:s +V:0
Size: 28

lexc> labels
1 = SOURCE, 2 = RULES, 3 = RESULT.
Choose 1, 2, or 3 [Cancel]: 2
a b c d e f g h i j k l m n o p q r s t u v w x y z e:0 #:0 0:b 0:d
0:g 0:m 0:n 0:p 0:r 0:s 0:t 0:z ^:0
Size: 39

lexc> labels 3
d e i l m n o p s t w e:0 i:a i:u 0:d 0:g 0:m 0:n 0:p +Base:0
+N:0 +Past:e +Past:0 +PastPart:e +PastPart:0 +Pl:s +PresPart:i
+Sg:0 +Sg3:s +V:0
Size: 30

Any symbol in the source lexicon that consists of more than one letter must be declared in a Multichar_Symbols declaration (cf. (22) and (32)), otherwise it will be misparsed as a symbol sequence.

The lexicon compiler converts strings to symbols from left to right always preferring the longest match. It is important to keep this in mind in designing the alphabet. It might be safe to declare qu as a single symbol in English because it is nearly always pronounced [k]. On the other hand, declaring th a multicharacter symbol would clearly be a bad decision because words like hothouse would be misparsed as h-o-th-o-u-s-e. If the lexicon consists of ordinary alphabetic symbols and multicharacter tags as our examples do, it is best to use some nonalphabetic character such as + as the first letter of all multicharacter symbols. In this way the regular alphabet and the special alphabet are disjoint and every string is guaranteed to parse correctly.

5. Scripting

5.1 Simple script

The commands 'begin-script', 'end-script', and 'run-script' speed up the development of a lexicon by allowing the user to record a sequence of commands to a file and execute them all at once from inside the lexicon compiler. 'begin-script' starts the recording, 'end-script' closes the script file:

(41)


lexc> begin-script update
Recording commands in 'update'. Type 'end-script' to stop.

lexc> compile-source ex5-lex.txt
....
lexc> end-script
End of script. Closing 'update'

A script file, such as (42) may also be produced with a text editor and executed with a single command as in (43).

(42)


compile-source ex5-lex.txt
compose-result
check-all -
save-result ex5-result.fst

(43)


lexc> run-script update

5.2 Using make

The simple script facility is useful on the Macintosh. In UNIX, it is possible to automate the development even better with the help of the 'make' command which is invoked from the UNIX command line. Let us illustrate this by constructing a Makefile for the result net of the current example.

The "target" here is a situation in which the file containing the result net is up-to-date. Let us assume that the net was saved as ex5-result.fst, the source net as ex5-source.fst, and the rules as ex5-rules.fst. The source net was compiled from ex5-lex.txt, the rules from ex5-rules.txt. If the rule file gets modified, ex5-rules.fst needs to be recomputed; if ex5-lex.txt changes, ex5-source.fst should be recompiled; if either ex5-source.fst or ex5-rules.fst has changed, ex5-result.fst has to be recomposed.

To get 'make' to keep track of the situation and update the files as needed, the dependencies and the update actions have to be specified in a file called Makefile. (44) is an example of a Makefile that keeps ex5-rules.fst up-to-date by using 'source' to run a specific update script when one of the dependent files has changed.

(44)


##############################################
# Makefile for ex5

ex5: ex5-result.fst

ex5-result.fst: ex5-source.fsm ex5-rules.fst

    csh<<!source update-result.txt
    !


ex5-source.fsm: ex5-lex.txt
    csh<<!source update-source.txt
    !


ex5-rules.fst: ex5-rules.txt
    csh<<!source update-rules.txt
    !

##############################################

The UNIX command 'make ex5' checks the dependencies and carries out any required update actions in the appropriate order. The scripts for the updates are in three separate files, update-result.txt (45), update-source.txt (46), and update-rules.txt (47).

(45)


lexc<<!
read-source ex5-source.fst
read-rules ex5-rules.fst
compose-result
singles
check-all ex5-check.log
save-result ex5-result.fst
quit
!

The update-result script in (45) invokes the lexicon compiler, reads in the source and the rule nets, composes them, turns on the singles flag, produces the check-all file ex5-check.log, saves the result net as ex5-result.fst and quits.

The update-source script in (46) is simpler.

(46)


lexc<<!
compile-source ex5-lex.txt
save-source ex5-source.fst
quit
!

Updating the rules requires the two-level rule compiler. The update-rules script in (47) invokes the rule compiler (twolc) and runs the appropriate commands. For details, see [3].

(47)


twolc<<!
read-grammar ex5-rules.txt
compile
save-binary ex5-rules.fst
quit
!

6. Regular expressions

In Section 2.3 we saw that entries in a source lexicon may consist of a regular expression and a continuation class. In this section we discuss regular expressions in more detail. The lexicon compiler follows the same syntactic conventions as the two-level rule compiler (see Section 3.2 in [3]) but it has a few additional operators. The common operators are listed in (48) in a descending order of precedence. A few comments are added to explain the meaning of some nonstandard operators.

(48)

Unary operators
~ complement  
\ term complement  \a is any symbol other than a.
$ containment  $ais any string that contains at least one a.
* Kleene star
+ Kleene plus
() option  (A) B is equivalent to A B | B.
Binary operators
/ ignore  [a b c]/x is equivalent to x* a x* b x* c x*.
  concatenation
| union
& intersection
- minus  A - B is equivalent to A & ~B.
Special symbols
0 epsilon
? other
% escape
" multiple escape  "a b c" is equivalent to a% b% c.
! begin comment  Rest of the line is ignored.
Grouping elements
[ ] [A] is equivalent to A.
{ } can only be used around a union: {A | B}.

The lexicon compiler requires that regular expressions be enclosed in a pair of angle brackets, as shown in (12) and (13), to distinguish them from simple lexical forms and form pairs of type string1:string2. All operators, special characters, grouping elements, angle brackets, punctuation and whitespace characters must be prefixed with the escape character (%) if they are used as ordinary symbols.

In addition to (48), the lexicon compiler recognizes the following regular expression operators:

(49)

iteration(^)
A^n is equivalent to exactly n concatenations of A with itself.
A^n,m is the language consisting of at least n and at most m concatenations of A. Here m must be larger than n.
A^<n is less than n concatenations of A with itself, including the empty string.
A^>n more than n concatenations of A with itself.

The iteration operator has the same precedence as * and +.

(50)

crossproduct (.x.)
A .x. B  is the regular relation consisting of the Cartesian product of A and B. For example, if A is [a b | c] and B is [y z], [A .x. B] is the relation [a:y b:z | c:y 0:z]. The crossproduct [s w i m .x. s w u m] is equivalent to the expression [s w i:u m]. The crossproduct operator has lower precedence than any of the operators in (48).

(51)

insert (@)

@"file"  is the regular relation saved in file in binary format by a 'save-source' or 'save-result' command, or by other means. For example, if the file noun.fsm contains a lexicon of nouns, the regular expression [@"noun.fsm" [# @"noun.fsm"]*] constructs an infinite lexicon of nouns and noun compounds: noun1#noun2, noun1#noun2#noun3, and so on. The insert operator has the same precedence as ~, \, and $.

Using the insert and iteration constructs we can abbreviate our date lexicon--cf. (6), (12), (13) in Section 2--to the form shown in (52). We assume here that the files 0to9.fsm and 1to9.fsm have the contents suggested by the name, a net with the digits 0 to 9 and 1 to 9, respectively.

(52)


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex6-lex.txt

LEXICON Root
<J a n u a r y | M a r c h | M a y | J u l y |
A u g u s t | [O c t o | D e c e m] b e r > 31days;
<A p r i l | J u n e | {S e p t | N o v} e m b e r> 30days;
February 29days;

LEXICON 31days
% 31 Years; 30days;

LEXICON 30days
% 3%0 Years; 29days;

LEXICON 29days
<% {@"1to9.fsm" | {1|2} @"0to9.fsm"}> Years;

LEXICON Years
<%,% {@"0to9.fsm" | @"1to9.fsm" @"0to9.fsm"^1,3}> #;

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

7. Miscellaneous

7.1 Saving

The two output commands ('save-source' and 'save-result') save the net in the binary format described in [6]. This same format is used by the rule compiler to store transducers. If the user is only interested in the surface forms of the result net, the surface language of the transducer can be extracted with the command 'extract-surface-forms'. This operation converts the result net to a simple automaton.

7.2 Storage

The command 'storage' gives statistics of memory allocation:

(53)


lexc> storage
States: 64 in use, 20 free, 2340 allocated (32k)
Arcs: 147 in use, 27 free, 2048 allocated (32k)
Nets: 1 in use, 11 free, 327 allocated (16k)
Alphs: 2 in use, 22 free, 1638 allocated (16k)

These heaps: 96k
Other heaps: 64k

7.3 Time

The command 'time' toggles an ON/OFF switch (default OFF). If the switch is ON, the execution time is printed for all commands.

Acknowledgments

Thanks to Beth Bryson, Ronald M. Kaplan, Hyuk-Chul Kwon, Christian Rohrer and Annie Zaenen for comments and suggestions. Special thanks to Kenneth R. Beesley for editorial assistance.

References

[1] Koskenniemi, Kimmo. Two-level Morphology: A General Computational Model for Word-Form Recognition and Production. Department of General Linguistics. University of Helsinki. 1983.

[2] Antworth, Evan L. PC-KIMMO: a two-level processor for morphological analysis. Occasional Publications in Academic Computing No. 16, Summer Institute of Linguistics, Dallas, Texas. 1990.

[3] Karttunen, Lauri and Kenneth R Beesley. Two-Level Rule Compiler Technical Report. ISTL-922. Xerox Palo Alto Research Center. Palo Alto, California. 1992.

[4] Ritchie, Graeme D., Graham J. Russell, Alan W. Black, and Stephen G. Pulman. Computational Morphology: Practical Mechanisms for the English Lexicon. MIT Press. 1992.

[5] Sproat, Richard. Morphology and Computation. MIT Press. Cambridge, Massachusetts. 1992.

[6] Karttunen, Lauri. Binary Encoding Format for Finite-State Networks. Technical Report SSL-9027. Xerox Palo Alto Research Center. Palo Alto. March 1990.

[7] Karttunen, Lauri, Ronald M. Kaplan and Annie Zaenen. "Two-Level Morphology with Composition". In the Proceedings of the Fifteenth International Conference on Computational Linguistics. COLING-92. Nantes, 2328 August 1992. Vol. I 141-48. ICCL. 1992.

APPENDIX 1

On-line help messages

'banner'
gives information about the version and the authors of the Xerox Finite-State Lexicon Compiler.

'begin-script <file>'
records all commands in <file> until 'end-script'. Run with 'run-script'.
'check-all'
helps you verify the RESULT of composing the SOURCE with the RULES. 'check-all' compares the upper side of SOURCE to RESULT and tabulates the number or SOURCE strings which have one, none, or several realizations in the result net. If the 'singles' flag is ON, all SOURCE strings with one output are printed alongside the result. If the 'failures' flag is on, every source strings that yields no output in RESULT is printed with a row of stars '***********'. If the 'duplicates' flag is on, each SOURCE string with multiple outputs is printed with all of its output forms. The default settings are: singles = OFF, failures = ON, duplicates = ON.
'compile-source <textfile>'
reads the lexicon contained in <textfile>, compiles it, and stores it as the current SOURCE. You can save this lexicon with the command 'save-source'. To read in a precompiled SOURCE, use the command 'read-source'. To compose the SOURCE with a set of rule transducers, previously read in by 'read-rules', use the command 'compose-result'.
'compose-result'
applies the RULES to the SOURCE, creating the RESULT transducer. This command requires SOURCE and RULES.
'duplicates'
is a switch that affects the output of the CHECK-ALL command. If DUPLICATES is ON, check-all prints forms with multiple outputs. If DUPLICATES is OFF, check-all does not print such forms. Default is ON. Use 'status' to see the current setting.
'end-script'
stops recording commands into a script file. Begin recording with 'begin-script', run with 'run-script'.
'extract-surface-forms'
replaces the RESULT transducer with a simple fsm that contains only surface forms in the old RESULT.
'failures'
is a switch that affects the output of the CHECK-ALL command. If FAILURES is ON, check-all prints input forms that yield no output, paired with a string of asterisks. If FAILURES is OFF, check-all does not print such forms. Default is ON. Use 'status' to see the current setting.
'labels'
displays the label set of the SOURCE, RESULT, or RULES. The lower side of the source labels should match the upper side of the rule labels. If a symbol on the lower side of some source label does not appear on the upper side of at least one rule label, every string that contains that source label will disappear in the composition.
'lookdown <string>'
searches the SOURCE or the RESULT for <string> by matching <string> against the upper side of the transducer, and printing out all corresponding strings on the lower side of the transducer.
'lookup <string>'
searches the SOURCE or the RESULT for <string> by matching <string> against the lower side of the transducer, and printing out all corresponding strings on the upper side of the transducer.
'?'
prints a list of available commands.
'quit'
exits lexc.
'random'
generates 15 random words from either the SOURCE or the output language of the RESULT, depending upon the user's choice.
'read-result <fsmfile>'
loads the transducer in <fsmfile>, and stores it as the current RESULT. This is most useful for checking a previously composed RESULT via the 'check-all' command. The previous RESULT is replaced.
'read-rules <fsmfile>'
loads the transducers in <fsmfile>, and stores them as the current RULES. Previous RULES are replaced if the user so chooses.
'read-source <fsmfile>'
loads the compiled lexicon in <fsmfile>, and stores it as the current SOURCE. The previous SOURCE is replaced if the user so chooses.
'result-to-source'
makes RESULT the SOURCE, and clears RESULT.
'run-script <textfile>'
executes the commands in <textfile>. Record a script with 'begin-script' and 'end-script', or create in a text editor.
'save-result <filename>'
stores the RESULT in a binary format. Use 'read-result' to read files created in this way.
'save-source <filename>'
stores the current SOURCE in a binary format. Use 'read-source' to read files created in this way.
'singles'
is a switch that affects the output of the CHECK-ALL command. If SINGLES is ON, check-all prints input forms that yield exactly one output If SINGLES is OFF, check-all does not print such forms. Default is OFF. Use 'status' to see the current setting.
'source-to-result'
copies SOURCE to RESULT. This makes it possible to use 'check-all' to examine a source transducer.
'status'
displays information about the current SOURCE, RULES, and RESULT and the current settings of all the switches.
'storage'
displays statistics about current memory usage.
'time'
is an ON/OFF switch that provides timing information for some commands. Default is OFF. Use 'status' to see the current setting.

APPENDIX 2

Cascaded composition: French compound plurals

This appendix presents a simple example of cascaded composition, based on the 1992 Coling paper by Karttunen, Kaplan, and Zaenen [7]. In cascaded composition, two or more rule transducers are composed with the lexicon in a sequence. Introducing a cascade does not enrich the descriptive power of two-level morphology. Any regular relation than can be described by two rules or rule systems applying in a sequence could also be described by a single two-level grammar. But in the case of French compound plurals a single two-level description is clearly more complicated for the human rule writer than two rule systems that apply in a sequence.

Let us first construct a lexicon and a rule system that describes the inflection of simple nouns and adjectives in French. Depending on the type of the stem, nouns have three types of plural endings: -s (chef:chefs), -x (cheval:chevaux), or zero (nez:nez, prorata:prorata). Feminine adjectives in general add an -e to the end of the stem, which can have an effect on how the plural is realized (social:sociaux Masc, sociale:sociales Fem). Some words have alternate forms in the plural, for example, travail:travaux~travails.

The sample lexicon in (1) contains a few words that illustrate these phenomena. The lexical form of cheval, for example, is encoded as cheval+N+Masc+Sg.

(1)


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex7-lex.txt

Multichar_Symbols
+N +Adj +Masc +Fem +Sg +Pl
^s

LEXICON Root
Nouns;

LEXICON Nouns
chef Nm ; cheval Nm ; démocrate Nmf;
dieu Nm ; eau Nf ; hôtel Nm;
nez Nm ; social Adj ; travail Nms;
travail Nm ; prorata Nmi ;

LEXICON Nmi 0:^ Nm;
LEXICON Nm +N Masc ;
LEXICON Nf +N Fem ;
LEXICON Nmf +N Gender ;
LEXICON Nms 0:^s Nm ;
LEXICON Adj +Adj Gender ;
LEXICON Gender Masc; Fem ;
LEXICON Masc +Masc Number ;
LEXICON Fem +Fem Number ;
LEXICON Number +Sg # ; +Pl # ;

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Words that end in a sibilant (nez) are typically invariant in the plural; words like prorata 'proportional share' that exceptionally take no plural ending are marked in the lexicon with the diacritic 0:^. Note that there are two entries for travail. One of them gets a diacritic, 0:^s inserted between the stem and the gender marker. We use this diacritic to control the shape of the plural (travaux or travails). Because of the presence of the diacritic pairs, the compilation of (1) creates a transducer. The source net maps all other lexical forms to themselves but prorata+N+Masc+Pl is mapped to prorata^+N+Masc +Pl and travail+N+Masc+Pl to both travail+N+Masc+Pl and travail^s+N+Masc+Pl.

A simple rule system that describes the realization of these lexical forms is presented in (2).

(2)


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex7-b-rules.txt

Alphabet
a à â ä b c ç d e é è ê ë f g h i î ï j k l m n ñ o ô ö p q r s t
u ù û v w x y z %- ' %+Sg:0 %+Pl:s %+Masc:0 %+Fem:0 %+N:0 %+Adj:0
%^:0 %^s:0 #:0 ;

Sets
Sibilant = s z x;

Definitions
Cat = %+N: | %+Adj: ;
Gender = %+Masc: | %+Fem: ;
XContext = Cat Gender - %+Adj: %+Fem: ;

Rules
"1. Zero plural" ! No plural marker after a sibilant or diacritic ^.
%+Pl:0 <=> {Sibilant | %^: } Cat Gender _ ;

"2. Plural x" ! Plural is realized as x in words ending in
! -al, -ail, -au, -eu (except if ^s intervenes).
%+Pl:x <=> {a | e} (i:) :u XContext _ ;

"3. l to u" ! Final l is realized as u in the plural of words
! ending in -al and -ail (unless blocked by ^s).
l:u <=> a (i:) _ XContext %+Pl: ;

"4. i deletion" ! i is deleted in the plural of words ending
! in -ail (unless blocked by ^s).
i:0 <=> a _ l: XContext %+Pl: ;

"5. Feminine e" ! Feminine is realized as e in adjectives.
%+Fem:e <=> %+Adj: _ ;

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

We compile the source lexicon in the usual manner and compose it with the rule transducers. The output from check-all verifies that the rules have the intended effect.

(3)


lexc> check-all
FLAGS: singles=ON, duplicates=ON, failures=ON.
Output file (- = stdout, ; = dialog box) [cancel]: -
===================================
Thu Apr 1 13:06:17 1993
SOURCE: ex7-lex.txt
RESULT: ex7-lex.txt+ex7-b-rules.fst
FLAGS: singles=ON, duplicates=ON, failures=ON.
===================================
chef+N+Masc+Pl chefs
chef+N+Masc+Sg chef
cheval+N+Masc+Pl chevaux
cheval+N+Masc+Sg cheval
dieu+N+Masc+Pl dieux
dieu+N+Masc+Sg dieu
démocrate+N+Fem+Pl démocrates
démocrate+N+Fem+Sg démocrate
démocrate+N+Masc+Pl démocrates
démocrate+N+Masc+Sg démocrate
eau+N+Fem+Pl eaux
eau+N+Fem+Sg eau
hôtel+N+Masc+Pl hôtels
hôtel+N+Masc+Sg hôtel
nez+N+Masc+Pl nez
nez+N+Masc+Sg nez
prorata+N+Masc+Pl prorata
prorata+N+Masc+Sg prorata
social+Adj+Fem+Pl sociales
social+Adj+Fem+Sg sociale
social+Adj+Masc+Pl sociaux
social+Adj+Masc+Sg social
travail+N+Masc+Pl travails
travaux
travail+N+Masc+Sg travail

Source word count: 24
Single: 23, Multiple: 1, Failure: 0
Output word count: 2


The form travail+N+Masc+Pl gets two surface realizations because there are two intermediate forms travail+N+Masc+Pl and travail+^s+N+Masc+Pl. The presence of ^s exempts the latter form from the alternations that regularly take place in XContext. The diacritic itself disappears in the composition. A similar case is illustrated in Section 3.2 in examples (34) and (35).

In compound nouns and adjectives several patterns are possible. Sometimes only the first part of the compound is marked for the plural (hôtels-dieu), sometimes the last part (mini-jupes), sometimes both parts (portes-fenêtres). Some compounds are invariant in the plural (porte-plume). It is not possible to predict from the form alone (and not always from the meaning either) which class a given compound belongs to. We assume here that it has to be marked in the lexical form.

The rules in (2) only describe the pattern where the plural is realized at the end of the second word. It is possible in principle to modify these rules so that they also correctly account for other types of compounds, but this is not a simple matter. It would fail to capture the obvious fact that the plurals portes and fenêtres in portes-fenêtres in themselves are regular; the only thing that is special about the word is that plurality is expressed in both parts of the compound.

We avoid these complications by creating a second rule system that determines whether the plurality is expressed by the first word, by the second word or both. Because the new rules are triggered by compound diacritics, they have no effect on simple nouns. The rules are listed in (4).

(4)


Alphabet
a à â ä b c ç d e é è ê ë f g h i î ï j k l m n ñ o ô ö p q r s t
u ù û v w x y z
%- '
%+Sg %+Pl %+Sg:0 %+Pl:0
%+Masc %+Fem %+Masc:0 %+Fem:0
%+N %+Adj %+N:0 %+Adj:0 #:0 %^ %^s %^n:0 %^nn:0 %^an:0;
Sets
Letter = a à â ä b c ç d e é è ê ë f g h i î ï j k l m n ñ o ô ö
p q r s t u ù û v w x y z ' %- ;
CopyNMark = %^n %^nn ;
CopyMark = CopyNMark %^an ;
DeleteMark = %^n ;
Cat = %+Adj %+N ;
Gender = %+Masc %+Fem ;
Number = %+Sg %+Pl ;
Definitions
FirstWord = #: [Letter - %-]+ ;
LastWord = %- Letter+ ;
InsertContext = :Gender :Number LastWord ;
CopyContext = LastWord CopyMark:0 ;
DeleteContext = DeleteMark:0 (Cat:) (Gender:);

Rules
"1. Insert +N"
0:%+N <=> FirstWord _ InsertContext CopyNMark: ;
"2. Insert +Adj"
0:%+Adj <=> FirstWord _ InsertContext %^an: ;
"3. Copy gender"
0:Gx <=> FirstWord (:Cat) _ :Number CopyContext Cat: Gx: ;
where Gx in Gender;
"4. Copy number"
0:Nx <=> FirstWord (:Cat :Gender) _ CopyContext Cat: Gender: Nx: ;
where Nx in Number;
"5. Delete category"
Cat:0 <=> DeleteContext _ ;
"6. Delete gender"
Gender:0 <=> DeleteContext _ ;
"7. Delete number"
Number:0 <=> DeleteContext _ ;

We use the diacritic ^n to mark compounds like hôtel-dieu that mark the plural on the first noun. ^nn is used for noun-noun compounds like porte-fenêtre that have double plurals. ^an is a marks words like social-démocrate where the first part of the compound is inflected as an adjective for gender and number. Rules 1-4 insert the appropriate category, gender, and number at the end of the first word in the compound. Rules 5-7 delete the morphological codes at the end of the second noun in words like hôtel-dieu.

This new set of rules is ordered before the rules in (2). The correct result can be produced in either of two ways. (i) We intersect the rules in each set to a single transducer, compose the results, and then compose the lexicon with a single rule transducer, as described in [6]. Alternatively, (ii) we compose the lexicon with the rules in (4) and compose the result with the rules in (2). Here we take the second option. But let us first expand our noun lexicon in (1) with the new entries and sublexicons shown in (5).

(5)


! Additions to Multichar_Symbols
^n ^nn ^an

! Additions to LEXICON Nouns

cheval-vapeur NN1m ;
eau-de-vie NN1f ;
hôtel-dieu NN1m ;
plain-chant NN2m ;
porte-fenêtre NN2f ;
social-démocrate AN2mf ;
porte-plume Nmi ;

! New sublexicons
LEXICON NN1m 0:^n Nm;
LEXICON NN1f 0:^n Nf;
LEXICON NN2m 0:^nn Nm;
LEXICON NN2f 0:^nn Nf;
LEXICON AN2mf 0:^an Nm; 0:^an Nf;

In (6) we compile the new source lexicon and compose it with the rules in (4).

(6)


lexc> compile-source ex8-lex.txt
Opening 'ex8-lex.txt'...
Root...1, Nouns...18, Nm...1, Nf...1, Nmf...1, Nms...1,
Adj...1,
Gender...2, Masc...1, Fem...1, Number...2, NN1m...1,
NN1f...1,
NN2m...1, NN2f...1, AN2mf...2
Building lexicon...Minimizing...Done!
SOURCE: 109 states, 127 arcs, 42 words.

lexc> read-rules ex7-a-rules.fst
Opening 'ex7-a-rules.fst'...
7 nets read.

lexc> compose-result
Epenthetical forms are possible.
Initial and final word boundaries added.
.Done.
1686 states, 1726 arcs, 42 words.
Minimizing...Done.
204 states, 230 arcs, 42 words

The next step is to move the result net to the source position. After reading in the second set of rules, we compose again for the final result.

(7)


lexc> result-to-source
Done.

lexc> read-rules ex7-b-rules.fst
Opening 'ex7-b-rules.fst'...
5 nets read.
lexc> compose-result
No epenthesis.
Initial and final word boundaries added.
...Done.
248 states, 278 arcs, 42 words.
Minimizing...Done.
189 states, 220 arcs, 41 words.

The check-all listing in (8) verifies that the result is as we expect.

(8)


==================================
Thu Apr 1 17:35:12 1993
SOURCE: ex8-lex.txt+ex7-a-rules.fst
RESULT: ex8-lex.txt+ex7-a-rules.fst+ex7-b-rules.fst
FLAGS: singles=ON, duplicates=ON, failures=ON.
=================================

--- Omitting lines that duplicate the listing in (3). ---
cheval-vapeur+N+Masc+Pl chevaux-vapeur
cheval-vapeur+N+Masc+Sg cheval-vapeur
eau-de-vie+N+Fem+Pl eaux-de-vie
eau-de-vie+N+Fem+Sg eau-de-vie
hôtel-dieu+N+Masc+Pl hôtels-dieu
hôtel-dieu+N+Masc+Sg hôtel-dieu
plain-chant+N+Masc+Pl plains-chants
plain-chant+N+Masc+Sg plain-chant
porte-fenêtre+N+Fem+Pl portes-fenêtres
porte-fenêtre+N+Fem+Sg porte-fenêtre
porte-plume+N+Masc+Pl porte-plume
porte-plume+N+Masc+Sg porte-plume
social-démocrate+N+Fem+Pl sociales-démocrates
social-démocrate+N+Fem+Sg sociale-démocrate
social-démocrate+N+Masc+Pl sociaux-démocrates
social-démocrate+N+Masc+Sg social-démocrate

Source word count: 40
Single: 39, Multiple: 1, Failure: 0
Output word count: 41

The minor differences in word counts--42 vs. 41 in (7), 41 vs. 40 in (8)--are caused by the diacritic ^s in the intermediate forms of travail. The sequence of diagrams (9)-(11) illustrates the effect of the two compositions on the plural form of cheval-vapeur 'horse-power'.

(9) is the mapping in the source lexicon.

(9)


Lexical Form: c h e v a l - v a p e u r +N +Masc +Pl
1st Intermediate Form: c h e v a l - v a p e u r ^n +N +Masc +Pl

The first composition with the rules in (4) copies the morphological tags to the end of the first part of the compound. The resulting structure is shown in (10). Because cheval-vapeur is a compound that marks plurality only on the first word, the tags following vapeur are removed. These changes are triggered by the diacritic ^n, which itself disappears.

(10)


Lexical Form: c h e v a l - v a p e u r +N +Masc +Pl
2nd Intermediate Form: c h e v a l +N +Masc +Pl - v a p e u r

The second composition with the rules in (2) realizes the plural marker at the end of cheval in the regular way and the final structure emerges in (11).

(11)


Lexical Form: c h e v a l - v a p e u r +N +Masc +Pl
Surface Form: c h e v a u x - v a p e u r

As we mentioned above, exactly the same result could be produced by another route of computations. The intersections of the two rule sets, (4) and (2), could be merged to a single rule automaton by composition. The composition of the lexicon with the resulting single rule transducer yields the same result as the cascading composition illustrated here.