 |
 |
 |
 |
PART-OF-SPEECH TAGGING
The general purpose of a part-of-speech tagger is to associate each word
in a text with its morphosyntactic category (represented by a tag).
EXAMPLE :

This+PRON is+VAUX_3SG a+DET sentence+NOUN_SG .+SENT
The process of tagging consists in three steps:
tokenization: break a text into tokens
lexical lookup: provide all potential tags for each token
disambiguation: assign to each token a single tag
Each step is performed by an application program
which uses language specific data:
-
The tokenization step uses a finite-state transducer to insert token
boundaries around simple words (or multi-word expressions), punctuations,
numbers, etc.
-
Lexical lookup requires a morphological analyser
to associate each token with one or more readings. Unknown words are handled by a guesser
which provides potential part-of-speech categories based on affix patterns.
-
Disambiguation is done with statistical methods (Hidden Markov Model).
Using the Xerox HMM training tools, we have
developed part-of-speech disambiguators for various languages
including Dutch, English, French, German, Italian, Portuguese,
Spanish.
Have a look at our demo part-of-speech tagger!
Related CA publications:
Back to Tools for Natural Language Processing
|
 |