Home page Site map Contact
  

 

PART-OF-SPEECH TAGGING

The general purpose of a part-of-speech tagger is to associate each word in a text with its morphosyntactic category (represented by a tag).

 

EXAMPLE :

This+PRON is+VAUX_3SG a+DET sentence+NOUN_SG .+SENT

The process of tagging consists in three steps:

  1. tokenization: break a text into tokens

  2. lexical lookup: provide all potential tags for each token

  3. disambiguation: assign to each token a single tag

Each step is performed by an application program which uses language specific data:

  • The tokenization step uses a finite-state transducer to insert token boundaries around simple words (or multi-word expressions), punctuations, numbers, etc.

  • Lexical lookup requires a morphological analyser to associate each token with one or more readings. Unknown words are handled by a guesser which provides potential part-of-speech categories based on affix patterns.

  • Disambiguation is done with statistical methods (Hidden Markov Model).

Using the Xerox HMM training tools, we have developed part-of-speech disambiguators for various languages including Dutch, English, French, German, Italian, Portuguese, Spanish.

Have a look at our demo part-of-speech tagger!

Related CA publications:

 

Back to Tools for Natural Language Processing