How to do POS-tagging and lemmatization in languages other than English
While is it fairly easy to do POS-tagging and lemmatization in English using Python and the NLTK or TextBlob modules, building applications that handle other languages is not always as straight-forward.
TreeTagger has been around for a while (since 1995!) as still fills an important role as an open-sourced, multilingual POS-tagging and lemmatization command-line utility. While the algorithms behind it are not as fancy as the ones used in today’s state-of-the-art, it has decent precision (around 97% for German) and it covers all the following languages: German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Chinese, Swahili, Slovak, Slovenian, Latin, Estonian, Polish and old French.
Some better resources might have been developed for some of these languages since TreeTagger came out.
That being said, this is the only piece of software that acts as a one-stop shop for all of this.
Here are the installation instructions for a Linux machine.
Make sure to read the license terms.
Follow the instructions in the download section.
Don’t forget to save every file you download from the site with its original name, all in the same folder, before running the install-tagger.sh script.
Set a TREETAGGER environment variable on your system to the place where install-tagger.sh put the tagging scripts.
For example, I downloaded every file to /opt/treetagger and executed the script from there, so I set TREETAGGER to “/opt/treetagger/cmd”.
To do so, open the ~/.bashrc file (or create it of it does not exist) by entering this into the terminal:
$ sudo nano ~/.bashrc
and add this line in the file
before quitting and saving with CTRL+C, then Y.
You have to get the treetagger-python wrapper from miotto’s Github repo
$ sudo git clone https://github.com/miotto/treetagger-python.git
then navigate to the folder you just downloaded and launch the python setup script:
$ cd treetagger-python $ python setup.py install
Test that it works
Launch the Python interpreter and input the following to test the installation:
>>> from treetagger import TreeTagger >>> tt_en = TreeTagger(encoding='utf-8', language='english') >>> tt_fr = TreeTagger(encoding='utf-8', language='french') >>> from pprint import pprint >>> pprint(tt_en.tag('Does this thing even work?')) [[u'Does', u'VBZ', u'do'], [u'this', u'DT', u'this'], [u'thing', u'NN', u'thing'], [u'even', u'RB', u'even'], [u'work', u'VB', u'work'], [u'?', u'SENT', u'?']] >>> pprint(tt_fr.tag(u'Mon Dieu, faites que ça marche!')) [[u'Mon', u'DET:POS', u'mon'], [u'Dieu', u'NOM', u'Dieu'], [u',', u'PUN', u','], [u'faites', u'VER:pres', u'faire'], [u'que', u'KON', u'que'], [u'\xe7a', u'PRO:DEM', u'cela'], [u'marche', u'NOM', u'marche'], [u'!', u'SENT', u'!']]
That’s it! Enjoy your quick and easy multilingual POS-tagger and lemmatizer.