How to call TreeTagger from Python

How to do POS-tagging and lemmatization in languages other than English

While is it fairly easy to do POS-tagging and lemmatization in English using Python and the NLTK or TextBlob modules, building applications that handle other languages is not always as straight-forward.

Here I show you what I consider to be the simplest solution to this problem, using Python, TreeTagger and a wrapper module named… treetagger-python.


About TreeTagger

TreeTagger has been around for a while (since 1995!) as still fills an important role as an open-sourced, multilingual POS-tagging and lemmatization command-line utility. While the algorithms behind it are not as fancy as the ones used in today’s state-of-the-art, it has decent precision (around 97% for German) and it covers all the following languages: German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Chinese, Swahili, Slovak, Slovenian, Latin, Estonian, Polish and old French.

Some better resources might have been developed for some of these languages since TreeTagger came out.
That being said, this is the only piece of software that acts as a one-stop shop for all of this.


Here are the installation instructions for a Linux machine.

Install TreeTagger

Make sure to read the license terms.

Follow the instructions in the download section.
Don’t forget to save every file you download from the site with its original name, all in the same folder, before running the install-tagger.sh script.

IMPORTANT STEP!

Set a TREETAGGER environment variable on your system to the place where install-tagger.sh put the tagging scripts.
For example, I downloaded every file to /opt/treetagger and executed the script from there, so I set TREETAGGER to “/opt/treetagger/cmd”.

To do so, open the ~/.bashrc file (or create it of it does not exist) by entering this into the terminal:

$ sudo nano ~/.bashrc

and add this line in the file

export TREETAGGER="/opt/treetagger/cmd"

before quitting and saving with CTRL+C, then Y.

Install treetagger-python

You have to get the treetagger-python wrapper from miotto’s Github repo

$ sudo git clone https://github.com/miotto/treetagger-python.git

then navigate to the folder you just downloaded and launch the python setup script:

$ cd treetagger-python
$ python setup.py install

Test that it works

Launch the Python interpreter and input the following to test the installation:

>>> from treetagger import TreeTagger
>>> tt_en = TreeTagger(encoding='utf-8', language='english')
>>> tt_fr = TreeTagger(encoding='utf-8', language='french')
>>> from pprint import pprint
>>> pprint(tt_en.tag('Does this thing even work?'))
[[u'Does', u'VBZ', u'do'],
 [u'this', u'DT', u'this'],
 [u'thing', u'NN', u'thing'],
 [u'even', u'RB', u'even'],
 [u'work', u'VB', u'work'],
 [u'?', u'SENT', u'?']]
>>> pprint(tt_fr.tag(u'Mon Dieu, faites que ça marche!'))
[[u'Mon', u'DET:POS', u'mon'],
 [u'Dieu', u'NOM', u'Dieu'],
 [u',', u'PUN', u','],
 [u'faites', u'VER:pres', u'faire'],
 [u'que', u'KON', u'que'],
 [u'\xe7a', u'PRO:DEM', u'cela'],
 [u'marche', u'NOM', u'marche'],
 [u'!', u'SENT', u'!']]

That’s it! Enjoy your quick and easy multilingual POS-tagger and lemmatizer.

4 thoughts on “How to call TreeTagger from Python

  1. Hello thank you very much for your article I found it very good I liked it a lot, I would like to know how I can use TreeTagger to parse a txt file and get the result only of the lemmas in another txt?

    Like

    1. Hi Carlos,

      If you are using TreeTagger from Python, the data structure that is returned by the tag method will contain the lemmas for every token, as you can see in the post. It is then a matter a extract a list of these lemmas and writing them to a file. If you are using TreeTagger as a command-line tool, the output will be a tab separated file where the third column will contain the lemmas of your input. Finally, if you are a Windows user, there exists a graphical interface from which you can select to output only the lemmas. Here is the link to get the Windows GUI working: http://www.smo.uhi.ac.uk/~oduibhin/oideasra/interfaces/winttinterface.htm

      Like

  2. Thank you for this article. I was using TreeTaggerWrapper before. Tried to follow your explanation but when I try to run it, it does import correctly but then I get “NLTK was unable to find the TreeTagger bin!” Any idea what could I have possibly done wrong?

    Like

    1. Hey there, sorry for the long delay in answering, I’m on vacation and away from computers 🙂

      NLTK is looking for the TreeTagger binary, but the path it is given does not lead to that binary
      This path should be set somewhere in the configuration. A first candidate would be the TREETAGGER_HOME environment variable, that points to the top folder of the TreeTagger installation.

      Other people having the same issue discussed this on the TreeTaggerWrapper GitHub repo: https://github.com/miotto/treetagger-python/issues/10

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s