POS tagging in not so common languages usually requires a bit of effort to be set up. Luckily, for Croatian, Željko Agić has created a very good POS tagger licensed under CC-BY-SA-3.0. It is based on the hunpos package which was originally created for Hungarian and which is licensed under the New BSD License.

According to my research, Agić is the most important POS tagging researcher for Croatian language. Another very important person in the field of Croatian NLP is Marko Tadić, but he seems to be more involved in the whole field of corpus creation.

## Create hunpos binary

To use the Croatian POS tagger model with hunpos, you need to compile the latest version of hunpos from source. It does not seem to work with the precompiled hunpos linux binaries. Compiling it is quite simple. Download the package, unpack it and then call the build script.

You now have the hunpos binaries in the current directory as tagger.native and trainer.native.

The next step is using the Croatian model. Just download the model from Agić’s website and rename it to something more convenient like croatian-ffzg.hunpos.

The input into the POS tagger is one token per line. Empty lines are used as sentence separators. So a simple test file from a wikipedia article about Lučka kapetanija (was on croatian wikipedia’s main page on 2016-05-26) might look like this:

Lučka
kapetanija
(engl.
Harbour
Master's
Office)
je
glavna
institucija
s
provedbenom
funkcijom
u
području
pomorstva
u
Republici
Hrvatskoj

Na
čelu
lučke
kapetanije
nalazi
se
lučki
kapetan
po
kome
je
institucija
kroz
povijest
i
dobila
ime


According to my tests, you should make sure to remove punctuation marks, because it leads to wrong classifications. For example, when I had Hrvatskoj. and ime. in the file, it classified Hrvatskoj. as a masculine noun (but it is feminine) and ime. as a number (but it is a noun). Without the dots, classification was correct. As far as I can say, these two words (Hrvatskoj and ime) are then classified totally correct as Npfsl (noun, proper name, feminine, singular, locative) and Ncnsa (noun, common noun, neuter, singular, accusative). This degree of detail is really impressive.

You can run the POS tagging on this test file yourself with:

Classification for this whole file is:

Lučka	Agpfsn
kapetanija	Ncfsn
(engl	Ncmsn
Harbour	Npmsn
Master's	Npmsn
Office)	Npmsn
je	Vcr3s
glavna	Agpfsn
institucija	Ncfsn
s	Si
provedbenom	Agpfsi
funkcijom	Ncfsi
u	Sl
području	Ncnsl
pomorstva	Ncnsg
u	Sl
Republici	Ncfsl
Hrvatskoj	Npfsl
Vmr3s
Na	Sl
čelu	Ncnsl
lučke	Agpfsg
kapetanije	Ncfsg
nalazi	Vmr3s
se	Px--sa--ypn
lučki	Agpmsn
kapetan	Agpmsn
po	Sl
kome	Pp3fsi--n-n
je	Vcr3s
institucija	Ncfsn
kroz	Sa
povijest	Ncfsa
i	Cc
dobila	Vmp-sf
ime	Ncnsa


There seems to be a bug in hunspell with the input format, because even though the documentation states Empty lines are sentence separators. my empty lines are classified as verbs (Vmr3s). However, in my test, this does not influence the second sentence.

The POS tags are given in form of revised Multext East version 4.

According to Agić, the model achieves an accuracy of 87% at full MSD-HR and a “POS-only accuracy” of 97%. The worst numbers in Agić’s paper are 80% on full MSD-HR and 94% for POS-only. It is also worth mentioning, that this POS tagger can be employed both to Croatian and Serbian (in latin characters). If you put in cyrillic characters, everything is a noun:

Први   Npmsn
потпредседник   Npmsn
Владе   Npmsn
Републике   Npmsn
Србије   Npmsn


However, you can just transliterate the characters to latin and then it works:

Prvi   Agpmsn
potpredsednik   Ncmsn

Agić also maintains a repository on github for the corpora used to train his POS tagger (SETimes.HR). All of them are licensed under CC, but at the time of writing news and web include the NC requirement.