Pseudo-lemmatization in Croatian-English SMT

Archive - Central European Conference on Information and Intelligent Systems, CECIIS - 2014

Marija Brkić, Maja Matetić, Sanja Seljan

Last modified: 2014-07-16

Abstract

One of the first difficulties in conducting a thorough analysis of statistical machine translation involving Croatian as a morphologically rich and resource poor language is the lack of quality language resources. This paper presents results of two standard fourteen feature Croatian-English phrase-based statistical machine translation systems. Prior to building the second system a partial pseudo-lemmatization of the Croatian parts of training and test sets is made in an attempt to simplify the translation process. Besides automatic evaluation, manual evaluation is conducted in order to gain insight into the nature of the translation differences achieved between two systems.