Archive - Central European Conference on Information and Intelligent Systems, CECIIS - 2014

Font Size: 
Searching for Semantically Correct Postal Addresses on the Croatian Web
Ivo Ugrina, Mislav Žigo

Last modified: 2014-07-13

Abstract


This article presents a method of extraction and simultaneous verification of postal addresses within web pages written in a highly inflective language (Croatian). The method uses a combined approach of direct city name extraction, string similarity measure (Jaro-Winkler) for street names, an algorithm for treating overlapping addresses and a machine learning classifier (Decision trees) to derive Semantically Correct Postal Addresses. A Semantically Correct Postal Address is defined
as one that was meant to be written by an author of the text and is not simply there by a lucky ordering of words. The presented method jointly does geoparsing and geocoding. For the initial search of cities and streets, the method relies on a database containing most of the streets and cities in Croatia. The method was evaluated on a data set consisting of 13,000,000 documents (from 35,000 web domains) and resulted in 4,000,000 addresses found in 2,750,000 documents. The quality of classifiers was tested on a hand annotated set giving F1 scores greater than 0.9.