Logo BSU

Please use this identifier to cite or link to this item: https://elib.bsu.by/handle/123456789/160177
Title: Robust multilingual document identification
Other Titles: Устойчивая идентификация многоязычного документа
Authors: Stefanovitch, N.
Keywords: ЭБ БГУ::ОБЩЕСТВЕННЫЕ НАУКИ::Информатика
ЭБ БГУ::ОБЩЕСТВЕННЫЕ НАУКИ::Информатика
Issue Date: 25-Oct-2016
Publisher: Минск: БГУ
Abstract: We consider in this paper the problem of detection the language of document when no assumptions are made about a document: it can be of any size and contain zero, one or several languages. Language identification is considered a solved task, but actually, among others shortcomings, does not deal with the case of accurately the presence or absence of several languages in arbitrary documents. In order to tackle these problems, we propose an approach based on word dictionaries using Bayesian statistics and ad-hoc features. We show on two datasets that with sufficient statistics our approach is able to give very satisfying results in dealing with both unsolved tasks: detection of documents with no languages and identification of languages in multilingual documents.
URI: http://elib.bsu.by/handle/123456789/160177
ISBN: 978-985-566-369-1
Appears in Collections:Секция 6. ИНТЕЛЛЕКТУАЛЬНЫЕ ИНФОРМАЦИОННЫЕ СИСТЕМЫ

Files in This Item:
File Description SizeFormat 
Stefanovitch.pdf471,34 kBAdobe PDFView/Open
Show full item record Google Scholar



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.