Semi-automatic bilingual corpus creation with zero entropy alignments

Laukaitis, Algirdas; Vasilecas, Olegas; Laukaitis, Ričardas; Plikynas, Darius

dc.contributor.author	Laukaitis, Algirdas
dc.contributor.author	Vasilecas, Olegas
dc.contributor.author	Laukaitis, Ričardas
dc.contributor.author	Plikynas, Darius
dc.date.accessioned	2023-09-18T18:55:37Z
dc.date.available	2023-09-18T18:55:37Z
dc.date.issued	2011
dc.identifier.issn	0868-4952
dc.identifier.other	(BIS)VGT02-000024041
dc.identifier.uri	https://etalpykla.vilniustech.lt/handle/123456789/133516
dc.description.abstract	In this paper, we describe a model for aligning books and documents from bilingual corpus with a goal to create "perfectly'' aligned bilingual corpus on word-to-word level. Presented algorithms differ from existing algorithms in consideration of the presence of human translator which usage we are trying to minimize. We treat human translator as an oracle who knows exact alignments and the goal of the system is to optimize (minimize) the use of this oracle. The effectiveness of the oracle is measured by the speed at which he can create "perfectly'' aligned bilingual corpus. By "Perfectly'' aligned corpus we mean zero entropy corpus because oracle can make alignments without any probabilistic interpretation, i.e., with 100% confidence. Sentence level alignments and word-to-word alignments, although treated separately in this paper, are integrated in a single framework. For sentence level alignments we provide a dynamic programming algorithm which achieves low precision and recall error rate. For word-to-word level alignments Expectation Maximization algorithm that integrates linguistic dictionaries is suggested as the main tool for the oracle to build "perfectly'' aligned bilingual corpus. We show empirically that suggested pre-aligned corpus requires little interaction from the oracle and that creation of perfectly aligned corpus can be achieved almost with the speed of human reading. Presented algorithms are language independent but in this paper we verify them with English-Lithuanian language pair on two types of text: law documents and fiction literature.	eng
dc.description.abstract	Šiame straipsnyje pristatomas metodas leidžiantis sukurti tikslius lygiagrečius dvikalbius tekstynus su mažomis žmogaus rankinio darbo sąnaudomis. Straipsnyje žmogus-vertėjas yra traktuojamas kaip orakulas, kuris žino tiksliai, kaip reikia anotuoti tekstyną, o sistemos tikslas – minimizuoti šio orakulo panaudojimą. Tai mes išmatuojame matuodami greitį su kuriuo žmogus sukuria tikslius lygiagrečius tekstynus. Žodis „tikslus“ tekstyno kūrimo kontekste naudojamas norint pabrėžti, kad orakulas anotuoja tekstyną su tikimybe lygia 1, t.y. be klaidų. Šiame straipsnyje pateikiami anotavimo algoritmai tiek sakinio lygmenyje, tiek žodžio lygmenyje, be to pasiūlytas metodas leidžia integruoti šiuos du algoritmų tipus į vieningą sistemą. Pasiūlytas metodas nepriklauso nuo kalbų pasirinkimo, tačiau šiame straipsnyje mes pateikiame eksperimentus, kurie buvo atlikti su anglų–lietuvių kalbų tekstynais. Straipsnyje parodome, kad pasiūlytas metodas ypač naudingas, kai mėginama anotuoti verstas grožin˙es literatūros knygas.	lit
dc.format	PDF
dc.format.extent	p. 203-224
dc.format.medium	tekstas / txt
dc.language.iso	eng
dc.relation.isreferencedby	Scopus
dc.relation.isreferencedby	INSPEC
dc.relation.isreferencedby	Science Citation Index Expanded (Web of Science)
dc.source.uri	https://doi.org/10.15388/Informatica.2011.323
dc.title	Semi-automatic bilingual corpus creation with zero entropy alignments
dc.title.alternative	Pusiau automatinis lygiagretaus tekstyno sudarymas su entropija, lygia nuliui
dc.type	Straipsnis Web of Science DB / Article in Web of Science DB
dcterms.references	20
dc.type.pubtype	S1 - Straipsnis Web of Science DB / Web of Science DB article
dc.contributor.institution	Vilniaus Gedimino technikos universitetas
dc.contributor.institution	Verslo ir vadybos akademija
dc.contributor.faculty	Fundamentinių mokslų fakultetas / Faculty of Fundamental Sciences
dc.subject.researchfield	N 009 - Informatika / Computer science
dc.subject.researchfield	T 007 - Informatikos inžinerija / Informatics engineering
dc.subject.en	Viterbi alignments
dc.subject.en	Dynamic programming
dc.subject.en	String alignments
dc.subject.en	Machine translation
dc.subject.en	Natural language processing
dc.subject.en	Rapid development
dc.subject.en	Low-density languages
dcterms.sourcetitle	Informatica
dc.description.issue	no. 2
dc.description.volume	Vol. 22
dc.publisher.name	Matematikos ir informatikos institutas
dc.publisher.city	Vilnius
dc.identifier.doi	10.15388/Informatica.2011.323
dc.identifier.elaba	3969580

Files in this item

Name:: inf_Vol22No2_203-224_Laukaitis.pdf
Size:: 983.3Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Straipsniai Web of Science ir/ar Scopus referuojamuose leidiniuose / Articles in Web of Science and/or Scopus indexed sources [7946]

Show simple item record