The Impact of Statistical Word Alignment Quality and Structure in Phrase Based Statistical Machine Translation

Hdl Handle:
http://hdl.handle.net/11285/572561
Title:
The Impact of Statistical Word Alignment Quality and Structure in Phrase Based Statistical Machine Translation
Authors:
Guzmán Herrera, Francisco J.
Issue Date:
01/12/2011
Abstract:
Statistical Word Alignments represent lexical word-to-word translations between source and target language sentences. They are considered the starting point for many state of the art Statistical Machine Translation (SMT) systems. In phrase-based systems, word alignments are loosely linked to the translation model. Despite the improvements reached in word alignment quality, there has been a modest improvement in the end-to-end translation. Until recently, little or no attention was paid to the structural characteristics of word-alignments (e.g. unaligned words) and their impact in further stages of the phrase-based SMT pipeline. A better understanding of the relationship between word alignment and the entailing processes will help to identify the variables across the pipeline that most influence translation performance and can be controlled by modifying word alignment's characteristics. In this dissertation, we perform an in-depth study of the impact of word alignments at different stages of the phrase-based statistical machine translation pipeline, namely word alignment, phrase extraction, phrase scoring and decoding. Moreover, we establish a multivariate prediction model for different variables of word alignments, phrase tables and translation hypotheses. Based on those models, we identify the most important alignment variables and propose two alternatives to provide more control over alignment structure and thus improve SMT. Our results show that using alignment structure into decoding, via alignment gap features yields significant improvements, specially in situations where translation data is limited. During the development of this dissertation we discovered how different characteristics of the alignment impact Machine Translation. We observed that while good quality alignments yield good phrase-pairs, the consolidation of a translation model is dependent on the alignment structure, not quality. Human-alignments are more dense than the computer generated counterparts, which trend to be more sparse and precision-oriented. Trying to emulate human-like alignment structure resulted in poorer systems, because the resulting translation models trend to be more compact and lack translation options. On the other hand, more translation options, even if they are noisier, help to improve the quality of the translation. This is due to the fact that translation does not rely only on the translation model, but also other factors that help to discriminate the noise from bad translations (e.g. the language model). Lastly, when we provide the decoder with features that help it to make "more informed decisions" we observe a clear improvement in translation quality. This was specially true for the discriminative alignments which inherently leave more unaligned words. The result is more evident in low-resource settings where having larger translation lexicons represent more translation options. Using simple features to help the decoder discriminate translation hypotheses, clearly showed consistent improvements.
Keywords:
Statistical Mahine; Shidden Markov Models; Empirical Methods
Degree Program:
Doctoral Program Information Technologies and Communications
Advisors:
Dr. Leonardo Garrido Luna
Committee Member / Sinodal:
Dr. Stephan Vogel
Degree Level:
Doctor of Philosophy in Information Technologies and Communications
School:
School Of Engineering And Information Technologies
Campus Program:
Campus Monterrey
Discipline:
Ingeniería y Ciencias Aplicadas / Engineering & Applied Sciences
Appears in Collections:
Ciencias Exactas

Full metadata record

DC FieldValue Language
dc.contributor.advisorDr. Leonardo Garrido Lunaes
dc.contributor.authorGuzmán Herrera, Francisco J.es
dc.date.accessioned2015-08-17T11:35:18Zen
dc.date.available2015-08-17T11:35:18Zen
dc.date.issued01/12/2011-
dc.identifier.urihttp://hdl.handle.net/11285/572561en
dc.description.abstractStatistical Word Alignments represent lexical word-to-word translations between source and target language sentences. They are considered the starting point for many state of the art Statistical Machine Translation (SMT) systems. In phrase-based systems, word alignments are loosely linked to the translation model. Despite the improvements reached in word alignment quality, there has been a modest improvement in the end-to-end translation. Until recently, little or no attention was paid to the structural characteristics of word-alignments (e.g. unaligned words) and their impact in further stages of the phrase-based SMT pipeline. A better understanding of the relationship between word alignment and the entailing processes will help to identify the variables across the pipeline that most influence translation performance and can be controlled by modifying word alignment's characteristics. In this dissertation, we perform an in-depth study of the impact of word alignments at different stages of the phrase-based statistical machine translation pipeline, namely word alignment, phrase extraction, phrase scoring and decoding. Moreover, we establish a multivariate prediction model for different variables of word alignments, phrase tables and translation hypotheses. Based on those models, we identify the most important alignment variables and propose two alternatives to provide more control over alignment structure and thus improve SMT. Our results show that using alignment structure into decoding, via alignment gap features yields significant improvements, specially in situations where translation data is limited. During the development of this dissertation we discovered how different characteristics of the alignment impact Machine Translation. We observed that while good quality alignments yield good phrase-pairs, the consolidation of a translation model is dependent on the alignment structure, not quality. Human-alignments are more dense than the computer generated counterparts, which trend to be more sparse and precision-oriented. Trying to emulate human-like alignment structure resulted in poorer systems, because the resulting translation models trend to be more compact and lack translation options. On the other hand, more translation options, even if they are noisier, help to improve the quality of the translation. This is due to the fact that translation does not rely only on the translation model, but also other factors that help to discriminate the noise from bad translations (e.g. the language model). Lastly, when we provide the decoder with features that help it to make "more informed decisions" we observe a clear improvement in translation quality. This was specially true for the discriminative alignments which inherently leave more unaligned words. The result is more evident in low-resource settings where having larger translation lexicons represent more translation options. Using simple features to help the decoder discriminate translation hypotheses, clearly showed consistent improvements.en
dc.language.isoenen
dc.rightsOpen Accessen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/*
dc.titleThe Impact of Statistical Word Alignment Quality and Structure in Phrase Based Statistical Machine Translationen
dc.typeTesis de Doctoradoes
thesis.degree.grantorInstituto Tecnológico y de Estudios Superiores de Monterreyes
thesis.degree.levelDoctor of Philosophy in Information Technologies and Communicationsen
dc.contributor.committeememberDr. Stephan Vogeles
thesis.degree.disciplineSchool Of Engineering And Information Technologiesen
thesis.degree.nameDoctoral Program Information Technologies and Communicationsen
dc.subject.keywordStatistical Mahineen
dc.subject.keywordShidden Markov Modelsen
dc.subject.keywordEmpirical Methodsen
thesis.degree.programCampus Monterreyes
dc.subject.disciplineIngeniería y Ciencias Aplicadas / Engineering & Applied Scienceses
All Items in REPOSITORIO DEL TECNOLOGICO DE MONTERREY are protected by copyright, with all rights reserved, unless otherwise indicated.