This thesis presents efficient word alignment algorithms for sentence-aligned parallel corpora. For the most part, procedures for statistical translation modelling are employed that make use of measures of term distribution as the basis for finding correlations between terms. Linguistic rules of morphology and syntax have also been used to complement the statistical methods. Two models have been developed which are briefly described as follows:
Alignment Model I.
For this first model a statistical global alignment method has been designed in which the entire target document is searched for the translation of a source word. The term in the target language that has the highest similarity in distribution with the source term is taken as the best translation. The output of this algorithm is a 1:1 alignment of a complex Amharic word with simple English words.
In reality, one word in one language is not necessarily translated into a single word in the other language and vice versa. This phenomenon is even more pronounced in disparate languages such as English and Amharic. Therefore, an enhancement method, relaxing routine, that would scale up the 1:1 alignments into 1:m alignments is devised. This approach that synthesises English chunks that are equivalent to Amharic words from parallel corpora is also described in this study. The procedure allows several words in the simpler language to be brought together and form a chunk equivalent to the complex word in the other language.
The relaxing procedure may resolve the shortcomings of a 1:1 alignment but it does not solve the distortion in the statistics of words created by morphological variants, hence finite-state shallow stemmers that strip salient affixes in both languages have also been developed.
Alignment Model II.
Model II performs local alignment of a source word in a source sentence in the source language to a word in the target sentence in the target language. The search for a translation of a word in a sentence is only limited to the corresponding sentences instead of the entire document. This is a step towards achieving an increased recall, which is vital when dealing with languages that have scarcity of translation texts. This procedure, however, results in a drop in precision. To improve the diminished precision, two procedures have been integrated into it:
1. Reuse of the lexicon from model I, that is, known translations are excluded from the search space, leaving a limited number of words from which to choose the most likely translation; and
2. a pattern recognition approach for recognising morphological and syntactic features that allows the guessing of translations in sentences has also been developed.
A comparative study of the performance of Model I across Amharic, English, Hebrew and German was also part of the study. The impact of the complexities and typological disparities on the performance of the alignment method has been observed.
Another attempt to exploit translation texts that has been made in the course of this research was an attempt to recognise nouns in Amharic by transfer from German translation. Since nouns in German are recognised by their initial capital, aligning nouns leads to the recognition of nouns in Amharic, which do not have special features that distinguish them from words in other word classes.
All the components of the system have been evaluated on text aligned at sentence level. On the same data, a comparison with the IBM alignment model implementation (GIZA++) has also been made.