The specific aim of this project is to develop techniques to extract pairs of equivalent terms from Spanish-Basque translation memories, and to develop a tool that implements these techniques. This aim is located within a broader field of research: the extraction of lexical equivalences from parallel corpora.
Nowadays, a module to extract bilingual terminology is incorporated into many help systems based on translation memories, and this offers the possibility of semi-automatically enriching the system’s terminological database. These systems do not have any tools for automatically processing Basque and, as a result, statistical techniques alone are used for the language. As Basque is an agglutinative language, the values of the statistical measures based on the co-occurrence of text forms are dispersed and are therefore not significant. Consequently, the results achieved using Basque-Spanish translation memories are poor.
Through ELexBI we are seeking to offer an extraction tool that takes the nature of Basque into consideration. For this purpose, techniques for extracting lexical and terminological resources from parallel corpora are currently being examined. Statistical techniques constitute the basis of most of the systems proposed and developed (in particular those developed in the context of machine translation statistics); alongside them there are linguistic techniques, in particular those developed in the context of terminology extraction. In recent times, there has been a clear trend for making use of these two kinds of techniques. In our trials, too, the best results in the extraction process have been yielded by combined linguistic and statistical techniques, and the use of a number of ‘heuristic’ programs have improved the results even further.
Over all, the structure of the process is as follows: firstly, candidate terms (single as well as multiple ones) of each language are identified (by means of linguistic techniques); then the candidates are paired (mainly through statistical techniques).
In order to identify the candidate terms of each language, ELexBI uses the following tools: for Basque, the Erauzterm term extractor developed by the Elhuyar Foundation in collaboration with the IXA group; for processing the Spanish text, the free software package Freeling developed by the UPC Centre de Tecnologies i Aplicacions del Llenguatge i la Parla (TALP) and the Centre de Llenguatge i Computació of Barcelona University.
General diagram of ELexBI:
To select equivalents, ELexBI uses a step-by-step algorithm:
Copyright © 2007 Elhuyar Fundazioa | Legal notice | Site Map | Erabiltzaile-kopurua: 856789
Diseinua: Blanco