Elhuyar Fundazioaren logoa

Elhuyar Fundazioa - Language Services 

Go to top of page

R+D+I

Lexicon and terminology extraction

ElexBI 

Aims and general description

The specific aim of this project is to develop techniques to extract pairs of equivalent terms from Spanish-Basque translation memories, and to develop a tool that implements these techniques. This aim is located within a broader field of research: the extraction of lexical equivalences from parallel corpora.

Nowadays, a module to extract bilingual terminology is incorporated into many help systems based on translation memories, and this offers the possibility of semi-automatically enriching the system’s terminological database. These systems do not have any tools for automatically processing Basque and, as a result, statistical techniques alone are used for the language. As Basque is an agglutinative language, the values of the statistical measures based on the co-occurrence of text forms are dispersed and are therefore not significant. Consequently, the results achieved using Basque-Spanish translation memories are poor.

Through ELexBI we are seeking to offer an extraction tool that takes the nature of Basque into consideration. For this purpose, techniques for extracting lexical and terminological resources from parallel corpora are currently being examined. Statistical techniques constitute the basis of most of the systems proposed and developed (in particular those developed in the context of machine translation statistics); alongside them there are linguistic techniques, in particular those developed in the context of terminology extraction. In recent times, there has been a clear trend for making use of these two kinds of techniques. In our trials, too, the best results in the extraction process have been yielded by combined linguistic and statistical techniques, and the use of a number of ‘heuristic’ programs have improved the results even further.

Over all, the structure of the process is as follows: firstly, candidate terms (single as well as multiple ones) of each language are identified (by means of linguistic techniques); then the candidates are paired (mainly through statistical techniques).

In order to identify the candidate terms of each language, ELexBI uses the following tools: for Basque, the Erauzterm term extractor developed by the Elhuyar Foundation in collaboration with the IXA group; for processing the Spanish text, the free software package Freeling developed by the UPC Centre de Tecnologies i Aplicacions del Llenguatge i la Parla (TALP) and the Centre de Llenguatge i Computació of Barcelona University.

ELexBI tool

General diagram of ELexBI:

 

To select equivalents, ELexBI uses a step-by-step algorithm:

  • In the first step the ‘most reliable’ pairs are selected, in other words, headings, chapters and similar pairs of terms are selected (if the candidate term in each language takes the corresponding complete heading)
  • After that, the similarities of the candidates are taken into account (the ones with the > 0.8 LCSR measure are selected)
  • Finally, co-occurrence of the same segment of candidates is taken into consideration (through statistical association measures (AM): mutual information (MI), likelihood ratio (LR), chi-square (χ2), t-score, Dice coefficient)
User Interface

Spreading

 

Go to top of page

Services

Go to top of page
Elhuyar Ikaslearen Hiztegia
24,70€Buy
Elhuyar Ikaslearen Hiztegia
Elhuyar Ikaslearen Hiztegia
24,70€Buy
Go to top of page Go to top of page
Elhuyar Zientzia eta Teknologia Hiztegi Entziklopedikoa
Itzulterm

Copyright © 2007 Elhuyar Fundazioa | Legal notice | Site Map | Erabiltzaile-kopurua: 856789

webmaster@elhuyar.com

Diseinua: Blanco

Go to top of page