Elhuyar Fundazioaren logoa

Elhuyar Fundazioa - Language Services 

Go to top of page

R&D

Lexicon and terminology extraction

Erauzterm 

Aims and general description
  • To carry out research into and to develop techniques for automatically extracting terms from texts and corpora in Basque
  • To implement a tool using these techniques

All kinds of processes have been used to extract terms automatically, even though they are all mainly based on two techniques: linguistic techniques and statistical techniques. As Basque is an agglutinative language, the results of an exclusively statistical method would not be acceptable (inflection leads to data spread). This is why we have gone for a hybrid method. In the initial step, linguistic techniques are used to extract candidate terms. The second step ranks the candidate terms by means of statistical techniques.

The fields researched in order to develop linguistic techniques are:

  • Analysis of the morpho-syntactic structures of Basque terms and the specifying of the main/most productive models (N, NN, NApos, AprepN, NNApos, AprepNApos, etc.).
  • A grammar and transducer to automatically mark the syntagmata on the basis of the morpho-syntactic models chosen (extraction of the ‘long’ candidate terms).
  • To analyse the ‘nested’ terms to be found within the long candidate terms in order to extract the new candidates from among them.
  • Examination of term variants in Basque: techniques to automatically process the variant types and some from among them (to link them with each other).

Areas in which research has been conducted to develop statistical techniques:

  • Measures to classify the termhood of individual candidates.
  • Measures to classify the termhood of multi-word candidates. Statistical association measures (AM): mutual information (MI, MI3), likelihood ratio (LR), chi-square (χ2), t-score, Dice coefficient, etc.
Erauzterm tool

Erauzterm’s modules:

  • Corpus builder
  • Automatic term extraction
    • Linguistic Module: to tag the corpus, to extract candidate terms → Candidate Term Database
    • Statistical Module: to calculate statistical measures of the candidate terms → Candidate Term Ranking
  • Corpus browser
    • Options for ordering candidate list (alphabetically, on the basis of frequency, on the basis of associative measures)
    • Information on the candidates: canonical form, morpho-syntactic model, frequency, selected associative measure (LLR log likelihood ratio in the illustration)
    • Displaying of contexts of candidates: concordances (KWIC) and broader context ( )
    • Displaying of candidate variants (T: ortho-typographical; M: morpho-syntactic; S: syntactic)
    • Field specifying
    • Term assessing
    • Term exporting
User interface

Erauzterm intefazea

Funding
  • Assisted by funding in the Saiotek grant application period for 2002 of the Department for Industry, Tourism and Trade of the Basque Autonomous Community Government
  • Assisted by funding in the Saiotek grant application period for 2003 of the Department for Industry, Tourism and Trade of the Basque Autonomous Community Government
  • Assisted by funding in the Network Agents’ grant application period for 2003 of the Provincial Council of Gipuzkoa
Spreading

 

Go to top of page

Services

Go to top of page
Enpresa Kudeaketa Aurreratua. Hiztegia
9,50€Buy
Energiaren hiztegi entziklopedikoa
Energiaren hiztegi entziklopedikoa
17,10€Buy
Go to top of page Go to top of page

Copyright © 2007 Elhuyar Fundazioa | Legal notice | Site Map | Erabiltzaile-kopurua: 856789

webmaster@elhuyar.com

Diseinua: Blanco

Go to top of page