Erauzterm
Aims and general description
- To carry out research into and to develop techniques for automatically extracting terms from texts and corpora in Basque
- To implement a tool using these techniques
All kinds of processes have been used to extract terms automatically, even though they are all mainly based on two techniques: linguistic techniques and statistical techniques. As Basque is an agglutinative language, the results of an exclusively statistical method would not be acceptable (inflection leads to data spread). This is why we have gone for a hybrid method. In the initial step, linguistic techniques are used to extract candidate terms. The second step ranks the candidate terms by means of statistical techniques.
The fields researched in order to develop linguistic techniques are:
- Analysis of the morpho-syntactic structures of Basque terms and the specifying of the main/most productive models (N, NN, NApos, AprepN, NNApos, AprepNApos, etc.).
- A grammar and transducer to automatically mark the syntagmata on the basis of the morpho-syntactic models chosen (extraction of the ‘long’ candidate terms).
- To analyse the ‘nested’ terms to be found within the long candidate terms in order to extract the new candidates from among them.
- Examination of term variants in Basque: techniques to automatically process the variant types and some from among them (to link them with each other).
Areas in which research has been conducted to develop statistical techniques:
- Measures to classify the termhood of individual candidates.
- Measures to classify the termhood of multi-word candidates. Statistical association measures (AM): mutual information (MI, MI3), likelihood ratio (LR), chi-square (χ2), t-score, Dice coefficient, etc.
Erauzterm tool
Erauzterm’s modules:
- Automatic term extraction
- Linguistic Module: to tag the corpus, to extract candidate terms → Candidate Term Database
- Statistical Module: to calculate statistical measures of the candidate terms → Candidate Term Ranking
- Corpus browser
- Options for ordering candidate list (alphabetically, on the basis of frequency, on the basis of associative measures)
- Information on the candidates: canonical form, morpho-syntactic model, frequency, selected associative measure (LLR log likelihood ratio in the illustration)
- Displaying of contexts of candidates: concordances (KWIC) and broader context ( )
- Displaying of candidate variants (T: ortho-typographical; M: morpho-syntactic; S: syntactic)
- Field specifying
- Term assessing
- Term exporting
User interface

Funding
- Assisted by funding in the Saiotek grant application period for 2002 of the Department for Industry, Tourism and Trade of the Basque Autonomous Community Government
- Assisted by funding in the Saiotek grant application period for 2003 of the Department for Industry, Tourism and Trade of the Basque Autonomous Community Government
- Assisted by funding in the Network Agents’ grant application period for 2003 of the Provincial Council of Gipuzkoa
Spreading
- In the media
- Articles
- Gurrutxaga A., Saralegi X., Ugartetxea S., Alegria I. 2005. "Erauzterm: euskarazko terminoak erauzteko tresna erdiautomatikoa". Mendebalde Kultur Alkartea, IX. Jardunaldiak: Euskera zientifiko-teknikoa. Bilbao.
- Alegria I., Gurrutxaga A., Lizaso P., Saralegi X., Ugartetxea S., Urizar R. 2004. "Linguistic and Statistical Approaches to Basque Term Extraction". GLAT 2004: The production of specialized texts. Barcelona.
- Alegria I., Gurrutxaga A., Lizaso P., Saralegi X., Ugartetxea S., Urizar R. 2004. "An Xml-Based Term Extraction Tool for Basque". 4th International Conference on Language Resources and Evaluations (LREC). Lisbon.