Elhuyar Fundazioaren logoa

Elhuyar Fundazioa - Language Services 

Go to top of page

R&D

Corpus tools

Aims and general description

Elhuyar launched the Corpusgile project with a view to developing a tool to assist in the Zientzia eta Teknologiaren Corpusa (Science and Technology Corpus) and other corpus building projects it may have in the future. The specific aims of the project are as follows:

  • To offer methodology and technology for organising corpus building in a systematic way.
  • To ensure that teams involved in corpus building will be using the same working model and methodology at all times.
  • To offer an agreed and tested methodology that will be of use for building a general reference corpus, which is the aim of corpus building in the future.

To bring corpus building in line with today’s international standards.

And the reasons behind developing this tool are as follows:

  • In the field of language technologies, the need for corpora is great and it is very important that the corpora being built should be reusable.
  • Corpus building is a complex process, it requires the integration of many resources and tools, and control is absolutely essential over the steps that make up the process as well as over the flow of information and documents between the resources and tools ???.
  • As corpus building is a process that requires large amounts of funding, offering cost minimizing processes is an idea of great interest.
  • The few corpus building tools available on the market do not include the tools and resources necessary for automatically processing Basque, and are unsuitable for building corpus texts in Basque

Corpusgile consists of a number of modules:

  • Text collecting module: its function is to design and implement a system for selecting, gathering together and storing the texts to be included in the corpus.
  • Structure tagging module: in this Module, the text sample coming out of the collection module in its original format is turned into the format in accordance with whatever tagging of its structure has been decided, and thereafter this tagging structure is enriched through an automatic or semi-automatic tool (sampling, typographical errors, standardisation, meaning tags, etc.).

 

Spreading

  • N. Areta, A. Gurrutxaga, I. Leturia, Z. Polin, R. Saiz, I. Alegria, X. Artola, A. Diaz de Ilarraza, N. Ezeiza, A. Sologaistoa, A. Soroa, A. Valverde. 2006. "Structure, Annotation and Tools in the Basque ZT Corpus". International Conference on Language Resources and Evaluations (LREC 2006). Genoa.
Go to top of page

Services

Go to top of page
Euskal hiztegi modernoa
36,10€Buy
Elhuyar hiztegia. Euskara/Gaztelania - Castellano/Vasco
Elhuyar hiztegia. Euskara/Gaztelania - Castellano/Vasco
43,70€Buy
Go to top of page Go to top of page

Copyright © 2007 Elhuyar Fundazioa | Legal notice | Site Map | Erabiltzaile-kopurua: 856789

webmaster@elhuyar.com

Diseinua: Blanco

Go to top of page