Elhuyar Fundazioaren logoa

Elhuyar Fundazioa - Language Services 

Go to top of page

R&D

Corpus resources

ZT Corpusa 

Aims and general description

The Science and Technology Corpus (Zientzia eta Teknologiaren Corpusa, or ZT corpusa) is a tagged and structured collection of texts in Basque from the fields of science and technology, and its main aim is to be a resource for conducting research into the use of Basque in these areas. It is a specific or specialised corpus and was built by the IXA Group of the UPV/EHU University of the Basque Country in collaboration with the Elhuyar Foundation.

The version we presented on ztcorpusa.net and which can be consulted there constitutes the first fruits of a broad-based project. Detailed information on the corpus and on the corpus-building methodology designed in this project is available in the section entitled ZT corpusaren diseinua eta metodologia (Design and methodology of the ZT Corpus).

  • As regards the composition of the Corpus, works in the field of science and technology published between 1990 and 2002 have been taken to feed it.
  • The Corpus is divided into sections according to field (sphere of knowledge) and genre (text type).
  • The Corpus is tagged, both with respect to text structure and format, as well as linguistically. Advanced technology for automatically processing Basque has been used (the Eustagger tagger of the IXA group) to conduct the linguistic tagging.
  • The first version of the Corpus contains 7.6 million words, of which 1.6 have been manually revised, disambiguated and corrected. The second has 8.5 million words of which 1.9 million have been processed manually.
  • We have developed the Corpusgile tool to perform and manage the corpus-building work; this tool incorporates the steps for building a corpus, and works in conjunction with the Eustagger and Eulia tools designed to conduct the linguistic tagging and developed by the IXA Group.

The version presented on the Internet is part of a broad-based project; this corpus project comprises two parts:

  • A balanced site designed to be representative of Basque-language text building in the field of science and technology.
  • An open section made up of works or parts of works that build the corpus depending on availability.

We have set up a powerful interface for consulting the corpus whereby the user will have the option of making a whole range of simple and compound searches by using the broad suite of parameters for the purpose: lemma, text form, category, field, genre, corpus section (manually corrected/full corpus, etc.).

Participants:

The Science and Technology Corpus is a joint project run by the IXA Group of the UPV/EHU-University of the Basque Country and the Elhuyar Foundation

Funding:
  • It was started as part of the Hizking21 strategic research project. The Hizking21 project has received the following assistance: the Etortek programme (2002-2004) of the Department for Industry of the Basque Autonomous Community Government; and the Science, Technology and Innovation Network Programme for Gipuzkoa (2004) of the Provincial Council of Gipuzkoa.
  • It has received assistance from the Basque and New Technologies Programme for 2005 of the Department for Culture of the Basque Autonomous Community Government.
  • It has also received assistance from the IKT (CIT) Programme for 2006 of the Department for Culture of the Basque Autonomous Community Government.
  • In addition, it has received assistance from the IKT (CIT) Programme for 2007 of the Department for Culture of the Basque Autonomous Community Government.
Spreading

Three different means have been set up to make the ZT Corpus available over the Internet:

  • Free on-line consultation: this website has been set up for this express purpose.

    Freely available for research through an agreement: institutions and research teams can avail themselves of it free of charge through an agreement and for research purposes alone.
  • Distribution by means of a licence for commercial use: we are keen to offer the option of using the corpus data in dictionary compilation, technological language applications, etc; since 2007, the corpus has formed part of the resources of the ELDA, and it is they who handle its distribution.

These last two options are ground-breaking ones in the field of Basque corpora, because the corpora built so far do not offer options of use beyond isolated consultations.

Publications:
News in the media:
Go to top of page

Services

Go to top of page
Elhuyar Zientzia eta Teknologiaren Hiztegi Entziklopedikoa
61,75€Buy
Elhuyar Ikaslearen Hiztegia
Elhuyar Ikaslearen Hiztegia
24,70€Buy
Go to top of page Go to top of page

Copyright © 2007 Elhuyar Fundazioa | Legal notice | Site Map | Erabiltzaile-kopurua: 856789

webmaster@elhuyar.com

Diseinua: Blanco

Go to top of page