ZT Corpusa
Aims and general description
The Science and Technology Corpus (Zientzia eta Teknologiaren Corpusa, or ZT corpusa) is a tagged and structured collection of texts in Basque from the fields of science and technology, and its main aim is to be a resource for conducting research into the use of Basque in these areas. It is a specific or specialised corpus and was built by the IXA Group of the UPV/EHU University of the Basque Country in collaboration with the Elhuyar Foundation.
The version we presented on ztcorpusa.net and which can be consulted there constitutes the first fruits of a broad-based project. Detailed information on the corpus and on the corpus-building methodology designed in this project is available in the section entitled ZT corpusaren diseinua eta metodologia (Design and methodology of the ZT Corpus).
- As regards the composition of the Corpus, works in the field of science and technology published between 1990 and 2002 have been taken to feed it.
- The Corpus is divided into sections according to field (sphere of knowledge) and genre (text type).
- The Corpus is tagged, both with respect to text structure and format, as well as linguistically. Advanced technology for automatically processing Basque has been used (the Eustagger tagger of the IXA group) to conduct the linguistic tagging.
- The first version of the Corpus contains 7.6 million words, of which 1.6 have been manually revised, disambiguated and corrected. The second has 8.5 million words of which 1.9 million have been processed manually.
- We have developed the Corpusgile tool to perform and manage the corpus-building work; this tool incorporates the steps for building a corpus, and works in conjunction with the Eustagger and Eulia tools designed to conduct the linguistic tagging and developed by the IXA Group.
The version presented on the Internet is part of a broad-based project; this corpus project comprises two parts:
- A balanced site designed to be representative of Basque-language text building in the field of science and technology.
- An open section made up of works or parts of works that build the corpus depending on availability.
We have set up a powerful interface for consulting the corpus whereby the user will have the option of making a whole range of simple and compound searches by using the broad suite of parameters for the purpose: lemma, text form, category, field, genre, corpus section (manually corrected/full corpus, etc.).
Participants:
The Science and Technology Corpus is a joint project run by the IXA Group of the UPV/EHU-University of the Basque Country and the Elhuyar Foundation
Funding:
- It was started as part of the Hizking21 strategic research project. The Hizking21 project has received the following assistance: the Etortek programme (2002-2004) of the Department for Industry of the Basque Autonomous Community Government; and the Science, Technology and Innovation Network Programme for Gipuzkoa (2004) of the Provincial Council of Gipuzkoa.
- It has received assistance from the Basque and New Technologies Programme for 2005 of the Department for Culture of the Basque Autonomous Community Government.
- It has also received assistance from the IKT (CIT) Programme for 2006 of the Department for Culture of the Basque Autonomous Community Government.
- In addition, it has received assistance from the IKT (CIT) Programme for 2007 of the Department for Culture of the Basque Autonomous Community Government.
Spreading
Three different means have been set up to make the ZT Corpus available over the Internet:
- Free on-line consultation: this website has been set up for this express purpose.
Freely available for research through an agreement: institutions and research teams can avail themselves of it free of charge through an agreement and for research purposes alone. - Distribution by means of a licence for commercial use: we are keen to offer the option of using the corpus data in dictionary compilation, technological language applications, etc; since 2007, the corpus has formed part of the resources of the ELDA, and it is they who handle its distribution.
These last two options are ground-breaking ones in the field of Basque corpora, because the corpora built so far do not offer options of use beyond isolated consultations.
Publications:
- Areta N., Gurrutxaga A., Leturia I., Alegria I., Artola X., Díaz de Ilarraza A., Ezeiza N., Sologaistoa A. "ZT Corpus: Annotation and tools for Basque corpora". Corpus Linguistics 2007. Birmingham
- N. Areta, A. Gurrutxaga, I. Leturia, Z. Polin, R. Saiz, I. Alegria, X. Artola, A. Diaz de Ilarraza, N. Ezeiza, A. Sologaistoa, A. Soroa, A. Valverde. 2006. "Structure, Annotation and Tools in the Basque ZT Corpus". International Conference on Language Resources and Evaluations (LREC 2006). Genoa.
- N. Areta, A. Gurrutxaga, I. Leturia, Z. Polin, R. Saiz, I. Alegria, X. Artola, A. Diaz de Ilarraza, N. Ezeiza, A. Sologaistoa, A. Soroa, A. Valverde. 2005. "Zientzia eta teknologiaren corpusa. Diseinua eta metodologia" EHU/UPV. Bilbao.
- Alegria I., Artola X., Díaz de Ilarraza A., Ezeiza N., Sologaistoa A., Soroa A., Valverde A., N. Arteta, A. Gurrutxaga, I. Leturia, R. Saiz. 2005."Zientzia eta teknologiaren corpusa". Euskera zientifiko-teknikoa: Normalizaziotik homologazinora. Mendebalde Kultura Alkartea. Bilbao.
News in the media: