Sharing Corpus Resources in Language Learning

Angela Chambers (University of Limerick, Ireland) and Martin Wynne (University of Oxford, UK)
DOI: 10.4018/978-1-59904-895-6.ch025
Since the early 1990s, researchers have been investigating the effectiveness of corpora as a resource in language learning, mostly creating their own small corpora. As it is neither feasible nor desirable to envisage a future in which all teachers create their own corpora, and as the content of language courses is similar in many universities throughout the world, the sharing of resources is clearly necessary if corpus data are to be made available to language teachers and learners on a large scale. Taking one small corpus as an example, this chapter aims to investigate the issues arising if corpus consultation is to become an integral part of the language-learning environment. The chapter firstly deals with fundamental questions concerning the creation and reusability of corpora, namely planning, construction, documentation, and also legal, moral and technical issues. It then explores the issues arising from the use of a corpus of familiar texts, in this case a French journalistic corpus, with advanced learners. In conclusion we propose a framework for the optimal use of corpora with language learners in the context of higher education.

Key Terms in this Chapter

Annotation: (see Markup)

Collocation: The tendency of certain words to occur more frequently in the vicinity of particular words in texts. For example, ‘rancid’ tends to occur with ‘butter.’

Corpus: A collection of naturally occurring data collected for the purpose of a linguistic investigation. A corpus may include materials representing various modes, registers and text types, and it may be possible to isolate these subsets of data, and analyze them separately or contrast them. Such a subdivision of a corpus is known as a subcorpus. A parallel corpus contains texts and translations of those texts, and is compiled in order to analyze and study translations.

Markup: In the form of tags in a text, is used to add information about the structure of a text and about its linguistic properties. Markup may be used to indicate such structural features as titles and headings, paragraph boundaries, highlighted text, and linguistic features such as lemmas and word classes. Linguistic information which has been added to a corpus in the form of tags is often known as annotation.

Concordance: A list of the occurrences of a word (or other search term), presented one per line along with the immediate surrounding text, in order to display for the analyst a set of examples of the usage of a word, and to enable patterns of usage surrounding the word to be observed. Concordances may be produced by a piece of software known as a concordancer.

Text Encoding: Text may be captured in electronic form in various ways. Electronic texts are stored in the form of binary data, and will make use of some form of mapping from the binary codes to characters in the language. In the past, various competing standards have existed, with different mappings for different languages and on different computer systems. There is now an international standard, Unicode, which aims to represent all characters in all languages, and be usable on all computer systems. Not all corpora use Unicode, and not all software applications currently make use of it, so difficulties may arise when attempting to share language data.

Archive: A repository where materials which are considered to be of potential future value are deposited in a secure environment, where their ongoing viability may be monitored. In the case of electronic resources, such as language corpora, a digital archive is required. Digital archives need to ensure the physical security of the data, which may be on a variety of media such as magnetic tape, removable disks, computer disk drives, and need to provide robust backup and disaster recovery facilities. It is also necessary that the curation of the data involves ensuring that it is stored in formats which are usable with current software.

Metadata: In corpus linguistics, the information about a corpus and about the constituent texts is known as metadata. Metadata will typically include information about when and by whom a corpus was created, the sampling strategy which was applied to compile the corpus, and information about the texts in the corpus, such as title, author and date of publication. Metadata may be in separate documentation files, or may be inserted in the corpus text files in the form of headers.

