This last observation is less surprising when we consider that text and record structures are the primary domains for the two subfields of computer science that focus on data management, namely text retrieval and databases.

TIMIT was developed by a consortium including Texas Instruments and MIT, from which it derives its name.

It was designed to provide data for the acquisition of acoustic-phonetic knowledge and to support the development and evaluation of automatic speech recognition systems.

Finally, TIMIT includes demographic data about the speakers, permitting fine-grained study of vocal, social, and gender characteristics.

TIMIT illustrates several key features of corpus design.

It could also be a phrasal lexicon, where the key field is a phrase rather than a single word.

A thesaurus also consists of record-structured data, where we look up entries via non-key fields that correspond to topics.Structured collections of annotated linguistic data are essential in most areas of NLP, however, we still face many obstacles in using them.The goal of this chapter is to answer the following questions: Along the way, we will study the design of existing corpora, the typical workflow for creating a corpus, and the lifecycle of corpus.Finally, notice that even though TIMIT is a speech corpus, its transcriptions and associated data are just text, and can be processed using programs just like any other text corpus.Therefore, many of the computational methods described in this book are applicable.It may come with annotations such as part-of-speech tags, morphological analysis, discourse structure, and so forth.

