What are some useful text corpora and lexical resources, and how can we access them with Python? Which Python constructs are most helpful for this work? How do we the text in German language for children repeating ourselves when writing Python code?
This chapter continues to present programming concepts by example, in the context of a linguistic processing task. We will wait until later before exploring each Python construct systematically. 1 Accessing Text Corpora As just mentioned, a text corpus is a large body of text. Many corpora are designed to contain a careful balance of material in one or more genres. We examined some small text collections in 1. However, since we want to be able to work with other texts, this section examines a variety of text corpora. We’ll see how to select individual texts, and how to work with them.
We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk. Now that you have started examining data from nltk. By contrast average sentence length and lexical diversity appear to be characteristics of particular authors. 2 Web and Chat Text Although Project Gutenberg contains thousands of books, it represents established literature. It is important to consider less formal language as well. SCENE 1: KING ARTHUR: Whoa there!
3 Brown Corpus The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on. Let’s compare genres in their usage of modal verbs. The first step is to produce the counts for a particular genre. Next, we need to obtain counts for each genre of interest. We’ll use NLTK’s support for conditional frequency distributions. These are presented systematically in 2, where we also unpick the following code line by line.
For the moment, you can ignore the details and just concentrate on the output. The idea that word counts might distinguish genres will be taken up again in chap-data-intensive. 4 Reuters Corpus The Reuters Corpus contains 10,788 news documents totaling 1. This split is for training and testing algorithms that automatically detect the topic of a document, as we will see in chap-data-intensive. Unlike the Brown Corpus, categories in the Reuters corpus overlap with each other, simply because a news story often covers multiple topics. We can ask for the topics covered by one or more documents, or for the documents included in one or more categories.