Constructing corpora from plain text files

We’ll start in the ideal situation: you have a folder of nicely titled text files (in a single text encoding, ideally UTF-8) and you would like to make them into a corpus for processing.

One such folder is texts/uk-election-manifestos, a a collection of party platforms for the main UK political parties in the post war period.

The most useful tool here is readtext from the package of the same name:

library(tidyverse)

library(quanteda) # for working with corpora
Package version: 3.0.0
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 16 of 16 threads used.
See https://quanteda.io for tutorials and examples.
library(quanteda.textplots) # for plotting 'keyness'
library(readtext) # for getting documents and their info into a data frame 
manifestos <- readtext("texts/uk-election-manifestos/")
head(manifestos)
readtext object consisting of 6 documents and 0 docvars.
# Description: df [6 × 2]
  doc_id                        text               
  <chr>                         <chr>              
1 UK_natl_2010_en_BNP.txt       "\"DEMOCRACY,\"..."
2 UK_natl_2010_en_Coalition.txt "\"The Coalit\"..."
3 UK_natl_2010_en_Con.txt       "\"INVITATION\"..."
4 UK_natl_2010_en_Green.txt     "\"Green Part\"..."
5 UK_natl_2010_en_Lab.txt       "\"The Labour\"..."
6 UK_natl_2010_en_LD.txt        "\"Liberal De\"..."

There’s a lot of useful information about these manifesto files that is encoded in the file title, so it’s worth grabbing it as we read everything in. Let’s try it again:

docinfo <- 
manifestos <- readtext("texts/uk-election-manifestos/",
                       docvarsfrom = "filenames",
                       docvarnames = c("country", "national", "year", "language", "party"))
head(manifestos)
readtext object consisting of 6 documents and 5 docvars.
# Description: df [6 × 7]
  doc_id                 text            country national  year language party  
  <chr>                  <chr>           <chr>   <chr>    <int> <chr>    <chr>  
1 UK_natl_2010_en_BNP.t… "\"DEMOCRACY,\… UK      natl      2010 en       BNP    
2 UK_natl_2010_en_Coali… "\"The Coalit\… UK      natl      2010 en       Coalit…
3 UK_natl_2010_en_Con.t… "\"INVITATION\… UK      natl      2010 en       Con    
4 UK_natl_2010_en_Green… "\"Green Part\… UK      natl      2010 en       Green  
5 UK_natl_2010_en_Lab.t… "\"The Labour\… UK      natl      2010 en       Lab    
6 UK_natl_2010_en_LD.txt "\"Liberal De\… UK      natl      2010 en       LD     

Here we’ve specified that there are non-text fields (using some heuristics about what counts as a separator, which happily include the our -) encoded in the title and provided our own names for them. Had we not provided names they would have arrived in columns labelled docvar1 through docvar5.

manifestos is a data frame, so you can do anything to it that you’d normally do with a data frame, e.g. add or remove columns, merge in information from another source, and extract subsets. Here we will do the bare minimum: remove the variable fields that are uninformative. country is always “UK”, national is always “natl”, and the language is always “en” (English).

# easiest to use dplyr here
manifestos <- select(manifestos, doc_id, text, year, party) 
manifestos
readtext object consisting of 23 documents and 2 docvars.
# Description: df [23 × 4]
  doc_id                        text                 year party    
  <chr>                         <chr>               <int> <chr>    
1 UK_natl_2010_en_BNP.txt       "\"DEMOCRACY,\"..."  2010 BNP      
2 UK_natl_2010_en_Coalition.txt "\"The Coalit\"..."  2010 Coalition
3 UK_natl_2010_en_Con.txt       "\"INVITATION\"..."  2010 Con      
4 UK_natl_2010_en_Green.txt     "\"Green Part\"..."  2010 Green    
5 UK_natl_2010_en_Lab.txt       "\"The Labour\"..."  2010 Lab      
6 UK_natl_2010_en_LD.txt        "\"Liberal De\"..."  2010 LD       
# … with 17 more rows

Aside on base R

If you prefer base R, that would have been

manifestos <- manifestos[, c("doc_id", "text", "year", "party")]

Note: you can go a long way with select, filter, group_by, and summarize even if you don’t want to get deep into the tidyverse stuff. I’ll use select a few times later, so just translate it in your head into things like the line above.

Corpus time

The next step is to get this into a corpus.

manif_corp <- corpus(manifestos) 
manif_corp
Corpus consisting of 23 documents and 2 docvars.
UK_natl_2010_en_BNP.txt :
"DEMOCRACY, FREEDOM, CULTURE AND IDENTITY.  BRITISH NATIONAL ..."

UK_natl_2010_en_Coalition.txt :
"The Coalition: our programme for government.  Freedom Fairne..."

UK_natl_2010_en_Con.txt :
"INVITATION TO JOIN THE GOVERNMENT OF BRITAIN THE CONSERVATIV..."

UK_natl_2010_en_Green.txt :
"Green Party general election manifesto 2010.  Fair is worth ..."

UK_natl_2010_en_Lab.txt :
"The Labour Party Manifesto 2010 A future fair for all labour..."

UK_natl_2010_en_LD.txt :
"Liberal Democrat Manifesto 2010 fair taxes that put money ba..."

[ reached max_ndoc ... 17 more documents ]

Corpus structure

A corpus has two important components, the texts themselves, and information about them, called ‘docvars’.

docvars

# the top few docvars
head(docvars(manif_corp))
  year     party
1 2010       BNP
2 2010 Coalition
3 2010       Con
4 2010     Green
5 2010       Lab
6 2010        LD

The docvars function returns an ordinary R data frame. To get vectors of specific docvars, add the column name as a second argument. Here are the party names of the last few manifestos in the corpus

tail(docvars(manif_corp, "party"))
[1] "Green" "Lab"   "LD"    "PCy"   "SNP"   "UKIP" 

If you want to add a document variable to a corpus, pretend the corpus is a data.frame (it isn’t) and use $, e.g. like this

manif_corp$my_new_doc_var <- vector_of_the_right_length

Texts

The texts function extracts just the document texts from a corpus

When extracting texts, we can use as.character. But you should be super careful here. Typing the name of the text objects will make R try to print them to screen which may take a loooong time.

txts <- as.character(manif_corp) # don't type txts !
# show the first 100 characters of the first text
substr(txts[1], 1, 100)
                                                                               UK_natl_2010_en_BNP.txt 
"DEMOCRACY, FREEDOM, CULTURE AND IDENTITY.  BRITISH NATIONAL PARTY GENERAL ELECTION MANIFESTO 2010.  " 

Honestly the raw texts aren’t so useful. We’ll usually want them as tokens.

Regrettably, working with raw texts is sometimes necessary. For this, there are a pile of functions with names beginning char_ to do it. We’ll try not to need to use them though.

Corpus functions

We can start by getting a bit more information about the corpus using summary

# summary restricted to the first 10 documents
summary(manif_corp, n = 10)
Corpus consisting of 23 documents, showing 10 documents:

                          Text Types Tokens Sentences year     party
       UK_natl_2010_en_BNP.txt  5162  31997      1431 2010       BNP
 UK_natl_2010_en_Coalition.txt  2867  14692       660 2010 Coalition
       UK_natl_2010_en_Con.txt  4458  31267      1094 2010       Con
     UK_natl_2010_en_Green.txt  3856  20126       961 2010     Green
       UK_natl_2010_en_Lab.txt  4406  33178      1309 2010       Lab
        UK_natl_2010_en_LD.txt  3955  32355       846 2010        LD
       UK_natl_2010_en_PCy.txt  1819   7554       372 2010       PCy
       UK_natl_2010_en_SNP.txt  1880   9226       337 2010       SNP
      UK_natl_2010_en_UKIP.txt  2601   9262       439 2010      UKIP
       UK_natl_2015_en_Con.txt  4485  40301      1160 2015       Con

Note that, by default - that is without specifying n above, summary will report just the first 100 document. You can make this longer as well as shorter by setting n as above. (Sometimes its useful to know that the output of summary is also a data frame, in case you want to keep the contents for later.)

The information in summary is available in separate functions if you need them:

ndoc(manif_corp) # document count

docnames(manif_corp) # unique document identifiers
ntype(manif_corp) # types in each document
ntoken(manif_corp) # tokens in each document
nsentence(manif_corp) # sentences in each documenta

Subsets of corpora

Most often we will want to subset on the basis of the docvars, e.g.  here we make another corpus containing just the three main parties in elections since 2000.

main_parties <- c("Lab", "Con", "LD")
manif_subcorp <- corpus_subset(manif_corp, 
                               year > 2000 & party %in% main_parties)
summary(manif_subcorp)
Corpus consisting of 9 documents, showing 9 documents:

                    Text Types Tokens Sentences year party
 UK_natl_2010_en_Con.txt  4458  31267      1094 2010   Con
 UK_natl_2010_en_Lab.txt  4406  33178      1309 2010   Lab
  UK_natl_2010_en_LD.txt  3955  32355       846 2010    LD
 UK_natl_2015_en_Con.txt  4485  40301      1160 2015   Con
 UK_natl_2015_en_Lab.txt  3230  20684       821 2015   Lab
  UK_natl_2015_en_LD.txt  5086  37244      1301 2015    LD
 UK_natl_2017_en_Con.txt  4227  34842      1231 2017   Con
 UK_natl_2017_en_Lab.txt  4427  28702      1074 2017   Lab
  UK_natl_2017_en_LD.txt  4244  24511       940 2017    LD

Two other useful corpus functions are corpus_sample for when your corpus is very large and you want to work with a random smaller set of documents, and corpus_trim which can remove sentences (or paragraphs or documents if you prefer) that have fewer or more than than a specified number of tokens in them. Note that if there are no sentences left after removing those deemed too short, then the document is removed from the corpus. Usually this is what you want.

Collapsing into groups

Sometimes it’s useful to collapse all the documents with certain docvar or other value together. for example, here’s everything each party said in any postwar election, pushed into one document each and returned as a three document corpus:

partycorp <- corpus_group(manif_subcorp, groups = party)
summary(partycorp)
Corpus consisting of 3 documents, showing 3 documents:

 Text Types Tokens Sentences party
  Con  7828 106410      3483   Con
  Lab  7360  82564      3202   Lab
   LD  7660  94110      3085    LD

this probably doesn’t make a lot of sense for manifestos, but it’s often useful for pulling together all a speakers contributions over a debate. If you wanted to collapse the parties within each election year, set the groups to be year.

More corpus editing

We can do other things with a corpus but using the functions with names starting corpus_, e.g. sample, trim, and reshape. corpus_reshape will do the opposite of corpus_group by expanding or contracting the number of documents by redefining the document unit. We’ll see an example shortly.

Looking into Bara et al.

The debate “Medical Termination of Pregnancy Bill” is on Hansard, but I’ve scraped it and concatenated each speaker’s contributions into a single document in a corpus object for you. This is certainly not the only way to think about analyzing this data, but it’s what Bara et al. did.

load("data/corpus_bara_speaker.rda")
summary(corpus_bara_speaker)
Corpus consisting of 27 documents, showing 27 documents:

                     Text Types Tokens Sentences                  speaker vote
            Dr David Owen   640   1948        89            Dr David Owen  yes
           Dr Horace King   106    197        17           Dr Horace King  abs
         Dr John Dunwoody   585   2106        85         Dr John Dunwoody  yes
    Dr Michael Winstanley    52     72         4    Dr Michael Winstanley  yes
          Hon. Sam Silkin    38     45         4          Hon. Sam Silkin  yes
        Miss Joan Vickers   732   2453       110        Miss Joan Vickers  yes
             Mr Alex Lyon    59     69         4             Mr Alex Lyon  yes
           Mr Angus Maude   717   2551        95           Mr Angus Maude  yes
       Mr Charles Pannell   105    194        13       Mr Charles Pannell  yes
           Mr David Steel  1310   5792       207           Mr David Steel  yes
          Mr Edward Lyons   419    878        42          Mr Edward Lyons  yes
        Mr John Mendelson    55     74         3        Mr John Mendelson  yes
        Mr Kevin McNamara  1068   3896       153        Mr Kevin McNamara   no
              Mr Leo Abse   760   2433        95              Mr Leo Abse  yes
 Mr Norman St John-Stevas   739   2553       124 Mr Norman St John-Stevas   no
         Mr Peter Jackson    20     22         2         Mr Peter Jackson  yes
           Mr Peter Mahon    65     99         6           Mr Peter Mahon   no
           Mr Roy Jenkins   761   2800       102           Mr Roy Jenkins  yes
           Mr Roy Roebuck    18     23         2           Mr Roy Roebuck  yes
        Mr William Deedes   543   1706        82        Mr William Deedes  abs
         Mr William Wells   827   3113       144         Mr William Wells   no
            Mrs Anne Kerr    12     12         1            Mrs Anne Kerr  yes
     Mrs Gwyneth Dunwoody    67     99         4     Mrs Gwyneth Dunwoody  yes
          Mrs Jill Knight   860   2999       128          Mrs Jill Knight   no
          Mrs Renée Short   736   2501        94          Mrs Renée Short  yes
   Sir Henry Legge-Bourke    76    113         5   Sir Henry Legge-Bourke  yes
          Sir John Hobson   734   3261       135          Sir John Hobson  abs

A quick reminder: Here are all the speakers that voted ‘no’ at the end of the debate and said more than 100 words during it

no_corp <- corpus_subset(corpus_bara_speaker, 
                         vote == "no" & ntoken(corpus_bara_speaker) > 100)
no_corp
Corpus consisting of 4 documents and 2 docvars.
Mr Kevin McNamara :
"                               §           Mr. Kevin McNamar..."

Mr Norman St John-Stevas :
"                               §           Mr. Norman St. Jo..."

Mr William Wells :
"                               §           Mr. William Wells..."

Mrs Jill Knight :
"                               §           Mrs. Jill Knight ..."

(Goodbye Mr Mahon)

Finally, it’s sometimes convenient to be able to switch between thinking sets of documents to sets of paragraphs, or even sentences.

para_corp <- corpus_reshape(corpus_bara_speaker, 
                            to = "paragraphs") # or "sentences"
head(summary(para_corp)) # Just the top few lines
              Text Types Tokens Sentences        speaker vote
1  Dr David Owen.1   433   1000        41  Dr David Owen  yes
2  Dr David Owen.2    81    122         8  Dr David Owen  yes
3  Dr David Owen.3    67     96        10  Dr David Owen  yes
4  Dr David Owen.4   297    730        31  Dr David Owen  yes
5 Dr Horace King.1    75    118         6 Dr Horace King  abs
6 Dr Horace King.2    21     26         4 Dr Horace King  abs

Happily we can always reverse this process by changing to back to “documents”.

Let’s explore a little more by looking for the key terms in play.
One way to do this is to look for collocations. The collocation finder functions operates on the tokens of the corpus, so let’s talk a bit about those first.

Tokens

We can extract tokens from the corpus directly

toks <- tokens(corpus_bara_speaker)

Structurally, a tokens object is a list of character vectors i.e. one list element per document and one character vector element per ‘token’, so to get the tokens for the 10th document

toks[[10]]

and to get a shorter list containing only first through third documents’ tokens

toks[1:3]

If we want more control over the tokenization process (pro tip: we do), e.g.  perhaps we don’t want to count punctuation and do want to remove numbers, then take a look at the help page for this function and practice adding the relevant extra parameters to the function call.

Processing tokens

There are a lot of useful dedicated tokens-processing functions in {quanteda} all beginning with tokens_:

chunk compound keep ngrams remove replace sample 
segment select split subset subset tolower toupper
wordstem lookup

Handily, tokens objects carry their docvars around with them too, so you can use tokens_subset to get the tokens for e.g. just the speakers who voted “yes”, just the way you’d use corpus_subset.

But let’s get to the collocation finding tools. We’ll need a companion package for this sort of thing

library(quanteda.textstats)

colls <- textstat_collocations(toks)
head(colls, 20)
   collocation count count_nested length   lambda        z
1        it is   199            0      2 3.529352 35.73172
2       of the   410            0      2 1.993056 31.11606
3     the bill   208            0      2 3.648663 26.55903
4       do not    70            0      2 5.127010 26.41491
5   member for    88            0      2 6.569652 24.35931
6     there is    82            0      2 3.653461 24.01364
7    should be    62            0      2 3.853810 23.34994
8    those who    37            0      2 5.387605 23.05370
9       my hon    41            0      2 4.412487 22.05236
10    has been    37            0      2 4.818490 22.03920
11    would be    58            0      2 3.562342 21.81443
12   there are    45            0      2 3.927881 21.35060
13   have been    40            0      2 4.348728 20.90335
14     i think    59            0      2 4.135198 20.60385
15  think that    64            0      2 4.332629 20.06994
16      may be    42            0      2 4.032614 19.70798
17      i have    71            0      2 2.753093 19.44622
18     that it    92            0      2 2.376724 19.36427
19 an abortion    28            0      2 4.387547 18.93842
20   right hon    28            0      2 4.574635 18.86004

This is disappointing unsubstantive, but if we work a bit harder we can get better results. First we’ll grab some stopwords

stps <- stopwords() 
head(stps)
[1] "i"      "me"     "my"     "myself" "we"     "our"   

and do the whole again, but removing them all and leaving an empty space where they were

toks2 <- tokens_remove(toks, stps, padding = TRUE)

Back to processing tokens

Let’s see what we did

toks2[[1]][1:20] # first 20 tokens of document 1
 [1] "§"           "Dr"          "."           "David"       "Owen"       
 [6] "("           "Plymouth"    ","           "Sutton"      ")"          
[11] ""            "gives"       ""            "great"       "pleasure"   
[16] ""            "speak"       "immediately" ""            ""           

Now rerun the function, maintaining the capitalization

coll2 <- textstat_collocations(toks2, tolower = FALSE, size = 2)
head(coll2, 20)
            collocation count count_nested length    lambda        z
1             right hon    28            0      2  4.574635 18.86004
2    medical profession    37            0      2  8.080010 18.18837
3            human life    18            0      2  5.976166 17.78455
4     illegal abortions    17            0      2  7.641377 17.24541
5        learned Member    12            0      2  6.170088 15.80828
6           put forward     9            0      2  7.120416 14.71733
7         Royal College    15            0      2 10.199146 13.98221
8       Private Members     8            0      2  8.669238 13.62481
9        public opinion     7            0      2  7.490535 12.96645
10           many women     9            0      2  4.921211 12.78258
11            years ago     7            0      2  8.400914 12.72887
12             give way     8            0      2  5.123612 12.54151
13        mental health     6            0      2  6.248463 12.45540
14     Committee points     6            0      2  6.228301 12.29201
15 general practitioner     6            0      2  8.768819 12.28565
16             30 years     7            0      2  7.727610 12.27375
17     substantial risk     7            0      2  7.727610 12.27375
18         serious risk     6            0      2  5.980019 12.21676
19          present law    10            0      2  4.532178 12.15809
20      Catholic Church     5            0      2  7.152340 12.11036

much better, I think.

We can also ask for three word collocations

coll3 <- textstat_collocations(toks2, tolower = FALSE, size = 3)
head(coll3, 30)
                              collocation count count_nested length     lambda
1                     abortion law reform     3            0      3  2.9744919
2                        present case law     2            0      3  1.8848477
3                  give qualified support     2            0      3  1.7554198
4                         Friend give way     2            0      3  1.4239933
5                   position beyond doubt     2            0      3 -0.3902018
6                 points extremely fairly     2            0      3 -0.5831442
7                      Lord Silkin's Bill     2            0      3 -0.7000055
8                   Private Member's Bill     6            0      3 -1.0589782
9                give drafting assistance     2            0      3 -1.4338460
10                   healthy human beings     3            0      3 -2.1689239
11 Royal Medico-Psychological Association     2            0      3 -2.8474063
12                   involve serious risk     2            0      3 -2.4552517
13              pregnant woman's capacity     2            0      3 -3.0693678
14                     across party lines     2            0      3 -3.8476748
15                    learned Friend give     2            0      3 -3.2998966
16                 Law Reform Association     5            0      3 -4.2829594
17        registered medical practitioner     2            0      3 -3.6766671
18       Government's collective attitude     3            0      3 -5.2018835
19                termed Committee points     2            0      3 -4.2116169
20                    Abortion Law Reform     5            0      3 -5.0452333
21                     Kingston upon Hull     2            0      3 -5.5383892
22            British Medical Association     2            0      3 -4.7348107
23                National Health Service     4            0      3 -5.5822424
24                        accept Clause 1     2            0      3 -5.6965570
25                       thirty years ago     2            0      3 -6.3703986
26              potentially healthy human     2            0      3 -5.9134771
27                      Dame Joan Vickers     3            0      3 -9.0528006
28                          last 30 years     3            0      3 -4.7341412
29                  Roman Catholic Church     2            0      3 -6.2529756
30                  National Opinion Poll     2            0      3 -8.1651964
            z
1   1.5478204
2   0.8762667
3   0.6797675
4   0.6621731
5  -0.1494450
6  -0.2247319
7  -0.2389001
8  -0.3564888
9  -0.5325926
10 -0.7454156
11 -0.9734579
12 -1.0855574
13 -1.3223867
14 -1.3971378
15 -1.4417678
16 -1.4507937
17 -1.5165221
18 -1.6186247
19 -1.6234315
20 -1.6942560
21 -1.7138761
22 -1.9840171
23 -2.0489942
24 -2.2256479
25 -2.4279730
26 -2.5423975
27 -2.5827487
28 -2.6438300
29 -2.6579283
30 -2.7342655

Keywords in context

Since this is an abortion law debate, let’s see how the honourable members talk about mothers and babies. We’ll use the ‘keyword in context’ function kwic, which wants to be given a bunch of tokens, some pattern to match, and a window:

toks <- tokens(corpus_bara_speaker)
kw_mother <- kwic(toks, "mother*", window = 10)
head(kw_mother)
Keyword-in-context with 6 matches.                         
  [Dr John Dunwoody, 941]
  [Dr John Dunwoody, 986]
 [Dr John Dunwoody, 1177]
 [Dr John Dunwoody, 1204]
 [Dr John Dunwoody, 1655]
 [Dr John Dunwoody, 1702]
                                                                     
     of that survival. I am thinking particularly of the | mothers  |
                to think more of the family unit, of the |  mother  |
            In numerical terms, so far as the numbers of | mothers  |
                 c ), which lays down the grounds of the | mother's |
              As I understand it, it means capacity as a |  mother  |
 family together, who knits the various children and the |  mother  |
                                                                 
 with large families and the burdens of large families very      
 and father and the children. I take it further                  
 are concerned, they are comparatively unimportant. The important
 capacity being severely overstrained by the care of a child     
 in the fullest sense. I think it means something                
 and father together, so that the mother can play                

KWICs can get quite large, but if you want to see it all

View(kw_mother)

will open a browser with the whole thing. This object is a data frame underneath, which can be helpful for various tasks, e.g. we could use filter or equivalent to find all instances spoken by a particular participant.

In any case, one thing we learn is that there is much less talk of babies than of mothers. In this debate, the other major actors are doctors and their professional association, and a small amount of religious content. We can investigate this the same way.

Working with phrases

If we want to look at phrases, e.g. the word pairs and triples we found in the concordances earlier, or things we’re sure are there like “medical profession” or “human life” in their contexts there are two approaches:

Either we can use tokens_compound to force them into a single token and look for that (fiddly) or we can use phrase inside the kwic function.

medprof <- kwic(toks, phrase("medical profession"))

You can view this one for yourself.

For reference, here’s how to take the other route

phrases <- phrase(c("medical profession", "human life")) # make a phrase
toks <- tokens_compound(toks, phrases) # make the tokens show it as _ connected

and then

kwic(toks, "medical_profession")

Since the output of kwic is simply a data frame, one thing that’s often useful is to treat the left and right sides of the KWIC as a document (about babies)

Constructing a document feature matrix

Returning to the full corpus, we will often want to construct a document term matrix. {quanteda} calls this a ‘dfm’ (document feature matrix) to allow that we will often count things other than words.

corpdfm <- dfm(toks) # lowercases by default, but not much more
dim(corpdfm)
[1]   27 4039
featnames(corpdfm)[1:40] # really just colnames
 [1] "§"            "dr"           "."            "david"        "owen"        
 [6] "("            "plymouth"     ","            "sutton"       ")"           
[11] "it"           "gives"        "me"           "great"        "pleasure"    
[16] "to"           "speak"        "immediately"  "after"        "the"         
[21] "hon"          "lady"         "member"       "for"          "devonport"   
[26] "dame"         "joan"         "vickers"      "she"          "and"         
[31] "i"            "shared"       "same"         "political"    "platform"    
[36] "at"           "an"           "inter-church" "meeting"      "declared"    
docnames(corpdfm)
 [1] "Dr David Owen"            "Dr Horace King"          
 [3] "Dr John Dunwoody"         "Dr Michael Winstanley"   
 [5] "Hon. Sam Silkin"          "Miss Joan Vickers"       
 [7] "Mr Alex Lyon"             "Mr Angus Maude"          
 [9] "Mr Charles Pannell"       "Mr David Steel"          
[11] "Mr Edward Lyons"          "Mr John Mendelson"       
[13] "Mr Kevin McNamara"        "Mr Leo Abse"             
[15] "Mr Norman St John-Stevas" "Mr Peter Jackson"        
[17] "Mr Peter Mahon"           "Mr Roy Jenkins"          
[19] "Mr Roy Roebuck"           "Mr William Deedes"       
[21] "Mr William Wells"         "Mrs Anne Kerr"           
[23] "Mrs Gwyneth Dunwoody"     "Mrs Jill Knight"         
[25] "Mrs Renée Short"          "Sir Henry Legge-Bourke"  
[27] "Sir John Hobson"         

But let’s remove some things that aren’t (currently) of interest to us

toks <- tokens(corpus_bara_speaker, ## yes, yes
               remove_punct = TRUE, 
               remove_numbers = TRUE)
toks <- tokens_remove(toks, stps) # those stopwords we saw earlier

corpdfm <- dfm(toks)
dim(corpdfm) # a bit smaller
[1]   27 3755
featnames(corpdfm)[1:40]
 [1] "dr"           "david"        "owen"         "plymouth"     "sutton"      
 [6] "gives"        "great"        "pleasure"     "speak"        "immediately" 
[11] "hon"          "lady"         "member"       "devonport"    "dame"        
[16] "joan"         "vickers"      "shared"       "political"    "platform"    
[21] "inter-church" "meeting"      "declared"     "intention"    "support"     
[26] "measure"      "abortion"     "law"          "reform"       "one"         
[31] "come"         "house"        "plea"         "given"        "government"  
[36] "time"         "obviously"    "free"         "vote"         "leaving"     

We could also stem

stoks <- tokens(corpus_bara_speaker, ## yes, yes
                remove_punct = TRUE, 
                remove_numbers = TRUE)
stoks <- tokens_wordstem(stoks)

scorpdfm <- dfm(stoks)
dim(scorpdfm) # a bit smaller
[1]   27 2765
featnames(scorpdfm)[1:40]
 [1] "dr"           "david"        "owen"         "plymouth"     "sutton"      
 [6] "it"           "give"         "me"           "great"        "pleasur"     
[11] "to"           "speak"        "immedi"       "after"        "the"         
[16] "hon"          "ladi"         "member"       "for"          "devonport"   
[21] "dame"         "joan"         "vicker"       "she"          "and"         
[26] "i"            "share"        "same"         "polit"        "platform"    
[31] "at"           "an"           "inter-church" "meet"         "declar"      
[36] "our"          "intent"       "support"      "a"            "measur"      

Be careful if you’re planning on applying a dictionary,since its entries aren’t stemmed; we’d confuse it by stemming the source material first.

Trimming dfms

For modeling, we’ll often want to remove the low frequency and idiosyncratic words

smallcorpdfm <- dfm_trim(corpdfm, min_termfreq = 5, min_docfreq = 5)
dim(smallcorpdfm) # this might have been a bit drastic...
[1]  27 530

where min_count removes any word that occurs less than 5 times and min_docfreq removes any words that occurs any number of times but in fewer than 5 different documents. That makes things a lot smaller.

Making use of the docvars

One very convenient feature of the dfm, tokens, and corpus is that that they keep our docvars squirreled away inside themselves. so we can subset in the same way as we did with the corpus object

dfm_subset(corpdfm, vote != "abs") # remove abstentions
Document-feature matrix of: 24 documents, 3,755 features (90.71% sparse) and 2 docvars.
                       features
docs                    dr david owen plymouth sutton gives great pleasure
  Dr David Owen          4     1    4        2      1     1     5        1
  Dr John Dunwoody       4     1    0        0      0     0     3        0
  Dr Michael Winstanley  1     0    0        0      0     0     0        0
  Hon. Sam Silkin        0     0    0        0      0     0     0        0
  Miss Joan Vickers      3     1    0        1      0     0     3        1
  Mr Alex Lyon           0     0    0        0      0     0     0        0
                       features
docs                    speak immediately
  Dr David Owen             2           1
  Dr John Dunwoody          0           0
  Dr Michael Winstanley     0           0
  Hon. Sam Silkin           0           0
  Miss Joan Vickers         0           0
  Mr Alex Lyon              0           0
[ reached max_ndoc ... 18 more documents, reached max_nfeat ... 3,745 more features ]

And just like the corpus objects we made earlier we can ‘group’ to collapse the dfm counts across documents (here speakers)

dfm_votes <- dfm_group(corpdfm, vote)
dfm_votes
Document-feature matrix of: 3 documents, 3,755 features (50.63% sparse) and 1 docvar.
     features
docs  dr david owen plymouth sutton gives great pleasure speak immediately
  abs  1     3    0        0      0     2    16        0     6           0
  no   2     4    0        0      0     1     7        0    12           1
  yes 16    15    6        5      7     1    22        3     8           3
[ reached max_nfeat ... 3,745 more features ]

This might be useful if we want to examine language in the light of subsequent voting. Let’s generalize this kind of comparison next.

Comparing groups of documents

If we are interested in comparing the usage of groups of speakers, we can use the textstat_frequency function. There are a lot of textstat_ functions, e.g. 

select dist simil entropy frequency keyness lexdiv readability

so we can compare several different ways

Here we’ll examine what sorts of words eventual yes and no voters used, removing the abstainers before we get going. We’ll use textstat_frequency to set groups on the fly, so it won’t matter whether we added any when we called dfm:

corpdfm_yesno <- dfm_subset(corpdfm, vote != "abs")

textstat_frequency(corpdfm_yesno, 
                   n = 20, groups = vote)
     feature frequency rank docfreq group
1       bill        77    1       5    no
2        hon        63    2       5    no
3   abortion        55    3       5    no
4        one        53    4       4    no
5      child        49    5       4    no
6       life        43    6       4    no
7        law        39    7       4    no
8        may        37    8       5    no
9         mr        36    9       5    no
10     right        31   10       5    no
11    member        30   11       3    no
12   medical        29   12       4    no
13    mother        29   12       4    no
14     house        26   14       4    no
15       can        26   14       4    no
16       say        25   16       4    no
17   members        24   17       4    no
18     human        23   18       4    no
19   support        21   19       4    no
20      many        21   19       4    no
21      bill       162    1      12   yes
22       hon       118    2      15   yes
23       law       100    3      10   yes
24  abortion        97    4       9   yes
25       one        86    5       9   yes
26    member        68    6      11   yes
27     think        66    7      10   yes
28     house        63    8      11   yes
29 pregnancy        61    9       9   yes
30       can        60   10       9   yes
31        mr        55   11      15   yes
32       may        55   11      10   yes
33   medical        51   13       8   yes
34      many        50   14      10   yes
35      said        49   15       9   yes
36    social        46   16      11   yes
37     child        45   17      11   yes
38    friend        44   18      11   yes
39     woman        43   19       9   yes
40     right        43   19      12   yes

as is often the case, raw counts are not so informative, so we can instead ask for terms which differ statistically across yes and no voters in a more statistical fashion. For this we’ll make use of “keyness”

Keyness in document comparisons

Here we’ll actually need to add groups to our dfm first. We’ll also always be comparing one document to the rest and asking what makes it distinctive. That means we will need to define documents so that we get the comparison we want.

dfm_yesno <- dfm_group(corpdfm_yesno, vote)

Here’s no voter’s distinctive vocabulary

no_terms <- textstat_keyness(dfm_yesno, "no")
head(no_terms, 25)
        feature      chi2            p n_target n_reference
1      argument 18.076446 2.122105e-05       19           7
2          baby 18.032510 2.171645e-05       14           3
3          life 16.116036 5.957741e-05       43          35
4         child 14.379226 1.494419e-04       49          45
5           bad 13.935827 1.891585e-04       13           4
6           let 13.665578 2.184217e-04       10           1
7      evidence 13.263193 2.706682e-04       16           7
8         human 11.180932 8.264226e-04       23          16
9         wells 10.698829 1.072034e-03       11           3
10       reject  9.027651 2.659260e-03        6           0
11      society  8.782368 3.041561e-03       13           7
12       rights  8.100676 4.424875e-03        7           1
13          yet  7.611673 5.799165e-03       14           9
14       demand  7.412482 6.477289e-03        9           3
15       knight  7.412482 6.477289e-03        9           3
16   birmingham  7.323926 6.804260e-03       11           6
17    principle  7.323926 6.804260e-03       11           6
18       unborn  7.318244 6.825806e-03       10           4
19        argue  7.096196 7.724774e-03        5           0
20  independent  7.096196 7.724774e-03        5           0
21 intelligence  7.096196 7.724774e-03        5           0
22       spoken  7.096196 7.724774e-03        5           0
23    tradition  7.096196 7.724774e-03        5           0
24     contains  6.311625 1.199489e-02        6           1
25       embryo  6.311625 1.199489e-02        6           1

and here’s yes voter’s distinctive vocabulary

yes_terms <-  textstat_keyness(dfm_yesno, "yes")
head(yes_terms, 25)
       feature      chi2            p n_target n_reference
1  termination 12.405046 0.0004281753       29           1
2      subject 10.388204 0.0012682293       25           1
3        think  9.445068 0.0021171767       66          14
4         kind  9.383059 0.0021899984       23           1
5       reform  8.853983 0.0029245006       26           2
6       social  8.649324 0.0032718279       46           8
7        among  7.631432 0.0057359754       15           0
8      believe  7.461937 0.0063017160       40           7
9      perhaps  7.092561 0.0077404537       26           3
10   pregnancy  6.664001 0.0098379793       61          15
11       years  6.632589 0.0100129633       25           3
12       means  6.388376 0.0114870092       17           1
13   committee  6.251787 0.0124068067       40           8
14    children  5.917113 0.0149945083       30           5
15       women  5.868895 0.0154107242       39           8
16     carried  5.276889 0.0216103217       22           3
17     opinion  5.091765 0.0240397070       28           5
18      church  4.910597 0.0266924159       14           1
19        find  4.731488 0.0296152564       24           4
20    national  4.687791 0.0303776753       12           0
21      friend  4.636050 0.0313069760       44          11
22   attention  4.187336 0.0407270519       11           0
23          dr  4.115111 0.0425017411       16           2
24     reading  3.909668 0.0480090853       22           4
25        said  3.729155 0.0534704448       49          14

or as a picture

textplot_keyness(yes_terms)

Keyness without grouping

If we had been interested in the personal linguistic style of one of our speakers, we would not have had to group the dfm. For example, here are terms preferentially used by Mr. Norman St John-Stevas

nsjs_terms <- textstat_keyness(corpdfm, "Mr Norman St John-Stevas")
head(nsjs_terms, 10)
      feature     chi2            p n_target n_reference
1   principle 74.95131 0.000000e+00       10           7
2   tradition 36.26554 1.721812e-09        4           1
3       rests 31.76786 1.737445e-08        3           0
4    theology 31.76786 1.737445e-08        3           0
5       vital 22.60896 1.985563e-06        3           1
6  difference 20.18761 7.020622e-06        4           4
7         rid 20.18761 7.020622e-06        4           4
8      namely 17.13909 3.473993e-05        3           2
9       value 17.13909 3.473993e-05        3           2
10     unborn 16.94578 3.846281e-05        5           9

from I think it’s pretty clear what he wants to talk about.