We’ll start in the ideal situation: you have a folder of nicely titled text files (in a single text encoding, ideally UTF-8) and you would like to make them into a corpus for processing.
One such folder is texts/uk-election-manifestos
, a a collection of party platforms for the main UK political parties in the post war period.
The most useful tool here is readtext
from the package of the same name:
library(tidyverse)
library(quanteda) # for working with corpora
Package version: 3.0.0
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 16 of 16 threads used.
See https://quanteda.io for tutorials and examples.
library(quanteda.textplots) # for plotting 'keyness'
library(readtext) # for getting documents and their info into a data frame
manifestos <- readtext("texts/uk-election-manifestos/")
head(manifestos)
readtext object consisting of 6 documents and 0 docvars.
# Description: df [6 × 2]
doc_id text
<chr> <chr>
1 UK_natl_2010_en_BNP.txt "\"DEMOCRACY,\"..."
2 UK_natl_2010_en_Coalition.txt "\"The Coalit\"..."
3 UK_natl_2010_en_Con.txt "\"INVITATION\"..."
4 UK_natl_2010_en_Green.txt "\"Green Part\"..."
5 UK_natl_2010_en_Lab.txt "\"The Labour\"..."
6 UK_natl_2010_en_LD.txt "\"Liberal De\"..."
There’s a lot of useful information about these manifesto files that is encoded in the file title, so it’s worth grabbing it as we read everything in. Let’s try it again:
docinfo <-
manifestos <- readtext("texts/uk-election-manifestos/",
docvarsfrom = "filenames",
docvarnames = c("country", "national", "year", "language", "party"))
head(manifestos)
readtext object consisting of 6 documents and 5 docvars.
# Description: df [6 × 7]
doc_id text country national year language party
<chr> <chr> <chr> <chr> <int> <chr> <chr>
1 UK_natl_2010_en_BNP.t… "\"DEMOCRACY,\… UK natl 2010 en BNP
2 UK_natl_2010_en_Coali… "\"The Coalit\… UK natl 2010 en Coalit…
3 UK_natl_2010_en_Con.t… "\"INVITATION\… UK natl 2010 en Con
4 UK_natl_2010_en_Green… "\"Green Part\… UK natl 2010 en Green
5 UK_natl_2010_en_Lab.t… "\"The Labour\… UK natl 2010 en Lab
6 UK_natl_2010_en_LD.txt "\"Liberal De\… UK natl 2010 en LD
Here we’ve specified that there are non-text fields (using some heuristics about what counts as a separator, which happily include the our -
) encoded in the title and provided our own names for them. Had we not provided names they would have arrived in columns labelled docvar1
through docvar5
.
manifestos
is a data frame, so you can do anything to it that you’d normally do with a data frame, e.g. add or remove columns, merge in information from another source, and extract subsets. Here we will do the bare minimum: remove the variable fields that are uninformative. country
is always “UK”, national
is always “natl”, and the language is always “en” (English).
# easiest to use dplyr here
manifestos <- select(manifestos, doc_id, text, year, party)
manifestos
readtext object consisting of 23 documents and 2 docvars.
# Description: df [23 × 4]
doc_id text year party
<chr> <chr> <int> <chr>
1 UK_natl_2010_en_BNP.txt "\"DEMOCRACY,\"..." 2010 BNP
2 UK_natl_2010_en_Coalition.txt "\"The Coalit\"..." 2010 Coalition
3 UK_natl_2010_en_Con.txt "\"INVITATION\"..." 2010 Con
4 UK_natl_2010_en_Green.txt "\"Green Part\"..." 2010 Green
5 UK_natl_2010_en_Lab.txt "\"The Labour\"..." 2010 Lab
6 UK_natl_2010_en_LD.txt "\"Liberal De\"..." 2010 LD
# … with 17 more rows
If you prefer base R, that would have been
manifestos <- manifestos[, c("doc_id", "text", "year", "party")]
Note: you can go a long way with select
, filter
, group_by
, and summarize
even if you don’t want to get deep into the tidyverse stuff. I’ll use select
a few times later, so just translate it in your head into things like the line above.
The next step is to get this into a corpus
.
manif_corp <- corpus(manifestos)
manif_corp
Corpus consisting of 23 documents and 2 docvars.
UK_natl_2010_en_BNP.txt :
"DEMOCRACY, FREEDOM, CULTURE AND IDENTITY. BRITISH NATIONAL ..."
UK_natl_2010_en_Coalition.txt :
"The Coalition: our programme for government. Freedom Fairne..."
UK_natl_2010_en_Con.txt :
"INVITATION TO JOIN THE GOVERNMENT OF BRITAIN THE CONSERVATIV..."
UK_natl_2010_en_Green.txt :
"Green Party general election manifesto 2010. Fair is worth ..."
UK_natl_2010_en_Lab.txt :
"The Labour Party Manifesto 2010 A future fair for all labour..."
UK_natl_2010_en_LD.txt :
"Liberal Democrat Manifesto 2010 fair taxes that put money ba..."
[ reached max_ndoc ... 17 more documents ]
A corpus has two important components, the texts themselves, and information about them, called ‘docvars’.
# the top few docvars
head(docvars(manif_corp))
year party
1 2010 BNP
2 2010 Coalition
3 2010 Con
4 2010 Green
5 2010 Lab
6 2010 LD
The docvars
function returns an ordinary R data frame. To get vectors of specific docvars, add the column name as a second argument. Here are the party names of the last few manifestos in the corpus
tail(docvars(manif_corp, "party"))
[1] "Green" "Lab" "LD" "PCy" "SNP" "UKIP"
If you want to add a document variable to a corpus, pretend the corpus is a data.frame (it isn’t) and use $
, e.g. like this
manif_corp$my_new_doc_var <- vector_of_the_right_length
The texts function extracts just the document texts from a corpus
When extracting texts, we can use as.character
. But you should be super careful here. Typing the name of the text objects will make R try to print them to screen which may take a loooong time.
txts <- as.character(manif_corp) # don't type txts !
# show the first 100 characters of the first text
substr(txts[1], 1, 100)
UK_natl_2010_en_BNP.txt
"DEMOCRACY, FREEDOM, CULTURE AND IDENTITY. BRITISH NATIONAL PARTY GENERAL ELECTION MANIFESTO 2010. "
Honestly the raw texts aren’t so useful. We’ll usually want them as tokens.
Regrettably, working with raw texts is sometimes necessary. For this, there are a pile of functions with names beginning char_
to do it. We’ll try not to need to use them though.
We can start by getting a bit more information about the corpus using summary
# summary restricted to the first 10 documents
summary(manif_corp, n = 10)
Corpus consisting of 23 documents, showing 10 documents:
Text Types Tokens Sentences year party
UK_natl_2010_en_BNP.txt 5162 31997 1431 2010 BNP
UK_natl_2010_en_Coalition.txt 2867 14692 660 2010 Coalition
UK_natl_2010_en_Con.txt 4458 31267 1094 2010 Con
UK_natl_2010_en_Green.txt 3856 20126 961 2010 Green
UK_natl_2010_en_Lab.txt 4406 33178 1309 2010 Lab
UK_natl_2010_en_LD.txt 3955 32355 846 2010 LD
UK_natl_2010_en_PCy.txt 1819 7554 372 2010 PCy
UK_natl_2010_en_SNP.txt 1880 9226 337 2010 SNP
UK_natl_2010_en_UKIP.txt 2601 9262 439 2010 UKIP
UK_natl_2015_en_Con.txt 4485 40301 1160 2015 Con
Note that, by default - that is without specifying n
above, summary
will report just the first 100 document. You can make this longer as well as shorter by setting n
as above. (Sometimes its useful to know that the output of summary
is also a data frame, in case you want to keep the contents for later.)
The information in summary is available in separate functions if you need them:
ndoc(manif_corp) # document count
docnames(manif_corp) # unique document identifiers
ntype(manif_corp) # types in each document
ntoken(manif_corp) # tokens in each document
nsentence(manif_corp) # sentences in each documenta
Most often we will want to subset on the basis of the docvars, e.g. here we make another corpus containing just the three main parties in elections since 2000.
main_parties <- c("Lab", "Con", "LD")
manif_subcorp <- corpus_subset(manif_corp,
year > 2000 & party %in% main_parties)
summary(manif_subcorp)
Corpus consisting of 9 documents, showing 9 documents:
Text Types Tokens Sentences year party
UK_natl_2010_en_Con.txt 4458 31267 1094 2010 Con
UK_natl_2010_en_Lab.txt 4406 33178 1309 2010 Lab
UK_natl_2010_en_LD.txt 3955 32355 846 2010 LD
UK_natl_2015_en_Con.txt 4485 40301 1160 2015 Con
UK_natl_2015_en_Lab.txt 3230 20684 821 2015 Lab
UK_natl_2015_en_LD.txt 5086 37244 1301 2015 LD
UK_natl_2017_en_Con.txt 4227 34842 1231 2017 Con
UK_natl_2017_en_Lab.txt 4427 28702 1074 2017 Lab
UK_natl_2017_en_LD.txt 4244 24511 940 2017 LD
Two other useful corpus functions are corpus_sample
for when your corpus is very large and you want to work with a random smaller set of documents, and corpus_trim
which can remove sentences (or paragraphs or documents if you prefer) that have fewer or more than than a specified number of tokens in them. Note that if there are no sentences left after removing those deemed too short, then the document is removed from the corpus. Usually this is what you want.
Sometimes it’s useful to collapse all the documents with certain docvar or other value together. for example, here’s everything each party said in any postwar election, pushed into one document each and returned as a three document corpus:
partycorp <- corpus_group(manif_subcorp, groups = party)
summary(partycorp)
Corpus consisting of 3 documents, showing 3 documents:
Text Types Tokens Sentences party
Con 7828 106410 3483 Con
Lab 7360 82564 3202 Lab
LD 7660 94110 3085 LD
this probably doesn’t make a lot of sense for manifestos, but it’s often useful for pulling together all a speakers contributions over a debate. If you wanted to collapse the parties within each election year, set the groups to be year.
We can do other things with a corpus but using the functions with names starting corpus_
, e.g. sample, trim, and reshape. corpus_reshape
will do the opposite of corpus_group
by expanding or contracting the number of documents by redefining the document unit. We’ll see an example shortly.
The debate “Medical Termination of Pregnancy Bill” is on Hansard, but I’ve scraped it and concatenated each speaker’s contributions into a single document in a corpus object for you. This is certainly not the only way to think about analyzing this data, but it’s what Bara et al. did.
load("data/corpus_bara_speaker.rda")
summary(corpus_bara_speaker)
Corpus consisting of 27 documents, showing 27 documents:
Text Types Tokens Sentences speaker vote
Dr David Owen 640 1948 89 Dr David Owen yes
Dr Horace King 106 197 17 Dr Horace King abs
Dr John Dunwoody 585 2106 85 Dr John Dunwoody yes
Dr Michael Winstanley 52 72 4 Dr Michael Winstanley yes
Hon. Sam Silkin 38 45 4 Hon. Sam Silkin yes
Miss Joan Vickers 732 2453 110 Miss Joan Vickers yes
Mr Alex Lyon 59 69 4 Mr Alex Lyon yes
Mr Angus Maude 717 2551 95 Mr Angus Maude yes
Mr Charles Pannell 105 194 13 Mr Charles Pannell yes
Mr David Steel 1310 5792 207 Mr David Steel yes
Mr Edward Lyons 419 878 42 Mr Edward Lyons yes
Mr John Mendelson 55 74 3 Mr John Mendelson yes
Mr Kevin McNamara 1068 3896 153 Mr Kevin McNamara no
Mr Leo Abse 760 2433 95 Mr Leo Abse yes
Mr Norman St John-Stevas 739 2553 124 Mr Norman St John-Stevas no
Mr Peter Jackson 20 22 2 Mr Peter Jackson yes
Mr Peter Mahon 65 99 6 Mr Peter Mahon no
Mr Roy Jenkins 761 2800 102 Mr Roy Jenkins yes
Mr Roy Roebuck 18 23 2 Mr Roy Roebuck yes
Mr William Deedes 543 1706 82 Mr William Deedes abs
Mr William Wells 827 3113 144 Mr William Wells no
Mrs Anne Kerr 12 12 1 Mrs Anne Kerr yes
Mrs Gwyneth Dunwoody 67 99 4 Mrs Gwyneth Dunwoody yes
Mrs Jill Knight 860 2999 128 Mrs Jill Knight no
Mrs Renée Short 736 2501 94 Mrs Renée Short yes
Sir Henry Legge-Bourke 76 113 5 Sir Henry Legge-Bourke yes
Sir John Hobson 734 3261 135 Sir John Hobson abs
A quick reminder: Here are all the speakers that voted ‘no’ at the end of the debate and said more than 100 words during it
no_corp <- corpus_subset(corpus_bara_speaker,
vote == "no" & ntoken(corpus_bara_speaker) > 100)
no_corp
Corpus consisting of 4 documents and 2 docvars.
Mr Kevin McNamara :
" § Mr. Kevin McNamar..."
Mr Norman St John-Stevas :
" § Mr. Norman St. Jo..."
Mr William Wells :
" § Mr. William Wells..."
Mrs Jill Knight :
" § Mrs. Jill Knight ..."
(Goodbye Mr Mahon)
Finally, it’s sometimes convenient to be able to switch between thinking sets of documents to sets of paragraphs, or even sentences.
para_corp <- corpus_reshape(corpus_bara_speaker,
to = "paragraphs") # or "sentences"
head(summary(para_corp)) # Just the top few lines
Text Types Tokens Sentences speaker vote
1 Dr David Owen.1 433 1000 41 Dr David Owen yes
2 Dr David Owen.2 81 122 8 Dr David Owen yes
3 Dr David Owen.3 67 96 10 Dr David Owen yes
4 Dr David Owen.4 297 730 31 Dr David Owen yes
5 Dr Horace King.1 75 118 6 Dr Horace King abs
6 Dr Horace King.2 21 26 4 Dr Horace King abs
Happily we can always reverse this process by changing to
back to “documents”.
Let’s explore a little more by looking for the key terms in play.
One way to do this is to look for collocations. The collocation finder functions operates on the tokens of the corpus, so let’s talk a bit about those first.
We can extract tokens from the corpus directly
toks <- tokens(corpus_bara_speaker)
Structurally, a tokens object is a list
of character
vectors i.e. one list element per document and one character vector element per ‘token’, so to get the tokens for the 10th document
toks[[10]]
and to get a shorter list containing only first through third documents’ tokens
toks[1:3]
If we want more control over the tokenization process (pro tip: we do), e.g. perhaps we don’t want to count punctuation and do want to remove numbers, then take a look at the help page for this function and practice adding the relevant extra parameters to the function call.
There are a lot of useful dedicated tokens-processing functions in {quanteda}
all beginning with tokens_
:
chunk compound keep ngrams remove replace sample
segment select split subset subset tolower toupper
wordstem lookup
Handily, tokens objects carry their docvars around with them too, so you can use tokens_subset
to get the tokens for e.g. just the speakers who voted “yes”, just the way you’d use corpus_subset
.
But let’s get to the collocation finding tools. We’ll need a companion package for this sort of thing
library(quanteda.textstats)
colls <- textstat_collocations(toks)
head(colls, 20)
collocation count count_nested length lambda z
1 it is 199 0 2 3.529352 35.73172
2 of the 410 0 2 1.993056 31.11606
3 the bill 208 0 2 3.648663 26.55903
4 do not 70 0 2 5.127010 26.41491
5 member for 88 0 2 6.569652 24.35931
6 there is 82 0 2 3.653461 24.01364
7 should be 62 0 2 3.853810 23.34994
8 those who 37 0 2 5.387605 23.05370
9 my hon 41 0 2 4.412487 22.05236
10 has been 37 0 2 4.818490 22.03920
11 would be 58 0 2 3.562342 21.81443
12 there are 45 0 2 3.927881 21.35060
13 have been 40 0 2 4.348728 20.90335
14 i think 59 0 2 4.135198 20.60385
15 think that 64 0 2 4.332629 20.06994
16 may be 42 0 2 4.032614 19.70798
17 i have 71 0 2 2.753093 19.44622
18 that it 92 0 2 2.376724 19.36427
19 an abortion 28 0 2 4.387547 18.93842
20 right hon 28 0 2 4.574635 18.86004
This is disappointing unsubstantive, but if we work a bit harder we can get better results. First we’ll grab some stopwords
stps <- stopwords()
head(stps)
[1] "i" "me" "my" "myself" "we" "our"
and do the whole again, but removing them all and leaving an empty space where they were
toks2 <- tokens_remove(toks, stps, padding = TRUE)
Let’s see what we did
toks2[[1]][1:20] # first 20 tokens of document 1
[1] "§" "Dr" "." "David" "Owen"
[6] "(" "Plymouth" "," "Sutton" ")"
[11] "" "gives" "" "great" "pleasure"
[16] "" "speak" "immediately" "" ""
Now rerun the function, maintaining the capitalization
coll2 <- textstat_collocations(toks2, tolower = FALSE, size = 2)
head(coll2, 20)
collocation count count_nested length lambda z
1 right hon 28 0 2 4.574635 18.86004
2 medical profession 37 0 2 8.080010 18.18837
3 human life 18 0 2 5.976166 17.78455
4 illegal abortions 17 0 2 7.641377 17.24541
5 learned Member 12 0 2 6.170088 15.80828
6 put forward 9 0 2 7.120416 14.71733
7 Royal College 15 0 2 10.199146 13.98221
8 Private Members 8 0 2 8.669238 13.62481
9 public opinion 7 0 2 7.490535 12.96645
10 many women 9 0 2 4.921211 12.78258
11 years ago 7 0 2 8.400914 12.72887
12 give way 8 0 2 5.123612 12.54151
13 mental health 6 0 2 6.248463 12.45540
14 Committee points 6 0 2 6.228301 12.29201
15 general practitioner 6 0 2 8.768819 12.28565
16 30 years 7 0 2 7.727610 12.27375
17 substantial risk 7 0 2 7.727610 12.27375
18 serious risk 6 0 2 5.980019 12.21676
19 present law 10 0 2 4.532178 12.15809
20 Catholic Church 5 0 2 7.152340 12.11036
much better, I think.
We can also ask for three word collocations
coll3 <- textstat_collocations(toks2, tolower = FALSE, size = 3)
head(coll3, 30)
collocation count count_nested length lambda
1 abortion law reform 3 0 3 2.9744919
2 present case law 2 0 3 1.8848477
3 give qualified support 2 0 3 1.7554198
4 Friend give way 2 0 3 1.4239933
5 position beyond doubt 2 0 3 -0.3902018
6 points extremely fairly 2 0 3 -0.5831442
7 Lord Silkin's Bill 2 0 3 -0.7000055
8 Private Member's Bill 6 0 3 -1.0589782
9 give drafting assistance 2 0 3 -1.4338460
10 healthy human beings 3 0 3 -2.1689239
11 Royal Medico-Psychological Association 2 0 3 -2.8474063
12 involve serious risk 2 0 3 -2.4552517
13 pregnant woman's capacity 2 0 3 -3.0693678
14 across party lines 2 0 3 -3.8476748
15 learned Friend give 2 0 3 -3.2998966
16 Law Reform Association 5 0 3 -4.2829594
17 registered medical practitioner 2 0 3 -3.6766671
18 Government's collective attitude 3 0 3 -5.2018835
19 termed Committee points 2 0 3 -4.2116169
20 Abortion Law Reform 5 0 3 -5.0452333
21 Kingston upon Hull 2 0 3 -5.5383892
22 British Medical Association 2 0 3 -4.7348107
23 National Health Service 4 0 3 -5.5822424
24 accept Clause 1 2 0 3 -5.6965570
25 thirty years ago 2 0 3 -6.3703986
26 potentially healthy human 2 0 3 -5.9134771
27 Dame Joan Vickers 3 0 3 -9.0528006
28 last 30 years 3 0 3 -4.7341412
29 Roman Catholic Church 2 0 3 -6.2529756
30 National Opinion Poll 2 0 3 -8.1651964
z
1 1.5478204
2 0.8762667
3 0.6797675
4 0.6621731
5 -0.1494450
6 -0.2247319
7 -0.2389001
8 -0.3564888
9 -0.5325926
10 -0.7454156
11 -0.9734579
12 -1.0855574
13 -1.3223867
14 -1.3971378
15 -1.4417678
16 -1.4507937
17 -1.5165221
18 -1.6186247
19 -1.6234315
20 -1.6942560
21 -1.7138761
22 -1.9840171
23 -2.0489942
24 -2.2256479
25 -2.4279730
26 -2.5423975
27 -2.5827487
28 -2.6438300
29 -2.6579283
30 -2.7342655
Since this is an abortion law debate, let’s see how the honourable members talk about mothers and babies. We’ll use the ‘keyword in context’ function kwic
, which wants to be given a bunch of tokens, some pattern to match, and a window:
toks <- tokens(corpus_bara_speaker)
kw_mother <- kwic(toks, "mother*", window = 10)
head(kw_mother)
Keyword-in-context with 6 matches.
[Dr John Dunwoody, 941]
[Dr John Dunwoody, 986]
[Dr John Dunwoody, 1177]
[Dr John Dunwoody, 1204]
[Dr John Dunwoody, 1655]
[Dr John Dunwoody, 1702]
of that survival. I am thinking particularly of the | mothers |
to think more of the family unit, of the | mother |
In numerical terms, so far as the numbers of | mothers |
c ), which lays down the grounds of the | mother's |
As I understand it, it means capacity as a | mother |
family together, who knits the various children and the | mother |
with large families and the burdens of large families very
and father and the children. I take it further
are concerned, they are comparatively unimportant. The important
capacity being severely overstrained by the care of a child
in the fullest sense. I think it means something
and father together, so that the mother can play
KWICs can get quite large, but if you want to see it all
View(kw_mother)
will open a browser with the whole thing. This object is a data frame underneath, which can be helpful for various tasks, e.g. we could use filter
or equivalent to find all instances spoken by a particular participant.
In any case, one thing we learn is that there is much less talk of babies than of mothers. In this debate, the other major actors are doctors and their professional association, and a small amount of religious content. We can investigate this the same way.
If we want to look at phrases, e.g. the word pairs and triples we found in the concordances earlier, or things we’re sure are there like “medical profession” or “human life” in their contexts there are two approaches:
Either we can use tokens_compound
to force them into a single token and look for that (fiddly) or we can use phrase
inside the kwic
function.
medprof <- kwic(toks, phrase("medical profession"))
You can view this one for yourself.
For reference, here’s how to take the other route
phrases <- phrase(c("medical profession", "human life")) # make a phrase
toks <- tokens_compound(toks, phrases) # make the tokens show it as _ connected
and then
kwic(toks, "medical_profession")
Since the output of kwic
is simply a data frame, one thing that’s often useful is to treat the left and right sides of the KWIC as a document (about babies)
Returning to the full corpus, we will often want to construct a document term matrix. {quanteda}
calls this a ‘dfm’ (document feature matrix) to allow that we will often count things other than words.
corpdfm <- dfm(toks) # lowercases by default, but not much more
dim(corpdfm)
[1] 27 4039
featnames(corpdfm)[1:40] # really just colnames
[1] "§" "dr" "." "david" "owen"
[6] "(" "plymouth" "," "sutton" ")"
[11] "it" "gives" "me" "great" "pleasure"
[16] "to" "speak" "immediately" "after" "the"
[21] "hon" "lady" "member" "for" "devonport"
[26] "dame" "joan" "vickers" "she" "and"
[31] "i" "shared" "same" "political" "platform"
[36] "at" "an" "inter-church" "meeting" "declared"
docnames(corpdfm)
[1] "Dr David Owen" "Dr Horace King"
[3] "Dr John Dunwoody" "Dr Michael Winstanley"
[5] "Hon. Sam Silkin" "Miss Joan Vickers"
[7] "Mr Alex Lyon" "Mr Angus Maude"
[9] "Mr Charles Pannell" "Mr David Steel"
[11] "Mr Edward Lyons" "Mr John Mendelson"
[13] "Mr Kevin McNamara" "Mr Leo Abse"
[15] "Mr Norman St John-Stevas" "Mr Peter Jackson"
[17] "Mr Peter Mahon" "Mr Roy Jenkins"
[19] "Mr Roy Roebuck" "Mr William Deedes"
[21] "Mr William Wells" "Mrs Anne Kerr"
[23] "Mrs Gwyneth Dunwoody" "Mrs Jill Knight"
[25] "Mrs Renée Short" "Sir Henry Legge-Bourke"
[27] "Sir John Hobson"
But let’s remove some things that aren’t (currently) of interest to us
toks <- tokens(corpus_bara_speaker, ## yes, yes
remove_punct = TRUE,
remove_numbers = TRUE)
toks <- tokens_remove(toks, stps) # those stopwords we saw earlier
corpdfm <- dfm(toks)
dim(corpdfm) # a bit smaller
[1] 27 3755
featnames(corpdfm)[1:40]
[1] "dr" "david" "owen" "plymouth" "sutton"
[6] "gives" "great" "pleasure" "speak" "immediately"
[11] "hon" "lady" "member" "devonport" "dame"
[16] "joan" "vickers" "shared" "political" "platform"
[21] "inter-church" "meeting" "declared" "intention" "support"
[26] "measure" "abortion" "law" "reform" "one"
[31] "come" "house" "plea" "given" "government"
[36] "time" "obviously" "free" "vote" "leaving"
We could also stem
stoks <- tokens(corpus_bara_speaker, ## yes, yes
remove_punct = TRUE,
remove_numbers = TRUE)
stoks <- tokens_wordstem(stoks)
scorpdfm <- dfm(stoks)
dim(scorpdfm) # a bit smaller
[1] 27 2765
featnames(scorpdfm)[1:40]
[1] "dr" "david" "owen" "plymouth" "sutton"
[6] "it" "give" "me" "great" "pleasur"
[11] "to" "speak" "immedi" "after" "the"
[16] "hon" "ladi" "member" "for" "devonport"
[21] "dame" "joan" "vicker" "she" "and"
[26] "i" "share" "same" "polit" "platform"
[31] "at" "an" "inter-church" "meet" "declar"
[36] "our" "intent" "support" "a" "measur"
Be careful if you’re planning on applying a dictionary,since its entries aren’t stemmed; we’d confuse it by stemming the source material first.
For modeling, we’ll often want to remove the low frequency and idiosyncratic words
smallcorpdfm <- dfm_trim(corpdfm, min_termfreq = 5, min_docfreq = 5)
dim(smallcorpdfm) # this might have been a bit drastic...
[1] 27 530
where min_count
removes any word that occurs less than 5 times and min_docfreq
removes any words that occurs any number of times but in fewer than 5 different documents. That makes things a lot smaller.
One very convenient feature of the dfm
, tokens
, and corpus
is that that they keep our docvars squirreled away inside themselves. so we can subset in the same way as we did with the corpus object
dfm_subset(corpdfm, vote != "abs") # remove abstentions
Document-feature matrix of: 24 documents, 3,755 features (90.71% sparse) and 2 docvars.
features
docs dr david owen plymouth sutton gives great pleasure
Dr David Owen 4 1 4 2 1 1 5 1
Dr John Dunwoody 4 1 0 0 0 0 3 0
Dr Michael Winstanley 1 0 0 0 0 0 0 0
Hon. Sam Silkin 0 0 0 0 0 0 0 0
Miss Joan Vickers 3 1 0 1 0 0 3 1
Mr Alex Lyon 0 0 0 0 0 0 0 0
features
docs speak immediately
Dr David Owen 2 1
Dr John Dunwoody 0 0
Dr Michael Winstanley 0 0
Hon. Sam Silkin 0 0
Miss Joan Vickers 0 0
Mr Alex Lyon 0 0
[ reached max_ndoc ... 18 more documents, reached max_nfeat ... 3,745 more features ]
And just like the corpus objects we made earlier we can ‘group’ to collapse the dfm counts across documents (here speakers)
dfm_votes <- dfm_group(corpdfm, vote)
dfm_votes
Document-feature matrix of: 3 documents, 3,755 features (50.63% sparse) and 1 docvar.
features
docs dr david owen plymouth sutton gives great pleasure speak immediately
abs 1 3 0 0 0 2 16 0 6 0
no 2 4 0 0 0 1 7 0 12 1
yes 16 15 6 5 7 1 22 3 8 3
[ reached max_nfeat ... 3,745 more features ]
This might be useful if we want to examine language in the light of subsequent voting. Let’s generalize this kind of comparison next.
If we are interested in comparing the usage of groups of speakers, we can use the textstat_frequency
function. There are a lot of textstat_
functions, e.g.
select dist simil entropy frequency keyness lexdiv readability
so we can compare several different ways
Here we’ll examine what sorts of words eventual yes and no voters used, removing the abstainers before we get going. We’ll use textstat_frequency
to set groups on the fly, so it won’t matter whether we added any when we called dfm
:
corpdfm_yesno <- dfm_subset(corpdfm, vote != "abs")
textstat_frequency(corpdfm_yesno,
n = 20, groups = vote)
feature frequency rank docfreq group
1 bill 77 1 5 no
2 hon 63 2 5 no
3 abortion 55 3 5 no
4 one 53 4 4 no
5 child 49 5 4 no
6 life 43 6 4 no
7 law 39 7 4 no
8 may 37 8 5 no
9 mr 36 9 5 no
10 right 31 10 5 no
11 member 30 11 3 no
12 medical 29 12 4 no
13 mother 29 12 4 no
14 house 26 14 4 no
15 can 26 14 4 no
16 say 25 16 4 no
17 members 24 17 4 no
18 human 23 18 4 no
19 support 21 19 4 no
20 many 21 19 4 no
21 bill 162 1 12 yes
22 hon 118 2 15 yes
23 law 100 3 10 yes
24 abortion 97 4 9 yes
25 one 86 5 9 yes
26 member 68 6 11 yes
27 think 66 7 10 yes
28 house 63 8 11 yes
29 pregnancy 61 9 9 yes
30 can 60 10 9 yes
31 mr 55 11 15 yes
32 may 55 11 10 yes
33 medical 51 13 8 yes
34 many 50 14 10 yes
35 said 49 15 9 yes
36 social 46 16 11 yes
37 child 45 17 11 yes
38 friend 44 18 11 yes
39 woman 43 19 9 yes
40 right 43 19 12 yes
as is often the case, raw counts are not so informative, so we can instead ask for terms which differ statistically across yes and no voters in a more statistical fashion. For this we’ll make use of “keyness”
Here we’ll actually need to add groups to our dfm first. We’ll also always be comparing one document to the rest and asking what makes it distinctive. That means we will need to define documents so that we get the comparison we want.
dfm_yesno <- dfm_group(corpdfm_yesno, vote)
Here’s no voter’s distinctive vocabulary
no_terms <- textstat_keyness(dfm_yesno, "no")
head(no_terms, 25)
feature chi2 p n_target n_reference
1 argument 18.076446 2.122105e-05 19 7
2 baby 18.032510 2.171645e-05 14 3
3 life 16.116036 5.957741e-05 43 35
4 child 14.379226 1.494419e-04 49 45
5 bad 13.935827 1.891585e-04 13 4
6 let 13.665578 2.184217e-04 10 1
7 evidence 13.263193 2.706682e-04 16 7
8 human 11.180932 8.264226e-04 23 16
9 wells 10.698829 1.072034e-03 11 3
10 reject 9.027651 2.659260e-03 6 0
11 society 8.782368 3.041561e-03 13 7
12 rights 8.100676 4.424875e-03 7 1
13 yet 7.611673 5.799165e-03 14 9
14 demand 7.412482 6.477289e-03 9 3
15 knight 7.412482 6.477289e-03 9 3
16 birmingham 7.323926 6.804260e-03 11 6
17 principle 7.323926 6.804260e-03 11 6
18 unborn 7.318244 6.825806e-03 10 4
19 argue 7.096196 7.724774e-03 5 0
20 independent 7.096196 7.724774e-03 5 0
21 intelligence 7.096196 7.724774e-03 5 0
22 spoken 7.096196 7.724774e-03 5 0
23 tradition 7.096196 7.724774e-03 5 0
24 contains 6.311625 1.199489e-02 6 1
25 embryo 6.311625 1.199489e-02 6 1
and here’s yes voter’s distinctive vocabulary
yes_terms <- textstat_keyness(dfm_yesno, "yes")
head(yes_terms, 25)
feature chi2 p n_target n_reference
1 termination 12.405046 0.0004281753 29 1
2 subject 10.388204 0.0012682293 25 1
3 think 9.445068 0.0021171767 66 14
4 kind 9.383059 0.0021899984 23 1
5 reform 8.853983 0.0029245006 26 2
6 social 8.649324 0.0032718279 46 8
7 among 7.631432 0.0057359754 15 0
8 believe 7.461937 0.0063017160 40 7
9 perhaps 7.092561 0.0077404537 26 3
10 pregnancy 6.664001 0.0098379793 61 15
11 years 6.632589 0.0100129633 25 3
12 means 6.388376 0.0114870092 17 1
13 committee 6.251787 0.0124068067 40 8
14 children 5.917113 0.0149945083 30 5
15 women 5.868895 0.0154107242 39 8
16 carried 5.276889 0.0216103217 22 3
17 opinion 5.091765 0.0240397070 28 5
18 church 4.910597 0.0266924159 14 1
19 find 4.731488 0.0296152564 24 4
20 national 4.687791 0.0303776753 12 0
21 friend 4.636050 0.0313069760 44 11
22 attention 4.187336 0.0407270519 11 0
23 dr 4.115111 0.0425017411 16 2
24 reading 3.909668 0.0480090853 22 4
25 said 3.729155 0.0534704448 49 14
or as a picture
textplot_keyness(yes_terms)
If we had been interested in the personal linguistic style of one of our speakers, we would not have had to group the dfm. For example, here are terms preferentially used by Mr. Norman St John-Stevas
nsjs_terms <- textstat_keyness(corpdfm, "Mr Norman St John-Stevas")
head(nsjs_terms, 10)
feature chi2 p n_target n_reference
1 principle 74.95131 0.000000e+00 10 7
2 tradition 36.26554 1.721812e-09 4 1
3 rests 31.76786 1.737445e-08 3 0
4 theology 31.76786 1.737445e-08 3 0
5 vital 22.60896 1.985563e-06 3 1
6 difference 20.18761 7.020622e-06 4 4
7 rid 20.18761 7.020622e-06 4 4
8 namely 17.13909 3.473993e-05 3 2
9 value 17.13909 3.473993e-05 3 2
10 unborn 16.94578 3.846281e-05 5 9
from I think it’s pretty clear what he wants to talk about.