In this lab we’re going to investigate a different debate on abortion law, this time in the United States. The debate occurred on October 21, 2003 as the last stage of the passage of the ‘Partial-Birth Abortion Ban Act of 2003’ that banned specific forms of abortion.

There are 23 speakers and we will be interested in simultaneously discovering the main topics of the debate and decomposing each speaker’s (and implicitly their party’s) patterns of emphasis over those topics. We’ll use a representation of the debate that bundles all speaker contributions together, scraped from the plain-ish text versions offered by congress.gov. The process of extracting this information is a bit involved, so we’ll pass over the details.

However, we will usually need more than 23 documents to get a got topic model going. Accordingly we’ll treat a smaller unit as our documents - it will be whatever the Congressional Record thinks should be a paragraph.

Here are the raw materials

library(tidyverse)
library(quanteda)
library(stm) # the topic model package we'll use

load("data/corpus_us_debate.rda")
summary(corpus_us_debate, n = 5)
## Corpus consisting of 23 documents, showing 5 documents:
## 
##       Text Types Tokens Sentences party   speaker
##     ALLARD   400   1165        53     R    ALLARD
##       BOND   129    232         9     R      BOND
##      BOXER  2231  18527       886     D     BOXER
##  BROWNBACK   646   2884       168     R BROWNBACK
##    BUNNING   281    593        32     R   BUNNING

Now we’ll do the switcheroo to paragraphs

para_corp <- corpus_reshape(corpus_us_debate, to = "paragraphs") 
head(summary(para_corp))
##       Text Types Tokens Sentences party speaker
## 1 ALLARD.1    32     48         3     R  ALLARD
## 2 ALLARD.2    41     64         4     R  ALLARD
## 3 ALLARD.3    25     29         1     R  ALLARD
## 4 ALLARD.4    22     24         1     R  ALLARD
## 5 ALLARD.5    68    144         3     R  ALLARD
## 6 ALLARD.6    55     89         3     R  ALLARD

Notice that the texts now have a numerical identifier added and the other document variables have been duplicated appropriately.

Although contributions are probably too big, ironically paragraphs run the risk of being too small (or even disappearing if the paragraph spotting algorithm is imperfect). Let’s check

table(ntoken(para_corp)) # 15 documents with, err, no words
## 
##   0   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##  15  53  24  57  54  43  62 129 210 211 214 138  77  35  14  12  11   5  13   6 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   6   3   5   5   4   3   4   9   5   8   8   6  10   4   2   7   6   2   4   6 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   8   8   5   3   5  11   1   3   9   4   4   5   4   6   8   3   9   9   7   6 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   5   9   3   8   6   3  10   7   4   6  13  11   5   3   4   3   9   2  11   8 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  99 100 101 
##   4   2   5   2   3   6   6   3  11   4   1   5   4   3   4   4   5   6   1   1 
## 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 
##   3   3   2   3   3  10   3   2   4   2   3   2   5   1   2   3   3   3   2   1 
## 122 123 124 125 126 127 128 129 131 132 133 134 135 137 138 139 140 141 142 143 
##   4   1   3   4   1   2   2   5   2   2   1   2   1   4   4   3   1   1   2   3 
## 144 145 146 148 151 155 157 158 160 161 165 166 168 169 174 175 182 190 191 197 
##   2   1   1   2   1   1   1   2   1   1   1   1   1   1   1   1   1   2   1   1 
## 200 206 209 211 235 259 276 
##   1   1   1   1   1   1   1

so we’ll trim these tiny documents. This happens because the paragraph splitter is heuristic, and therefore sometimes just wrong.

Then we’ll make a document term matrix for the topic model to work with from a lightly edite corpus that leaves out these empty documents and also the ones with less than 3 words in.

para_corp <- corpus_subset(para_corp, ntoken(para_corp) > 2)

toks <- tokens(para_corp, 
               remove_punct = TRUE, 
               remove_numbers = TRUE) |>
  tokens_remove(stopwords()) # living dangerously!

para_dfm <- dfm(toks)

To start with we’ll fit a 10 topic topic model using the stm package

mod <- stm(para_dfm, K = 10, seed = 12345)

Don’t forget to set the seed so others can replicate your work.

Model structure

We can extract the topic proportions directly from the model

topic_props <- mod$theta
dim(topic_props) # a documents by K matrix
## [1] 1889   10

head(topic_props)
##             [,1]        [,2]        [,3]        [,4]        [,5]        [,6]
## [1,] 0.002025542 0.005055413 0.002025542 0.969221789 0.002025542 0.002024922
## [2,] 0.024850691 0.022996467 0.024850691 0.017263099 0.024850691 0.024976595
## [3,] 0.057541536 0.053512314 0.057541536 0.027687428 0.057541536 0.058104665
## [4,] 0.008369383 0.032210700 0.008369383 0.020745106 0.008369383 0.008362421
## [5,] 0.020288536 0.109409476 0.020288536 0.005016331 0.020288536 0.020453474
## [6,] 0.037574160 0.510241171 0.037574160 0.013238199 0.037574160 0.038452853
##             [,7]        [,8]        [,9]       [,10]
## [1,] 0.003238480 0.007778911 0.002025542 0.004578318
## [2,] 0.152153110 0.113430063 0.024850691 0.569777904
## [3,] 0.141812680 0.040076215 0.057541536 0.448640554
## [4,] 0.008968545 0.881099070 0.008369383 0.015136627
## [5,] 0.088519471 0.662662540 0.020288536 0.032784564
## [6,] 0.022002999 0.175019903 0.037574160 0.090748238

(If you for some reason want uncertainty around these proportions, you can get a sample from the posterior distribution of thetas using the thetaPosterior function).

How about the word probabilities? the package really doesn’t want you to manipulate the betas by yourself, so they appear in a list within a list and in log form. Let’s find, just once, the most frequently generated words for topic 4

word_logprobs <- mod$beta[[1]][[1]] # !!
dim(word_logprobs) # K by vocabulary size
## [1]   10 3941

# sort from high to low log beta
topic4_highest <- order(word_logprobs[4,], decreasing = TRUE)
# look at the top 20
mod$vocab[topic4_highest[1:20]]
##  [1] "health"    "senator"   "roe"       "v"         "bill"      "court"    
##  [7] "women"     "procedure" "senate"    "wade"      "abortion"  "right"    
## [13] "decision"  "exception" "supreme"   "want"      "necessary" "life"     
## [19] "said"      "time"

There must be an easier way to do this. And there is:

Identifying topics

Here’s a function to find representative words in a topic. Check out topic 4

labelTopics(mod)
## Topic 1 Top Words:
##       Highest Prob: procedure, abortion, health, procedures, bill, life, women 
##       FREX: procedures, procedure, human, women's, right, doctors, us 
##       Lift: antonia, imposes, unconstitutionally, ailment, mechanism, tries, agonizing 
##       Score: procedure, procedures, abortion, health, life, human, bill 
## Topic 2 Top Words:
##       Highest Prob: abortion, d, physicians, induction, techniques, baby's, intact 
##       FREX: dc, evacuation, rate, hysterotomy, methods, dilation, extraction 
##       Lift: 10-point, 12-point, 19th, al, cunningham, mcdonald, kidney 
##       Score: abortion, d, induction, dc, physicians, labor, washington 
## Topic 3 Top Words:
##       Highest Prob: procedure, abortion, health, procedures, bill, life, women 
##       FREX: procedures, procedure, human, women's, right, doctors, us 
##       Lift: ed, imposes, unconstitutionally, ailment, mechanism, tries, agonizing 
##       Score: procedure, procedures, abortion, health, life, human, bill 
## Topic 4 Top Words:
##       Highest Prob: health, senator, roe, v, bill, court, women 
##       FREX: roe, v, wade, supreme, house, harkin, court 
##       Lift: henriquez, harkin's, striping, aside, grown, inures, mistaken 
##       Score: roe, v, wade, senator, senate, court, supreme 
## Topic 5 Top Words:
##       Highest Prob: procedure, abortion, health, procedures, bill, life, women 
##       FREX: procedures, procedure, human, women's, right, doctors, us 
##       Lift: hernandez, imposes, unconstitutionally, ailment, mechanism, tries, agonizing 
##       Score: procedure, procedures, abortion, health, life, human, bill 
## Topic 6 Top Words:
##       Highest Prob: procedure, health, abortion, women, life, procedures, legislation 
##       FREX: save, lives, preserve, legislation, living, follows, attempt 
##       Lift: inquiry, thin, essence, notes, ripped, thread, vicious 
##       Score: procedure, procedures, health, legislation, women, life, abortion 
## Topic 7 Top Words:
##       Highest Prob: partial-birth, mother, birth, act, called, around, yield 
##       FREX: birth, act, partial-birth, latinas, second-trimester, hon, mention 
##       Lift: restricting, cairo, development, egypt, icpd, affairs, deputy 
##       Score: partial-birth, birth, act, mother, around, latinas, hon 
## Topic 8 Top Words:
##       Highest Prob: baby, child, health, can, bill, going, think 
##       FREX: hand, see, little, samuel, age, womb, nurse 
##       Lift: n.a.r.a.l, deaf, operate, resulting, spinal, abundant, affirmative 
##       Score: baby, child, going, can, see, think, hand 
## Topic 9 Top Words:
##       Highest Prob: procedure, abortion, health, procedures, bill, life, women 
##       FREX: procedures, procedure, human, women's, right, doctors, us 
##       Lift: p, imposes, unconstitutionally, ailment, mechanism, tries, agonizing 
##       Score: procedure, procedures, abortion, health, life, human, bill 
## Topic 10 Top Words:
##       Highest Prob: abortion, medical, procedure, health, life, women, abortions 
##       FREX: physical, term, partial, stories, endangered, medical, administration 
##       Lift: parliamentary, embryos, insurance, requirement, contingency, separated, clerk 
##       Score: abortion, procedure, medical, life, abortions, women, health

Despite how it looks the output is actually a list, so you can grab the summary you like and use it elsewhere. Here’s the x-ray view of the object using str

topic_labs <- labelTopics(mod)
str(topic_labs)
## List of 5
##  $ prob     : chr [1:10, 1:7] "procedure" "abortion" "procedure" "health" ...
##  $ frex     : chr [1:10, 1:7] "procedures" "dc" "procedures" "roe" ...
##  $ lift     : chr [1:10, 1:7] "antonia" "10-point" "ed" "henriquez" ...
##  $ score    : chr [1:10, 1:7] "procedure" "abortion" "procedure" "roe" ...
##  $ topicnums: int [1:10] 1 2 3 4 5 6 7 8 9 10
##  - attr(*, "class")= chr "labelTopics"

Add an argument to labelTopics if you want fewer words returned. Let’s get the highest FREX words and melt them into a 10 long strings. This little bit of code might be useful

topic_labs <- apply(topic_labs$frex, 1, paste, collapse = "-")
topic_labs
##  [1] "procedures-procedure-human-women's-right-doctors-us"            
##  [2] "dc-evacuation-rate-hysterotomy-methods-dilation-extraction"     
##  [3] "procedures-procedure-human-women's-right-doctors-us"            
##  [4] "roe-v-wade-supreme-house-harkin-court"                          
##  [5] "procedures-procedure-human-women's-right-doctors-us"            
##  [6] "save-lives-preserve-legislation-living-follows-attempt"         
##  [7] "birth-act-partial-birth-latinas-second-trimester-hon-mention"   
##  [8] "hand-see-little-samuel-age-womb-nurse"                          
##  [9] "procedures-procedure-human-women's-right-doctors-us"            
## [10] "physical-term-partial-stories-endangered-medical-administration"

we might use those as column headers to remind ourselves what the topic probably is about.

For putting in online appendices and showing to your co-authors, a plot is often more compelling

plot(mod, type = "labels", labeltype = "frex") # or frex, lift, score

You may need to pop out the window to make this big enough to view.

Identifying topicful documents

To get a better sense of a topic it’s sometimes helpful to look at a sample of documents a lot of whose tokens were assigned to it. Here we find documents (for us, paragraphs) that have a lot of topic 4.

findThoughts(mod, texts = as.character(para_corp), topics = 4)
## 
##  Topic 4: 
##       About a month ago we had a very enlightening debate on the Senate 
## floor over an important amendment offered to S. 3 by our colleague, 
## Senator Harkin. The amendment reaffirmed support for the Supreme 
## Court's decision in Roe v. Wade. The House Republican leadership 
## decided that the Senate did not have the wisdom, and their leadership 
## and their anti-choice friends removed Senator Harkin's language in 
## conference. Striping this bill of the Harkin amendment that reaffirms 
## Roe v. Wade shows us what the President and his anti-choice allies are 
## really after. They want to overturn Roe v. Wade. It has been said many 
## times. Unfortunately, this bill puts them on that path.
##      We all come here and say we know what Americans want. It is 
## interesting because, of course, we are trying to determine that. 
## Senator Sessions had a poll that said women in this country no longer 
## want the right to choose. That is what he said. I have a poll that 
## shows everyone in this country believes Roe is a fair balance and 
## should continue. But let me tell you what I think Americans want. Let 
## me tell you what I know Californians want. I don't speak for every 
## Californian. I couldn't. There are 35 million of us. But the vast 
## majority of us -- and we have had amazing polls on this point--want 
## American women protected. They want children protected. They want 
## privacy protected. They want women respected. They trust women more 
## than they trust Senators. They want us to do the right thing, and they 
## know what the right thing is.
##      This bill was stripped of the supportive language of Roe v. Wade that 
## this Senate passed twice -- not once but twice--saying that Roe v. Wade 
## should remain the law of the land. Oh, no, they were so radical in that conference committee, they kicked out that 
## very simple statement where most Americans agree that Roe v. Wade, 
## making this decision in the early stages of a pregnancy in private--
## Government stay out of it; Senator Boxer, I might think you are really 
## a good gal, but stay out of my private life. They are right. I don't 
## deserve to be in it.

This should be mostly about issues of constitutionality (though your random seed may disagree with mine, which is regrettably a thing that happens). Try a few others, but be aware that some of these are deliberately gruesome.

Model checking

We would be remiss not to run some basic model diagnostics. Here are three, conveniently built into the package.

First, we’d prefer it if words were not exclusively generated by one topic (we’re trying to abstract up from the words after all)

checkBeta(mod)$problemWords # we'll look at the 'problem words field'
## [[1]]
##      Word       Topic                          
## [1,] "No Words" "Load Exclusively Onto 1 Topic"

On the other hand we’d like a certain amount of exclusivity to the topics.
Exclusivity here means how much each each topic has its own vocabulary not used by other topics

dotchart(exclusivity(mod), labels = 1:10)

We’d also like the topics to be coherent. If the topics are in fact semantically coherent, one thing we’d expect is that words a topic generates should co-occur often in the same document. One measure of this is

cohere <- semanticCoherence(mod, para_dfm)
dotchart(cohere, labels = 1:10)

and it looks like we might want to check topics 2 and 7. Typically semantically ‘incoherent’ vocabulary are stop words (which we removed), and procedural vocabulary. Here it seems like they are just more general terms in the debate topic that everyone is happy to use.

Predicting topic prevalences

We may reasonably expect that both parties and speakers would prefer to emphasise a different set of topics. We have two options to connect speaker and party identifiers to topics.

First, we can manually extract the topic proportions (called \(\theta\) in the lectures) from the model and use parts of them as a dependent variable. The topic proportions for each document are inside the model object

head(mod$theta)
##             [,1]        [,2]        [,3]        [,4]        [,5]        [,6]
## [1,] 0.002025542 0.005055413 0.002025542 0.969221789 0.002025542 0.002024922
## [2,] 0.024850691 0.022996467 0.024850691 0.017263099 0.024850691 0.024976595
## [3,] 0.057541536 0.053512314 0.057541536 0.027687428 0.057541536 0.058104665
## [4,] 0.008369383 0.032210700 0.008369383 0.020745106 0.008369383 0.008362421
## [5,] 0.020288536 0.109409476 0.020288536 0.005016331 0.020288536 0.020453474
## [6,] 0.037574160 0.510241171 0.037574160 0.013238199 0.037574160 0.038452853
##             [,7]        [,8]        [,9]       [,10]
## [1,] 0.003238480 0.007778911 0.002025542 0.004578318
## [2,] 0.152153110 0.113430063 0.024850691 0.569777904
## [3,] 0.141812680 0.040076215 0.057541536 0.448640554
## [4,] 0.008968545 0.881099070 0.008369383 0.015136627
## [5,] 0.088519471 0.662662540 0.020288536 0.032784564
## [6,] 0.022002999 0.175019903 0.037574160 0.090748238

We can then tie these to the document variables and get a high level view of who talks about what. Who tends to use topic 1?

df <- data.frame(topic4 = mod$theta[,4], docvars(para_dfm))
head(df)
##        topic4 party speaker
## 1 0.969221789     R  ALLARD
## 2 0.017263099     R  ALLARD
## 3 0.027687428     R  ALLARD
## 4 0.020745106     R  ALLARD
## 5 0.005016331     R  ALLARD
## 6 0.013238199     R  ALLARD

Adding covariates to a topic model

We can also fit the topic model taking into account that we think we know what factors might predict different topic prevalence. This time, let’s look at how party relates to topic prevalence.

We need to add two bits of information to the model specification: the the document variables we think are predictive, and the functional form of the relationship between them and topics. The document variables come in as a data argument and the functional form is specified as the right hand side of a formula. To say that party affiliation predicts topic prevalence (‘prevalence’ is just the topic proportion) we say prevalence = ~ party, and fit the model as before

mod2 <- stm(para_dfm, K = 10, 
           prevalence = ~ party, data = docvars(para_dfm),
           seed = 12345)

If we are interested in topics (constitutional) 4 and (gruesome) 8, we can focus in on their relationship to party like this:

gg <- estimateEffect(c(4,8) ~ party, mod2, docvars(para_dfm))
summary(gg)
## 
## Call:
## estimateEffect(formula = c(4, 8) ~ party, stmobj = mod2, metadata = docvars(para_dfm))
## 
## 
## Topic 4:
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.118058   0.005841  20.212  < 2e-16 ***
## partyR      -0.065531   0.012214  -5.365 9.08e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Topic 8:
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.044114   0.004981   8.857   <2e-16 ***
## partyR      0.150461   0.013139  11.452   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

By changing the numbers in the ‘dependent variable’ part of the formula first argument we can look at as many of the topics as we like.

As always, a plot is nicer

plot(gg, "party", labeltype = "prob", model = mod2)

From this we can see that Republican speakers use topic 4 somewhat less than Democrats, but topic 8 much more.

If you’d like to know whether this is a general Republican thing or just a quirk of one or two speakers, adjust the model above to use speaker as the predictor and follow the steps above.

Aside: controlling for things

In general you can use any specification you like in the formula, so you don’t have to choose individual covariates, as we did here. In particular it’s often nice to ‘control’ for time in a smooth fashion. For example if you have a year variable and enough years you might use party + s(year) as the formula, which would allow topic prevalences to move smoothly over time, and the party ‘effect’ to be estimated against this background.

Note also that formulae need to know about nesting in data, so party/speaker would be the way to express that speakers are in one party only.