US Senate Speeches

Let’s take a look at a US Senate debate on partial birth abortion. As always, we’ll load the texts, make a corpus, then a document term matrix to start off

library(tidyverse)
library(quanteda)
library(quanteda.textmodels)
library(quanteda.textstats)

library(ggrepel) # for labelling ggplots 

In this exercise we’ll continue to look at the US Debate, but this time from a scaling perspective. If these speakers were expressing their positions on a latent dimension using words, where would they all be?

We’ll use the pre-constructed data frame of speaker contributions and make a corpus out of it. Because we’ll be interested in speaker ideal points we will assume that all their contributions to the debate are always expressing the same point, so it’s reasonable bundle them all into one document, i.e. documents are speakers

load("data/corpus_us_debate_speaker.rda")
summary(corpus_us_debate_speaker)
## Corpus consisting of 23 documents, showing 23 documents:
## 
##        Text Types Tokens Sentences party    speaker
##      ALLARD   400   1165        53     R     ALLARD
##        BOND   129    232         9     R       BOND
##       BOXER  2231  18527       886     D      BOXER
##   BROWNBACK   646   2884       168     R  BROWNBACK
##     BUNNING   281    593        32     R    BUNNING
##    CANTWELL   395   1114        55     D   CANTWELL
##      DeWINE   463   1438        75     R     DeWINE
##    DOMENICI   203    479        27     R   DOMENICI
##      DURBIN   520   1874        89     D     DURBIN
##      ENSIGN   410   1235        66     R     ENSIGN
##    FEINGOLD   213    414        19     D   FEINGOLD
##   FEINSTEIN  1348   6112       284     D  FEINSTEIN
##       FRIST   222    501        25     R      FRIST
##      HARKIN   655   2612       145     D     HARKIN
##       HATCH   451   1173        60     R      HATCH
##  LAUTENBERG   678   2131       123     D LAUTENBERG
##    MIKULSKI   260    631        42     D   MIKULSKI
##       NELSO   155    328        17     D      NELSO
##     NICKLES   280    568        26     R    NICKLES
##        REID    38     59         4     D       REID
##    SANTORUM  1593  10507       458     R   SANTORUM
##    SESSIONS   653   2174       111     R   SESSIONS
##   VOINOVICH   186    363        19     R  VOINOVICH

corp <- corpus_us_debate_speaker # give it a shorter name!

Now we’ll make a document term matrix

# add 2 but remove 6
stops <- c(setdiff(stopwords(),  
                   c("her", "she", "hers", "he", "his", "him")),
           "bill", "can")

toks <- tokens(corp,  
               remove_punct = TRUE, 
               remove_symbols = TRUE,  
               remove_numbers = TRUE) |>
  tokens_remove(stops) |>
  tokens_tolower()
 
corpdfm <- dfm(toks)

Before we took care to remove a lot of things we though would be confusing.
Here let’s live dangerously and use all the words. (The models being fit are a lot more constrained than before). You should check these decisions don’t affect much.

One dimensional scaling

We’ll run a simple one dimensional scaling model, known to ‘wordfish’ to political scientists, and an association model to everyone else

wfish <- textmodel_wordfish(corpdfm, dir = c(3, 21))
summary(wfish)
## 
## Call:
## textmodel_wordfish.dfm(x = corpdfm, dir = c(3, 21))
## 
## Estimated Document Positions:
##               theta      se
## ALLARD      0.02165 0.06304
## BOND       -0.08693 0.12951
## BOXER      -1.05777 0.01098
## BROWNBACK   2.38270 0.05180
## BUNNING     0.81777 0.09908
## CANTWELL   -1.25635 0.03847
## DeWINE      1.38729 0.06900
## DOMENICI    0.32865 0.10463
## DURBIN     -0.25970 0.04686
## ENSIGN      0.83659 0.07612
## FEINGOLD   -1.20438 0.06369
## FEINSTEIN  -1.59354 0.01318
## FRIST       0.83881 0.11169
## HARKIN     -0.79095 0.03345
## HATCH       0.97740 0.07396
## LAUTENBERG -0.65712 0.03848
## MIKULSKI   -1.22626 0.05053
## NELSO      -1.10229 0.07946
## NICKLES     0.13604 0.08712
## REID        0.20834 0.27792
## SANTORUM    0.48882 0.02367
## SESSIONS    0.62232 0.05214
## VOINOVICH   0.18892 0.11350
## 
## Estimated Feature Scores:
##           mr president  commend senator pennsylvania santorum   frist
## beta 0.06468   0.05462 -0.02596 0.06741      -0.1341   0.1254  0.4459
## psi  0.87009   1.50382 -2.04487 1.85843      -0.6193  -0.4688 -1.7535
##      leadership particular  issue  worked extremely    hard    also presiding
## beta    0.01234    -0.3105 0.2177  0.1999  -0.04553  0.3747 -0.1087   -0.2618
## psi    -0.93473    -0.3830 1.0103 -3.1011  -1.76368 -2.0193  0.6330   -2.1476
##      officer    his rights  unborn  pleased cosponsor partial-birth abortion
## beta -0.2618 0.5470 0.3719  0.3750 -0.06108  0.003728        0.2335  -0.1562
## psi  -2.1476 0.9393 0.8146 -0.6331 -1.20952 -1.748114        1.6869   2.7850
##          act       s legislation designed    help protect unnecessary
## beta -0.4283 -0.8522     -0.5748 -0.07312  0.5454 -0.2535     -0.2491
## psi   0.4152 -0.1480      1.1859 -0.67491 -0.8843  0.7095     -0.8415

the dir argument sets the meaning of left and right. We are going to assert that the 3rd speaker Barbara Boxer, is to the left of the 21st speaker Rick Santorum. We don’t make any claims about how far to the left, or about any of the other speakers. Why to we do this?

A note on identification

The model we are fitting says that each association is modelled by the multiplication of a speaker position \(\theta\) and a word position \(\beta\). So if they are both positive or both negative we see the speaker using that word more. However, we don’t specify either \(\theta\) or \(\beta\) values so there are always two model that fit identically: the one with \(\theta_i\beta_j\) and the one with \((-\theta_i)(-\beta_j)\). The second has all the sorts the speakers the opposite way to the first, but we can’t tell because it also sorts the word positions the other way to make up for it. Since we’d like to be interpretable in substantive terms, we can set a left and right speaker to orient the model, without adding any real information to it.

It doesn’t really matter if you do this step or not. We could always do a bit of multiplication after the model is fit. Just don’t be surprised when people you expect on one side end up on the other. For non-ideological contexts i.e. when the dimension doesn’t have a natural direction, this whole thing will matter even less.

The fitted model

The fitted model contains all the components you saw in the lectures, organized in two blocks

wf_coef <- coef(wfish)
names(wf_coef)
## [1] "documents" "features"
head(wf_coef$documents) # speaker parameters 
##                 theta       alpha
## ALLARD     0.02165274  0.21348639
## BOND      -0.08693357 -1.29227632
## BOXER     -1.05777079  2.76206193
## BROWNBACK  2.38269597  0.12975044
## BUNNING    0.81776695 -0.48868033
## CANTWELL  -1.25634814 -0.07911048
head(wf_coef$features) # word parameters
##                     beta        psi
## mr            0.06468449  0.8700898
## president     0.05462191  1.5038171
## commend      -0.02595502 -2.0448676
## senator       0.06740927  1.8584309
## pennsylvania -0.13413832 -0.6192956
## santorum      0.12538330 -0.4688482

Positions for speakers

The estimated speaker positions can be extracted as if this were a regular fitted regression model. It’s nice to get some uncertainty around these positions too

preds  <- predict(wfish, interval = "confidence")
preds
## $fit
##                    fit         lwr        upr
## ALLARD      0.02165274 -0.10190180  0.1452073
## BOND       -0.08693357 -0.34076934  0.1669022
## BOXER      -1.05777079 -1.07929641 -1.0362452
## BROWNBACK   2.38269597  2.28117382  2.4842181
## BUNNING     0.81776695  0.62356627  1.0119676
## CANTWELL   -1.25634814 -1.33174776 -1.1809485
## DeWINE      1.38728909  1.25204737  1.5225308
## DOMENICI    0.32865152  0.12357508  0.5337280
## DURBIN     -0.25969664 -0.35154658 -0.1678467
## ENSIGN      0.83658945  0.68738863  0.9857903
## FEINGOLD   -1.20437761 -1.32919832 -1.0795569
## FEINSTEIN  -1.59354288 -1.61938317 -1.5677026
## FRIST       0.83880779  0.61988997  1.0577256
## HARKIN     -0.79094571 -0.85650252 -0.7253889
## HATCH       0.97739715  0.83244576  1.1223485
## LAUTENBERG -0.65712119 -0.73253786 -0.5817045
## MIKULSKI   -1.22625771 -1.32530022 -1.1272152
## NELSO      -1.10229306 -1.25803199 -0.9465541
## NICKLES     0.13603646 -0.03471560  0.3067885
## REID        0.20834388 -0.33636489  0.7530527
## SANTORUM    0.48881739  0.44242432  0.5352105
## SESSIONS    0.62231611  0.52012750  0.7245047
## VOINOVICH   0.18892279 -0.03352536  0.4113709

As in the topic model case, we need to work a bit to extract the (only) element in preds called ‘fit’ to get actual table of positions and associated uncertainty. When we’ve dug it out, we’ll attach the docvars to keep everything together

# make a data frame from fit matrix and the document variables
ests <- data.frame(preds$fit)
speaker_pos <- data.frame(docvars(corpdfm), ests) |>
  arrange(fit) # sort left (negative) to right (positive)
head(speaker_pos)
##           party   speaker       fit       lwr        upr
## FEINSTEIN     D FEINSTEIN -1.593543 -1.619383 -1.5677026
## CANTWELL      D  CANTWELL -1.256348 -1.331748 -1.1809485
## MIKULSKI      D  MIKULSKI -1.226258 -1.325300 -1.1272152
## FEINGOLD      D  FEINGOLD -1.204378 -1.329198 -1.0795569
## NELSO         D     NELSO -1.102293 -1.258032 -0.9465541
## BOXER         D     BOXER -1.057771 -1.079296 -1.0362452

Let’s plot these positions. I’ll use ggplot2

N <- nrow(speaker_pos)
speaker_names <- speaker_pos$speaker

ggplot(speaker_pos, aes(x = fit, y = 1:N, col = party)) +
  geom_point() +
  geom_errorbarh(aes(xmin = lwr, xmax = upr), height = 0) +
  scale_color_manual(values = c("blue", "red")) +
  scale_y_continuous(labels = speaker_names, breaks = 1:N) +
  labs(x = "Position", y = "Speaker") +
  ggtitle("Estimated Positions from Senate Partial-Birth Abortion Debate",
          subtitle = "October 21st, 2003")

Positions for words

Let’s extract the word positions for later:

word_pos <- data.frame(word = wfish$features,
                       position = wfish$beta,
                       offset = wfish$psi)

We can get some idea of the nature of the debate language by looking for words we think ought to be charged. Take a moment to make some predictions about the scores of these:

testwords <- c("life", "choice", "womb", "her", "woman", "health",
               "born", "baby", "little", "gruesome", "kill", "roe",
               "wade", "medical", "her", "his", "child", "religion",
               "catholic", "doctor", "nurse")

Now let’s see

testscores <- word_pos |>
  filter(word %in% testwords) |>
  arrange(position)

testscores[,1:2] # just word and score columns
##        word    position
## 1  catholic -0.87235748
## 2  religion -0.87235748
## 3    health -0.67988781
## 4     woman -0.67535054
## 5    choice -0.46990160
## 6   medical -0.45263277
## 7       roe -0.43295110
## 8      wade -0.31416752
## 9       her -0.18386806
## 10   doctor -0.01346807
## 11     life  0.18469327
## 12     baby  0.48621181
## 13     born  0.50502110
## 14      his  0.54704842
## 15   little  0.57050720
## 16     kill  0.63760060
## 17 gruesome  0.66857723
## 18    child  0.68572689
## 19    nurse  0.71239098
## 20     womb  1.10874972

If we were Slapin and Proksch we might put all this on one of their Eiffel Tower diagrams…

ggplot(word_pos, aes(position, offset, label = word)) +
  geom_point(color = "grey", alpha = 0.2) +
  geom_text_repel(data = testscores, col = "black") +
  geom_point(data = testscores) +
  labs(x = "Word position", y = "Offset parameter") +
  ggtitle("Estimated word positions for selected debate vocabulary",
          subtitle = "Note: Offset parameter is roughly proportional to word frequency")

One thing to take away from this is that the more frequent the word (higher on the y axis) the less likely it is to express an extreme position. None of our test words were particularly extreme, so you might want to take a look at those unlabelled extremes. You can do that bby puling all the word positions and sorting them in order of beta, then looking at the lowest and highest, e.g. using head or tail.

Model checking

If we were being thorough about these words we’d check they do what we think they do by looking by looking at them in all their contexts, as we did before. This one is somewhat gruesome, so I’ll let you run it yourself, or you can choose another word of interest.

kwic(tokens(corp), "baby", window = 15)

Multidimensional models

To fit a more than one dimensional scaling model we need to switch models. We’ll use correspondence analysis (CA) for efficiency, which quanteda wraps up into the function textmodel_ca. And just as textmodel_lr actually called functions from {glmnet}, textmodel_ca is a light wrapper around the {ca}, the venerable correspondence analysis package. Let’s take it for a spin with two dimensions

mod2 <- textmodel_ca(corpdfm, nd = 2)

Speaker and word positions with CA

Let’s see what we get from this model

ca_coefs <- coef(mod2)
names(ca_coefs)
## [1] "coef_feature"     "coef_feature_se"  "coef_document"    "coef_document_se"

This looks promising, until we realize that we’re only getting the first dimension from coef_document and coef_feature

ca_coefs$coef_document
##      ALLARD        BOND       BOXER   BROWNBACK     BUNNING    CANTWELL 
## -0.26237636 -0.18086402  0.60798719 -3.05908611 -0.90092468  0.84585647 
##      DeWINE    DOMENICI      DURBIN      ENSIGN    FEINGOLD   FEINSTEIN 
## -1.46656126 -0.47811240 -0.03301386 -0.94315418  0.77460379  1.19487689 
##       FRIST      HARKIN       HATCH  LAUTENBERG    MIKULSKI       NELSO 
## -0.95364604  0.38041003 -1.13512695  0.24535845  0.82459942  0.69317035 
##     NICKLES        REID    SANTORUM    SESSIONS   VOINOVICH 
## -0.35322982 -0.42473104 -0.65283838 -0.76297152 -0.37723043

But since we’re here, let’s compare these positions to the association model ones we saw earlier. Happily the speaker order in exactly the same way (you can check that), so we can simply correlate them

cor(ca_coefs$coef_document, ests$fit)
## [1] -0.9887326

And here you’ll see that a) they agree very closely, but b) they’re in the opposite direction, because we didn’t force a direction for theta and the fitting process chose the other way. Hence if we want to interpret these then we should either multiple theta and beta by -1 or just practice our lateral thinking skills…

The other thing that’s noticeable on inspection is that there are no measures of uncertainty around anything.

ca_coefs$coef_document_se # seems rather rude 
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

These are possible to compute in CA (although not in this package), but a bit more expensive. Ask if you want to know how.

No really. The positions, please

Let’s instead make use of the fact that the fitted model is a {ca} model and use the package’s own functions. The speakers are rows and the words are columns, so the package offers us a general purpose way to get row coordinates and column coordinates for arbitrary dimensions. We fit 2 so let’s get speaker positions in the first two dimensions

library(ca) # this comes with quanteda.textmodels

speaker_pos2 <- cacoord(mod2, dim = 1:2, rows = TRUE)
speaker_pos2
##                   Dim1       Dim2
## ALLARD     -0.26237636 -1.3915416
## BOND       -0.18086402 -1.3675989
## BOXER       0.60798719  0.5297288
## BROWNBACK  -3.05908611  2.8875633
## BUNNING    -0.90092468 -1.6842824
## CANTWELL    0.84585647  0.3283115
## DeWINE     -1.46656126 -1.3136832
## DOMENICI   -0.47811240 -1.1858946
## DURBIN     -0.03301386  0.1726151
## ENSIGN     -0.94315418 -0.6092284
## FEINGOLD    0.77460379  0.1559779
## FEINSTEIN   1.19487689  0.1530177
## FRIST      -0.95364604 -3.2332382
## HARKIN      0.38041003  0.3028870
## HATCH      -1.13512695 -1.6493573
## LAUTENBERG  0.24535845  0.2934741
## MIKULSKI    0.82459942  0.2156859
## NELSO       0.69317035 -0.1825432
## NICKLES    -0.35322982 -2.1983291
## REID       -0.42473104 -0.6159025
## SANTORUM   -0.65283838 -0.6556323
## SESSIONS   -0.76297152 -1.3123759
## VOINOVICH  -0.37723043 -1.3371385

These are, of course, better plotted. We can either extract do this ourselves:

sp_pos <- data.frame(speaker_pos2, 
                     speaker = rownames(speaker_pos2)) # add the speaker as a column
ggplot(sp_pos, aes(Dim1, Dim2)) + 
  geom_point() + 
  geom_text_repel(aes(label = speaker))

or just use the rather comprehensive plot function provided by {ca}. Since there are a lot of words, we could use this function to plot the speakers only

plot(mod2, what = c("all", "none"))

(Remembering that this model has dimension 1 flipped).

If you want the words and not the speakers, change what to be c("none", "all"), and get ready to make a really busy plot…

If you want both then leave out what altogether. The only downside to this function is that we’d probably like to flip the sign of at least the first axis when we explain this model to people, but there isn’t an easy way to do it. If we take back control of plotting, we can, but then it’s more work.

Interpretation

But already we can see that from this plot that Brownback is not just extreme on the first dimension but talks quite differently to all the other speakers. What’s happening here? (Maybe we should take a look at Frist too)

Let’s see what Brownback talks about

bb <- textstat_keyness(corpdfm, target = "BROWNBACK")
head(bb, 20)
##             feature      chi2            p n_target n_reference
## 1            samuel 396.36357 0.000000e+00       22           1
## 2              hand 367.35657 0.000000e+00       26           8
## 3              womb 280.15762 0.000000e+00       26          17
## 4            person 188.46923 0.000000e+00       18          12
## 5                he 151.74913 0.000000e+00       33          78
## 6          property 140.87980 0.000000e+00        9           1
## 7             child 127.78792 0.000000e+00       32          87
## 8           surgery  99.71526 0.000000e+00       11           9
## 9             utero  83.43004 0.000000e+00        6           1
## 10        surgeries  79.48003 0.000000e+00        5           0
## 11        beautiful  59.90493 9.992007e-15        4           0
## 12     photographer  59.90493 9.992007e-15        4           0
## 13          testify  59.90493 9.992007e-15        4           0
## 14              usa  59.90493 9.992007e-15        4           0
## 15           unique  46.54580 8.950174e-12        4           1
## 16          amongst  40.51390 1.952226e-10        3           0
## 17           famous  40.51390 1.952226e-10        3           0
## 18           needle  40.51390 1.952226e-10        3           0
## 19 responsibilities  40.51390 1.952226e-10        3           0
## 20           sacred  40.51390 1.952226e-10        3           0

These are the distinctive parts of his vocabulary, and we can check how they land in the second dimension

# extract word positions
word_pos2 <- cacoord(mod2, dim = 1:2, cols = TRUE)
bb_words <- data.frame(word_pos2, word = rownames(word_pos2)) |>
  filter(word %in% bb$feature[1:20]) |>
  arrange(Dim2)
bb_words
##                       Dim1      Dim2             word
## child            -1.800534 0.5388316            child
## he               -1.629173 1.1061740               he
## womb             -3.244048 2.5033597             womb
## person           -3.061756 2.8073656           person
## surgery          -2.125780 3.1464641          surgery
## hand             -3.920159 3.7798270             hand
## utero            -4.480888 3.9808858            utero
## unique           -3.689256 4.1618744           unique
## property         -4.535972 4.5031753         property
## samuel           -4.666632 4.9507920           samuel
## surgeries        -4.923228 5.1330251        surgeries
## testify          -4.923228 5.1330251          testify
## beautiful        -4.923228 5.1330251        beautiful
## photographer     -4.923228 5.1330251     photographer
## usa              -4.923228 5.1330251              usa
## amongst          -4.923228 5.1330251          amongst
## needle           -4.923228 5.1330251           needle
## famous           -4.923228 5.1330251           famous
## responsibilities -4.923228 5.1330251 responsibilities
## sacred           -4.923228 5.1330251           sacred

After this cursory glance it seems that Brownback is his own ‘dimension’, or rather, his distinctive speech has been interpreted as one pole of a dimension by this model that is looking for dimensions.

How do things look if we take him out and refit the model?

corp_nobb <- corpus_subset(corp, speaker != "BROWNBACK")

toks_nobb <- tokens(corp_nobb,  
                    remove_punct = TRUE, 
                    remove_symbols = TRUE,  
                    remove_numbers = TRUE) |>
  tokens_remove(stops) 

mod2_nobb <- textmodel_ca(dfm(toks_nobb), nd = 2)

Let’s do a quick comparison

pos <- data.frame(cacoord(mod2, dim = 1:2, rows = TRUE))
pos_nobb <- data.frame(cacoord(mod2_nobb, dim = 1:2, rows = TRUE))

# connect the positions from both runs of the model together
pos_both <- merge(pos, pos_nobb, by = "row.names", 
                  suffixes = c("",".nobb"))
head(pos_both)
##   Row.names       Dim1       Dim2  Dim1.nobb  Dim2.nobb
## 1    ALLARD -0.2623764 -1.3915416 -0.8893293 -0.9578887
## 2      BOND -0.1808640 -1.3675989 -0.7678698 -1.4052542
## 3     BOXER  0.6079872  0.5297288  0.6945660  0.5502065
## 4   BUNNING -0.9009247 -1.6842824 -1.6493941 -0.4302980
## 5  CANTWELL  0.8458565  0.3283115  0.8732419 -0.3398072
## 6    DeWINE -1.4665613 -1.3136832 -2.1651482  1.7980709

# does Brownback affect the first dimension?
cor(pos_both$Dim1, pos_both$Dim1.nobb)
## [1] 0.9775541

# how about the second?
cor(pos_both$Dim2, pos_both$Dim2.nobb)
## [1] 0.6380843

These correlation suggest that removing Brownback doesn’t change the first dimension positions much at all (which is kind of reassuring), but does rearrange speakers on the second.

It’s quite possible that this second ‘dimension’ is in general just idiosyncratic topics that speakers like and which generate variation in word counts without being really ‘dimensional’. If that’s true then Brownback is not really a special case, and we should put him back in and stop trying to figure out what the second dimension ‘really’ is. But we’d need to dig in further than this to establish that conclusion.