library(tidyverse)
library(quanteda)
library(quanteda.textmodels)
library(plotROC) # for evaluating models
For showing how document classification works we’ll use a corpus of movie reviews. Not super interesting from the social science perspective perhaps but intuitive enough to get the hang of things.
corp_movies <- data_corpus_moviereviews
head(docvars(corp_movies))
## sentiment id1 id2
## 1 neg cv000 29416
## 2 neg cv001 19502
## 3 neg cv002 17424
## 4 neg cv003 12683
## 5 neg cv004 12641
## 6 neg cv005 29357
Most are fairly long, but here’s a pithy one
as.character(corp_movies)[507]
## cv506_17521.txt
## "this film is extraordinarily horrendous and i'm not going to waste any more words on it . "
The corpus is 2000 reviews. The first 1000 examples are ‘topic’ negative and remainder ‘topic’ positive. In realistic situations, we won’t know all the topic labels so we’ll train some classifiers on a sample and see how well it does on the rest.
First we pull a random half of the negatives and a random half of the positive reviews out and second to make a balanced training corpus, and then make another test corpus containing all the other documents:
corp_movies$id_numeric <- 1:ndoc(corp_movies) # add a docvar to the main corpus
# randomize the training set
set.seed(800)
id_train <- c(sample(1:1000, 500), sample(1001:2000, 500))
id_test <- setdiff(1:2000, id_train)
# make two corpora
train_corp <- corpus_subset(corp_movies, id_numeric %in% id_train)
test_corp <- corpus_subset(corp_movies, id_numeric %in% id_test)
One programming feature here worth noting: if we want to assign to or get the value of a single docvar we can (in this latest version of quanteda) either
docvars(corp_movies, "id_numeric") # get a docvar
docvars(corp_movies, "id_numeric") <- 1:ndoc(corp_movies) # assign/overwrite a docvar
or kind of pretend that our dfm
or corpus
is a list and use $
, like this:
corp_movies$id_numeric # get a docvar
corp_movies$id_numeric <- 1:ndoc(corp_movies) # assign/overwrite a docvar
This is sometimes convenient and feels natural for corpora, but don’t forget that the feature names in some dfm
are featnames(d)
.
I’ll do this in a pipe:
train_dfm <- tokens(train_corp, remove_punct = TRUE, remove_symbols = TRUE,
remove_numbers = TRUE, remove_url = TRUE) |>
# take out stopwords
tokens_remove(stopwords()) |>
# make sure it's all lowercase
tokens_tolower() |>
# count those tokens
dfm() |>
# remove columns of tokens that didn't occur 'often enough'
dfm_trim(min_termfreq = 5)
Now we need the test data, which is just every document not in the training set. It’s very much the same process, but we won’t need to worry too much about stripping things out, as we’ll force it to have the same vocabulary set as the training data (else how would we do prediction?)
test_dfm <- tokens(test_corp, remove_punct = TRUE, remove_symbols = TRUE,
remove_numbers = TRUE, remove_url = TRUE) |>
# make sure it's all lowercase
tokens_tolower() |>
# count everything, which is fine because
dfm() |>
# this drops out everything new and zeros everything we didn't see
# in the training data
dfm_match(featnames(train_dfm))
Don’t skip the dfm_match
stage. Zipf’s Law makes it almost inevitable that there will be words in the test documents that never turned up in the training document and vice versa.
Let’s fit the model. Inside this textmodel_lr
functions from {glmnet}
, are doing their thing, so if you you want to tweak the settings, those are the help pages to look at. We’ll run on the defaults. But first some brief background on the model underneath.
In the slides we talked about the importance of regularization in logistic regression approaches to classification when there were many more words (variables) than documents (cases). And we noted that we needed to constrain the coefficients somehow to make it possible to fit the model. Specifically, we could ask either for ‘small’ coefficients (L2), or ‘few non-zero’ coefficients (L1). The glmnet model tries to do both by learning a linear combination of the ‘small’ constraint and the ‘few’ constraint. That’s the lambda that will turn up in the summary (tl;dr the larger the lambda, the more weight is put on being small over sparse). You can read more about it here.
Since textmodel_lr
is a regression model we have to give it a matrix of covariates - here our document term matrix - and a corresponding vector of topic labels. (R would normally offer a formula interface for regression models, but that would be very tiring to type out). We’ll pull the dependent variable - the topic labels - directly from the dfm using $
and the fiction that it is a list of its own docvars:
tmod1 <- textmodel_lr(train_dfm, train_dfm$sentiment)
summary(tmod1)
##
## Call:
## textmodel_lr.dfm(x = train_dfm, y = train_dfm$sentiment)
##
## Lambda Min:
## [1] 0.01284
##
## Lambda 1se:
## [1] 0.01698
##
## Estimated Feature Scores:
## (Intercept) plot two teen couples go church party drink drive get
## pos -0.3038 -0.07566 0 0 0 0 0 0 0 0 0
## accident one guys dies girlfriend continues see life nightmares deal watch
## pos 0 0 0 0 0 0 0 0.111 0 0 0
## movie find critique generation touches cool idea presents
## pos 0 0 0 0 0 0 0 0
You’ll see immediately that a lot of these coefficients are set to zero. Let’s take a look at the ones that aren’t, since they are the ones that define the decision boundary.
Confusingly, but understandably, when we ask for the coefficients we get back a sparse matrix; there may be a lot, but we expect many/most to be zero. So we have to work a bit to get them out. Since we’re doing a two way classification I’ll sort them in order of size, so we can see which words makes the review more likely to be predicted to be negative and which positive.
tmod1_coefs <- coef(tmod1)
dim(tmod1_coefs) # hmm, it's basically one skinny matrix / vector
## [1] 9685 1
# so we'll make it something easier to deal with
tmp <- data.frame(word = rownames(tmod1_coefs),
coef = as.vector(tmod1_coefs)) |>
filter(coef != 0) |>
arrange(coef)
head(tmp)
## word coef
## 1 wasted -1.0866183
## 2 mess -0.8813956
## 3 ridiculous -0.8778457
## 4 beard -0.7762680
## 5 bland -0.6597092
## 6 potentially -0.6257308
tail(tmp)
## word coef
## 319 horrific 0.6116408
## 320 pantoliano 0.6716004
## 321 bet 0.6979838
## 322 spoil 0.7386362
## 323 luckily 0.8747745
## 324 gently 0.8764630
One thing to take away from this is that ‘bad’ words predict a negative review, but may also predict a positive one (horror movies, I guess…) so a sentiment dictionary approach might not be as good as we expect in this domain. Also, someone please tell me what a “pantoliano” is.
Is this thing any good at distinguishing positive from negative? A minimal requirement is that it should be reasonable at recovering the topic labels it was trained on. A more useful requirement is that when given the test set, it should be reasonable at recovering those topics too. In real applications we won’t know them, but here we can check.
First let’s construct the confusion matrix for the training sample. This is just a cross tabulation of the models classifications and the correct ones. We’ll get the model’s classifications ‘in sample’ by running the predict
function without any extra arguments:
training_preds <- predict(tmod1)
table(model = training_preds, true = train_dfm$sentiment)
## true
## model neg pos
## neg 494 8
## pos 6 492
Pretty good! But not yet reassuring about how it behave on new documents. Let’s do that. We’ll add an argument to predict
to offer it the test_dfm
now
test_preds <- predict(tmod1, newdata = test_dfm)
testconf <- table(model = test_preds, true = test_dfm$sentiment)
testconf
## true
## model neg pos
## neg 395 89
## pos 105 411
Eh, not quite so impressive. ‘Accuracy’ - the sum of the diagonals over the number cases - is generally not so useful (it’s 0.806) although in our artificially balanced training set, it’s not too misleading. In general we’ll want to compute precision and recall.
Remember that precision for, say the positive topic, is the probability that the model is correct when it says some review is positive. And recall for the positive topic is the probability that a truly positive review is predicted to positive. The easy way to compute all this at once is to normalize either the rows or the columns:
precision_tab <- prop.table(testconf, 1) # make rows (1) sum to 1 => precision
precision_tab
## true
## model neg pos
## neg 0.8161157 0.1838843
## pos 0.2034884 0.7965116
recall_tab <- prop.table(testconf, 2) # make columns (2) sum to 1 => recall
recall_tab
## true
## model neg pos
## neg 0.790 0.178
## pos 0.210 0.822
# precision for each category
diag(precision_tab)
## neg pos
## 0.8161157 0.7965116
# recall for each category
diag(recall_tab)
## neg pos
## 0.790 0.822
The classifications you see here are the result of thresholding (at 0.5) the model’s estimated probability of some review being positive. This might be ok if precision is as important as recall. Or if we didn’t really care how confident the model was. For example, we might be less happy to assign positive to a review if the model thought the probability of it being positive was 0.51 than if it was 0.96.
Let’s pull out the underlying probabilities using the predict function, so we can work with them directly
test_probs <- predict(tmod1, newdata = test_dfm, type = "probability")
test_probs <- as.data.frame(test_probs) # traditionally but annoyingly predict returns a matrix
head(test_probs)
## pos neg
## cv001_19502.txt 0.21555467 0.7844453
## cv002_17424.txt 0.05961360 0.9403864
## cv004_12641.txt 0.18710670 0.8128933
## cv005_29357.txt 0.22498904 0.7750110
## cv008_29326.txt 0.75046682 0.2495332
## cv009_29417.txt 0.03531478 0.9646852
What sorts of predictions does this model make? Let’s look at the positive ones (the negative ones have the same distribution but flipped from left to right, obvs)
ggplot(test_probs, aes(x = pos)) +
geom_density(fill = "grey") +
labs(x = "Predicted probability of being a positive review")
So quite some variation in confidence. Could we do ‘better’ by witholding judgment on the more uncertain calls? For example, we might choose those to look at manually. Let’s check
# for which reviews are we more than 80% sure one way or the other?
surer_ones <- which(test_probs[,"pos"] > 0.8 | test_probs$neg > 0.8)
# new confident results by subsetting the predictions and the true topics
surer_testconf <- table(model = test_preds[surer_ones],
true = test_dfm$sentiment[surer_ones])
surer_testconf
## true
## model neg pos
## neg 193 11
## pos 9 170
diag(prop.table(surer_testconf, 1)) # precision
## neg pos
## 0.9460784 0.9497207
diag(prop.table(surer_testconf, 2)) # recall
## neg pos
## 0.9554455 0.9392265
Cool. Unfortunately that’s only using 0.383 of the predictions.
But maybe being uncertain is ok? A little caution from our machines is always welcome. This introduces a second evaluation consideration (closely related to discussions of fairness in the algorithmic bias literature) which is calibration. This model is ‘calibrated’ if, of all the reviews that the model assigns probability 0.7 of being positive, about 70% are in fact positive.
Our predictive probabilities go down to several decimal places, so we’ll need to bundle them up a bit to compare. Let’s do a quick check by putting model predictions in 10 bins of width 0.1, e.g. from reviews whose whose ‘positive’ prediction is between 0.2 and 0.3, and check each such bin.
# these are the *midpoints* of the intervals
labels <- seq(0.05, 1, by = 0.1)
cal <- data.frame(pred_prob = test_probs$pos, # predicted probability of being positive
truth = test_dfm$sentiment) |> # whether it was actually positive
# divide the reviews into groups according to their predictive probabilities
# using 'cut' and providing our own labels
mutate(bin = cut(pred_prob, breaks = 10, labels = labels)) |>
# bundle together similar predictions
# .drop = FALSE means even if there are no predictions in a bin, keep the bin
group_by(bin, .drop = FALSE) |>
# within each bin, what proportion were really positive?
summarize(prop_pos = mean(truth == "pos"))
cal
## # A tibble: 10 x 2
## bin prop_pos
## <fct> <dbl>
## 1 0.05 0.0202
## 2 0.15 0.0857
## 3 0.25 0.147
## 4 0.35 0.252
## 5 0.45 0.402
## 6 0.55 0.587
## 7 0.65 0.713
## 8 0.75 0.853
## 9 0.85 0.917
## 10 0.95 0.988
Let’s visualize how we did
# make sure we can plot the midpoints properly (bin is now a factor, so we'll
# use the midpoints we made)
cal <- mutate(cal, midpoint = labels)
ggplot(cal, aes(midpoint, prop_pos)) +
geom_abline(intercept = 0, slope = 1, linetype = "dotted") +
geom_point() +
geom_line() +
lims(x = c(0,1), y = c(0,1)) +
labs(x = "Midpoint of predicted probabilities (positive)",
y = "Proportion of reviews positive")
Looks like this classifier is a little bit overconfident, but not too bad. With perfect calibration every point would lie on the diagonal. A more neurotic classifier would flip this pattern over the diagonal and put predictions too close to 0.5.
Classifiers vary according to their precision, recall, and calibration, and not all of these may matter equally for your applications.
How might we improve this classifier? Well, there are basically three strategies:
We’ll consider the first option in class, and we’re mostly stuck with the words we’ve got. However, we can try to remove useless words or identify ones that do the same work e.g. by stemming. Slightly less crudely, though still heuristically, we can downweight words we think won’t be useful. That leads to the cryptically named dfm preprocessing step called “tfidf” which involves switching to logged counts (tf = term frequncy) and dividing each by the proportion of reviews that term occurs in (idf = inverse document frequency).
You can play with all those knobs and dials for yourself now…
Since quanteda has a function for that, let’s try it an compare results. Most of the code below is the same as above, without the elaboration.
In the lectures we talked more about Naive Bayes, so let’s see how this fares on the task. Happily the quanteda functions work more or less the same way as with textmodel_lr
. I’ll prefix all the this model’s stuff with _nb
so we don’t get it muddles with the output of the previous model.
train_dfm_nb <- tokens(train_corp, remove_punct = TRUE, remove_symbols = TRUE,
remove_numbers = TRUE, remove_url = TRUE) |>
tokens_remove(stopwords()) |>
tokens_tolower() |>
dfm() |>
dfm_trim(min_termfreq = 5)
test_dfm_nb <- tokens(test_corp, remove_punct = TRUE, remove_symbols = TRUE,
remove_numbers = TRUE, remove_url = TRUE) |>
tokens_tolower() |>
dfm() |>
dfm_match(featnames(train_dfm_nb))
tmod2 <- textmodel_nb(train_dfm_nb, train_dfm_nb$sentiment)
Let’s take a look at the coefficients. We know from the lecture how this works, so it won’t be surprising to find that the coefficients are just the likelihood of each word being generated by the positive topic and the negative topics. As before, let’s pull the out and sort them to see what words make for good and bad reviews.
nb_coefs <- coef(tmod2) |>
as.data.frame() |>
# construct the ubiquitous log ratios to sort with
mutate(valence = log(pos/neg)) |>
arrange(valence)
head(nb_coefs) # the worst (I lol-ed)
## neg pos valence
## nbsp 0.0003841646 5.882422e-06 -4.179103
## seagal 0.0002734731 5.882422e-06 -3.839235
## schumacher 0.0001888267 5.882422e-06 -3.468861
## jackal 0.0001758041 5.882422e-06 -3.397402
## bats 0.0001367366 5.882422e-06 -3.146088
## bye 0.0001367366 5.882422e-06 -3.146088
tail(nb_coefs) # the best
## neg pos valence
## ordell 6.511264e-06 0.0001470606 3.117311
## maximus 6.511264e-06 0.0001529430 3.156531
## taran 6.511264e-06 0.0001588254 3.194272
## sweetback 6.511264e-06 0.0001647078 3.230639
## cauldron 6.511264e-06 0.0002294145 3.561997
## shrek 6.511264e-06 0.0002588266 3.682625
One thing you’ll probably notice is that this model picks up particular films and characters as predictive, whereas the previous model picked up more general evaluative language. Also, here we use all the words, and don’t get to pick and choose.
Is this model any good as a classifier? Let’s check.
test_preds_nb <- predict(tmod2, newdata = test_dfm_nb)
# confusion matrix
testconf_nb <- table(model = test_preds_nb, true = test_dfm_nb$sentiment)
testconf_nb
## true
## model neg pos
## neg 415 116
## pos 85 384
This doesn’t seem quite as good, though you can see it makes errors in different places. Here are the two models’ predictions compared with each other (not the to the right answer)
table(test_preds, test_preds_nb)
## test_preds_nb
## test_preds neg pos
## neg 404 80
## pos 127 389
One consequence of these differences is that our new model is a bit more unbalanced in its positive and negative performance
diag(prop.table(testconf_nb, 1)) # precision
## neg pos
## 0.7815443 0.8187633
diag(prop.table(testconf_nb, 2)) # recall
## neg pos
## 0.830 0.768
Let’s check the calibration of this new model. This is very much the same code as above, except that everything is suffixed with _nb
so we can tell things apart.
# calibration
test_probs_nb <- predict(tmod2, newdata = test_dfm_nb, type = "probability")
test_probs_nb <- as.data.frame(test_probs_nb)
cal_nb <- data.frame(pred_prob = test_probs_nb$pos,
truth = test_dfm_nb$sentiment) |>
mutate(bin = cut(pred_prob, breaks = 10, labels = labels)) |>
group_by(bin, .drop = FALSE) |>
summarize(prop_pos = mean(truth == "pos"))
cal_nb
## # A tibble: 10 x 2
## bin prop_pos
## <fct> <dbl>
## 1 0.05 0.197
## 2 0.15 0.429
## 3 0.25 0.7
## 4 0.35 0.333
## 5 0.45 0.6
## 6 0.55 0.625
## 7 0.65 NaN
## 8 0.75 0.4
## 9 0.85 0.556
## 10 0.95 0.832
cal_nb <- mutate(cal_nb, midpoint = labels)
ggplot(cal_nb, aes(midpoint, prop_pos)) +
geom_abline(intercept = 0, slope = 1, linetype = "dotted") +
geom_point() +
geom_line() +
lims(x = c(0,1), y = c(0,1)) +
labs(x = "Midpoint of predicted probabilities (positive)",
y = "Proportion of reviews positive")
## Warning: Removed 1 rows containing missing values (geom_point).
Calibration is pretty terrible, no? That’s because this classifier is overconfident, as we can see from the sorts of probabilities it predicts
ggplot(test_probs_nb, aes(x = pos)) +
geom_density(fill = "grey") +
labs(x = "Predicted probability of being a positive review")
Which model to use then? Well, why not both? We can easily make a mega-uber-model out of the predictions of both of them. This is a simple form of ensemble learning, although our ensemble will be a bit, err, small.
Here we will try to learn the appropriate combination of predicted probabilities from each model in a third model. It’ll be a logistic regression, but the details need not detain us. If we had a lot of models, we could just have them vote, or something equally straightforward.
We’ll try to learn the right linear combination from the training data and then use this and the fitted models to generate prediction for the test data.
# some stuff we need now but didn't create earlier
train_probs <- predict(tmod1, type = "probability")
train_probs <- as.data.frame(train_probs)
train_probs_nb <- predict(tmod2, type = "probability")
train_probs_nb <- as.data.frame(train_probs_nb)
# the 'training data' (or previous model in sample predictions)
# with positive -> 1 and negative -> 0
ens_train <- data.frame(lr = train_probs$pos,
nb = test_probs_nb$pos,
is_pos = as.numeric(train_dfm$sentiment == "pos"))
# the 'test data' (our previous models test data predictions)
ens_test <- data.frame(lr = test_probs$pos,
nb = test_probs_nb$pos)
# fit a model
ens_model <- glm(is_pos ~ lr + nb, data = ens_train, family = binomial)
summary(ens_model)
##
## Call:
## glm(formula = is_pos ~ lr + nb, family = binomial, data = ens_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.52884 -0.00757 -0.00001 0.00688 2.29148
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -17.5615 2.5632 -6.851 7.31e-12 ***
## lr 32.6426 4.9339 6.616 3.69e-11 ***
## nb 2.5827 0.7849 3.290 0.001 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1386.294 on 999 degrees of freedom
## Residual deviance: 55.573 on 997 degrees of freedom
## AIC: 61.573
##
## Number of Fisher Scoring iterations: 10
# predicted probabilities from the uber-model
ens_probs <- predict(ens_model, newdata = ens_test, type = "response")
# classification
ens_topic <- ifelse(ens_probs > 0.5, "pos", "neg")
# how confused are we this time?
testconf_ens <- table(model = ens_topic, true = test_dfm_nb$sentiment)
testconf_ens
## true
## model neg pos
## neg 408 84
## pos 92 416
So are we “better together” as the anti-Scots Independence campaigners used to say? On almost all fronts, yes. Better accuracy, better overall precision, and better overall accuracy. Although not by a great margin. Combining more models would probably help. And the calibration is… well, you can check if you like.
You might remember from the lectures that we can change the recall and precision of a classifier by changing the probability threshold. Make it lower than 0.5 if missing a positive review is worse than missing a negative one, and higher than 0.5 if the opposite is true. So if you are using classifiers to find interesting but rare things in a big body of text so that you can then read them, then you probably have such an asymmetric situation and should probably adjust the threshold for calling the topic accordingly.
But what if we don’t have any particular idea of which kind of error is more important? In this case we can look at the precision recall curve traced out by checking every possible threshold. This defines an ROC curve. If one model’s curve is uniformly lower than another’s then the second model is better whatever you want to do with it. And it’s a nice way to choose a model, if you want to do that.
It’s not hard to program an ROC curve, but it’s easier to use a package. This one {plotROC}
wants a a column of individual predictions, a column of true values as 0/1 (I set positive to 1 here), and if we are comparing more than one model, a column of model indicators. Then plotting is straightforward.
roc_data <- data.frame(pred = c(test_probs$pos, test_probs_nb$pos, ens_probs),
true = rep(ifelse(test_dfm$sentiment == "pos", 1, 0), 3),
model = rep(c("logit", "naive-bayes", "combined"), each = 1000))
ggplot(roc_data, aes(d = true, m = pred, color = model)) +
geom_roc(n.cuts = 0) +
labs(x = "1 - Precision",
y = "Recall")
It insists on some medical conventions in the labeling so I’ve just overridden them.
The combined model is better than either of its components at pretty much any probability cutoff, so we should probably just use that.