There are multiple ways to measure which words (or bigrams, or other
units of text) are important in a text. You can count words, or measure tf-idf.
This package implements a different approach for measuring which words
are important to a text, a **weighted log odds**.

A log odds ratio is a way of expressing probabilities, and we can
weight a log odds ratio so that our implementation does a better job
dealing with different combinations of words and documents having
different counts. In particular, we use the method outlined in Monroe, Colaresi, and Quinn
(2008) for posterior log odds ratios, assuming a multinomial model
with a Dirichlet prior. The default prior is **estimated from the
data itself**, an empirical Bayesian approach, but an
uninformative prior is also available.

What does this mean? It means that by weighting using empirical Bayes estimation, we take into account the sampling error in our measurements and acknowledge that we are more certain when we’ve counted something a lot of times and less certain when we’ve counted something only a few times. When weighting by a prior in this way, we focus on differences that are more likely to be real, given the evidence that we have.

Let’s look at just such an example.

Let’s explore the six published,
completed novels of Jane Austen and use the tidytext package to
count up the bigrams (sequences of two adjacent words) in each novel,
focusing on bigrams that include *no upper case letters*. (We can
look at differences other than those involving proper nouns this way.)
This weighted log odds approach would work equally well for single
words, or other kinds of n-grams.

```
library(dplyr)
library(janeaustenr)
library(tidytext)
library(stringr)
<- austen_books() %>%
tidy_bigrams unnest_tokens(bigram, text, token="ngrams", n = 2, to_lower = FALSE) %>%
filter(!str_detect(bigram, "[A-Z]"))
<- tidy_bigrams %>%
bigram_counts count(book, bigram, sort = TRUE)
bigram_counts#> # A tibble: 256,348 × 3
#> book bigram n
#> <fct> <chr> <int>
#> 1 Mansfield Park of the 708
#> 2 Mansfield Park to be 582
#> 3 Emma to be 574
#> 4 Emma of the 524
#> 5 Mansfield Park in the 520
#> 6 Pride & Prejudice of the 437
#> 7 Pride & Prejudice to be 416
#> 8 Sense & Sensibility to be 410
#> 9 Persuasion of the 409
#> 10 Emma in the 405
#> # … with 256,338 more rows
```

Notice that we haven’t removed stop words, or filtered out rarely used words. We have done very little pre-processing of this text data.

Now let’s use the `bind_log_odds()`

function from the
tidylo package to find the weighted log odds for each bigram. The
weighted log odds computed by this function are also z-scores for the
log odds; this quantity is useful for comparing frequencies across
categories or sets but its relationship to an odds ratio is not
straightforward after the weighting.

What are the bigrams with the highest weighted log odds for these books?

```
library(tidylo)
<- bigram_counts %>%
bigram_log_odds bind_log_odds(book, bigram, n)
%>%
bigram_log_odds arrange(-log_odds_weighted)
#> # A tibble: 256,348 × 4
#> book bigram n log_odds_weighted
#> <fct> <chr> <int> <dbl>
#> 1 Emma any thing 150 13.3
#> 2 Northanger Abbey pump room 23 9.32
#> 3 Sense & Sensibility any thing 58 9.17
#> 4 Northanger Abbey the pump 22 9.11
#> 5 Northanger Abbey the abbey 20 8.69
#> 6 Northanger Abbey the general's 15 7.53
#> 7 Persuasion any thing 5 7.36
#> 8 Emma any body 61 6.72
#> 9 Emma every body 67 6.61
#> 10 Sense & Sensibility the cottage 26 6.28
#> # … with 256,338 more rows
```

The highest log odds bigrams (bigrams more likely to come from each
book, compared to the others) involve concepts specific to each novel,
such as the pump room in *Northanger Abbey* and sister/mother
figures in *Sense & Sensibility*. We can make a visualization
as well.

```
library(ggplot2)
%>%
bigram_log_odds group_by(book) %>%
slice_max(log_odds_weighted, n = 10) %>%
ungroup() %>%
mutate(bigram = reorder(bigram, log_odds_weighted)) %>%
ggplot(aes(log_odds_weighted, bigram, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(vars(book), scales = "free") +
labs(y = NULL)
```

These bigrams have the highest weighted log odds for each book.

The proper names of characters tend to be unique from book to book
unless you’re looking at a series like *Harry Potter*. When it
comes identifying words or phrases that are unique to a text, a measure
like tf-idf does a good job. But because of the way tf-idf is
calculated, it cannot distinguish between words that are used in
**all** texts.

For example, the phrase “had been” is used over 100 times in all six
Austen novels. The “idf” in tf-idf stands for “inverse document
frequency”. When a word is in all the texts, tf-idf is zero for all of
them. In weighted log odds, however, each book has a different value; in
this case the weighted log odds ranges from 5.42 in *Persuasion*
(very high) down to -2.27 in *Sense & Sensibility* (which
suggests something about the style or content is suppressing the
phrase).

Text analysis is a main motivator for this implementation of weighted log odds, but this is a general approach for measuring how much more likely one feature (any kind of feature, not just a word or bigram) is to be associated than another for some set or group (any kind of set, not just a document or book).

To demonstrate this, let’s look at everybody’s favorite data about
cars. What do we know about the relationship between number of gears and
engine shape `vs`

?

```
<- mtcars %>%
gear_counts count(vs, gear)
gear_counts#> vs gear n
#> 1 0 3 12
#> 2 0 4 2
#> 3 0 5 4
#> 4 1 3 3
#> 5 1 4 10
#> 6 1 5 1
```

Now we can use `bind_log_odds()`

to find the weighted log
odds for each number of gears and engine shape. First, let’s use the
default empirical Bayes prior. It regularizes the values.

```
<- gear_counts %>%
regularized bind_log_odds(vs, gear, n)
regularized#> vs gear n log_odds_weighted
#> 1 0 3 12 1.1728347
#> 2 0 4 2 -1.3767516
#> 3 0 5 4 0.4033125
#> 4 1 3 3 -1.1354777
#> 5 1 4 10 1.5661168
#> 6 1 5 1 -0.4362340
```

For engine shape `vs = 0`

, having three gears has the
highest weighted log odds while for engine shape `vs = 1`

,
having four gears has the highest weighted log odds. This dataset is
small enough that you can look at the count data and see how this is
working.

Now, let’s use the uninformative prior, and compare to the unweighted log odds. These log odds will be farther from zero than the regularized estimates.

```
<- gear_counts %>%
unregularized bind_log_odds(vs, gear, n, uninformative = TRUE, unweighted = TRUE)
unregularized#> vs gear n log_odds log_odds_weighted
#> 1 0 3 12 0.6968169 1.8912729
#> 2 0 4 2 -1.2527630 -1.9691060
#> 3 0 5 4 0.3249262 0.5549172
#> 4 1 3 3 -0.9673459 -1.7407107
#> 5 1 4 10 1.1451323 2.8421436
#> 6 1 5 1 -0.5268260 -0.6570674
```

Most importantly, you can notice that this approach is useful both in the initial motivating example of text data but also more generally whenever you have counts in some kind of groups or sets and you want to find what feature is more likely to come from a group, compared to the other groups.