sentixr with tidytext and quanteda

sentixr is designed to be used with other R packages for text analysis. If you prefer the tidytext ecosystem or the quanteda framework, you can export the lexicons and use them in your existing pipelines.

Setup

library(sentixr)

We’ll use the example data provided by the package:

data(recensioni_tv)
recensioni_tv
#>   doc_id
#> 1   doc1
#> 2   doc2
#> 3   doc3
#> 4   doc4
#> 5   doc5
#>                                                                                             text
#> 1                          Ottimo prodotto, la qualità dell'immagine è buona, colori molto vivi.
#> 2 Ho riscontrato subito problemi; mi sono ostinato a fare delle prove, purtroppo senza risultati
#> 3                               La tv è molto bella, ma la qualità dell'audio ha delle mancanze.
#> 4     Il prodotto va benissimo. C'è da dire che il costo irrisorio corrisponde ad alcuni limiti.
#> 5           I colori sono eccessivamente saturi, per non parlare dell'audio, a dir poco pessimo!

sentixr with tidytext

For the tidytext ecosystem, sentixr provides access to its lexicons in a tidy format, ready for joining.

library(tidytext)

Get the Lexicon

Use get_sentix() to retrieve the lexicon as a tidy tibble, with only two columns.

For tidytext workflows on raw text (without lemmatization), the MAL lexicon is preferred because it contains inflected forms, matching the output of standard tokenizers.

# Get the MAL lexicon (inflected forms)
mal_dict <- get_sentix("MAL")
head(mal_dict)
#> # A tibble: 6 × 2
#>   word            score
#>   <chr>           <dbl>
#> 1 genere_lucilia -0.25 
#> 2 gotta          -0.375
#> 3 pianississimo  -0.125
#> 4 posse           0.125
#> 5 siboglinidae    0.125
#> 6 cacaphony       0.777

Tokenize and Join

Use tidytext::unnest_tokens() to split the text into words.

# Tokenize
tidy_text <- recensioni_tv |> 
  unnest_tokens(word, text)
# Join with lexicon
tidy_sent <- tidy_text |>
  left_join(mal_dict, by = "word")

head(tidy_sent)
#>   doc_id          word     score
#> 1   doc1        ottimo 0.5625000
#> 2   doc1      prodotto 0.1250000
#> 3   doc1            la 0.0000000
#> 4   doc1       qualità 0.3631757
#> 5   doc1 dell'immagine        NA
#> 6   doc1             è 0.1256011

Here, left_join is used to keep all words (even those without a score) so that the token count (n_tokens) remains accurate: score will be NA for words not found in the lexicon.

Alternatively, an inner_join() can be used to keep only the words present in the lexicon.

Analyze

To get the sentiment scores, you can use the native sentix_summarize() function for a quick analysis on the joined data:

# Calculate average sentiment per document
sentix_summarize(tidy_sent, simplify = FALSE)
#> # A tibble: 5 × 4
#>   doc_id   score n_tokens n_scored
#>   <chr>    <dbl>    <int>    <int>
#> 1 doc1    0.239        10        9
#> 2 doc2   -0.160        14       11
#> 3 doc3    0.195        12       10
#> 4 doc4    0.115        15        9
#> 5 doc5   -0.0208       13        8

Alternatively, if you prefer custom metrics (e.g., standard deviation or median), you can manually group and summarize:

# Manual summary with dplyr
tidy_sent |>
  group_by(doc_id) |>
  summarise(
    sentiment = mean(score, na.rm = T),
    n_tokens = n(),
    n_scored = sum(!is.na(score))
  )
#> # A tibble: 5 × 4
#>   doc_id sentiment n_tokens n_scored
#>   <chr>      <dbl>    <int>    <int>
#> 1 doc1      0.239        10        9
#> 2 doc2     -0.160        14       11
#> 3 doc3      0.195        12       10
#> 4 doc4      0.115        15        9
#> 5 doc5     -0.0208       13        8

Polarity Analysis

To perform polarity analysis (counting positive vs negative words), retrieve the lexicon with polarity = TRUE (assigns “positive”/“negative” labels), and join as before.

# Get MAL with polarity labels
polar_dict <- get_sentix("MAL", polarity = TRUE)
head(polar_dict)
#> # A tibble: 6 × 2
#>   word           polarity
#>   <chr>          <chr>   
#> 1 genere_lucilia negative
#> 2 gotta          negative
#> 3 pianississimo  negative
#> 4 posse          positive
#> 5 siboglinidae   positive
#> 6 cacaphony      positive
# Join with tokenized text
tidy_text |>
  left_join(polar_dict, by = "word") |>
  head()
#>   doc_id          word polarity
#> 1   doc1        ottimo positive
#> 2   doc1      prodotto positive
#> 3   doc1            la  neutral
#> 4   doc1       qualità positive
#> 5   doc1 dell'immagine     <NA>
#> 6   doc1             è positive

It is also possible to generate polarity labels from continuous scores in a custom way, using the make_polarity() function. Here the threshold is set to 0.125 (positive scores above 0.125 are “positive”, negative scores below -0.125 are “negative”, and scores in between are “neutral”):

mal_dict |> 
  mutate(polarity = make_polarity(score, 
                                  threshold = 0.125)) |> 
  head()
#> # A tibble: 6 × 3
#>   word            score polarity
#>   <chr>           <dbl> <chr>   
#> 1 genere_lucilia -0.25  negative
#> 2 gotta          -0.375 negative
#> 3 pianississimo  -0.125 negative
#> 4 posse           0.125 positive
#> 5 siboglinidae    0.125 positive
#> 6 cacaphony       0.777 positive

Or, to directly convert all numeric scores into polarity labels:

get_elita() |> 
  mutate(across(where(is.numeric), 
                ~ make_polarity(.x))) |> 
  tail()
#> # A tibble: 6 × 4
#>   lemma         valenza  attivazione dominanza
#>   <chr>         <chr>    <chr>       <chr>    
#> 1 spropositato  positive positive    positive 
#> 2 strascico     negative negative    negative 
#> 3 suggestivo    positive positive    positive 
#> 4 ufficialmente neutral  neutral     negative 
#> 5 verificare    negative positive    negative 
#> 6 vorticoso     positive positive    negative

It is also possible to set different thresholds for positive and negative classifications, by providing a vector of two values, such as c(0.125, -0.135).

sentixr with Quanteda

sentixr also includes helper functions to convert its lexicons into quanteda::dictionary objects, facilitating integration with the quanteda framework.

library(quanteda)
data(recensioni_tv)
sentix_toks <- corpus(recensioni_tv) |>
  tokens(remove_punct = TRUE)

Creating a Quanteda Dictionary

The df_to_dict() function converts a dataframe lexicon into a Quanteda dictionary that can be used, for example, with tokens_lookup() or dfm_lookup().

If the package quanteda.sentiment is installed, the function will automatically assign the appropriate polarity or valence attributes, making the dictionary compatible with textstat_valence() or textstat_polarity().

Other helper functions include df_to_valence() and df_to_polar() for explicit control.

# Convert MAL to a valence dictionary
my_dict <- df_to_dict(mal_dict)

that is equivalent to:

df_to_valence(MAL)

If the quanteda.sentiment package is installed, the valence scores will be automatically assigned to the dictionary’s “valence” attribute, and the dictionary will be ready for use with:

# Compute valence
quanteda.sentiment::textstat_valence(sentix_toks, dictionary = my_dict)
#>   doc_id  sentiment
#> 1   doc1  0.2689482
#> 2   doc2 -0.1755017
#> 3   doc3  0.2788701
#> 4   doc4  0.1295423
#> 5   doc5 -0.0208181

Otherwise, the function will create a standard Quanteda dictionary.

To get a polarity dictionary:

my_dict2 <- get_sentix("MAL", polarity = TRUE) |> 
  # if there are other numeric columns, other than 'polarity'
  df_to_polar()

which is equivalent to applying df_to_dict() to the polarity version of the lexicon:

my_dict2 <- df_to_dict(polar_dict)
# Compute polarity scores
quanteda.sentiment::textstat_polarity(sentix_toks, 
                                      dictionary = my_dict2)
#>   doc_id sentiment
#> 1   doc1 2.8332133
#> 2   doc2 0.0000000
#> 3   doc3 1.4663371
#> 4   doc4 0.9555114
#> 5   doc5 0.0000000