Introduction to sentixr

sentixr is a package designed to simplify sentiment analysis in Italian using a variety of lexicons: Sentix, MAL, ELIta VAD and ELIta basic.

Sentix includes 68,190 Italian lemmas (field lemma) with associated affective scores and an index of polypathy. MAL expands Sentix with inflected forms from Morph-it!, and can be used without lemmatization.

ELIta VAD includes scores for 6,905 Italian lexical entries (lemmas and emojis) on the VAD dimensions (Valence, Arousal, and Dominance), while ELIta basic focuses on the eight basic emotions of Plutchik’s wheel together with the dyad love.

This vignette illustrates the core workflow using the functions sentix_annotate() and sentix_summarize(), and some of the main features of sentix_annotate().

See also the vignette on using the package with tidytext and quanteda.

Basic Workflow

The typical workflow consists of two main steps: annotation and summarization.

library(sentixr)
testo <- "Oggi è una bella giornata. Esco a fare una passeggiata"
frase_ann <- sentix_annotate(testo, 
                             # set the document ID
                             docid_field = "frase")
head(frase_ann)
#>   doc_id sentence_id token_id    token    lemma  upos     score
#> 1  frase           1        1     Oggi     oggi   ADV 0.1988304
#> 2  frase           1        2        è   essere   AUX 0.1256011
#> 3  frase           1        3      una      uno   DET 0.0000000
#> 4  frase           1        4    bella    bello   ADJ 0.7413988
#> 5  frase           1        5 giornata giornata  NOUN 0.0000000
#> 6  frase           1        6        .        . PUNCT        NA
sentix_summarize(frase_ann)
#> # A tibble: 1 × 4
#>   doc_id score n_tokens n_scored
#>   <chr>  <dbl>    <int>    <int>
#> 1 frase  0.176       10        9

sentix_summarize()

sentix_summarize() computes overall sentiment scores and auxiliary metrics per document (or other segments, via the argument by) from the annotated dataframe. The default behavior is to summarize by document.

By default, sentix_summarize() returns

To obtain only sentiment scores, set simplify = TRUE:

sentix_summarize(frase_ann, 
                 simplify = TRUE)
#> # A tibble: 1 × 2
#>   doc_id score
#>   <chr>  <dbl>
#> 1 frase  0.176

To get scores by sentences (or other segments) within each document, set the by argument:

sentix_summarize(frase_ann,
                 by = c("doc_id", "sentence_id"))
#> # A tibble: 2 × 5
#>   doc_id sentence_id score n_tokens n_scored
#>   <chr>        <int> <dbl>    <int>    <int>
#> 1 frase            1 0.213        5        5
#> 2 frase            2 0.131        5        4

Note that udpipe assigns sentence IDs starting from 1 for each document, so sentence IDs will repeat across documents.

sentix_annotate()

The sentix_annotate() function performs tokenization and lemmatization (via udpipe), and then joins the result with one of the available sentiment lexicons. By default, it uses the Sentix lexicon.

The output is a dataframe where each row is a token. It is a simplified version of the full udpipe output, plus the sentiment score(s).

For large corpora, the user may optionally specify the number of cores to use, via the argument parallel.cores, which is inherited from udpipe and passed to udpipe::udpipe().

Managing udpipe model

If no model is given, the function automatically downloads and uses the default Italian udpipe model. After the first run, the downloaded model can be passed with model = "local":

sentix_annotate(recensioni_tv, model = "local")

To load the downloaded model, or any other udpipe model, manually, use udpipe::udpipe_load_model():

# Load the model manually
model <- udpipe::udpipe_load_model("italian-isdt-ud-2.5-191206.udpipe")

With multiple texts

The function, like udpipe, accepts as input single texts, multiple texts (a character vector, a list, or a list of tokens), or data frames.

sentix_annotate() simplifies the document ID management that is normally required by udpipe. In particular, the user can explicitly pass a vector of IDs using the docid_field argument, which safely processes them before passing the data to udpipe::udpipe().

# Multiple texts
testi <- c("Oggi è una bella giornata. Esco a fare una passeggiata", 
           "Non mi piace la pioggia, mi rende triste.")
sentix_annotate(testi,
                # loaded model
                model = model) |> head()
#>   doc_id sentence_id token_id    token    lemma  upos     score
#> 1   doc1           1        1     Oggi     oggi   ADV 0.1988304
#> 2   doc1           1        2        è   essere   AUX 0.1256011
#> 3   doc1           1        3      una      uno   DET 0.0000000
#> 4   doc1           1        4    bella    bello   ADJ 0.7413988
#> 5   doc1           1        5 giornata giornata  NOUN 0.0000000
#> 6   doc1           1        6        .        . PUNCT        NA
sentix_annotate(testi,
                model = model,
                # to specify document IDs
                docid_field = paste0("doc_", seq_along(testi))
) |> head()
#>   doc_id sentence_id token_id    token    lemma  upos     score
#> 1  doc_1           1        1     Oggi     oggi   ADV 0.1988304
#> 2  doc_1           1        2        è   essere   AUX 0.1256011
#> 3  doc_1           1        3      una      uno   DET 0.0000000
#> 4  doc_1           1        4    bella    bello   ADJ 0.7413988
#> 5  doc_1           1        5 giornata giornata  NOUN 0.0000000
#> 6  doc_1           1        6        .        . PUNCT        NA

To get the full udpipe output, set simplify = FALSE.

With dataframe

While udpipe expects data frames to have columns named text and doc_id, sentix_annotate() also allows specifying the input column names, using the text_field and docid_field arguments.

Note that the function extracts and processes only these two columns, ignoring other metadata, and that they will be renamed to text and doc_id in the output.

data(recensioni_tv)
recensioni_tv
#>   doc_id
#> 1   doc1
#> 2   doc2
#> 3   doc3
#> 4   doc4
#> 5   doc5
#>                                                                                             text
#> 1                          Ottimo prodotto, la qualità dell'immagine è buona, colori molto vivi.
#> 2 Ho riscontrato subito problemi; mi sono ostinato a fare delle prove, purtroppo senza risultati
#> 3                               La tv è molto bella, ma la qualità dell'audio ha delle mancanze.
#> 4     Il prodotto va benissimo. C'è da dire che il costo irrisorio corrisponde ad alcuni limiti.
#> 5           I colori sono eccessivamente saturi, per non parlare dell'audio, a dir poco pessimo!
# Annotate the dataframe
sentix_res <- sentix_annotate(recensioni_tv, 
                              model = model)
head(sentix_res)
#>   doc_id sentence_id token_id    token    lemma  upos     score
#> 1   doc1           1        1   Ottimo   ottimo   ADJ 1.0000000
#> 2   doc1           1        2 prodotto prodotto  NOUN 0.0000000
#> 3   doc1           1        3        ,        , PUNCT        NA
#> 4   doc1           1        4       la       il   DET        NA
#> 5   doc1           1        5  qualità  qualità  NOUN 0.3631757
#> 6   doc1           1      6-7    dell'     <NA>  <NA>        NA
# Summarize sentiment per document
sentix_summarize(sentix_res)
#> # A tibble: 5 × 4
#>   doc_id   score n_tokens n_scored
#>   <chr>    <dbl>    <int>    <int>
#> 1 doc1    0.274        12        9
#> 2 doc2   -0.139        15        9
#> 3 doc3    0.244        15        9
#> 4 doc4    0.117        16       12
#> 5 doc5   -0.0187       15        9

Using Different Lexicons

Other lexicons available in sentixr can be used with the dict argument.

The MAL lexicon contains inflected forms rather than lemmas. The function automatically handles this by joining on the token column.

# Use MAL lexicon
anno_mal <- sentix_annotate(recensioni_tv, 
                            model = model, dict = "MAL")
head(anno_mal)
#>   doc_id sentence_id token_id    token    lemma  upos     score
#> 1   doc1           1        1   Ottimo   ottimo   ADJ 0.5625000
#> 2   doc1           1        2 prodotto prodotto  NOUN 0.1250000
#> 3   doc1           1        3        ,        , PUNCT        NA
#> 4   doc1           1        4       la       il   DET 0.0000000
#> 5   doc1           1        5  qualità  qualità  NOUN 0.3631757
#> 6   doc1           1      6-7    dell'     <NA>  <NA>        NA
# Summarize
summary_mal <- sentix_summarize(anno_mal)

summary_mal
#> # A tibble: 5 × 4
#>   doc_id   score n_tokens n_scored
#>   <chr>    <dbl>    <int>    <int>
#> 1 doc1    0.215        12       10
#> 2 doc2   -0.146        15       12
#> 3 doc3    0.163        15       12
#> 4 doc4    0.116        16       10
#> 5 doc5   -0.0185       15        9

When using the ELIta family lexicons, the functions will produce scores and statistics for each dimension.

# Use ELIta VAD lexicon
anno_vad <- sentix_annotate(recensioni_tv, 
                            model = model,
                            dict = "elita_VAD")
head(anno_vad)
#>   doc_id sentence_id token_id    token    lemma  upos valenza attivazione
#> 1   doc1           1        1   Ottimo   ottimo   ADJ   0.875      -0.250
#> 2   doc1           1        2 prodotto prodotto  NOUN   0.125      -0.625
#> 3   doc1           1        3        ,        , PUNCT      NA          NA
#> 4   doc1           1        4       la       il   DET      NA          NA
#> 5   doc1           1        5  qualità  qualità  NOUN      NA          NA
#> 6   doc1           1      6-7    dell'     <NA>  <NA>      NA          NA
#>   dominanza
#> 1    0.2925
#> 2    0.0425
#> 3        NA
#> 4        NA
#> 5        NA
#> 6        NA
# Summarize 
sentix_summarize(anno_vad)
#> # A tibble: 5 × 6
#>   doc_id  valenza attivazione dominanza n_tokens n_scored
#>   <chr>     <dbl>       <dbl>     <dbl>    <int>    <int>
#> 1 doc1    0.500       -0.0971    0.285        12        6
#> 2 doc2    0.0414       0.387     0.0957       15        7
#> 3 doc3    0.402        0.208     0.125        15        3
#> 4 doc4    0.0714      -0.0296    0.0954       16        7
#> 5 doc5   -0.00667      0.0554    0.0342       15        6

elita_VAD scores are automatically rescaled to the -1 to 1 range for consistency with other lexicons. It is also possible to use the original -4/+4 scale by setting rescale = "none".

sentix_annotate(recensioni_tv, 
                model = model,
                dict = "elita_VAD",
                rescale = "none") |> head()
#>   doc_id sentence_id token_id    token    lemma  upos valenza attivazione
#> 1   doc1           1        1   Ottimo   ottimo   ADJ     3.5        -1.0
#> 2   doc1           1        2 prodotto prodotto  NOUN     0.5        -2.5
#> 3   doc1           1        3        ,        , PUNCT      NA          NA
#> 4   doc1           1        4       la       il   DET      NA          NA
#> 5   doc1           1        5  qualità  qualità  NOUN      NA          NA
#> 6   doc1           1      6-7    dell'     <NA>  <NA>      NA          NA
#>   dominanza
#> 1      1.17
#> 2      0.17
#> 3        NA
#> 4        NA
#> 5        NA
#> 6        NA

Polypathy Handling

Sentix and MAL include words that were originally polypathic (with multiple sentiment scores derived from SentiWordNet synsets), and that have been reduced to a single score (see Basile and Nissim 2013; Basile et al. 2025).

The polypathy index (ordered factor) indicates the level of variation among the original scores:

You can enable polypathy handling by setting polypathy = TRUE. This will return another column in the output dataframe:

anno_poly <- sentix_annotate(recensioni_tv, 
                            model = model, polypathy = TRUE)

head(anno_poly)
#>   doc_id sentence_id token_id    token    lemma  upos     score polypathy_index
#> 1   doc1           1        1   Ottimo   ottimo   ADJ 1.0000000               0
#> 2   doc1           1        2 prodotto prodotto  NOUN 0.0000000               0
#> 3   doc1           1        3        ,        , PUNCT        NA            <NA>
#> 4   doc1           1        4       la       il   DET        NA            <NA>
#> 5   doc1           1        5  qualità  qualità  NOUN 0.3631757               2
#> 6   doc1           1      6-7    dell'     <NA>  <NA>        NA            <NA>

The index will be summarized as ambiguity score, calculated as n_poly / n_scored, where:

sentix_summarize(anno_poly,
                 # the default value
                 ambiguity = 3)
#> # A tibble: 5 × 6
#>   doc_id   score ambiguity n_tokens n_scored n_poly
#>   <chr>    <dbl>     <dbl>    <int>    <int>  <int>
#> 1 doc1    0.274     0.333        12        9      3
#> 2 doc2   -0.139     0.333        15        9      3
#> 3 doc3    0.244     0.111        15        9      1
#> 4 doc4    0.117     0.0833       16       12      1
#> 5 doc5   -0.0187    0.333        15        9      3

A higher ambiguity score in the summarized output indicates that a document relies heavily on words historically associated with mixed or contrasting sentiments, suggesting a more nuanced or complex overall polarity.

References

Basile, Valerio, and Malvina Nissim. 2013. “Sentiment Analysis on Italian Tweets.” In Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, 100–107. https://aclanthology.org/W13-1614/.
Basile, Valerio, Malvina Nissim, Cristina Bosco, Marco Vassallo, and Giuliano Gabrieli. 2025. “Sentix.” https://github.com/valeriobasile/sentix.