sentixr is a package designed to simplify sentiment
analysis in Italian using a variety of lexicons: Sentix, MAL, ELIta VAD
and ELIta basic.
Sentix includes 68,190 Italian lemmas (field lemma) with
associated affective scores and an index of polypathy. MAL expands
Sentix with inflected forms from Morph-it!, and can be used
without lemmatization.
ELIta VAD includes scores for 6,905 Italian lexical entries (lemmas and emojis) on the VAD dimensions (Valence, Arousal, and Dominance), while ELIta basic focuses on the eight basic emotions of Plutchik’s wheel together with the dyad love.
This vignette illustrates the core workflow using the functions
sentix_annotate() and sentix_summarize(), and
some of the main features of sentix_annotate().
See also the vignette on using the
package with tidytext and quanteda.
The typical workflow consists of two main steps: annotation and summarization.
frase_ann <- sentix_annotate(testo,
# set the document ID
docid_field = "frase")
head(frase_ann)
#> doc_id sentence_id token_id token lemma upos score
#> 1 frase 1 1 Oggi oggi ADV 0.1988304
#> 2 frase 1 2 è essere AUX 0.1256011
#> 3 frase 1 3 una uno DET 0.0000000
#> 4 frase 1 4 bella bello ADJ 0.7413988
#> 5 frase 1 5 giornata giornata NOUN 0.0000000
#> 6 frase 1 6 . . PUNCT NAsentix_summarize()sentix_summarize() computes overall sentiment scores and
auxiliary metrics per document (or other segments, via the argument
by) from the annotated dataframe. The default behavior is
to summarize by document.
By default, sentix_summarize() returns
sentiment: The average sentiment score for the
document.n_tokens: Total number of tokens (excluding
punctuation).n_scored: Number of tokens found in the lexicon.To obtain only sentiment scores, set
simplify = TRUE:
sentix_summarize(frase_ann,
simplify = TRUE)
#> # A tibble: 1 × 2
#> doc_id score
#> <chr> <dbl>
#> 1 frase 0.176To get scores by sentences (or other segments) within each document,
set the by argument:
sentix_summarize(frase_ann,
by = c("doc_id", "sentence_id"))
#> # A tibble: 2 × 5
#> doc_id sentence_id score n_tokens n_scored
#> <chr> <int> <dbl> <int> <int>
#> 1 frase 1 0.213 5 5
#> 2 frase 2 0.131 5 4Note that udpipe assigns sentence IDs starting from 1 for each document, so sentence IDs will repeat across documents.
sentix_annotate()The sentix_annotate() function performs tokenization and
lemmatization (via udpipe), and then joins the result with
one of the available sentiment lexicons. By default, it uses the Sentix
lexicon.
The output is a dataframe where each row is a token. It is a
simplified version of the full udpipe output, plus the
sentiment score(s).
For large corpora, the user may optionally specify the number of
cores to use, via the argument parallel.cores, which is
inherited from udpipe and passed to
udpipe::udpipe().
If no model is given, the function automatically downloads and uses
the default Italian udpipe model. After the first run, the downloaded
model can be passed with model = "local":
To load the downloaded model, or any other udpipe model, manually,
use udpipe::udpipe_load_model():
The function, like udpipe, accepts as input single texts, multiple texts (a character vector, a list, or a list of tokens), or data frames.
sentix_annotate() simplifies the document ID management
that is normally required by udpipe. In particular, the user
can explicitly pass a vector of IDs using the docid_field
argument, which safely processes them before passing the data to
udpipe::udpipe().
# Multiple texts
testi <- c("Oggi è una bella giornata. Esco a fare una passeggiata",
"Non mi piace la pioggia, mi rende triste.")
sentix_annotate(testi,
# loaded model
model = model) |> head()
#> doc_id sentence_id token_id token lemma upos score
#> 1 doc1 1 1 Oggi oggi ADV 0.1988304
#> 2 doc1 1 2 è essere AUX 0.1256011
#> 3 doc1 1 3 una uno DET 0.0000000
#> 4 doc1 1 4 bella bello ADJ 0.7413988
#> 5 doc1 1 5 giornata giornata NOUN 0.0000000
#> 6 doc1 1 6 . . PUNCT NAsentix_annotate(testi,
model = model,
# to specify document IDs
docid_field = paste0("doc_", seq_along(testi))
) |> head()
#> doc_id sentence_id token_id token lemma upos score
#> 1 doc_1 1 1 Oggi oggi ADV 0.1988304
#> 2 doc_1 1 2 è essere AUX 0.1256011
#> 3 doc_1 1 3 una uno DET 0.0000000
#> 4 doc_1 1 4 bella bello ADJ 0.7413988
#> 5 doc_1 1 5 giornata giornata NOUN 0.0000000
#> 6 doc_1 1 6 . . PUNCT NATo get the full udpipe output, set
simplify = FALSE.
While udpipe expects data frames to have columns named
text and doc_id, sentix_annotate() also allows
specifying the input column names, using the text_field and
docid_field arguments.
Note that the function extracts and processes only these two columns,
ignoring other metadata, and that they will be renamed to
text and doc_id in the output.
data(recensioni_tv)
recensioni_tv
#> doc_id
#> 1 doc1
#> 2 doc2
#> 3 doc3
#> 4 doc4
#> 5 doc5
#> text
#> 1 Ottimo prodotto, la qualità dell'immagine è buona, colori molto vivi.
#> 2 Ho riscontrato subito problemi; mi sono ostinato a fare delle prove, purtroppo senza risultati
#> 3 La tv è molto bella, ma la qualità dell'audio ha delle mancanze.
#> 4 Il prodotto va benissimo. C'è da dire che il costo irrisorio corrisponde ad alcuni limiti.
#> 5 I colori sono eccessivamente saturi, per non parlare dell'audio, a dir poco pessimo!# Annotate the dataframe
sentix_res <- sentix_annotate(recensioni_tv,
model = model)
head(sentix_res)
#> doc_id sentence_id token_id token lemma upos score
#> 1 doc1 1 1 Ottimo ottimo ADJ 1.0000000
#> 2 doc1 1 2 prodotto prodotto NOUN 0.0000000
#> 3 doc1 1 3 , , PUNCT NA
#> 4 doc1 1 4 la il DET NA
#> 5 doc1 1 5 qualità qualità NOUN 0.3631757
#> 6 doc1 1 6-7 dell' <NA> <NA> NAOther lexicons available in sentixr can be used with the
dict argument.
The MAL lexicon contains inflected forms rather than lemmas. The function automatically handles this by joining on the token column.
# Use MAL lexicon
anno_mal <- sentix_annotate(recensioni_tv,
model = model, dict = "MAL")
head(anno_mal)
#> doc_id sentence_id token_id token lemma upos score
#> 1 doc1 1 1 Ottimo ottimo ADJ 0.5625000
#> 2 doc1 1 2 prodotto prodotto NOUN 0.1250000
#> 3 doc1 1 3 , , PUNCT NA
#> 4 doc1 1 4 la il DET 0.0000000
#> 5 doc1 1 5 qualità qualità NOUN 0.3631757
#> 6 doc1 1 6-7 dell' <NA> <NA> NA# Summarize
summary_mal <- sentix_summarize(anno_mal)
summary_mal
#> # A tibble: 5 × 4
#> doc_id score n_tokens n_scored
#> <chr> <dbl> <int> <int>
#> 1 doc1 0.215 12 10
#> 2 doc2 -0.146 15 12
#> 3 doc3 0.163 15 12
#> 4 doc4 0.116 16 10
#> 5 doc5 -0.0185 15 9When using the ELIta family lexicons, the functions will produce scores and statistics for each dimension.
# Use ELIta VAD lexicon
anno_vad <- sentix_annotate(recensioni_tv,
model = model,
dict = "elita_VAD")
head(anno_vad)
#> doc_id sentence_id token_id token lemma upos valenza attivazione
#> 1 doc1 1 1 Ottimo ottimo ADJ 0.875 -0.250
#> 2 doc1 1 2 prodotto prodotto NOUN 0.125 -0.625
#> 3 doc1 1 3 , , PUNCT NA NA
#> 4 doc1 1 4 la il DET NA NA
#> 5 doc1 1 5 qualità qualità NOUN NA NA
#> 6 doc1 1 6-7 dell' <NA> <NA> NA NA
#> dominanza
#> 1 0.2925
#> 2 0.0425
#> 3 NA
#> 4 NA
#> 5 NA
#> 6 NA# Summarize
sentix_summarize(anno_vad)
#> # A tibble: 5 × 6
#> doc_id valenza attivazione dominanza n_tokens n_scored
#> <chr> <dbl> <dbl> <dbl> <int> <int>
#> 1 doc1 0.500 -0.0971 0.285 12 6
#> 2 doc2 0.0414 0.387 0.0957 15 7
#> 3 doc3 0.402 0.208 0.125 15 3
#> 4 doc4 0.0714 -0.0296 0.0954 16 7
#> 5 doc5 -0.00667 0.0554 0.0342 15 6elita_VAD scores are automatically rescaled to the -1 to
1 range for consistency with other lexicons. It is also possible to use
the original -4/+4 scale by setting rescale = "none".
sentix_annotate(recensioni_tv,
model = model,
dict = "elita_VAD",
rescale = "none") |> head()
#> doc_id sentence_id token_id token lemma upos valenza attivazione
#> 1 doc1 1 1 Ottimo ottimo ADJ 3.5 -1.0
#> 2 doc1 1 2 prodotto prodotto NOUN 0.5 -2.5
#> 3 doc1 1 3 , , PUNCT NA NA
#> 4 doc1 1 4 la il DET NA NA
#> 5 doc1 1 5 qualità qualità NOUN NA NA
#> 6 doc1 1 6-7 dell' <NA> <NA> NA NA
#> dominanza
#> 1 1.17
#> 2 0.17
#> 3 NA
#> 4 NA
#> 5 NA
#> 6 NASentix and MAL include words that were originally polypathic (with multiple sentiment scores derived from SentiWordNet synsets), and that have been reduced to a single score (see Basile and Nissim 2013; Basile et al. 2025).
The polypathy index (ordered factor) indicates the level of variation among the original scores:
You can enable polypathy handling by setting
polypathy = TRUE. This will return another column in the
output dataframe:
anno_poly <- sentix_annotate(recensioni_tv,
model = model, polypathy = TRUE)
head(anno_poly)
#> doc_id sentence_id token_id token lemma upos score polypathy_index
#> 1 doc1 1 1 Ottimo ottimo ADJ 1.0000000 0
#> 2 doc1 1 2 prodotto prodotto NOUN 0.0000000 0
#> 3 doc1 1 3 , , PUNCT NA <NA>
#> 4 doc1 1 4 la il DET NA <NA>
#> 5 doc1 1 5 qualità qualità NOUN 0.3631757 2
#> 6 doc1 1 6-7 dell' <NA> <NA> NA <NA>The index will be summarized as ambiguity score,
calculated as n_poly / n_scored, where:
n_poly: Number of tokens with ambiguous tokens, based
on the ambiguity level setting (default = 3, which
indicates the highest ambiguity level "3");n_scored: Number of tokens found in the lexicon.sentix_summarize(anno_poly,
# the default value
ambiguity = 3)
#> # A tibble: 5 × 6
#> doc_id score ambiguity n_tokens n_scored n_poly
#> <chr> <dbl> <dbl> <int> <int> <int>
#> 1 doc1 0.274 0.333 12 9 3
#> 2 doc2 -0.139 0.333 15 9 3
#> 3 doc3 0.244 0.111 15 9 1
#> 4 doc4 0.117 0.0833 16 12 1
#> 5 doc5 -0.0187 0.333 15 9 3A higher ambiguity score in the summarized output
indicates that a document relies heavily on words historically
associated with mixed or contrasting sentiments, suggesting a more
nuanced or complex overall polarity.