topiclabels topiclabels logo

CRAN R build status

Automated Topic Labeling with Language Models

topiclabels leverages (large) language models for automatic topic labeling. The main function converts a list of top terms into a label for each topic. Hence, it is complementary to any topic modeling package that produces a list of top terms for each topic. While human judgement is indispensable for topic validation (i.e., inspecting top terms and most representative documents), automatic topic labeling can be a valuable tool for researchers in various scenarios (see also our Vignette).

topiclabels animation

References

Related work: * Grootendorst (2023). Topic Modeling with Llama 2. Create easily interpretable topics with Large Language Models * Li et al. (2023). Can Large Language Models (LLM) label topics from a topic model? * Wanna et al. (2024). TopicTag: Automatic Annotation of NMF Topic Models Using Chain of Thought and Prompt Tuning with LLMs

Topic models (selection): * BERTopic * Structural Topic Model * topicmodels * RollingLDA * ldaPrototype

Contribution

This R package is licensed under the GPLv3. For bug reports (lack of documentation, misleading or wrong documentation, unexpected behaviour, …) and feature requests please use the issue tracker. Pull requests are welcome and will be included at the discretion of the author.

Installation

You can install the recent CRAN version using

install.packages("topiclabels")

For installation of the development version use devtools:

devtools::install_github("PetersFritz/topiclabels")

(Quick Start) Example

library("topiclabels")

First of all, you should store your Huggingface token in the variable token. If you do not have a token, create a Huggingface account and generate a token based on this guideline.

token = "" # set your hf token here

We would now like to label two topics, one with the three top terms zidane, figo, kroos and the other with the three top terms gas, power, wind. Here, we show three typical variants of topic representation:

topwords_matrix = matrix(c("zidane", "figo", "kroos", "gas", "power", "wind"), ncol = 2)
topwords_list = list(c("zidane", "figo", "kroos"), c("gas", "power", "wind"))
topwords_vector = c("zidane, figo, kroos", "gas, power, wind")

A common way to represent top terms is a matrix structure.

topwords_matrix
     [,1]     [,2]   
[1,] "zidane" "gas"  
[2,] "figo"   "power"
[3,] "kroos"  "wind" 

For our package, it is not necessary that all topics are characterized by the same number of top terms. For this case, the input must be given via the following list format:

topwords_list
[[1]]
[1] "zidane" "figo"   "kroos" 

[[2]]
[1] "gas"   "power" "wind" 

If you have stored your top terms as vectors (e.g., in a data table), it may look like this:

topwords_vector
[1] "zidane, figo, kroos" "gas, power, wind" 

Using one of the following three calls

label_topics(topwords_matrix, token = token)
label_topics(topwords_list, token = token)
label_topics(as.list(topwords_vector), token = token)

the labels for the two topics can then be generated, which yields

lm_topic_labels object generated using mistralai/Mixtral-8x7B-Instruct-v0.1
 1: Real Madrid Midfielders [zidane, figo, kroos]
 2: Renewable Energy [gas, power, wind]

Beyond this, it is also possible to display the actual generated outputs of the language models, which might be helpful if our default postprocessing function did not generate proper labels for single topics.

obj = label_topics(topwords_matrix, token = token)
names(obj)
# [1] "terms"       
# [2] "prompts"     
# [3] "model"       
# [4] "params"      
# [5] "with_token"  
# [6] "time"        
# [7] "model_output"
# [8] "labels"      
obj$model_output
# [1] "\n\n{\n\"label\": \"Real Madrid Midfielders\"\n}"
# [2] "\n\n{\n\"label\": \"Renewable Energy\"\n}"
obj$labels
# [1] "Real Madrid Midfielders"
# [2] "Renewable Energy"

Feel free to also check the following examples and check our Vignette of the package for further reading.

label_topics(list(c("zidane", "figo", "ronaldo"), c("gas", "power", "wind")), token = token)
label_topics(list("wind", "greta", "hambach"), token = token)
label_topics(list("wind", "fire", "air"), token = token)
label_topics(list("wind", "feuer", "luft"), token = token)
label_topics(list("wind", "feuer", "luft"), context = "Elements of the Earth", token = token)