LDABiplotsit is an extraction, analysis, and visualization tool for the exploratory analysis of news published on the web by digital newspapers, which, by extracting data from the web (Bradley et al. 2019), allows the implementation of the Latent Dirichlet Allocation probabilistic model(LDA) (Blei, Ng y Jordán, 2003) and the generation of Biplot (Gabriel K.R, 1971) and HJ-Biplot (Galindo-Villardón P, 1986) visualizations of the main topics of the headlines of the news published on the web.
LDABiplotsallows for optimizing the data extraction from the web, the LDA modeling routine, and the generation of Biplot visualizations in an interactive way for users who are not adapted to the use of R.
To download install the stable version of Comprehensive R Archive Network (CRAN)
# install.packages("LDABiplots") # library(LDABiplots)
Once the library is loaded, to use the web interface, type in the R console
LDABiplotsallows us to extract data from the web page www.google.com, the data belongs to the news section in the GOOGLE search engine. For users using a different extraction page
LDABiplotsalso allows the loading of files in Excel format.
LDABiplots, you set your computer’s google search engine to Advanced Search, set the region and preferred search language in the news section for better extraction results.(Video 2)
LDABiplots, selecting in the Import or Load Data tab the Load web data option, and writing the search keywords (use a maximum of 4 keywords, for a better search performance, select the search language in Choose Language, and the pagination number to be extracted, by default google shows 10 news per page, select Run to execute the search and extraction.(Video 3) .
To exemplify the operation of
LDABiplots, we will
extract from the web the news related to “covid, coronavirus,
France”, according to what is shown in video 3, this extraction
allows us to list in two tables the number of newspapers with their
respective frequency of news, as well as showing us each of the
newspapers with the headlines of the news, these tables can be
downloaded in various formats from the application, for the processing
of this data we will proceed as follows:
Selection of Digital Newspapers to Analyze, it is recommended to select the newspapers with the most frequent digital news.
Inclusion of n-grams, n-grams is a contiguous sequence of words. According to your study, you must select between unigrams, bigrams, or trigrams, for the example bigrams are selected
Remove numbers, this option allows us to remove the numbers, in case they are not informative.
Select language for stopwords, stopwords are those words that have no lexical meaning and that appear with high frequency in the news, such as articles or pronouns. We proceeded to select according to the language of extraction of the news, that is, in English.
Add stopword words, this allows us to eliminate from the study words that can be considered highly frequent and that do not add value to the study, in the example the words covid and France were added.
Select Lemmatization, this allows us to reduce the words to the basic form, it should be used with caution in studies, for example, the lemmatization was not selected.
Selecting Sparcity, allows us to eliminate terms that are used infrequently in very few news before generating the models. Allowing better computational performance since it eliminates information that does not contribute to the model, in our case sparcity was used with 0.985(98.5%), that is, the DTM will be generated with the terms that appear in the 1.5% of the headlines of the news.Create DTM, finally the document terms matrix (DTM) is generated, once the process is finished, the dimensions of the DTM are shown in a summary table, see video 4. .
After processing the original data of the selected newspapers, a matrix of the corpus of 402 unique terms have been obtained, out of the 444 of the original corpus, this allows obtaining a better computational performance for the following analyses. The DTM matrix obtained can be seen by selecting Document Term Matrix Visualization, in the Data tab in a tabular manner, and can be downloaded in different formats, such as Excel, CSV, and pdf. This DTM matrix shows us the frequencies of the terms, the number of documents in which each term appears, and the IDF or inverse frequency of documents, which is a measure of the importance of the term.
The Barplot option allows us to generate an ordered bar graph, where the words are displayed according to their frequency, and enable the option of changing the color of the bars in the graph and downloading it in various formats with the export button. The Worcloud Option shows us a cloud of words, which can be modified, by selecting the number of words to show, with the export button, the graph can be downloaded in various formats.Co-occurrence displays a word co-occurrence plot which plots the sparse term correlations as a graph structure, based on the glasso procedure (Lasso Plot), to reduce the correlation matrix and keep only the relevant correlations between terms, with the Select Number option, it allows us to select the number of terms for the correlation graph, and Download the plot, allows us to download the graph in png and pdf format. Visualize video 5. .
For the inference and selection of the optimal number of topics for
the LDA model, we start from the DTM matrix, taking into account that a
small K can generate wide and heterogeneous topics, and a high
K will produce specific topics
this optimal k from the coherence of the topic, this being a measure of
the quality of the desired topic from the point of view of human
interpretability. This is based on the distribution hypothesis that
states that words with similar interpretations tend to coexist in
similar contexts. The best number of topics will be the one that offers
the greatest measure of coherence, this is done based on probability
theory and consists of adjusting several models with different topics
and calculating the coherence of each of them. For the option of this
number, the models that you want to check must be parameterized in the
Inference section in Candidate number of topics K, it
must be identified from the range of topics for the test, in the
Parameters section Gibbs sampling control, you must select the
number of iterations Iteratition of the sample based on Gibbs
sampling and the number of the first N samples to discard
Burn-in, to choose an N that is big enough.
LDABiplotsuses the value of 0.1 for calculation, see video 6. .
Once the number of topics was defined, according to the obtained
coherence of 0.069, it was inferred that the best number of topics is 4,
with this optimal K, the LDA model is generated from the DTM matrix,
with the optimal K number You must define the parameters similar to the
process where the inference was obtained, for the example, 100
iterations and a Burn-in of 5 were selected, as well as an Alpha of 0.1,
after evaluating the optimal K according to the determined rules. The
result obtained with
LDABiplots are two matrices, the first
is the Theta matrix, which shows in the columns an identifier of the
news of the analyzed newspapers and in the rows a distribution of topics
in the analyzed documents. Another matrix obtained is the phi, which
shows in the rows that represent a distribution of words on the
Both matrices can be downloaded in the Tabular result section, where before downloading the matrices you can select the number of terms Select number of term, select the number of labels in Select number of label, and the value of the assignments Select Assignments, to parameterize the number of words and the labels that you want to observe and download.
In Worcloud we can observe through a graph of words, which ones have greater weight in each of the topics. In Heatmap we observe through a heat map the probabilities of belonging to each of the newspapers, where, according to the color scale shown, it can be seen which topics are found more in any of the digital news newspapers in particular.In the Cluster tab, you can see the grouping of the topics found, for which you can select the grouping method in Agglomeration method among the methods included in the
LDABiplotswe have complete, single, Ward. D, Ward.D2, average, mcquitty, median, centroid, Ward’s minimum variance method aims to find compact and spherical groups. The complete method finds similar groups. The single method, which is closely related to the minimal spanning tree, adopts a friend of friend grouping strategy. The other methods can be thought of as targeting groups with features somewhere between the single and complete methods. The methods median and centroid do not lead to a monotonic distance measure or, equivalently. In the type of plot section, you can select the type of graph to display, there are the options of rectangle which draws rectangles around the branches of a dendrogram highlighting the corresponding groups, and circular which generates a graph efficiently and optimally with a heuristic and phylogenic circular grouping that shows through a phylogenetic tree how the hypothetical topics are related to each other, as well as a scroll bar to select the number of clusters to perform between topics, the package allows you to download the plot in pdf or png format. see video 7 .
Biplot graphs approximate the distribution of a multivariate sample
in a reduced dimension space, and superimpose on its representations of
the variables on which the sample is measured, this graph allows
graphically displaying the information of the rows (represented by
points, markers rows) and columns (represented by Vectors, column
LDABiplots, allows us to graphically and
tabularly display the results obtained when processing the Biplots, we
select the desired Biplot among the JK-Biplot, where the
coordinates of the rows are the coordinates on the main components and
the coordinates of the columns are the eigenvectors of the covariance or
correlation matrix. The Euclidean distances between row points in the
Biplot approximate the Euclidean distances between rows in
multidimensional space. Or the GH-Biplot, where the coordinates
of the rows are standardized and the distance between rows approximates
the Mahalanobis distance in multidimensional space. And the
HJ-Biplot that generates a high quality of representation for
both rows and columns, by presenting both identical goodnesses of fit,
it is possible to interpret the row-column relationship.
The distances between row markers are interpreted as an inverse function of their similarities, so that neighboring markers are more similar
The length of the vectors (column markers) approximates the standard deviation of the daily news.
The cosines of the angles between the column markers approximate the correlations between the Diaries, acute angles associate a high positive correlation between them, obtuse angles indicate a negative correlation and right angles indicate uncorrelated variables.
The order of the orthogonal projections of the points (row markers) onto a vector (column marker) approximates the order of the row elements (centers) in that column. The greater the projection of a point on a vector, the more the center deviates from the mean of that journal.
Before selecting the generation of the Biplot representation to be carried out, it is necessary to mark how the centering of the covariance matrix will be carried out, LDABiplots gives us 4 options for centering and scaling the matrix, such: scale, center, center_scale, or none.
By clicking on run, the selected Biplots and the results will be generated in tabular form, which can be downloaded in different formats, the tabular results shown are: Eigenvalues the vectors with the eigenvalues, Variance explained a vector containing the proportion of variance explained by the first 1, 2,., K main components obtained, loadings The loadings of the main components, Coordinates of individuals matrix with the coordinates of the individuals, Coordinates of variables array with the coordinates of the variables.In the Biplot tab, you will find the graphical representation generated according to the previously selected parameters, the graphic can be modified in its form, and with the different options offered by the package, it can be modified in the Options to Customize the section. Biplot, the theme, the axes to display in Axis-X and in Axis-Y, the color of the column markers, and the color of the row markers, you can also change the size of the markers and add the labels of both markers in different sizes. The representation can be downloaded in png or pdf format. See video 8 .
If you use
LDABiplots, please cite it in your work
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
Galindo-Villardón,P. (1986). Una alternativa de representación simultánea: HJ-Biplot (An alternative of simultaneous representation: HJ-Biplot). Questíio 1986, 10, 13–23.
Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal component analysis. Biometrika, 58(3), 453-467.Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235.