`LDABiplots`

it is an extraction, analysis, and visualization
tool for the exploratory analysis of news published on the web by
digital newspapers, which, by extracting data from the web (Bradley et
al. 2019), allows the implementation of the Latent Dirichlet Allocation
probabilistic model(LDA) (Blei, Ng y Jordán, 2003) and the generation of
Biplot (Gabriel K.R, 1971) and HJ-Biplot (Galindo-Villardón P, 1986)
visualizations of the main topics of the headlines of the news published
on the web. `LDABiplots`

allows for optimizing the data
extraction from the web, the LDA modeling routine, and the generation of
Biplot visualizations in an interactive way for users who are not
adapted to the use of R.
To download install the stable version of Comprehensive R Archive Network (CRAN)

```
# install.packages("LDABiplots")
# library(LDABiplots)
```

Once the library is loaded, to use the web interface, type in the R console

`# runLDABiplots()`

`LDABiplots`

allows us to extract data from the web page
`LDABiplots`

also allows the loading of files in Excel
format.
The data can be imported from a file in the directory, by selecting the
*Import or Load Data* tab the *Import excel file* option,
and selecting the file to upload from *Browse* and the work tab
where the data is located *Worksheet Name*. The data to be
uploaded must have the header and format according to figure 1
Video 1. Importing Data

.

.

It is recommended that before performing news extraction through
*Advanced Search*, set the region and preferred search language
in the news section for better extraction results.(Video 2)
Video 2. Browser Configuration

`LDABiplots`

, you set your computer’s google search engine to
.

.

The data can be extracted by web scraping directly in
*Import or Load Data*
tab the *Load web data* option, and writing the search keywords
(use a maximum of 4 keywords, for a better search performance, select
the search language in *Choose Language*, and the pagination
number to be extracted, by default google shows 10 news per page, select
*Run* to execute the search and extraction.(Video 3)
.
Video 3. Webscraping News from the WEB

`LDABiplots`

, selecting in the To exemplify the operation of `LDABiplots`

, we will
extract from the web the news related to “*covid, coronavirus,
France*”, according to what is shown in video 3, this extraction
allows us to list in two tables the number of newspapers with their
respective frequency of news, as well as showing us each of the
newspapers with the headlines of the news, these tables can be
downloaded in various formats from the application, for the processing
of this data we will proceed as follows:

**Selection of Digital Newspapers to Analyze**, it is
recommended to select the newspapers with the most frequent digital
news.

**Inclusion of n-grams**, n-grams is a contiguous
sequence of words. According to your study, you must select between
*unigrams, bigrams, or trigrams*, for the example bigrams are
selected

**Remove numbers**, this option allows us to remove the
numbers, in case they are not informative.

**Select language for stopwords**, stopwords are those
words that have no lexical meaning and that appear with high frequency
in the news, such as articles or pronouns. We proceeded to select
according to the language of extraction of the news, that is, in
English.

**Add stopword words**, this allows us to eliminate from
the study words that can be considered highly frequent and that do not
add value to the study, in the example the words *covid* and
*France* were added.

**Select Lemmatization**, this allows us to reduce the
words to the basic form, it should be used with caution in studies, for
example, the lemmatization was not selected.

**Selecting Sparcity**, allows us to eliminate terms
that are used infrequently in very few news before generating the
models. Allowing better computational performance since it eliminates
information that does not contribute to the model, in our case
*sparcity* was used with 0.985(98.5%), that is, the DTM will be
generated with the terms that appear in the 1.5% of the headlines of the
news.

After processing the original data of the selected newspapers, a
matrix of the corpus of 402 unique terms have been obtained, out of the
444 of the original corpus, this allows obtaining a better computational
performance for the following analyses. The DTM matrix obtained can be
seen by selecting *Document Term Matrix Visualization*, in the
*Data* tab in a tabular manner, and can be downloaded in
different formats, such as Excel, CSV, and pdf. This DTM matrix shows us
the frequencies of the terms, the number of documents in which each term
appears, and the IDF or inverse frequency of documents, which is a
measure of the importance of the term.

The *Barplot* option allows us to generate an ordered bar
graph, where the words are displayed according to their frequency, and
enable the option of changing the color of the bars in the graph and
downloading it in various formats with the *export* button. The
*Worcloud* Option shows us a cloud of words, which can be
modified, by selecting the number of words to show, with the
*export* button, the graph can be downloaded in various
formats.

For the inference and selection of the optimal number of topics for
the LDA model, we start from the DTM matrix, taking into account that a
small *K* can generate wide and heterogeneous topics, and a high
*K* will produce specific topics `LDABiplots`

obtains
this optimal k from the coherence of the topic, this being a measure of
the quality of the desired topic from the point of view of human
interpretability. This is based on the distribution hypothesis that
states that words with similar interpretations tend to coexist in
similar contexts. The best number of topics will be the one that offers
the greatest measure of coherence, this is done based on probability
theory and consists of adjusting several models with different topics
and calculating the coherence of each of them. For the option of this
number, the models that you want to check must be parameterized in the
*Inference* section in *Candidate number of topics K*, it
must be identified from the range of topics for the test, in the
*Parameters section Gibbs sampling* control, you must select the
number of iterations *Iteratition* of the sample based on Gibbs
sampling and the number of the first N samples to discard
*Burn-in*, to choose an N that is big enough.

`LDABiplots`

uses the value of
Once the number of topics was defined, according to the obtained
coherence of 0.069, it was inferred that the best number of topics is 4,
with this optimal K, the LDA model is generated from the DTM matrix,
with the optimal K number You must define the parameters similar to the
process where the inference was obtained, for the example, 100
iterations and a Burn-in of 5 were selected, as well as an Alpha of 0.1,
after evaluating the optimal K according to the determined rules. The
result obtained with `LDABiplots`

are two matrices, the first
is the Theta matrix, which shows in the columns an identifier of the
news of the analyzed newspapers and in the rows a distribution of topics
in the analyzed documents. Another matrix obtained is the phi, which
shows in the rows that represent a distribution of words on the
topics.

Both matrices can be downloaded in the *Tabular result*
section, where before downloading the matrices you can select the number
of terms *Select number of term*, select the number of labels in
*Select number of label*, and the value of the assignments
*Select Assignments*, to parameterize the number of words and the
labels that you want to observe and download.

In *Worcloud* we can observe through a graph of words, which
ones have greater weight in each of the topics. In *Heatmap* we
observe through a heat map the probabilities of belonging to each of the
newspapers, where, according to the color scale shown, it can be seen
which topics are found more in any of the digital news newspapers in
particular.

`LDABiplots`

we
have Biplot graphs approximate the distribution of a multivariate sample
in a reduced dimension space, and superimpose on its representations of
the variables on which the sample is measured, this graph allows
graphically displaying the information of the rows (represented by
points, markers rows) and columns (represented by Vectors, column
markers), `LDABiplots`

, allows us to graphically and
tabularly display the results obtained when processing the Biplots, we
select the desired Biplot among the *JK-Biplot*, where the
coordinates of the rows are the coordinates on the main components and
the coordinates of the columns are the eigenvectors of the covariance or
correlation matrix. The Euclidean distances between row points in the
Biplot approximate the Euclidean distances between rows in
multidimensional space. Or the *GH-Biplot*, where the coordinates
of the rows are standardized and the distance between rows approximates
the Mahalanobis distance in multidimensional space. And the
*HJ-Biplot* that generates a high quality of representation for
both rows and columns, by presenting both identical goodnesses of fit,
it is possible to interpret the row-column relationship.

The distances between row markers are interpreted as an inverse function of their similarities, so that neighboring markers are more similar

The length of the vectors (column markers) approximates the standard deviation of the daily news.

The cosines of the angles between the column markers approximate the correlations between the Diaries, acute angles associate a high positive correlation between them, obtuse angles indicate a negative correlation and right angles indicate uncorrelated variables.

The order of the orthogonal projections of the points (row markers) onto a vector (column marker) approximates the order of the row elements (centers) in that column. The greater the projection of a point on a vector, the more the center deviates from the mean of that journal.

Before selecting the generation of the Biplot representation to be
carried out, it is necessary to mark how the centering of the covariance
matrix will be carried out, LDABiplots gives us 4 options for centering
and scaling the matrix, such: *scale*, *center*,
*center_scale*, or *none*.

By clicking on *run*, the selected Biplots and the results
will be generated in tabular form, which can be downloaded in different
formats, the tabular results shown are: *Eigenvalues* the vectors
with the eigenvalues, *Variance explained * a vector containing
the proportion of variance explained by the first 1, 2,., K main
components obtained, *loadings* The loadings of the main
components, *Coordinates of individuals* matrix with the
coordinates of the individuals, *Coordinates of variables* array
with the coordinates of the variables.

If you use `LDABiplots`

, please cite it in your work
as:

*Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent
dirichlet allocation. Journal of machine Learning research, 3(Jan),
993-1022.*

*Galindo-Villardón,P. (1986). Una alternativa de representación
simultánea: HJ-Biplot (An alternative of simultaneous representation:
HJ-Biplot). Questíio 1986, 10, 13–23.*

*Gabriel, K. R. (1971). The biplot graphic display of matrices
with application to principal component analysis. Biometrika, 58(3),
453-467.*