rebib parser mechanics

The magic behind rebib is a parser which parses bibliographic data and assorts them according to the matching regular expressions.

Stage 1: Read the Embedded Bibliography Block

This stage is a minor step where it reads the Embedded Bibliography from the LaTeX document. This step also includes Filtering out the commented code to avoid un-intended entries read.

Lastly, the data is broken down based on the LaTeX macro \\bibitem as a marker for a new entry and this assorted data is exported to a variable.

file_name <- rebib:::get_texfile_name(your_article_path)
bib_items <- rebib:::extract_embeded_bib_items(your_article_path,file_name)
bib_items[[1]]
#> [1] "\\bibitem[Ihaka, Ross and Gentleman, Robert]{ihaka:1996}"                          
#> [2] "Ihaka, Ross and Gentleman, Robert"                                                 
#> [3] "\\newblock \\emph{R: A Language for Data Analysis and Graphics.}"                  
#> [4] "\\newblock \\emph{Journal of Computational and Graphical Statistics}, 3:\\penalty0"
#> [5] "299--314, 1996."                                                                   
#> [6] "\\newblock URL : \\url{https://doi.org/10.1080/10618600.1996.10474713}"
bib_items[[2]]
#> [1] "\\bibitem[R Core Team]{R}"                                                                  
#> [2] "R Core Team"                                                                                
#> [3] "\\newblock R: A Language and Environment for Statistical Computing"                         
#> [4] "\\newblock \\emph{R Foundation for Statistical Computing}, Vienna, Austria \\penalty0 2016."
#> [5] "\\newblock URL : \\url{https://www.R-project.org/}, ISBN 3-900051-07-0"

Stage 2: Regex Powered Parser

Now, with the chunks of bibliographic entries, each is passed to a parser which will break it down based on regular expressions. The logic is to use the LaTeX macro \\newblock as a placeholder to identify the position of text elements relative to it.

The first value to be parsed is the unique_id also called the citation reference which is used to cite elements inside the article. Usually, this is in the first or second line of the whole entry. The position of the unique_id will determine the position of the author names.

bib_items[[1]][1]
#> [1] "\\bibitem[Ihaka, Ross and Gentleman, Robert]{ihaka:1996}"

After reading the unique_id, the parser will attempt to read the author name(s) up to two lines long (Usually this is the case in most articles).

bib_items[[1]][2]
#> [1] "Ihaka, Ross and Gentleman, Robert"

Next, the title is extracted based on the position of the new blocks or the end of the bib chunk.

bib_items[[1]][3]
#> [1] "\\newblock \\emph{R: A Language for Data Analysis and Graphics.}"

This way the crucial elements of the bibliographic entry (unique_id, author names and title ) are parsed out.

The remaining data is stored as journal internally and publisher when writing to a new BibTeX file.

bib_items[[1]][4:6]
#> [1] "\\newblock \\emph{Journal of Computational and Graphical Statistics}, 3:\\penalty0"
#> [2] "299--314, 1996."                                                                   
#> [3] "\\newblock URL : \\url{https://doi.org/10.1080/10618600.1996.10474713}"

A series of filters for ISBN, URL, pages and year fields are applied to search for relevant data from the remaining data. If the data is not available then it is set as NULL and will be skipped while writing the BibTeX file. There is a lot of filtering of common LaTeX elements and after that, the data remaining is stored in a structured format to be written to a file.

bib_entry <- rebib:::bib_handler(bib_items)
bib_entry
#> $book
#> $book[[1]]
#> $book[[1]]$unique_id
#> [1] "ihaka:1996"
#> 
#> $book[[1]]$author
#> [1] "Ihaka, Ross and Gentleman, Robert"
#> 
#> $book[[1]]$title
#> [1] "R: A Language for Data Analysis and Graphics"
#> 
#> $book[[1]]$journal
#> [1] "Journal of Computational and Graphical Statistics 3:   :"
#> 
#> $book[[1]]$year
#> [1] "1996"
#> 
#> $book[[1]]$URL
#> [1] "https://doi.org/10.1080/10618600.1996.10474713"
#> 
#> $book[[1]]$isbn
#> NULL
#> 
#> $book[[1]]$pages
#> [1] "299--314"
#> 
#> 
#> $book[[2]]
#> $book[[2]]$unique_id
#> [1] "R"
#> 
#> $book[[2]]$author
#> [1] "R Core Team"
#> 
#> $book[[2]]$title
#> [1] "R: A Language and Environment for Statistical Computing"
#> 
#> $book[[2]]$journal
#> [1] "R Foundation for Statistical Computing Vienna Austria    :"
#> 
#> $book[[2]]$year
#> [1] "2016"
#> 
#> $book[[2]]$URL
#> [1] "https://www.R-project.org/"
#> 
#> $book[[2]]$isbn
#> [1] "3-900051-07-0"
#> 
#> $book[[2]]$pages
#> NULL
#> 
#> 
#> $book[[3]]
#> $book[[3]]$unique_id
#> [1] "Tremblay:2012"
#> 
#> $book[[3]]$author
#> [1] "A.~Tremblay"
#> 
#> $book[[3]]$title
#> [1] "LMERConvenienceFunctions: A suite of functions to back-fit fixed effects and forward-fit random effects, as well as other miscellaneous functions., "
#> 
#> $book[[3]]$journal
#> [1] "R package version 1.6.8.2"
#> 
#> $book[[3]]$year
#> [1] "2012"
#> 
#> $book[[3]]$URL
#> [1] "http://CRAN.R-project.org/package=LMERConvenienceFunctions"
#> 
#> $book[[3]]$isbn
#> NULL
#> 
#> $book[[3]]$pages
#> NULL

Stage 3: BibTeX writer

After reading the bibliographic entries and splitting out meaningful values from them, we can finally write a structured file in the BibTeX format.

The writer will read the bib chunks one at a time based on the metadata extracted and will write the corresponding data fields. The default entry type is a book, which should not have any problems with the web articles.

#> Warning in file.remove(bib_path): cannot remove file
#> '/tmp/RtmpsrzBoI/exampledir/article/example.bib', reason 'No such file or
#> directory'
rebib:::bibtex_writer(bib_entry,file_path)
cat(readLines(paste(your_article_path,"example.bib",sep="/")),sep = "\n")
#> @book{ihaka:1996,
#> author = {{Ihaka, Ross and Gentleman, Robert}},
#> title = {{R: A Language for Data Analysis and Graphics}},
#> publisher = {Journal of Computational and Graphical Statistics 3:   :},
#> pages = {299--314},
#> year = {1996},
#> url = {https://doi.org/10.1080/10618600.1996.10474713}
#> }
#> @book{R,
#> author = {R {Core Team}},
#> title = {{R: A Language and Environment for Statistical Computing}},
#> publisher = {R Foundation for Statistical Computing Vienna Austria    :},
#> year = {2016},
#> url = {https://www.R-project.org/},
#> isbn = {3-900051-07-0}
#> }
#> @book{Tremblay:2012,
#> author = {A.~{Tremblay}},
#> title = {{LMERConvenienceFunctions: A suite of functions to back-fit fixed effects and forward-fit random effects, as well as other miscellaneous functions., }},
#> publisher = {R package version 1.6.8.2},
#> year = {2012},
#> url = {http://CRAN.R-project.org/package=LMERConvenienceFunctions}
#> }

I expect the authors who are converting the document to take a look and check if there are any errors or missing values.