pubtatordb

Zachary Colburn

2019-11-22

Overview

PubTator is an NCBI product that contains detailed annotations of abstracts found on PubMed. This makes it a very useful research tool. While PubTator does provide an API, the use of an API is inconvenient for high-throughput analyses and also requires a guaranteed internet connection. Querying a local PubTator database is better suited for high-throughput analyses. The package pubtatordb makes it easy to quickly start using a local copy of PubTator’s data.

Installation

You can install the released version of pubtatordb from CRAN with:

install.packages("pubtatordb")

The version on GitHub can be downloaded using the devtools package with:

install.packages("devtools")
devtools::install_github("mamc-dci/pubtatordb")

Example

Load the package.

# Load the package.
library(pubtatordb)

After loading the package, database setup and querying can be accomplished in four steps.

After the user manually creates a folder to store the data, the user can define the path to that folder and then download the data to that location:

# Download the data.
# Use the full path. Writing to the temp directory is not recommended.
download_dir <- tempdir()
download_pt(download_dir)

After defining the path to the download directory created above, the database can be created with:

# Define the data directory, a subdirectory of the above directory.
pubtator_path <- file.path(download_dir, "PubTator")

# Create the database.
pt_to_sql(
  pubtator_path,
  skip_behavior = FALSE,
  remove_behavior = TRUE
)

If the .gz files from PubTator have already been extracted, their extraction can be skipped with the skip_behavior argument. After their insertion into the database, both the .gz and uncompressed files can be removed using the remove_behavior argument.

A connection can be created to the database using pt_connector. Note that this is a wrapper for the dbConnect function of the DBI package.

# Create a connection to the database.
db_con <- pt_connector(pubtator_path)

Querying the data is accomplished using the pt_select function. The first five rows of the gene table can be selected with:

# Query the data.
pt_select(
  db_con,
  "gene",
  columns = NULL,
  keys = NULL,
  keytype = NULL,
  limit = 5
)

The first five results for PMIDs in which the genes with ENTREZ IDs 7356 or 4199 were mentioned can be selected with:

# Query the data.
pt_select(
  db_con,
  "gene",
  columns = c("PMID", "ENTREZID"),
  keys = c("7356", "4199"),
  keytype = "ENTREZID",
  limit = 5
)

Other tables

PubTator has several datasets. The names of tables in the database can be obtained with:

pt_tables(db_con)

The column names for a particular table can be accessed with:

pt_columns(db_con, "species")

Note

The citation information for PubTator can be found on the PubTator website or with:

pubtator_citations()
#> Please cite PubTator in any publications:
#> 1. Wei CH et. al., PubTator: a Web-based text mining tool for assisting Biocuration, Nucleic acids research, 2013, 41 (W1): W518-W522. doi: 10.1093/nar/gkt44
#> 2. Wei CH et. al., Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts, Database (Oxford), bas041, 2012
#> 3. Wei CH et. al., PubTator: A PubMed-like interactive curation system for document triage and literature curation, in Proceedings of BioCreative 2012 workshop, Washington DC, USA, 145-150, 2012

Disclaimer

The views expressed are those of the author(s) and do not reflect the official policy of the Department of the Army, the Department of Defense or the U.S. Government.