HGNChelper Introduction

Levi Waldron and Markus Riester

2019-10-24

Why HGNChelper?

Physicians and biologists like gene symbols and bioinformaticians hate’em. Why? For one thing, they change constantly and are given new names or aliases. For another, some get munged into dates when imported into spreadsheet programs - and not only Excel (Thank you @karawoo for the picture!):

Myself (Levi speaking), I don’t mind them. It’s way easier to remember TP53 than to remember 7157 or ENSG00000141510. They’re a fact of life. So Markus Riester and I wrote HGNChelper to make them a little more pleasant to bioinformaticians.

HGNChelper functionality

HGNChelper has several functions that seemed useful back in the day when we first wrote it, but really one has withstood the test of time and remained useful:

checkGeneSymbols(x, unmapped.as.na = TRUE, map = NULL, species = "human")

checkGeneSymbols identifies HGNC human or MGI mouse gene symbols which are outdated or may have been mogrified by Excel or other spreadsheet programs. It returns a data.frame of the same number of rows as the input, with a second column indicating whether the symbols are valid and a third column with a corrected gene list.

library(HGNChelper)
human = c("FN1", "tp53", "UNKNOWNGENE","7-Sep", "9/7", "1-Mar", "Oct4", "4-Oct",
      "OCT4-PG4", "C19ORF71", "C19orf71")
checkGeneSymbols(human)
#> Maps last updated on: Thu Oct 24 12:31:05 2019
#> Warning in checkGeneSymbols(human): Human gene symbols should be all upper-
#> case except for the 'orf' in open reading frames. The case of some letters
#> was corrected.
#> Warning in checkGeneSymbols(human): x contains non-approved gene symbols
#>              x Approved   Suggested.Symbol
#> 1          FN1     TRUE                FN1
#> 2         tp53    FALSE               TP53
#> 3  UNKNOWNGENE    FALSE               <NA>
#> 4        7-Sep    FALSE            SEPTIN7
#> 5          9/7    FALSE            SEPTIN7
#> 6        1-Mar    FALSE MTARC1 /// MARCHF1
#> 7         Oct4    FALSE             POU5F1
#> 8        4-Oct    FALSE             POU5F1
#> 9     OCT4-PG4    FALSE           POU5F1P4
#> 10    C19ORF71    FALSE           C19orf71
#> 11    C19orf71     TRUE           C19orf71

As you see, it even helps fix capitalization. How does it fix those Excel dates? I imported a column of all human gene symbols into Excel, then exported using a whole bunch of available date formats. Then I kept any that differed from the originals for HGNChelper’s map.

Mouse gene symbols

Warning the list of valid mouse symbols seems to be incomplete, see below. Mouse genes work the same way, but you need to specify the argument species=mouse:

checkGeneSymbols(c("1-Feb", "Pzp", "A2m"), species="mouse")
#> Maps last updated on: Thu Oct 24 12:31:05 2019
#> Warning in checkGeneSymbols(c("1-Feb", "Pzp", "A2m"), species = "mouse"): x
#> contains non-approved gene symbols
#>       x Approved Suggested.Symbol
#> 1 1-Feb    FALSE             Feb1
#> 2   Pzp    FALSE             <NA>
#> 3   A2m    FALSE         AI893533

I don’t work with mouse data, so use this functionality with care and please let me know if you have any suggestions. For one thing, Pzp in the above example is a valid gene symbol, but it is not in the MGI_EntrezGene.rpt file I used to build a map. Suggestions welcome about how to build a more complete map of valid symbols. To be on the safe side, you could set unmapped.as.na = FALSE to keep unrecognized symbols as-is and only correct ones that have a definitive correction.

What exactly checkGeneSymbols does

HGNChelper does the following corrections:

  1. fix capitalization (for human only). Only orf genes are allowed to have lower-case letters.
  2. fix Excel-mogrified symbols
  3. fix symbols that are listed as aliases to a more recent symbol in the HGNC or MGI (MGI_EntrezGene_rpt) database.

Numbers 2 and 3 are done by comparing to a complete map of both valid and invalid but mappable symbols, shipped with HGNChelper:

dim(mouse.table)
#> [1] 135053      2
dim(hgnc.table)
#> [1] 98431     2

These are a combination of manually generated Excel mogrifications that remain constant, and aliases that can become out of date with time.

Updating maps of aliased gene symbols

Gene symbols are aliased much more frequently than I can update this package, like every day. If you want the most current maps of aliases, you can either:

  1. use the getCurrentHumanMap() or getCurrentMouseMap() function, and provide the returned result through the map= argument of checkGeneSymbols(), or
  2. See the instructions for updating your package locally at https://github.com/waldronlab/HGNChelper (it’s just one command-line command as long as you have R and roxygen2 installed)

Where do I find HGNChelper?

Please report any issues at https://github.com/waldronlab/HGNChelper/issues