**aphid** is an R package for the development and
application of hidden Markov models and profile HMMs for biological
sequence analysis. It contains functions for multiple and pairwise
sequence alignment, model construction and parameter optimization, file
import/export, implementation of the forward, backward and Viterbi
algorithms for conditional sequence probabilities, tree-based sequence
weighting, and sequence simulation. The package has a wide variety of
uses including database searching, gene-finding and annotation,
phylogenetic analysis and sequence classification.

Hidden Markov models (HMMs) underlie many of the most important tasks in computational biology, including multiple sequence alignment, genome annotation, and increasingly, sequence database searching. Originally developed for speech recognition algorithms, their application to the field of molecular biology has increased dramatically since advances in computational capacity have enabled full probabilistic analysis in place of heuristic approximations. Pioneering this transition are two groups lead by Anders Krogh and Sean Eddy, whose respective software packages SAM and HMMER have underpinned HMM-based bioinformatic analysis for over two decades.

Here, we present the **aphid** package for analysis with
profile hidden Markov models in the R environment (R Core Team 2017). The package contains
functions for developing, plotting, importing and exporting both
standard and profile HMMs, as well as implementations of the forward,
backward and Viterbi algorithms for computing full and optimal
conditional sequence probabilities. The package also features a multiple
sequence alignment tool that produces high quality alignments
*via* profile HMM training.

###Dependencies The **aphid** package is designed to
work in conjunction with the “DNAbin” and “AAbin” object types produced
using the **ape** package (Paradis,
Claude, and Strimmer 2004; Paradis 2012). These object types, in
which sequences are represented in a bit-level coding scheme, are
preferred over standard character-type sequences for maximizing memory
and speed efficiency. While we recommend using **ape**
alongside **aphid**, it is not a requisite and as such is
listed in the “Suggests” rather than “Imports” section of the package
description. Indeed, any sequence of standard ASCII characters is
supported, making **aphid** suitable for other applications
outside of biological sequence analysis. However, it should be noted
that if DNA and/or amino acid sequences are input as character vectors,
the functions may not recognize the ambiguity codes and therefore are
not guaranteed to treat them appropriately.

To maximize speed, the low-level dynamic programming functions
(including the forward, backward, Viterbi, and maximum *a
posteriori* algorithms) are written in C++ linking to the Rcpp
package (Eddelbuettel and Francois 2011).
R versions of these functions are also maintained for the purposes of
debugging, experimentation and code interpretation. This package also
relies on the **openssl** package (Ooms 2016) for sequence and alignment
comparisons using the MD5 hash algorithm.

###Classes Two primary object classes, “HMM” (hidden Markov model)
and “PHMM” (profile hidden Markov model) are generated using the
**aphid** functions deriveHMM and derivePHMM, respectively.
These objects are lists consisting of emission and transition
probability matrices (elements named “E” and “A”), and non-mandatory
elements that may include vectors of background emission and transition
probabilities (“qe” and “qa”, respectively) and other model metadata
including “name”, “description”, “size” (the number of modules in the
model), and “alphabet” (the set of symbols/residues emitted by the
model). Objects of class “DPA” (dynamic programming array) are also
generated by the Viterbi and forward/backward functions. These are
predominantly created for the purposes of succinct console-printing.

###Functions HMMs and PHMMs are explained in more detail throughout
the following sections using **aphid** package functions to
demonstrate their utility. The examples are borrowed from Durbin et al
(1998), to which users are encouraged to
refer for a more in-depth explanation on the theory and application of
these models. Book chapter numbers are provided wherever possible for
ease of reference.

The **aphid** package can be used to produce
high-quality multiple sequence alignments using the iterative model
training method outlined above. The function `align`

takes as
its primary argument a list of sequences either as a “DNAbin” object, an
“AAbin” object, or a list of character sequences. An object of class
“PHMM” can be passed to the function as an optional secondary argument
(“model”), in which case the sequences are simply aligned to the model
to produce the alignment matrix. If “model” is NULL, a preliminary model
is first derived using the ‘seed’ sequence method outlined above, after
which the model is trained using either the Baum Welch or Viterbi
training algorithm (specified *via* the “method” argument). The
sequences are then aligned to the model in the usual fashion to produce
the alignment. Note that if only two sequences are present inthe input
list, the `align`

function will perform a pairwise alignment
without a profile HMM (Smith-Waterman or Needleman-Wunch alignment).

In this final example, we will deconstruct the original globin alignment and re-align the sequences using the original PHMM as a guide.

```
<- unalign(globins)
globins align(globins, model = globins.PHMM, seqweights = NULL, residues = "AMINO")
```

```
## 1 2 3 I I 4 5 6 7 8
## HBA_HUMAN "V" "G" "A" "-" "-" "H" "A" "G" "E" "Y"
## HBB_HUMAN "V" "-" "-" "-" "-" "N" "V" "D" "E" "V"
## MYG_PHYCA "V" "E" "A" "-" "-" "D" "V" "A" "G" "H"
## GLB3_CHITP "V" "K" "G" "-" "-" "-" "-" "-" "-" "D"
## GLB5_PETMA "V" "Y" "S" "-" "-" "T" "Y" "E" "T" "S"
## LGB2_LUPLU "F" "N" "A" "-" "-" "N" "I" "P" "K" "H"
## GLB1_GLYDI "I" "A" "G" "A" "D" "N" "G" "A" "G" "V"
```

Note that the column names show the progressive positions along the model and where residues were predicted to have been emitted by insert states (e.g. the 4th and 5th residues of sequence 7).

This package was written based on the algorithms described in Durbin et al (1998). This book offers an in depth explanation of hidden Markov models and profile HMMs for users of all levels of familiarity. Many of the examples and datasets in the package are directly derived from the text, which serves as a useful primer for this package. There are also excellent resources available for those wishing to use profile HMMs outside of the R environment. The aphid package maintains compatibility with the HMMER software suite through the file input and output functions readPHMM and writePHMM. Those interested are further encouraged to check out the SAM software package, which also features a comprehensive suite of functions and tutorials.

This software was developed at Victoria University of Wellington, NZ, with funding from a Rutherford Foundation Postdoctoral Research Fellowship award from the Royal Society of New Zealand.

Durbin, Richard, Sean Eddy, Anders Krogh, and Graeme Mitchison. 1998.
*Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids*. Cambridge: Cambridge
University Press.

Eddelbuettel, Dirk, and Romain Francois. 2011. “Rcpp: seamless R and C++ integration.”
*Journal of Statistical Software* 40 (8): 1–18. https://www.jstatsoft.org/v40/i08/.

Ooms, Jeroen. 2016. *openssl: toolkit for
encryption, signatures and certificates based on OpenSSL*. R
package. https://cran.r-project.org/package=openssl.

Paradis, Emmanuel. 2012. *Analysis of
Phylogenetics and Evolution with R*. Second Edi. New York:
Springer.

Paradis, Emmanuel, Julien Claude, and Korbinian Strimmer. 2004.
“APE: analyses of phylogenetics and evolution
in R language.” *Bioinformatics* 20: 289–90. https://doi.org/10.1093/bioinformatics/btg412.

R Core Team. 2017. “R: A Language and
Environment for Statistical Computing.” Vienna, Austria: R
Foundation for Statistical Computing. https://cran.r-project.org/.