The goal of protHMM is to help integrate profile hidden markov model (HMM) representations of proteins into the machine learning and bioinformatics workflow. protHMM ports a number of features from use in Position Specific Scoring Matrices (PSSMs) to HMMs, along with implementing features used with HMMs specifically, which to our knowledge has not been done before. The adoption of HMM representations of proteins derived from HHblits and HMMer also presents an opportunity for innovation; it has been shown that HMMs can benefit from better multiple sequence alignment than PSSMs and thus get better results than corresponding HMMs using similar feature extraction techniques (Lyons et al. 2015). protHMM implements 20 different feature extraction techniques to provide a comprehensive list of feature sets for use in bioinformatics tasks ranging from protein fold classification to protein-protein interaction.

`library(protHMM)`

protHMM implements normalized Moreau-Broto, Moran and Geary autocorrelation descriptors, each of which measure the correlation between two amino acid residues separated by a lag value, (lg) along the sequence. Each of these feature vectors return a vector of \(20 \times lg\). Liang, Liu, and Zhang (2015) provide a mathematical representation for each of these features, having used them to predict protein structural class:

\({N_{lg}^j = \frac{1} {L-lg} \sum_{i = 1} ^ {L-lg} H_{i, j} \times H_{i+lg, j}, \space (j = 1,2..20, 0 < d < L)}\)

In which \({N_{lg}^j}\) is the Moreau-Broto correlation factor for column \(j\), \(lg\) is the lag value, \(L\) is the length of the protein sequence and \(H\) represents the HMM matrix.

\({N_{lg}^j = \frac{\frac{1} {L-lg} \sum_{i = 1} ^ {L-lg} (H_{i, j} - \bar{H_j}) \times (H_{i+lg, j} - \bar{H_j})}{\frac {1} {L} \sum_{i = 1} {L} (H_{i, j} - \bar{H_j})^2}, \space (j = 1,2..20, 0 < d < L)}\)

In which \({N_{lg}^j}\) is the Moran correlation factor for column \(j\), \(lg\) is the lag value, \(L\) is the length of the protein sequence, \(\bar{H_j}\) represents the average of column \(j\) in matrix \(H\) and \(H\) represents the HMM matrix.

\({N_{lg}^j = \frac{\frac{1} {2(L-lg)} \sum_{i = 1} ^ {L-lg} (H_{i, j} - H_{i+d, j})^2}{\frac {1} {L-1} \sum_{i = 1} {L} (H_{i, j} - \bar{H_j})^2}, \space (j = 1,2..20, 0 < d < L)}\)

In which \({N_{lg}^j}\) is the Geary correlation factor for column \(j\), \(lg\) is the lag value, \(L\) is the length of the protein sequence and \(H\) represents the HMM matrix.

```
library(protHMM)
## basic example code
<- hmm_MB(system.file("extdata", "1DLHA2-7", package="protHMM"))
h_MB<- hmm_MA(system.file("extdata", "1DLHA2-7", package="protHMM"))
h_MA<- hmm_GA(system.file("extdata", "1DLHA2-7", package="protHMM"))
h_GAhead(h_MB, 20)
#> [1] 0.0027399767 0.0028318487 0.0030151714 0.0027021756 0.0029396909
#> [6] 0.0028727712 0.0030186241 0.0028945067 0.0029009803 0.0006026105
#> [11] 0.0003812977 0.0004823361 0.0003340628 0.0005073259 0.0003092559
#> [16] 0.0002997833 0.0010281865 0.0004959552 0.0023601628 0.0021576905
head(h_MA, 20)
#> [1] -0.10731521 -0.08432884 0.03573254 -0.15878664 0.03660993 -0.02897298
#> [7] 0.05583864 -0.02260616 -0.01659614 -0.00317749 -0.02603402 -0.01744969
#> [13] -0.03281501 -0.01579394 -0.03647018 -0.03876510 0.03093880 -0.02196930
#> [19] 0.31531271 0.17506839
head(h_GA, 20)
#> [1] 1.0866849 1.0541761 0.9300136 1.1177010 0.9217161 0.9864679 0.9015713
#> [8] 0.9811766 0.9780934 1.0025380 1.0347499 1.0360329 1.0615057 1.0546791
#> [15] 1.0855622 1.0984717 1.0403336 1.1041239 0.6773725 0.8157369
```

protHMM implements two covariance-based features, autocovariance (AC) and cross covariance (CC). These features calculate the covariance between two residues separated by a lag value lg along the sequence. AC calculates this for positions in the same column, and CC for positions that are not in the same column of the HMM. It should also be noted that for these features, the HMM matrix is not converted to probabilities, unlike other features. Both features were used for protein fold recognition.

Dong, Zhou, and Guan (2009) provide a mathematical representation of autocovariance:

\(AC(j, lg) = \sum_{i = 1}^{L-lg} \frac{(H_{i, j} - \bar{H_j}) \times (H_{i+lg, j} - \bar{H_j})}{L-lg}, \space j = 1,2..20\)

In which \(AC(j,\space lg)\) is the autocovariance factor for column \(j\), \(lg\) is the lag value, \(L\) is the length of the protein sequence, \(\bar{H_j}\) represents the average of column \(j\) in matrix \(H\) and \(H\) represents the HMM matrix.

Dong, Zhou, and Guan (2009) also provide a mathematical representation of cross covariance:

\(AC(j_1, j_2, lg) = \sum_{i = 1}^{L-lg} \frac{(H_{i, j_1} - \bar{H_{j_1}}) \times (H_{i+lg, j_2} - \bar{H_{j_2})}}{L-lg}, \space j_1,j_2 = 1, 2, 3...20, \space j_1 \neq j_2\)

In which \(AC(j_1,\space j_2,\space lg)\) is the cross correlation factor for columns \(j_1, \space j_2\), \(lg\) is the lag value, \(L\) is the length of the protein sequence, \(\bar{H_{j_i}}\) represents the average of column \(j_i\) in matrix \(H\) and \(H\) represents the HMM matrix.

```
library(protHMM)
## basic example code
<- hmm_ac(system.file("extdata", "1DLHA2-7", package="protHMM"))
h_AC<- hmm_cc(system.file("extdata", "1DLHA2-7", package="protHMM"))
h_CChead(h_AC, 20)
#> [1] 2027588 2048277 2069394 2090950 8161072 8244348 8329341 8416105 3815863
#> [10] 3854800 3894540 3935108 2055989 2076968 2098380 2120238 4671700 4719371
#> [19] 4768024 4817691
head(h_CC, 20)
#> [1] 432613.90 607445.71 679287.41 527806.56 446405.28 989099.80 188119.73
#> [8] 479781.58 547044.82 29788.07 892089.22 946319.61 520382.72 820507.43
#> [15] 297221.83 194817.35 309185.70 286636.41 873835.20 -23142.80
```

The bigrams and trigrams features are outlined by Lyons et al. (2015), and they interpret the likelihood of 2 (bi) or 3 (tri) amino acids occurring consecutively in the sequence. Thus, they take shape as a 20 x 20 matrix for bigrams and a 20 x 20 x 20 array for trigrams, which are then flattened to create vectors of length 400 and 8000. These features were used by Lyons et al. (2015) for protein fold recognition.

Lyons et al. (2015) outlines bigrams mathematically as \(B[i, j]\), where:

\({B[i, j] = \sum_{a = 1}^{L-1} H_{a, i}H_{a+1, j}}\)

And \({H}\) corresponds to the original HMM matrix, and \({L}\) is the number of rows in \({H}\).

Lyons et al. (2015) outlines bigrams mathematically as \(B[i, j, k]\), where:

\({B[i, j, k] = \sum_{a = 1}^{L-2} H_{a, i}H_{a+1, j}H_{a+2, k}}\)

And \({H}\) corresponds to the original HMM matrix, and \({L}\) is the number of rows in \({H}\).

```
library(protHMM)
## basic example code
<- hmm_trigrams(system.file("extdata", "1DLHA2-7", package="protHMM"))
h_tri<- hmm_bigrams(system.file("extdata", "1DLHA2-7", package="protHMM"))
h_bihead(h_tri, 20)
#> [1] 0.005352191 0.011975355 0.019514992 0.008968399 0.018890547 0.008317675
#> [7] 0.013992176 0.016166307 0.022135295 0.004365804 0.013618584 0.020090544
#> [13] 0.008747969 0.012091065 0.024533920 0.019144562 0.017654663 0.006009293
#> [19] 0.006042909 0.011424255
head(h_bi, 20)
#> [1] 0.27125769 0.11864240 0.22349757 0.35406729 0.17926196 0.31284113
#> [7] 0.15635251 0.24893749 0.31070932 0.43189681 0.08388694 0.26433238
#> [13] 0.39767028 0.19981994 0.24877376 0.51622764 0.38255083 0.36514895
#> [19] 0.12440151 0.13735055
```

The CHMM feature begins by creating a CHMM, which is created by constructing 4 matrices, \({A, B, C, D}\) from the original HMM \({H}\). \({A}\) contains the first 75 percent of the original matrix \({H}\) row-wise, \({B}\) the last 75 percent, \({C}\) the middle 75 percent and \({D}\) the entire original matrix (An et al. 2019). These are then merged to create the new CHMM \({Z}\). From there, the bigrams feature is calculated with a flattened 20 x 20 matrix \({B}\), in which

\({B[i, j] = \sum_{a = 1}^{L-1} Z_{a, i} \times Z_{a+1, j}}\)

\({H}\) corresponds to the original HMM matrix, and \({L}\) is the number of rows in \({Z}\) (An et al. 2019). Local Average Group, or LAG is calculated by first splitting the CHMM into \(j\) groups and then the following:

\(LAG(k) = \frac{20}{L}\sum_{p = 1}^{N/20} Mt[p+(i-1)\times(N/20),\space j]\)

Where \({L}\) is the number of rows in \({Z}\) and \(Mt[p+(i-1)\times(N/20),\space j]\) represents the row vector of the CHMM and the \(i\)th position in the \(j\)th group (An et al. 2019). These features are then fused, and a vector of 800 along the the original length 400 bigrams and LAG vectors are returned. The CHMM features were used to predict protein-protein interactions.

```
library(protHMM)
<- chmm(system.file("extdata", "1DLHA2-7", package="protHMM"))[[1]]
h_fused<- chmm(system.file("extdata", "1DLHA2-7", package="protHMM"))[[2]]
h_lag<- chmm(system.file("extdata", "1DLHA2-7", package="protHMM"))[[3]]
h_bigramshead(h_fused, 20)
#> [1] 0.050431366 0.008073699 0.025377739 0.080125307 0.037055099 0.034279129
#> [7] 0.022591078 0.088067417 0.056547767 0.088639559 0.020870630 0.016466719
#> [13] 0.114605135 0.031219286 0.040658673 0.079269757 0.066999681 0.136328812
#> [19] 0.009850417 0.024292606
head(h_lag, 20)
#> [1] 0.050431366 0.008073699 0.025377739 0.080125307 0.037055099 0.034279129
#> [7] 0.022591078 0.088067417 0.056547767 0.088639559 0.020870630 0.016466719
#> [13] 0.114605135 0.031219286 0.040658673 0.079269757 0.066999681 0.136328812
#> [19] 0.009850417 0.024292606
head(h_bigrams, 20)
#> [1] 0.6644992 0.2932670 0.5686021 0.8736529 0.4484777 0.7703030 0.3734919
#> [8] 0.6135072 0.7469687 1.0671193 0.2097486 0.6524548 0.9846256 0.5010583
#> [15] 0.6275425 1.2841556 0.9755920 0.8980086 0.3038199 0.3414741
```

The distance feature calculates the cosine distance matrix between two HMMs \({A}, \space{B}\) before dynamic time warp is applied to the distance matrix calculate the cumulative distance between the HMMs, which acts as a measure of similarity (Lyons et al. 2016). The cosine distance matrix \({D}\) is calculated with:

\({D[a_i, b_j] = 1 - \frac{a_ib_j^{T}}{a_ia_i^Tb_jb_j^T}}\)

in which \({a_i}\) and \({a_i}\) refer to row vectors of \({A}\) and \({B}\) respectively (Lyons et al. 2016). This in turn means that \(D\) is of dimensions \({nrow(A), nrow(b)}\). Dynamic time warp then calculates the cumulative distance by calculating matrix:

\(C[i, j] = min(C[i-1, j], C[i, j-1], C[i-1, j-1]) + D[i, j]\)

where \({C_{i,j}}\) is 0 when \({i}\) or \({j}\) are less than 1 (Lyons et al. 2016). The lower rightmost point of the matrix \({C}\) is then returned as the cumulative distance between proteins. The distance feature was used by Lyons et al. (2016) for protein fold recognition.

```
library(protHMM)
# basic example code
<- hmm_distance(system.file("extdata", "1DLHA2-7", package="protHMM"), system.file("extdata", "1TEN-7",
hpackage="protHMM"))
h#> [1] 99.90334
```

FP_HMM consists of two vectors, \({d, s}\). Vector \({d}\) corresponds to the sums across the sequence for each of the 20 amino acid columns (Zahiri et al. 2013). Vector \({s}\) corresponds to a flattened matrix of \(S\) where:

\({S[i, j] = \sum_{k = 1}^{L} H[k, j] \times \delta[k, i]}\)

in which \({\delta[k, i] = 1}\) when \({A_i = H[k, j]}\) and 0 otherwise (Zahiri et al. 2013). \({A}\) refers to a list of all possible amino acids, \({i, j}\) span from \({1:20}\). This feature has been used for the prediction of protein-protein interactions.

```
library(protHMM)
# basic example code
<- fp_hmm(system.file("extdata", "1DLHA2-7", package="protHMM"))
h1]]
h[[#> [1] 5.327650 2.495163 4.226204 6.557727 3.430360 6.392750 2.638661 5.230106
#> [9] 5.887156 7.882747 1.510770 4.522352 6.408192 3.615053 4.486816 9.137523
#> [17] 7.110800 7.647033 1.971206 2.521788
head(h[[2]], 20)
#> [1] 0.00000000 0.06906144 0.25848195 0.61607168 0.21594299 0.21617140
#> [7] 0.24383966 0.22567252 0.18595546 0.51564480 0.00000000 0.15273093
#> [13] 0.63770076 0.00000000 0.24438102 0.21709957 0.46775998 0.89909902
#> [19] 0.10940189 0.05263446
```

This feature was created by Jin and Zhu (2021) and begins by creates a grouping matrix \(G\) by assigning each position a number \(1,2, 3\) based on the value at each position of HMM matrix \(H\); \(1\) represents the low probability group, \(2\) the medium and \(3\) the high probability group. The number of total points in each group for each column is then calculated, and the sequence is then split based upon the the positions of the 1st, 25th, 50th, 75th and 100th percentile (last) positioned points for each of the three groups, in each of the 20 columns of the grouping matrix. Thus for column \(j\):

\({S(k, \space j, \space z) = \sum_{i = 1}^{(z) \times.25 \times N} |G[i,\space j] = k|}\)

where \({k}\) is the group number, \({z = 1,2,3,4}\) and \({N}\) corresponds to number of rows in matrix \({G}\) (Jin and Zhu 2021).

```
library(protHMM)
# basic example code
<- hmm_GSD(system.file("extdata", "1DLHA2-7", package="protHMM"))
h1:40]
h[#> [1] 1.11117557 0.83022094 -0.43407491 -1.50170251 -1.67027529 0.91450733
#> [7] 0.54926631 0.26831167 -0.01264296 -1.58598890 1.02688918 0.94260279
#> [13] 0.71783909 0.43688445 -1.61408437 1.11117557 0.32450260 -0.34978852
#> [19] -0.99598417 -1.67027529 0.83022094 0.71783909 0.52117084 0.24021621
#> [25] -1.38932066 0.63355270 0.63355270 0.63355270 0.63355270 -1.55789344
#> [31] 1.11117557 0.46497992 -0.34978852 -1.13646149 -1.67027529 1.05498465
#> [37] 0.85831640 0.52117084 0.18402528 -1.50170251
```

Both of the pseudo HMM features, pse_hmm and IM_psehmm, are based off of Chou and Shen (2007)’s psuedo PSSM concept. This was engineered in order to create non-variable length representations of proteins, while keeping sequence order information. IM_psehmm stands for improved pseudo hmm, which is based off of Ruan et al. (2020)’s improvements to the Chou and Shen (2007)’s initial offering; Ruan et al. (2020)’s offering is still based on the PSSM matrix. As such, this feature is a port for use with HMMs.

Mathematically, pseudo_hmm is made of the fusion of 2 vectors. The first is of the means across the 20 amino acid emission frequency columns of the HMM. The second can be found in the following equation (Chou and Shen 2007):

\(V_{j}^{i} = \frac{1}{L-i}\sum_{k = 1}^{L-i}H[k, j] \times H[k+i, j], \space j = 1,2..20, \space i < L\)

\(H\) represents the HMM matrix, \(L\) represents the row number of \(H\), \(i\) represents the amount of contiguous amino acids that correlation is measured across and \(V_{j}^{i}\) represents the value of the vector for the \(j\)th column and \(i\)th most contiguous amino acids. This results in a vector of \(20 + g \times 20\).

IM_pse_hmm is also the fusion of two vectors, one being the same means vector described above. The second can be found as (Ruan et al. 2020):

\(V_{j}^{d} = \frac{1}{20-d}\sum_{k = 1}^{L}H[k, j] \times H[k, j+d], \space j = 1,2..20, \space d < 20\)

Where \(H\) the HMM matrix, \(L\) represents the number of rows in \(H\) and \(V_{j}^i\) represents the value in the feature vector for column \(j\) and columnal distance parameter \(d\). This results in a vector of length \({20+20\times d-d\times\frac{d+1}{2}}\).

This feature is based off of Li et al. (2019)’s work using local binary pattern to extract features from PSSMs to use in protein-protein interaction prediction. The local binary pattern extraction used here can be defined as \(B = b(s(H_{m, n+1}-H_{m, n}), s(H_{m+1, n+1}-H_{m, n}), s(H_{m+1, n}-H_{m, n})...s(H_{m-1, n+1}-H_{m, n})))\), where \(H[m,n]\) signifies the center of a local binary pattern neighborhood (\(2<m<L-1,2<n<20, L = nrow(H)\)), \(H\) refers to the HMM matrix, \(s(x) = 1\) if \(x \ge 0\), \(s(x) = 0\) otherwise and \(b\) refers to a function that converts a binary number to a decimal number. The local binary pattern features are then put into a histogram with 256 bins, one for each possible binary value. This is returned as a 256 length vector.

```
library(protHMM)
## basic example code
<- hmm_LBP(system.file("extdata", "1TEN-7", package="protHMM"))
h1:20]
h[#> [1] 166 31 16 18 35 8 4 7 12 6 3 1 7 0 2 0 19 4 5
#> [20] 2
```

In this feature, linear predictive coding (LPC) is used to extract a vector \(d\) of length \(g\) from each of the 20 amino acid emission frequency columns of the HMM matrix. linear predictive coding considers that position \(H[i, j]\) in the HMM matrix can be approximated by \(\sum_{k = 1}^{g} d_{k, j}a_{i,j}\). The feature vector extracted is thus the vector of d; this is done through the phonTools package (Barreda 2015). This assumes \(g\) of 14, and thus the feature vector generated if \(14\times20 = 280\). This feature is based off of Qin et al. (2015)’s work, in which they used an LPC feature set to predict protein structural class.

```
library(protHMM)
## basic example code
<- hmm_LPC(system.file("extdata", "1TEN-7", package="protHMM"))
h1:20]
h[#> [1] 1.00000000 0.74635054 0.59126153 0.36331602 0.36010191 0.30277057
#> [7] 0.47978532 0.41120113 0.29511047 0.28022645 0.20143373 0.24002696
#> [13] 0.15302503 0.04457104 1.00000000 0.71845887 0.84727202 0.92128401
#> [19] 1.02748498 0.83287042
```

This feature is documented by Mohammadi et al. (2022) and was used for protein-protein interaction prediction. The SCSH feature returns the k-mer composition between a protein’s consensus sequence given by the HMM and the protein’s actual sequence, for k = 2,3. First, all possible k-mers for all of the 20 possible amino acids are calculated (\({20^2}\) and \({20^3}\) permutations for 2 and 3-mers respectively). With those permutations, different vectors of length 400 and 8000 are created, vectors \(v_2, \space v_3\). Each position on the vectors corresponds to a specific k-mer, i.e \(v_2[1] = AA\) and \(v_2[1] = AAA\). Then, the protein sequence that corresponds to the HMM scores is extracted, and put into a bipartite graph with the actual protein sequence. Each path of length 1 or 2 is found on the graph, and the corresponding vertices on the graph are noted as possible 2 and 3-mers. For each 2 or 3-mer found from these paths, 1 is added to the position that corresponds to that 2/3-mer in the 2-mer and 3-mer vectors, which are the length 400 and 8000 vectors created previously. The vectors are then returned.

```
library(protHMM)
## basic example code
<- hmm_SCSH(system.file("extdata", "1TEN-7", package="protHMM"))
h## 2-mers
1]][1:20]
h[[#> [1] 0 0 0 1 0 2 0 0 1 1 1 0 2 0 1 0 2 1 0 0
## 3-mers: specific indexes as most of the vector is 0
2]][313:333]
h[[#> [1] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 2 0 0 0
```

Separated Dimers refers the the probability that there will be an amino acid dimer between amino acid residues separated by a distance of \(l\). Saini et al. (2015) conceived of this feature and applied it to protein fold recognition; mathematically, separated dimers calculates matrix \(F\):

\({F[m, n] = \sum_{i = 1}^{L-l} H_{i, m}H_{i+l, n}}\)

\({H}\) corresponds to the original HMM matrix, and \(L\) is the number of rows in \(H\). Matrix \(F\) is then flattened to a feature vector of length 400, and returned.

```
library(protHMM)
## basic example code
<- hmm_SepDim(system.file("extdata", "1DLHA2-7", package="protHMM"))
h1:40]
h[#> [1] 0.28073204 0.13127205 0.22465162 0.30993504 0.17705352 0.30286268
#> [7] 0.13611039 0.27110569 0.26843594 0.42275865 0.08623105 0.23523189
#> [13] 0.28262058 0.17449219 0.23466631 0.45744566 0.38400417 0.41489217
#> [19] 0.06656876 0.12644926 0.12468473 0.02787985 0.17961578 0.17639461
#> [25] 0.04536077 0.33523934 0.07878719 0.08306272 0.13034049 0.14246328
#> [31] 0.04005923 0.13693382 0.09324879 0.10231668 0.09851687 0.29441110
#> [37] 0.16511460 0.15335638 0.03221440 0.03884258
```

Nanni, Lumini, and Brahnam (2014) pioneer the Single Average Group feature and use it for protein classification. This feature groups together rows that are related to the same amino acid, using a vector \({SA(k)}\), in which \({k}\) spans \({1:400}\) and:

\({SA(k) = avg_{i \space = \space 1, 2... L}\space H[i, j] \times \delta(P(i), A(z))}\)

in which \({H}\) is the HMM matrix, \({P}\) in the protein sequence, \({A}\) is an ordered set of amino acids, the variables \({j, \space z = 1, 2, 3...20}\), the variable \({k = j + 20 \times (z-1)}\) when creating the vector. \({\delta()}\) represents Kronecker’s delta.

```
library(protHMM)
## basic example code
<- hmm_Single_Average(system.file("extdata", "1DLHA2-7", package="protHMM"))
h1:40]
h[#> [1] 5.327649515 2.495163167 4.226203618 6.557727068 3.430359915 6.392750239
#> [7] 2.638660734 5.230106250 5.887155542 7.882747181 1.510770387 4.522351690
#> [13] 6.408192096 3.615052651 4.486815933 9.137522926 7.110799532 7.647033087
#> [19] 1.971206424 2.521787664 0.069061438 1.448076555 0.000000000 0.000000000
#> [25] 0.040982786 0.037528025 0.034250848 0.033810905 0.000000000 0.033966298
#> [31] 0.018141963 0.012174447 0.061500560 0.009964409 0.031380236 0.049529900
#> [37] 0.045531222 0.055012065 0.008144264 0.011430244
```

This feature extraction technique is found in Fang, Noguchi, and Yamana (2013), being used in conjunction with another technique for ligand binding site prediction. This feature smooths the HMM matrix \(H\) by using sliding window of length \(sw\) to incorporate information from up and downstream residues into each row of the HMM matrix. Foreach HMM row :

\(r_i = \sum_{x \space = \space i-\frac{sw}{2}}^{i+\frac{sw}{2}}{r_x}\)

for \({i = 1, 2, 3...L}\), where \({L}\) is the number of rows in \({H}\). For rows such as the beginning and ending rows, zero matrices of dimensions \(sw/2, 20\) are appended to \({H}\).

```
library(protHMM)
## basic example code
<- hmm_smooth(system.file("extdata", "1DLHA2-7", package="protHMM"))
h1,]
h[#> [1] 0.11776641 0.01631112 0.05565213 0.11811762 0.02074633 0.18715081
#> [7] 0.11375485 0.18747835 0.25055389 0.12211269 0.05342971 0.12007436
#> [13] 0.07683993 0.16693724 0.01869764 0.24692156 0.30136984 0.67903030
#> [19] 0.04331468 0.10385260
```

This feature extraction method is found in Song et al. (2018), and uses singular value decomposition to extract a feature vector from a HMM matrix. SVD factorizes the matrix \(H\) of dimensions \({i, j}\) to

\({U[i, r] \times \Sigma[r, r] \times V[r, j]}\)

The diagonal values of \({\Sigma}\) are known as the singular values of matrix \(H\), and are what are returned with this function. This feature was used to predict protein-protein interactions.

```
library(protHMM)
# basic example code
<- hmm_svd(system.file("extdata", "1DLHA2-7", package="protHMM"))
h
h#> [1] 2.4328414 1.1318859 1.0331369 0.9046638 0.8595964 0.6696056 0.6425755
#> [8] 0.5865227 0.5189927 0.5155567 0.4701584 0.4118091 0.3627740 0.3249188
#> [15] 0.2858293 0.2615811 0.2576213 0.2251676 0.2074293 0.1242763
```

This function reads in the 20 amino acid emission frequency columns used in the feature extraction methods discussed previously, and converts the columns into probabilities.

```
library(protHMM)
## basic example code
<- hmm_read(system.file("extdata", "1DLHA2-7", package="protHMM"))
h1,]
h[#> [1] 0.0000000 0.0000000 0.0000000 0.3410370 0.0000000 0.0000000 0.0000000
#> [8] 0.3382121 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [15] 0.0000000 0.0000000 0.0000000 0.3208565 0.0000000 0.0000000
```

An, Ji-Yong, Yong Zhou, Yu-Jun Zhao, and Zi-Ji Yan. 2019. “An Efficient Feature Extraction Technique Based on Local
Coding PSSM and Multifeatures Fusion for Predicting Protein-Protein
Interactions.” *Evolutionary Bioinformatics* 15
(January): 117693431987992.

Barreda, Santiago. 2015. *phonTools: Functions for Phonetics in
r.*

Chou, Kuo-Chen, and Hong Shen. 2007. “MemType-2L: A Web server for predicting membrane proteins
and their types by incorporating evolution information through
Pse-PSSM.” *Biochemical and Biophysical Research
Communications* 360 (2): 339–45. https://doi.org/10.1016/j.bbrc.2007.06.027.

Dong, Qiwen, Shuigeng Zhou, and Jihong Guan. 2009. “A new taxonomy-based protein fold recognition approach
based on autocross-covariance transformation.”
*Bioinformatics* 25 (20): 2655–62. https://doi.org/10.1093/bioinformatics/btp500.

Fang, Chun, Tamotsu Noguchi, and Hayato Yamana. 2013. “SCPSSMpred: A General Sequence-based Method for
Ligand-binding Site Prediction.” *IPSJ Transactions on
Bioinformatics* 6 (0): 35–42. https://doi.org/10.2197/ipsjtbio.6.35.

Jin, Danyu, and Ping Zhu. 2021. “Protein
Subcellular Localization Based on Evolutionary Information and Segmented
Distribution.” *Mathematical Problems in
Engineering* 2021 (December): 1–14. https://doi.org/10.1155/2021/8629776.

Li, Yan, Liping Li, Lei Wang, Changqing Yu, Zheng Wang, and Zhu-Hong
You. 2019. “An Ensemble Classifier to Predict
Protein–Protein Interactions by Combining PSSM-based Evolutionary
Information with Local Binary Pattern Model.”
*International Journal of Molecular Sciences* 20 (14): 3511. https://doi.org/10.3390/ijms20143511.

Liang, Yunyun, Sanyang Liu, and Zhang. 2015. “Prediction of Protein Structural Class Based on Different
Autocorrelation Descriptors of Position–Specific Scoring
Matrix.” *MATCH: Communications in Mathematical and in
Computer Chemistry* 73 (3): 765–84.

Lyons, James, Abdollah Dehzangi, Rhys Heffernan, Yuedong Yang, Yaoqi
Zhou, Alok Sharma, and Kuldip K. Paliwal. 2015. “Advancing the Accuracy of Protein Fold Recognition by
Utilizing Profiles From Hidden Markov Models.” *IEEE
Transactions on Nanobioscience* 14 (7): 761–72. https://doi.org/10.1109/tnb.2015.2457906.

Lyons, James, Kuldip K. Paliwal, Abdollah Dehzangi, Rhys Heffernan,
Tatsuhiko Tsunoda, and Alok Sharma. 2016. “Protein fold recognition using HMM–HMM alignment and
dynamic programming.” *Journal of Theoretical
Biology* 393 (March): 67–74. https://doi.org/10.1016/j.jtbi.2015.12.018.

Mohammadi, Alireza M., Javad Zahiri, Saber Mohammadi, Mohsen Khodarahmi,
and Seyed Shahriar Arab. 2022. “PSSMCOOL: a
comprehensive R package for generating evolutionary-based descriptors of
protein sequences from PSSM profiles.” *Biology Methods
and Protocols* 7 (1). https://doi.org/10.1093/biomethods/bpac008.

Nanni, Loris, Alessandra Lumini, and Sheryl Brahnam. 2014. “An Empirical Study of Different Approaches for Protein
Classification.” *The Scientific World Journal*
2014 (January): 1–17. https://doi.org/10.1155/2014/236717.

Qin, Yufang, Xiaoqi Zheng, Jun Wang, Ming Chen, and Changjie Zhou. 2015.
“Prediction of protein structural class based
on Linear Predictive Coding of PSI-BLAST profiles.”
*Central European Journal of Biology* 10 (1). https://doi.org/10.1515/biol-2015-0055.

Ruan, Xiaoli, Dongming Zhou, Rencan Nie, and Yanbu Guo. 2020.
“Predictions of Apoptosis Proteins by
Integrating Different Features Based on Improving
Pseudo-Position-Specific Scoring Matrix.” *BioMed
Research International* 2020 (January): 1–13. https://doi.org/10.1155/2020/4071508.

Saini, Harsh, Gaurav Raicar, Alok Sharma, Sunil K. Lal, Abdollah
Dehzangi, James Lyons, Kuldip K. Paliwal, Seiya Imoto, and Satoru
Miyano. 2015. “Probabilistic expression of
spatially varied amino acid dimers into general form of Chou’s pseudo
amino acid composition for protein fold recognition.”
*Journal of Theoretical Biology* 380 (September): 291–98. https://doi.org/10.1016/j.jtbi.2015.05.030.

Song, Xiaoyu, Zhan-Heng Chen, Xiangyang Sun, Zhu-Hong You, Liping Li,
and Yang Zhao. 2018. “An Ensemble Classifier
with Random Projection for Predicting Protein–Protein Interactions Using
Sequence and Evolutionary Information.” *Applied
Sciences* 8 (1): 89. https://doi.org/10.3390/app8010089.

Zahiri, Javad, Omid Yaghoubi, Morteza Mohammad-Noori, Reza Ebrahimpour,
and Ali Masoudi-Nejad. 2013. “PPIevo :
Protein–protein interaction prediction from PSSM based evolutionary
information.” *Genomics* 102 (4): 237–42. https://doi.org/10.1016/j.ygeno.2013.05.006.