Datasets
The datasets were obtained from the public protein databases SCOP [7] and the extended SCOPe [8]. These databases contain a hierarchical structural classification of protein domains with solved structure. From the topdown view, such hierarchical levels are structural class, fold, superfamily and family, which group protein domains with increasing sequence similarity at each level.
Training dataset
We trained our neural network models using the SCOPe 2.06 training set from [43]. Such a training set was obtained after filtering out protein domains having a significant sequence similarity to those in the test set. To do so, the following similarity reduction methods were executed: MMseqs2 [59] (sequence identity 25%, evalue (10^{4})), CDHIT2D [60] (sequence identity 40%) and BLAST+ [61] (evalue (10^{4})). The final dataset contains 16133 protein domains sharing at most 95% pairwise sequence identity, which are classified into (K=1154) folds. For hyperparameter tuning, we performed a 5stage crossvalidation over the entire training set. Hence, we split the 16,133 protein domains into 5 groups, including domains from different families in each one (Additional file 1: S1). This prevents having proteins in the validation subsets with similar amino acid sequence to those in the corresponding training subset.
Benchmark datasets
We tested the effectiveness of our hyperspherical embeddings using both the wellknown LINDAHL dataset [3] and the updated LINDAHL_1.75 dataset we recently proposed in [48]. The original LINDAHL dataset includes 976 domains from SCOP 1.37 covering 330 folds. Updated to SCOP 1.75, the LINDAHL_1.75 dataset contains the same number of proteins (976) but now classified into 323 folds. Protein domains within both test sets share a maximum sequence identity of 40%, as well as with respect to the training domains. Each dataset is paired and evaluated independently at three different levels—family, superfamily and fold. Thus, while the number of individual protein domains evaluated within the LINDAHL dataset are 555, 434 and 321 for the family, superfamily and fold levels, in LINDAHL_1.75 we evaluate 547, 431 and 356 domains, respectively.
Protein residuelevel feature representation
In order to represent the protein amino acid sequence with variable length L, we considered 45 features for each amino acid as in previous works [42, 48]. These 45 residuelevel features contain the following information:

Amino acid encoding: onehot vector of size 20 representing the amino acid type.

Positionspecific scoring matrix (PSSM): 20 elements which contain the evolutionary profile information obtained from the multiple sequence alignment (MSA). We computed the PSSM matrix using PSIBLAST [10] and the nonredundant database ‘nr90’ for sequence homology searching.

Secondary structure: onehot vector of size 3 encoding the helix, strand and loop secondary structure elements. To predict the secondary structure we used the SSpro method from the SCRATCH suite [62].

Solvent accessibility: onehot vector of size 2 encoding the exposed and buried states. Similar to before, we used the ACCpro method from SCRATCH to predict the solvent accessibility states.
These (Ltimes 45) features are used as input to our neural network models, which are trained to predict the fold class for each protein domain.
Residualconvolutional and recurrent neural network
In this study, we improve our previously proposed neural network models, CNNGRU and CNNBGRU [48], with blocks of residual convolutions [52]. As a result, the model architecture is formed by three main parts, as depicted in Fig. 2: residualconvolutional (ResCNN), recurrent (RNN) and fullyconnected (FC). We named these new models as ResCNNGRU and ResCNNBGRU, depending on the use of unidirectional or bidirectional layers of gated recurrent units (GRU) in the recurrent part.
Residualconvolutional part
The convolutional neural network (CNN) aims to capture the local context of each residue in the protein domain and discover shortterm patterns within the amino acid sequence. At each CNN layer, we apply a 1Dconvolution operation along the sequence dimension, with several convolutional filters of specific length to be learned. Considering an input of size (Ltimes 45), the output of each 1Dconvolutional layer is of size (Ltimes N_l), where (N_l) is the number of learned filters in the lth layer. In our model, the 1Dconvolutional layers are grouped into residual blocks [52]. The output (mathcal {R}(x_b, mathcal {W}_b)) of each residual block is combined with its input (x_b) as (x_{b+1} = x_b + mathcal {R}(x_b, mathcal {W}_b)), where (mathcal {W}_b) are the weights and biases associated to the bth residual block, and (mathcal {R}(cdot )) is the mapping function performed by the block.
Figure 2a presents the ResCNN part of our model. We first apply an initial 1Dconvolution to transform the (Ltimes 45) input features into (Ltimes 256) outputs by using 256 filters of length 1. These are then processed by two residual blocks, each one formed by two layers with 64 and 256 filters of length 5. After each convolution, ReLU activation and BatchNormalization [63] are applied.
Recurrent part
The purpose of the recurrent neural network (RNN) is to exploit longdistance relations through all the amino acid sequence and generate a summary of the whole protein domain at its output. Here, the (Ltimes 256) outputs from the ResCNN are fed into a gated recurrent unit (GRU) [50] based layer with 1024 state units.
As shown in Fig. 2b, instead of saving all the (Ltimes 1024) states of the GRU, we only consider the last state ((overrightarrow{mathbf {h}_L})) as a summary vector of 1024 elements. In this way, our model architecture can process amino acid sequences of arbitrary length and extract a fixedsize vector representing the whole protein domain. We refer to this model as ResCNNGRU. An alternative architecture is that based on a bidirectional GRU [64] which also processes the sequence in reverse order. In such a case, last states from both forward ((overrightarrow{mathbf {h}_L})) and backward ((overleftarrow{mathbf {h}_L})) GRU layers are concatenated into a vector of 2048 elements. We denote this model as ResCNNBGRU.
Fullyconnected part
Finally, the fullyconnected (FC) part combines the recurrent output to create a foldrelated embedding for the whole protein domain, which is then used to perform a preliminary fold classification. The classification step guides the model during training to learn a meaningful embedding space, which is related to the protein folds. Then, these learned embeddings are used for pairwise fold recognition in the test phase.
In particular, the FC part (Fig. 2c) consists of two dense layers. The first one, with 512 units, is used to learn a nonlinear combination of the GRU output vector (1024 or 2048 for the unidirectional and bidirectional architectures, respectively) which shapes the foldrelated embedding. As nonlinearity, both the sigmoid and the hyperbolic tangent (tanh) activation functions have been tested in our experiments. The last layer performs a linear classification of the 512dimensional embeddings using K output units. Here, K is the number of fold classes in which the input proteins are classified during training. In the following subsections we detail how this last classification layer can be modified to learn more discriminative embedding vectors by distributing the fold class vectors in hyperspherical space.
Neural network model optimization
We trained our neural network models with minibatches of 64 protein domains. To process variablelength sequences, we applied zeropadding to the maximum length within each minibatch. After the GRU layer, we kept the last state vector of each domain sample before the zeropadding, which corresponds to the last amino acid residue of each domain in the minibatch. In the bidirectional GRU, the same GRU layers are used but the amino acid sequences were first reversed for the backward layer, so the last state (before zeropadding) corresponds to the first residue of each domain. The optimization process was performed in two different stages by comparing the model predictions with the true fold classes (ground truth). In the first one (Fig. 1a), we optimized the models by minimizing the wellknown softmax crossentropy loss, while in the second stage (Fig. 1c) we used the large margin cosine loss (LMCL) [56], which is a normalized and margin discriminative version of the softmax loss. In this case, we also used a fixed (i.e. nontrainable) weight matrix in the classification layer ((mathbf {W}) in Fig. 2c) which maximally separates fold class vectors in hyperspherical space (Fig. 1b). We used the Adam optimizer [65] with an initial learning rate of (10^{3}), which we reduced by a factor of 10 at epoch number 40, whereas the whole optimization process was completed in 80 epochs. In order to prevent overfitting to the most populated fold classes, we applied (L_2) penalty with a small weight decay of (5times 10^{4}) and dropout [66] with a drop probability of 0.2 in the convolutional and the first FC layers.
Large margin cosine loss
The softmax crossentropy loss (softmax loss for simplicity) is one of the most common loss functions for multiclass classification problems. It is defined as:
$$begin{aligned} L_{softmax} = – frac{1}{N} sum _{i=1}^N log p_i = – frac{1}{N} sum _{i=1}^N log frac{ e^{f_{y_i}} }{ sum _{k=1}^K e^{f_k} }, end{aligned}$$
(1)
where (p_i) is the posterior probability of the (mathbf {x}_i) embedding sample being classified into its groundtruth class (y_i), N is the number of training samples in the minibatch ((i={1, dots , N})), K is the number of classes ((k={1, dots , K})), and (f_k) is the output of the last linear classification layer with weight matrix (mathbf {W} in mathbb {R}^{Ktimes d}) (the bias is set to zero for simplicity). For each input (mathbf {x}_i), the output corresponding to class k is computed as:
$$begin{aligned} f_k = mathbf {w}_k^T mathbf {x}_i = left mathbf {w}_k right left mathbf {x}_i right cos (theta _{k,i}), end{aligned}$$
(2)
with (theta _{k,i}) being the angle between the vectors (mathbf {w}_k) and (mathbf {x}_i). If we enforce that (left mathbf {w}_k right = 1) through (L_2) normalization, and (left mathbf {x}_i right = s) by using a tunable scale hyperparameter, the posterior probability only depends on the cosine of the angle (theta _{k,i}). This results in the normalized softmax loss (NSL), defined as:
$$begin{aligned} L_{ns} = – frac{1}{N} sum _{i=1}^N log frac{ e^{s cos (theta _{y_i,i}) } }{ sum _{k=1}^K e^{s cos (theta _{k,i})} }. end{aligned}$$
(3)
The feature embeddings learned by NSL are angularly distributed, but they are not necessarily more discriminative than the ones learned by softmax loss. In order to control the classification boundaries, two variants of the NSL, the angular softmax (ASoftmax) loss [55] and the large margin cosine loss (LMCL) [56], introduce a margin hyperparameter ((m ge 0)). The decision margin in LMCL is defined in cosine space rather than in angle space, which proved to be more beneficial when learning the classification boundaries [56]. This is therefore the loss function we adopted to optimize our neural network models, and is formally defined as:
$$begin{aligned} L_{lmc} = – frac{1}{N} sum _{i=1}^N log frac{ e^{s (cos (theta _{y_i,i}) – m)} }{ e^{s (cos (theta _{y_i,i}) – m)} + sum _{kne y_i} e^{s cos (theta _{k,i})} }, end{aligned}$$
(4)
subject to (cos (theta _{k,i}) = hat{mathbf {w}}_k^T hat{mathbf {x}}_i), where (hat{mathbf {w}}_k) and (hat{mathbf {x}}_i) are the (L_2) normalized vectors ((hat{mathbf {w}}_k = mathbf {w}_k / left mathbf {w}_k right) and (hat{mathbf {x}}_i = mathbf {x}_i / left mathbf {x}_i right)).
As stated in the original paper [56], by (L_2)normalizing the embedding vectors (mathbf {x}_i), we enforce them to be distributed on the surface of a ddimensional hypersphere. Thus, the scaling hyperparameter s controls the radius of such hypersphere and its value increases with the number of classes. The margin hyperparameter m relates to the capacity of learning more discriminative embeddings. Possible values are in the range (m in [ 0, frac{K}{K1} )), although high values close to the upperbound could cause failures in convergence. Having this in mind, we tuned the scale s and margin m hyperparameters for each neural network model through crossvalidation.
Thomsonderived hyperspherical prototypes
We hypothesize that by providing a nontrainable matrix (mathbf {W} in mathbb {R}^{Ktimes d}) to the classification layer we can ease the training process. Such matrix contains K predefined prototype vectors representing each fold class, (mathbf {W} = { mathbf {w}_1, dots , mathbf {w}_K }). Thus, we can shape the embedding space to be representative of the protein folds, and so extract more meaningful foldrelated embeddings for each protein during the training stage (Fig. 1c). The use of such prototype networks was first proposed in [58].
Optimal distribution of prototypes
We argue that the optimal configuration of the K prototype vectors is that which provides maximal separation in the angular space. This can be achieved by placing the K points equidistant on the surface of a ddimensional hypersphere, so (mathbf {w}_k in mathbb {S}^{d1}), as shown in Fig. 1b. The Thomson problem [57] addresses this by studying the distribution of K charged particles on the surface of a unit 3Dsphere. The minimum energy configuration can be optimized by measuring the Coulomb’s law. When using simplified units for electron charges and Coulomb’s constant, the formula for a pair of electrons reduces to (E_{ij} = 1 / r_{ij}), relying only on the distance ((r_{ij})) between the two points.
This can be extended to points located on the surface of a hypersphere of d dimensions and computed for all possible pairs of points [67]. We could therefore optimize the distribution of our (mathbf {w}_k) prototype vectors by minimizing the generalized Thomson loss (THL), defined as:
$$begin{aligned} L_{th} = sum _{k=1}^K sum _{j=1}^{k1} frac{1}{left mathbf {w}_k – mathbf {w}_j right _2^2} + frac{lambda }{2} sum _{k=1}^K (left mathbf {w}_k right ^2 – 1)^2. end{aligned}$$
(5)
The hyperparameter (lambda) controls the weight of the norm constraint. Note that the Thomson loss uses the Euclidean distance between points, which is affected by the norm of each vector, while the cosine similarity is more adequate to measure the angular separation (independent of the norm). In order to remove the norm constraint from the loss function, we propose to directly maximize the Euclidean distance of the projected ((L_2)normalized) vectors. Thus, we can remove the hyperparameter (lambda) from equation (5), obtaining the following Thomson loss (THL–sum):
$$begin{aligned} L_{th_sum} = sum _{k=1}^K sum _{j=1}^{k1} left frac{mathbf {w}_k}{left mathbf {w}_k right } – frac{mathbf {w}_j}{left mathbf {w}_j right } right _2^{2}. end{aligned}$$
(6)
Alternatively, we can instead minimize the maximum cosine similarity computed for each prototype vector [58], using the following loss function (THL–maxcos):
$$begin{aligned} L_{th_maxcos} = frac{1}{K} sum _{k=1}^K max _{jne k} left( frac{mathbf {w}_k cdot mathbf {w}_j}{left mathbf {w}_k right left mathbf {w}_j right } right) . end{aligned}$$
(7)
Maximally separated prototype vectors are obtained by means of gradient descent over the proposed loss function (either THL–sum or THL–maxcos), where it must be noted that all possible pairs of points are taken to perform a single iteration step.
Initial prototype vectors
As initial matrix of prototypes we can consider a set of K Gaussian random variables of dimension d, (mathbf {W}^{random}). However, we found that the learned classification matrix from a model previously trained with the softmax crossentropy loss (Fig. 1a), (mathbf {W}^{softmax}), provides better results. Unlike (mathbf {W}^{random}), the matrix (mathbf {W}^{softmax}) has been trained to classify protein domains into folds, somehow preserving the arrangement of the structural classes within the learned space. To show this, we measured the intra and interstructural class prototype separation, as well as the angular Fisher score (AFS) [55]. Further details can be found in Additional file 1: S2.
Pairwise similarity scores
Cosine similarity measures
The FoldHSphere method (Fig. 1d) uses the hyperspherical embeddings extracted from our neural network model to compute a fold similarity measure between each pair of protein domains. Following previous works [43, 48], we used the cosine similarity between two embedding vectors ([ mathbf {x}_i, mathbf {x}_j ] in mathbb {R}^{d}) as metric, computed as:
$$begin{aligned} cos (mathbf {x}_i, mathbf {x}_j) = frac{mathbf {x}_i cdot mathbf {x}_j}{left mathbf {x}_i right left mathbf {x}_j right }, end{aligned}$$
(8)
which is a measure of angular separation (in the range ([1, 1])) and independent of the norm of each embedding vector.
Random forest enhanced scores
To obtain an improved fold similarity score (FoldHSpherePro in Fig. 1d), we trained a random forest (RF) model using our cosine similarity score along with the 84 pairwise similarity measures from [33, 34] and the DeepFR cosine similarity [43]. Thus, each input vector is of size 86 and corresponds to a pair of protein domains. The RF model uses this information to determine whether the domains in such a pair share the same fold class (binary classification). We trained and evaluated the RF models in a 10stage crossvalidation setting for the LINDAHL and LINDAHL_1.75 test sets independently. The random forest models used 500 decision trees each as in [43, 48].
Evaluation
Threelevel rank performance accuracy
As originally proposed in [3], we evaluated the test protein domains at three levels of increasing difficulty—family, superfamily and fold. At each level, we differentiated between positive and negative pairs of domains. A negative pair contains two protein domains from different fold classes, while in a positive pair both domains are from the same fold class. Each level includes all the negative pairs, while positive pairs are selected according to the SCOP hierarchy [7]. That is, the family level contains pairs of domains that share the same family class, and therefore the same superfamily and fold classes. At the superfamily level, the domains in each pair share the same superfamily class—and therefore the same fold—but not the same family. Finally, domains in positive pairs at the fold level only share the same fold class, but neither share the same family nor superfamily.
At each of these levels, for every individual protein domain (query) we ranked the rest of domains (templates) according to their similarity scores. These can be either cosine similarities or random forest output scores. Then, we assigned the fold class of the most similar template to the query and computed the ratio of hits—top 1 accuracy. We also obtained the ratio of finding the correct fold class within the 5 firstranked templates—top 5 accuracy. It must be noted that, instead of using the training set as the search database, in this evaluation we aim to find template domains inside the test set itself (either LINDAHL or LINDAHL_1.75).
In order to measure the statistical significance of our top 1 and top 5 accuracy results, we also provide standard errors estimated as the standard deviation of 1000 bootstrap samples. To do so, we sampled with replacement from the set of individual protein domains that are tested at each level (555, 434 and 321 domains respectively in the LINDAHL dataset). Then, for each sampled set we selected all negative pairs and positive pairs corresponding to the specific level, and proceeded with the evaluation as before.
Foldlevel LINDAHL crossvalidation evaluation
In order to compare with some recent methods [35,36,37,38,39,40,41, 44,45,46] we also provide results on a foldlevel 2stage crossvalidation setting on the LINDAHL test set [22]. Here, the 321 protein domains which form positive pairs at the fold level are separated into two subsets LE_a and LE_b, with 159 and 162 domains each. Note that the rest of domains within LINDAHL (up to 976) are not considered during this evaluation. When evaluating the protein domains in each subset (e.g. LE_a), the domains in the other subset (LE_b) act as templates for ranking. Thus, the random forest models are trained using pairs of protein domains from one subset, whereas the evaluation is performed on the other one. In this evaluation, we report the averaged performance accuracy over both crossvalidation subsets.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Disclaimer:
This article is autogenerated using RSS feeds and has not been created or edited by OA JF.
Click here for Source link (https://www.biomedcentral.com/)