Extended ProtVec

We present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE can be inferred over a large set of protein sequences (Swiss-Prot) and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif mining and protein sequence embedding.

ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variable-length protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks.


The paper is under review, but available on bioArxiv and the embedding and example iPython notebooks will be available on GitHub.

@article {Asgari345843,
author = {Asgari, Ehsaneddin and McHardy, Alice and Mofrad, Mohammad R. K.},
title = {Probabilistic variable-length segmentation of protein sequences for discriminative motif mining (DiMotif) and sequence embedding (ProtVecX)},
year = {2018},
doi = {10.1101/345843},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2018/07/12/345843},
eprint = {https://www.biorxiv.org/content/early/2018/07/12/345843.full.pdf},
journal = {bioRxiv}

See also DiMotif implementation at Github: