After a full year of software development, from June 2011 to March 2012, making the professional release of PLAST from its reseach prototype, we conducted the following benchmark.
PLAST has been compared to BLAST and SSearch to evaluate speedup and data quality produced by the new algorithm. Since we used SSearch in this test, we chose reduced data sets to take into account long running times.
PLASTp benchmark compared the first 2327 proteins from the black cottonwood Populus trichocarpa proteome against the first 2.9 million sequences from the NCBI RefSeq databank. All computations were conducted on an Apple MacPro computer.
Software
PLAST: release 2.2.0
BLAST: release 2.2.26+ from NCBI
SSearch: release 36 from University of Virginia
Datasets
Data sets retrieved on April 25th, 2012:
- Query databank: Populus trichocarpa, Fasta file Ptrichocarpa_156_peptide.fa.gz from ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v7.0/Ptrichocarpa/annotation/.
- Subject databank: NCBI RefSeq pre-formatted databank, volume 00. File refseq_protein.00.tar.gz from ftp://ftp.ncbi.nih.gov/blast/db/ were processed through blastdbcmd tool to extract the Fasta file.
Computer
All tests were conducted on an Apple MacPro computer running OSX-Lion (10.7.3) on two 2.66GHz 6-Core Intel Xeon “Westmere” processors, 32 Gb RAM and 1 Tb HDD.
Results
Accuracy vs. SSearch | Running time (s) | |||||
---|---|---|---|---|---|---|
(%)
|
Cores: | 1 | 4 | 8 | 12 | |
BLASTp |
74.4
|
37,762 | 10,253 | 6,229 | 4,891 | |
PLASTp |
74.7
|
1,302 | 394 | 262 | 214 | |
speedup(*) | 29x | 96x | 144x | 176x |
Comments
- Softwares were configured using an increasing number of cores for computation, a BLOSUM62 matrix, an E-Value threshold set to 1e-3 and results were produced in tabular formatted files to enable comparison of data between BLAST, PLAST and SSearch.
- Accuray was evaluated by computing the fraction (%) of sequence alignments produced by each algorithm that are also found by a reference algorithm: SSearch. Results from BLAST and PLAST were compared with SSearch as follows: for each query sequence, we checked equality between hit sequence IDs and sequence alignment locations.
- It is worth noting that PLAST is faster than BLAST even on a single computing core.
- (*) speedup of PLAST over BLAST running on a single computing core.