PLAST command-line arguments

Contents

1 Mandatory arguments
2 Controlling results size
3 Sorting results by query IDs
4 Controlling speed
5 Optimizing quality/speed ratio using PLAST specific arguments
6 Optimizing PLAST: sample recipes
7 Optimizing PLAST at runtime: using seed-ratio
8 Optimizing PLAST at runtime: using threshold score
- - 8.0.1 Use case on 16s RNA
  - 8.0.2 Use case on protein and nucleotide comparisons
9 Optimizing PLAST at runtime: using max-database-size
10 Score matrix, gap costs and match/mismatch costs
11 Proteic-based PLAST search
- - 11.0.1 BLOSUM62
  - 11.0.2 BLOSUM50
12 Nucleic-based PLAST search (plastn only)
13 Monitoring job

Mandatory arguments

Each PLAST job requires at least to use the following arguments:

	Argument	Description
	-p	comparison method. One of: plastn, plastp, plastx, tplastn or tplastx
	-i	the query file provided as a Fasta formatted sequence file
	-d	the reference databank. Either a Fasta file or a BLAST databank
	-o	the results file

As mentioned in the table, PLAST is capable of directly working with Fasta files. However, in the context of the reference databank, you can also provide a databank name when you have made use of KDMS (Korilog Databank Manager System) that is provided with KlastRunner. Such a databank name can be obtained by having a look at the KDMS graphical frontend: on the right panel, have a look at the column called “Name”.

Controlling results size

You can control how many hits are reported in a result file by using these arguments:

	Argument	Description
	-e	E-value threshold. Default value is 10
	-max-hit-per-query	set the maximum number of hits aligned to a query. Default value is 10. Requires to use additional argument: -force-query-order 1000
	-max-hsp-per-hit	set the maximum number of HSPs reported for a hit. Default value is 1. Requires to use additional argument: -force-query-order 1000

If you want to get all possible hits/HSPs, simply pass 0 (zero) to -max-hit-per-query and -max-hsp-per-hit.

Sorting results by query IDs

Being a bank to bank sequence comparison tool, PLAST does not care about query ordering when producing results. It means that PLAST produces query/hit matches without any particular order. If you prefer PLAST producing results as BLAST, i.e. hits are sorted by query IDs, simply add the following argument to your PLAST command:

plast ... -force-query-order 1000 ...

Controlling speed

PLAST speed is of course controlled by the number of available cores:

	Argument	Description
	-a	Number of cores. Default is the maximum number of cores available on the computer running PLAST.

PLAST speedup over Blast

PLAST provides additional parameters to fine tune speed/quality ratio, as explained in the coming sections.

Optimizing quality/speed ratio using PLAST specific arguments

PLAST’s default configuration has been setup to provide an optimal ratio between speed and quality in order to produce results with quality similar to Blast. Even in such a configuration, you’ll have great speedup factors.

Depending on your needs you can enhance speed factors with little loss of quality in your results.

PLAST specific arguments for optimizing search jobs are:

	Argument	Description
	-seeds-use-ratio	Ratio of seeds to be used (see comment, below). [1..100], default is 100. Decrease value to speedup algorithm with little loss of quality
	-s	Ungapped threshold trigger a small gapped extension (see comment, below). [25..127], default is 38 and 55 for protein-based and nucleic-based comparisons, respectively. Increase value to speedup algorithm with little loss of quality.
	-max-database-size	Maximum allowed size (in bytes) for a database. If greater, database is segmented (see comment, below)

Fine tuning seed-ratio, threshold score and max-database-size may provide impressive acceleration of the KLAST comparison engine, with little loss of quality in the results. Carefuly read the following sections.

Optimizing PLAST: sample recipes

In order to tune PLAST correctly, we always invite our users to try the software with sample data sets. When you need to compare large set of sequences, always start your work by comparing a small subset of your data. This way, you can check the parameters, the results and the speed of the software.

As an example, if you have to compare 300,000 sequences against NCBI nt, start your work by comparing 300 query sequences against NCBI nt using default PLASTn parameters. Then, fine tune it (see below the use of seed-ratio, max-database-size and threshold score) and check the results. As soon as your parameters are fine, go ahead with 3,000 and/or 10,000 query sequences, and check results and speed. If everything is fine, then run the full comparison.

Optimizing PLAST at runtime: using seed-ratio

When using PLAST for protein-based sequence comparisons, the algorithm can be speedup using the seed-ratio parameter. As stated here, PLASTp algorithm relies on a finite table of seeds; there are about 6,200 seeds for BLOSUM50 and BLOSUM62 matrices, whatever the input sequence databanks (for more information, see Reference [1] here). During the comparison, PLAST orders seeds by occurrences, starting to process seeds producing the highest number of hits. So, it is possible to ask PLAST to use either the entire set of seeds to achieve a comparison, or a subset. This fine-tuning PLAST feature is achieved using the seed-ratio parameter, ranging from 1% to 100%. The highest seed-ratio you use, the highest sensitivity you get… the lowest seed-ratio you use, the highest speed you get with little loss in quality, as illustrated on this example:

Using seed ratio to speedup Plast

Reducing number of seeds to use during a comparison still provides high results quality while dramatically reducing search time.

The seed-ratio parameter is available for PLASTp, PLASTx, tPLASTx and tPLASTn. When using PLAST from the command-line, use argument “-seeds-use-ratio value”, where “value” is in the range 1..100 (default is 100).

Optimizing PLAST at runtime: using threshold score

A second way to fine tune PLAST, and speedup the search, consists in using the ‘threshold score’ parameter. During a search, PLAST computes a score for each ungapped sequence alignment matching a query and a hit. As soon as this score is above the threshold, that alignment is retained for further processing. By default, this ‘threshold score’ (‘-s’ argument) is set to a small value (38 for protein comparisons, 55 for nucleotide comparisons) to let PLAST be as sensitive as possible. However, if you suspect that your query sequences may be closely related to the reference databank, you could increase the ‘threshold score’: PLAST can still produce high-quality results, but with an additional speedup.

Use case on 16s RNA

As an example, when comparing 900 reads (500 nucleotides on average) against Silva SSU databank (740,000 sequences) on a 8 cores Intel-Xeon based computer, search time was 73 hours using “-s 55”, but only 8 minutes using “-s 127”; results were the same in terms of quality, i.e. we got the same best hit for each query in both results.

Use case on protein and nucleotide comparisons

Use threshold score to speedup PLAST

Read these case studies about the optimal use of Threshold Score parameter

The threshold score parameter is available for PLASTn, PLASTp, PLASTx, tPLASTx and tPLASTn. When using PLAST from the command-line, use argument “-s value”, where “value” is in the range 25..127 (default is 38 and 55 for protein-based and nucleic-based comparisons, respectively).

Optimizing PLAST at runtime: using max-database-size

Another way to fine tune PLAST, and again speedup the search, consists in using the max-database-size parameter. It sets the amount of bytes to reserve in RAM in order to load databank pages into memory. Indeed, during the comparison of query vs. subject databanks, PLAST automatically paginates databanks if they do not fit entirely into RAM. For that purpose, PLAST relies on the max-database-size parameter; when setting up that parameter, compare the amount of RAM you have in your computer with ‘max-database-size x 8 x 2’ (each databank index requires ‘8 x max-database-size’ bytes, and you have two databanks). For instance, when using PLAST on a 32 Gb computer, increasing max-database-size from 20M (default value) to 100M may produce an additional speedup of 4x.

We advise you to setup the max-database-size value to enable the full load of the query databank into memory. For instance, if your query file sizes 12 Mb, then set the max-database-size parameter to 15000000 (15 Mb). Also, we do not recommend to set max-database-size to value above 100000000 (100Mb) ; in such a case, if your query file is very big, let PLAST paginates the query, or run several PLAST jobs, each of them processing one partition of your query file.

The max-database-size parameter is available for PLASTn, PLASTp, PLASTx, tPLASTx and tPLASTn. When using PLAST from the command-line, use argument “-max-database-size value”, where “value” is a number of bytes (default is 5000000).

Score matrix, gap costs and match/mismatch costs

PLAST arguments related to score matrix, gap costs and match/mismatch are presented in the following table.

	Argument	Description
	-m	matrix
	-G	cost to open a gap
	-E	cost to extend a gap
	-r	reward for a nucleotide match (plastn only)
	-q	penalty for a nucleotide mismatch (plastn only)

Argument -m (matrix) is only available for proteic-based comparison methods: plastp, plastx, tplastn and tplastx.

Arguments -r and -q (match/mismatch) are only available for plastn method.

Arguments -G and -E (gap costs) are available for all comparison methods: plastp, plastn, plastx, tplastn and tplastx.

Notice: in the following tables, (*) denotes the default value used by PLAST when you do not use a particular argument.

Proteic-based PLAST search

Valid score matrices and gap costs are as follows:

BLOSUM62

Gap open	Gap extend
11	2
11	1	(*)
9	2
8	2
7	2
6	2
12	1
11	1
10	1
9	1

BLOSUM50

Gap open	Gap extend
13	3
13	2	(*)
12	3
11	3
10	3
15	2
14	2
13	2
19	1
18	1
17	1
16	1

Nucleic-based PLAST search (plastn only)

Valid match/mismatch (arguments -r and -q) are as follows:

Match	Mismatch
1	-1
1	-2
1	-3
1	-4
2	-3	(*)
4	-5

Given match/mismatch of 1,-1, valid gap costs are as follows:

Gap open	Gap extend
3	2	(*)
2	2
1	2
0	2
4	1
3	1
2	1

Given match/mismatch of 1,-2, valid gap costs are as follows:

Gap open	Gap extend
5	2	(*)
2	2
1	2
0	2
3	1
2	1
1	1

Given match/mismatch of 1,-3, valid gap costs are as follows:

Gap open	Gap extend
5	2	(*)
2	2
1	2
0	2
2	1
1	1

Given match/mismatch of 1,-4, valid gap costs are as follows:

Gap open	Gap extend
5	2	(*)
1	2
0	2
2	1
1	1

Given match/mismatch of 2,-3, valid gap costs are as follows:

Gap open	Gap extend
4	4
2	4
0	4
3	3
6	2
5	2	(*)
4	2
2	2

Given match/mismatch of 4,-5, valid gap costs are as follows:

Gap open	Gap extend
12	8	(*)
6	5
5	5
4	5
3	5

Monitoring job

PLAST enables you to monitor job execution:

plast ... -bargraph

When using argument -bargraph, PLAST displays a progression bar such as this one:

plastp [1/1] 100.0% align=16960 time [00:00:08 - 00:00:00 - 00:00:08] mem=298.7Mo (max=298.7Mo tot=0.3Go) seeds [5082:5082] [====================] 100%

Several pieces of information are provided, as follows:

Element	Description
plastp	Name of comparison method
[1/1]	Pagination of reference databank. If reference databank fits in RAM at once, a single page is used to compare query and reference banks.
100%	Progression of execution
align=16960	Number of matches found
time […]	Three execution times are provided: ellapsed, remaining and total. It is worth noting that “remaining time” is always a predicted value.
mem (…)	Memory usages: currently used, maximum (peak) and cumulative (grand total over pagination of reference bank)
seeds […]	Number of seeds processed if the form [current seeds:total seeds]
[====….]x%	A text-based progression bar

Using PLAST on the command-line.

PLAST command-line arguments

Mandatory arguments

Controlling results size

Sorting results by query IDs

Controlling speed

Optimizing quality/speed ratio using PLAST specific arguments

Optimizing PLAST: sample recipes

Optimizing PLAST at runtime: using seed-ratio

Optimizing PLAST at runtime: using threshold score

Use case on 16s RNA

Use case on protein and nucleotide comparisons

Optimizing PLAST at runtime: using max-database-size

Score matrix, gap costs and match/mismatch costs

Proteic-based PLAST search

BLOSUM62

BLOSUM50

Nucleic-based PLAST search (plastn only)

Monitoring job

Start using PLAST

Recent Posts