Running jobs with BLAST using the Lancium CLI
This tutorial shows the basics of running a BLAST search in the Lancium Compute Infrastructure in a few different ways. For simplicity’s sake, it assumes that the sequence databases have already been processed via makeblastdb
and have been packaged into a .tar.gz file containing the database files.
$ ls
cow_db.tar.gz human_db.tar.gz
$ tar ztf cow_db_small.tar.gz
cow.1000.protein.faa
cow.1000.protein.faa.phr
cow.1000.protein.faa.pin
cow.1000.protein.faa.psq
$ tar ztf human_db.tar.gz
human.1.protein.faa
human.1.protein.faa.phr
human.1.protein.faa.pin
human.1.protein.faa.psq
Prerequisites
- A copy of the CLI binary downloaded into user’s $PATH and given executable privleges
- An API key generated from a user account on the Lancium Compute portal
Authentication
The user’s API key can be passed to the CLI either via an environment variable or as part of the command line.
Environment Variable
LANCIUM_API_KEY=<API_KEY>
Command Line
--api-key <API_KEY>
Running a BLAST job with a single command
The quickest way to get a new job running is to specify all of the necessary job parameters as part of a single call to the CLI.
$ lcli job run --name "BLAST Tutorial job" \
--command "blastp -query cow_db_small/cow.1000.protein.faa -db human_db/human.1.protein.faa -evalue 1e-5 -max_target_seqs 1 -num_threads 4" \
--image lancium/blast --cores 4 --mem 8 \
--input-file cow_db_small.tar.gz \
--input-file human_db.tar.gz
{
"id": 323,
"name": "BLAST Tutorial job",
"status": "created",
"qos": 100,
"command_line": "blastp -query cow_db_small/cow.1000.protein.faa -db human_db/human.1.protein.faa -evalue 1e-5 -max_target_seqs 1 -num_threads 4",
"image": "lancium/blast",
"resources": {
"core_count": 4,
"gpu_count": null,
"memory": 8,
"gpu": null,
"scratch": null
},
"max_run_time": 259200
}
Uploading input data... |████████████████████████████████████| 100%
The lcli job run
command will take care of creating a new job, uploading any input files, and submitting the job to the Lancium Compute grid in one step. With the job running, we can use the CLI to see its status. Input files that are an archive (currently .tar.gz
and .zip
are supported) will be automatically expanded in the job’s working directory.
$ lcli job show 323
{
"id": 323,
"name": "BLAST Tutorial job",
"status": "running",
"qos": 100,
"command_line": "blastp -query cow_db_small/cow.1000.protein.faa -db human_db/human.1.protein.faa -evalue 1e-5 -max_target_seqs 1 -num_threads 4",
"image": "lancium/blast",
"resources": {
"core_count": 4,
"gpu_count": null,
"memory": 8,
"gpu": null,
"scratch": null
},
"max_run_time": 259200,
"input_files": [
{
"id": 1086,
"name": "cow_db_small.tar.gz",
"source_type": "file",
"source": "cow_db_small.tar.gz",
"cache": false
},
{
"id": 1087,
"name": "human_db.tar.gz",
"source_type": "file",
"source": "human_db.tar.gz",
"cache": false
}
]
}
When the job is completed, we can see that there are now two output files available. All Lancium Compute jobs return the standard output and standard error automatically. Additional files created by the job can be returned by specifying their file names with the --output
flag.
$ lcli job show 323
{
"id": 323,
"name": "BLAST Tutorial job",
"status": "finished",
"qos": 100,
"command_line": "blastp -query cow_db_small/cow.1000.protein.faa -db human_db/human.1.protein.faa -evalue 1e-5 -max_target_seqs 1 -num_threads 4",
"image": "lancium/blast",
"resources": {
"core_count": 4,
"gpu_count": null,
"memory": 8,
"gpu": null,
"scratch": null
},
"max_run_time": 259200,
"input_files": [
{
"id": 1086,
"name": "cow_db_small.tar.gz",
"source_type": "file",
"source": "cow_db_small.tar.gz",
"cache": false
},
{
"id": 1087,
"name": "human_db.tar.gz",
"source_type": "file",
"source": "human_db.tar.gz",
"cache": false
}
],
"output_files": [
{
"name": "stderr.txt",
"size": 61,
"available": true
},
{
"name": "stdout.txt",
"size": 260835,
"available": true
}
]
}
Output files returned from a job can be viewed or downloaded using the lcli job output
command.
$ lcli job output get --view --file stdout.txt 323 | head -50
BLASTP 2.9.0+
Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.
Reference for composition-based statistics: Alejandro A. Schaffer,
L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.
Database: human.1.protein.faa
12,098 sequences; 7,911,674 total letters
Query= NP_001193503.1 E3 ubiquitin-protein ligase TRIM56 [Bos taurus]
Length=755
Score E
Sequences producing significant alignments: (Bits) Value
NP_003843.3 transcription intermediary factor 1-alpha isoform b [... 78.6 6e-15
>NP_003843.3 transcription intermediary factor 1-alpha isoform
b [Homo sapiens]
Length=1016
Score = 78.6 bits (192), Expect = 6e-15, Method: Compositional matrix adjust.
Identities = 74/359 (21%), Positives = 134/359 (37%), Gaps = 67/359 (19%)
Query 20 ACKICLEQL--RVPKTLPCLHTYCQDCL--------------------------AQLAEG 51
C +C + + R PK LPCLH++CQ CL G
Sbjct 55 TCAVCHQNIQSRAPKLLPCLHSFCQRCLPAPQRYLMLPAPMLGSAETPPPVPAPGSPVSG 114
Query 52 SR--------LRCPECRESVPVPPAGVAAFKTNFFVNGLLDLVKARAGGDLRAGKPACAL 103
S +RCP C + NFFV ++ + C
Sbjct 115 SSPFATQVGVIRCPVCSQE-----CAERHIIDNFFVKDTTEV----PSSTVEKSNQVCTS 165
Query 104 CPLMGGASAGGPATARCLDCADDLCQACADGHRCTRQTHSHRV-------VDLVGYRAGW 156
C A A G C++C + LC+ C H+ + T H V + VG +
Sbjct 166 C--EDNAEANG----FCVECVEWLCKTCIRAHQRVKFTKDHTVRQKEEVSPEAVGVTS-- 217
Incrementally creating a job
In addition to creating and executing a job with a single command, the lcli job
command can be used to specify a new job over multiple commands before submitting. The following commands execute an identical job to the command in the previous example. Some of the commands use the -q
flag to supress the output of the current job state.
$ lcli job create
{
"id": 324,
"status": "created",
"qos": 100,
"resources": {
"core_count": 2,
"gpu_count": null,
"memory": 4,
"gpu": null,
"scratch": null
},
"max_run_time": 259200
}
$ lcli job update --name "BLAST Tutorial job 2" \
--command "blastp -query cow_db_small/cow.1000.protein.faa -db human_db/human.1.protein.faa -evalue 1e-5 -max_target_seqs 1 -num_threads 4" \
--image lancium/blast --cores 4 --mem 8 324
{
"id": 324,
"name": "BLAST Tutorial job 2",
"status": "created",
"qos": 100,
"command_line": "blastp -query cow_db_small/cow.1000.protein.faa -db human_db/human.1.protein.faa -evalue 1e-5 -max_target_seqs 1 -num_threads 4",
"image": "lancium/blast",
"resources": {
"core_count": 4,
"gpu_count": null,
"memory": 8,
"gpu": null,
"scratch": null
},
"max_run_time": 259200,
"created_at": "2020-08-20T18:40:11.643Z",
"updated_at": "2020-08-20T18:42:41.993Z"
}
$ lcli job input add -q --file cow_db_small.tar.gz 324
$ lcli job input add -q --file human_db.tar.gz 324
$ lcli job submit 324
$ lcli job show 324
{
"id": 324,
"name": "BLAST Tutorial job 2",
"status": "running",
"qos": 100,
"command_line": "blastp -query cow_db_small/cow.1000.protein.faa -db human_db/human.1.protein.faa -evalue 1e-5 -max_target_seqs 1 -num_threads 4",
"image": "lancium/blast",
"resources": {
"core_count": 2,
"gpu_count": null,
"memory": 4,
"gpu": null,
"scratch": null
},
"max_run_time": 259200,
"input_files": [
{
"id": 1088,
"name": "cow_db_small.tar.gz",
"source_type": "file",
"source": "cow_db_small.tar.gz",
"cache": false
},
{
"id": 1089,
"name": "human_db.tar.gz",
"source_type": "file",
"source": "human_db.tar.gz",
"cache": false
}
]
}
Persistently storing data in Lancium Compute storage
Uploading input files as part of the job creation process may work for smaller data sets, but for most purposes it makes more sense to upload a data set into the Lancium Compute’s data storage area in advance. Uploaded data can be referenced in multiple jobs to save bandwidth and time.
Each user has a separate data storage area with path semantics similar to a standard file system.
$ lcli data show /
[
{
"name": "tensorflow",
"is_directory": true,
"size": null,
"last_modified": null,
"created": null
},
{
"name": "vgg16_weights.npz",
"is_directory": false,
"size": "553436134",
"last_modified": "2020-08-20T19:09:43.378+00:00",
"created": "2020-07-16T01:20:31.000+00:00"
}
]
$ lcli data makedir blast
$ lcli data show /
[
{
"name": "blast",
"is_directory": true,
"size": null,
"last_modified": null,
"created": null
},
{
"name": "tensorflow",
"is_directory": true,
"size": null,
"last_modified": null,
"created": null
},
{
"name": "vgg16_weights.npz",
"is_directory": false,
"size": "553436134",
"last_modified": "2020-08-20T19:29:48.827+00:00",
"created": "2020-07-16T01:20:31.000+00:00"
}
]
$ lcli data add --file cow_db_small.tar.gz /blast/cow_db_small.tar.gz
$ lcli data add --file human_db.tar.gz /blast/human_db.tar.gz
$ lcli data show /blast
[
{
"name": "cow_db_small.tar.gz",
"is_directory": false,
"size": "89432",
"last_modified": "2020-08-20T19:32:24.028+00:00",
"created": "2020-08-20T19:31:39.000+00:00"
},
{
"name": "human_db.tar.gz",
"is_directory": false,
"size": "7770227",
"last_modified": "2020-08-20T19:32:24.070+00:00",
"created": "2020-08-20T19:32:07.000+00:00"
}
]
Once a data set is uploaded to Lancium Compute storage, it can be added to a job definition as an input file by specifying the storage path.
$ lcli job create
{
"id": 325,
"status": "created",
"qos": 100,
"resources": {
"core_count": 2,
"gpu_count": null,
"memory": 4,
"gpu": null,
"scratch": null
},
"max_run_time": 259200
}
$ lcli job input add --data /blast/cow_db_small.tar.gz 325
$ lcli job input add --data /blast/human_db.tar.gz 325
$ lcli job show 325
{
"id": 325,
"status": "created",
"qos": 100,
"resources": {
"core_count": 2,
"gpu_count": null,
"memory": 4,
"gpu": null,
"scratch": null
},
"max_run_time": 259200,
"input_files": [
{
"id": 1090,
"name": "cow_db_small.tar.gz",
"source_type": "data",
"source": "blast/cow_db_small.tar.gz",
"cache": false
},
{
"id": 1091,
"name": "human_db.tar.gz",
"source_type": "data",
"source": "blast/human_db.tar.gz",
"cache": false
}
]
}