Running jobs with BLAST using the Lancium CLI

Running jobs with BLAST using the Lancium CLI

This tutorial shows the basics of running a BLAST search in the Lancium Compute Infrastructure in a few different ways. For simplicity’s sake, it assumes that the sequence databases have already been processed via makeblastdb and have been packaged into a .tar.gz file containing the database files.

$ ls

cow_db.tar.gz  human_db.tar.gz

$ tar ztf cow_db_small.tar.gz

cow.1000.protein.faa
cow.1000.protein.faa.phr
cow.1000.protein.faa.pin
cow.1000.protein.faa.psq

$ tar ztf human_db.tar.gz

human.1.protein.faa
human.1.protein.faa.phr
human.1.protein.faa.pin
human.1.protein.faa.psq

Prerequisites

Authentication

The user’s API key can be passed to the CLI either via an environment variable or as part of the command line.

Environment Variable

LANCIUM_API_KEY=<API_KEY>

Command Line

--api-key <API_KEY>

Running a BLAST job with a single command

The quickest way to get a new job running is to specify all of the necessary job parameters as part of a single call to the CLI.

$ lcli job run --name "BLAST Tutorial job" \
       --command "blastp -query cow_db_small/cow.1000.protein.faa -db human_db/human.1.protein.faa -evalue 1e-5 -max_target_seqs 1 -num_threads 4" \
       --image lancium/blast --cores 4 --mem 8 \
       --input-file cow_db_small.tar.gz \
       --input-file human_db.tar.gz

{
    "id": 323,
    "name": "BLAST Tutorial job",
    "status": "created",
    "qos": 100,
    "command_line": "blastp -query cow_db_small/cow.1000.protein.faa -db human_db/human.1.protein.faa -evalue 1e-5 -max_target_seqs 1 -num_threads 4",
    "image": "lancium/blast",
    "resources": {
        "core_count": 4,
        "gpu_count": null,
        "memory": 8,
        "gpu": null,
        "scratch": null
    },
    "max_run_time": 259200
}
Uploading input data... |████████████████████████████████████| 100%

The lcli job run command will take care of creating a new job, uploading any input files, and submitting the job to the Lancium Compute grid in one step. With the job running, we can use the CLI to see its status. Input files that are an archive (currently .tar.gz and .zip are supported) will be automatically expanded in the job’s working directory.

$ lcli job show 323
{
    "id": 323,
    "name": "BLAST Tutorial job",
    "status": "running",
    "qos": 100,
    "command_line": "blastp -query cow_db_small/cow.1000.protein.faa -db human_db/human.1.protein.faa -evalue 1e-5 -max_target_seqs 1 -num_threads 4",
    "image": "lancium/blast",
    "resources": {
        "core_count": 4,
        "gpu_count": null,
        "memory": 8,
        "gpu": null,
        "scratch": null
    },
    "max_run_time": 259200,
    "input_files": [
        {
            "id": 1086,
            "name": "cow_db_small.tar.gz",
            "source_type": "file",
            "source": "cow_db_small.tar.gz",
            "cache": false
        },
        {
            "id": 1087,
            "name": "human_db.tar.gz",
            "source_type": "file",
            "source": "human_db.tar.gz",
            "cache": false
        }
    ]
}

When the job is completed, we can see that there are now two output files available. All Lancium Compute jobs return the standard output and standard error automatically. Additional files created by the job can be returned by specifying their file names with the --output flag.


$ lcli job show 323
{
    "id": 323,
    "name": "BLAST Tutorial job",
    "status": "finished",
    "qos": 100,
    "command_line": "blastp -query cow_db_small/cow.1000.protein.faa -db human_db/human.1.protein.faa -evalue 1e-5 -max_target_seqs 1 -num_threads 4",
    "image": "lancium/blast",
    "resources": {
        "core_count": 4,
        "gpu_count": null,
        "memory": 8,
        "gpu": null,
        "scratch": null
    },
    "max_run_time": 259200,
    "input_files": [
        {
            "id": 1086,
            "name": "cow_db_small.tar.gz",
            "source_type": "file",
            "source": "cow_db_small.tar.gz",
            "cache": false
        },
        {
            "id": 1087,
            "name": "human_db.tar.gz",
            "source_type": "file",
            "source": "human_db.tar.gz",
            "cache": false
        }
    ],
    "output_files": [
        {
            "name": "stderr.txt",
            "size": 61,
            "available": true
        },
        {
            "name": "stdout.txt",
            "size": 260835,
            "available": true
        }
    ]
}

Output files returned from a job can be viewed or downloaded using the lcli job output command.

$ lcli job output get --view --file stdout.txt 323 | head -50
BLASTP 2.9.0+


Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.


Reference for composition-based statistics: Alejandro A. Schaffer,
L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.



Database: human.1.protein.faa
           12,098 sequences; 7,911,674 total letters



Query= NP_001193503.1 E3 ubiquitin-protein ligase TRIM56 [Bos taurus]

Length=755
                                                                      Score        E
Sequences producing significant alignments:                          (Bits)     Value

NP_003843.3 transcription intermediary factor 1-alpha isoform b [...  78.6       6e-15


>NP_003843.3 transcription intermediary factor 1-alpha isoform
b [Homo sapiens]
Length=1016

 Score = 78.6 bits (192),  Expect = 6e-15, Method: Compositional matrix adjust.
 Identities = 74/359 (21%), Positives = 134/359 (37%), Gaps = 67/359 (19%)

Query  20   ACKICLEQL--RVPKTLPCLHTYCQDCL--------------------------AQLAEG  51
             C +C + +  R PK LPCLH++CQ CL                               G
Sbjct  55   TCAVCHQNIQSRAPKLLPCLHSFCQRCLPAPQRYLMLPAPMLGSAETPPPVPAPGSPVSG  114

Query  52   SR--------LRCPECRESVPVPPAGVAAFKTNFFVNGLLDLVKARAGGDLRAGKPACAL  103
            S         +RCP C +              NFFV    ++        +      C
Sbjct  115  SSPFATQVGVIRCPVCSQE-----CAERHIIDNFFVKDTTEV----PSSTVEKSNQVCTS  165

Query  104  CPLMGGASAGGPATARCLDCADDLCQACADGHRCTRQTHSHRV-------VDLVGYRAGW  156
            C     A A G     C++C + LC+ C   H+  + T  H V        + VG  +
Sbjct  166  C--EDNAEANG----FCVECVEWLCKTCIRAHQRVKFTKDHTVRQKEEVSPEAVGVTS--  217

Incrementally creating a job

In addition to creating and executing a job with a single command, the lcli job command can be used to specify a new job over multiple commands before submitting. The following commands execute an identical job to the command in the previous example. Some of the commands use the -q flag to supress the output of the current job state.

$ lcli job create
{
    "id": 324,
    "status": "created",
    "qos": 100,
    "resources": {
        "core_count": 2,
        "gpu_count": null,
        "memory": 4,
        "gpu": null,
        "scratch": null
    },
    "max_run_time": 259200
}

$ lcli job update --name "BLAST Tutorial job 2" \
       --command "blastp -query cow_db_small/cow.1000.protein.faa -db human_db/human.1.protein.faa -evalue 1e-5 -max_target_seqs 1 -num_threads 4" \
       --image lancium/blast --cores 4 --mem 8 324
{
    "id": 324,
    "name": "BLAST Tutorial job 2",
    "status": "created",
    "qos": 100,
    "command_line": "blastp -query cow_db_small/cow.1000.protein.faa -db human_db/human.1.protein.faa -evalue 1e-5 -max_target_seqs 1 -num_threads 4",
    "image": "lancium/blast",
    "resources": {
        "core_count": 4,
        "gpu_count": null,
        "memory": 8,
        "gpu": null,
        "scratch": null
    },
    "max_run_time": 259200,
    "created_at": "2020-08-20T18:40:11.643Z",
    "updated_at": "2020-08-20T18:42:41.993Z"
}

$ lcli job input add -q --file cow_db_small.tar.gz 324

$ lcli job input add -q --file human_db.tar.gz 324

$ lcli job submit 324

$ lcli job show 324
{
    "id": 324,
    "name": "BLAST Tutorial job 2",
    "status": "running",
    "qos": 100,
    "command_line": "blastp -query cow_db_small/cow.1000.protein.faa -db human_db/human.1.protein.faa -evalue 1e-5 -max_target_seqs 1 -num_threads 4",
    "image": "lancium/blast",
    "resources": {
        "core_count": 2,
        "gpu_count": null,
        "memory": 4,
        "gpu": null,
        "scratch": null
    },
    "max_run_time": 259200,
    "input_files": [
        {
            "id": 1088,
            "name": "cow_db_small.tar.gz",
            "source_type": "file",
            "source": "cow_db_small.tar.gz",
            "cache": false
        },
        {
            "id": 1089,
            "name": "human_db.tar.gz",
            "source_type": "file",
            "source": "human_db.tar.gz",
            "cache": false
        }
    ]
}

Persistently storing data in Lancium Compute storage

Uploading input files as part of the job creation process may work for smaller data sets, but for most purposes it makes more sense to upload a data set into the Lancium Compute’s data storage area in advance. Uploaded data can be referenced in multiple jobs to save bandwidth and time.

Each user has a separate data storage area with path semantics similar to a standard file system.

$ lcli data show /
[
    {
        "name": "tensorflow",
        "is_directory": true,
        "size": null,
        "last_modified": null,
        "created": null
    },
    {
        "name": "vgg16_weights.npz",
        "is_directory": false,
        "size": "553436134",
        "last_modified": "2020-08-20T19:09:43.378+00:00",
        "created": "2020-07-16T01:20:31.000+00:00"
    }
]

$ lcli data makedir blast

$ lcli data show /
[
    {
        "name": "blast",
        "is_directory": true,
        "size": null,
        "last_modified": null,
        "created": null
    },
    {
        "name": "tensorflow",
        "is_directory": true,
        "size": null,
        "last_modified": null,
        "created": null
    },
    {
        "name": "vgg16_weights.npz",
        "is_directory": false,
        "size": "553436134",
        "last_modified": "2020-08-20T19:29:48.827+00:00",
        "created": "2020-07-16T01:20:31.000+00:00"
    }
]

$ lcli data add --file cow_db_small.tar.gz /blast/cow_db_small.tar.gz

$ lcli data add --file human_db.tar.gz /blast/human_db.tar.gz

$ lcli data show /blast
[
    {
        "name": "cow_db_small.tar.gz",
        "is_directory": false,
        "size": "89432",
        "last_modified": "2020-08-20T19:32:24.028+00:00",
        "created": "2020-08-20T19:31:39.000+00:00"
    },
    {
        "name": "human_db.tar.gz",
        "is_directory": false,
        "size": "7770227",
        "last_modified": "2020-08-20T19:32:24.070+00:00",
        "created": "2020-08-20T19:32:07.000+00:00"
    }
]

Once a data set is uploaded to Lancium Compute storage, it can be added to a job definition as an input file by specifying the storage path.

$ lcli job create
{
    "id": 325,
    "status": "created",
    "qos": 100,
    "resources": {
        "core_count": 2,
        "gpu_count": null,
        "memory": 4,
        "gpu": null,
        "scratch": null
    },
    "max_run_time": 259200
}

$ lcli job input add --data /blast/cow_db_small.tar.gz 325

$ lcli job input add --data /blast/human_db.tar.gz 325

$ lcli job show 325
{
    "id": 325,
    "status": "created",
    "qos": 100,
    "resources": {
        "core_count": 2,
        "gpu_count": null,
        "memory": 4,
        "gpu": null,
        "scratch": null
    },
    "max_run_time": 259200,
    "input_files": [
        {
            "id": 1090,
            "name": "cow_db_small.tar.gz",
            "source_type": "data",
            "source": "blast/cow_db_small.tar.gz",
            "cache": false
        },
        {
            "id": 1091,
            "name": "human_db.tar.gz",
            "source_type": "data",
            "source": "blast/human_db.tar.gz",
            "cache": false
        }
    ]
}