Lesson 2: Running Jobs
Starting New Jobs
There are four job commands that are related to the creation, specification, and submission of Compute jobs:
-
lcli job run <job_specs>
The
run
commands creates a new job with the given specifications and immediately submits it to the grid for execution. -
lcli job create <job_specs>
The
create
command creates a new job but leaves it in an unsubmitted, editable state. -
lcli job update <job_id> <job_specs>
The
update
command alters the job specs for jobs that are unsubmitted. Once a job is submitted to the grid for execution, this command can no longer be used to update the job. -
lcli job submit <job_id>
The
submit
command schedules the job for execution in the Lancium Compute grid.
The lcli job run
command will allow you to create, specify, and run a job in one command. Whereas if a user wanted to piece together a job incrementally, they could create the job (with any specifications or flags listed below), update the job with any additonal fields needed or wanted for the job, and then submit the job to be run.
All of the above commands with the exception of submit
will return a copy of the job information in the output. submit
will simply successfully return if the job is accepted by the grid or return a non-zero exit code and error message if there were issues submitting the job. Among the pieces of information returned for the new job is the id
field. This job id is used to reference the new job in other job commands.
Specifications about the required job execution environment are passed to job commands via a number of command line flags:
--spec <json_file>
parses the specified JSON file for job specifications. If a specification file is given along with any of the individual flags listed below, the values specified with flags override the values in the file. Any job requirements that can be specified via command-line flags can also be set from a job specification, but both do not need to be present. As an example, the following job spec includes all the configurable fields:
{
"name": "string",
"notes": "string",
"account": "string",
"qos": "string",
"image": "string",
"command_line": "string",
"expected_run_time": integer,
"max_run_time": integer,
"callback_url": "string",
"resources": {
"core_count": integer,
"gpu_count": integer,
"memory": integer,
"gpu": "string",
"scratch": integer
},
"input_files": [
{
"source_type": "file",
"source": "string",
"cache": boolean,
"name": "string",
}
],
"output_fies": [
{
"name": "string"
}
],
"environment": [
{
"value": "string",
"variable": "string"
}
]
}
-
--name <string>
/-n <string>
assigns a job name for display purposes. -
--notes <text>
/--description <text>
/-d <text>
attaches free-form detailed job information. -
--account <string>
provide internal billing or project references for job. Included in the invoice line item for the job. -
--qos <string>
The QOS priority for this job. Lower QOS values will result in discounts from the standard core-hour cost. The current QOS tiers are:high
- Jobs at this QOS will be scheduled and running at least 90% of the time during any particular period.medium
- Jobs at this QOS will be scheduled and running approximately 50% of the time during any particular period.low
- Jobs at this QOS will be scheduled and running approximately 25% of the time during any particular period.best_effort
- Jobs at this QOS will be scheduled and running only when there are free resources available that can’t be allocated to jobs at a higher QOS.
Currently, all jobs run at
high
QOS regardless of this setting. -
--command <string>
The command line to execute within the chosen Singularity image. The command line should either call a binary existing within the image or execute a script that was included as input data. Shell built-ins or chained commands using|
,;
, or&&
will not execute properly and should be run from within a script. -
--image <string>
The path to the requested customer or Lancium provided Singularity image. -
--max-run-time <integer>
The maximum amount of time in seconds to allow the job to run before automatically terminating it. -
--expected-run-time <integer>
The amount of time that this job should generally be expected to run in. In the future, accurate estimates of run time will result in discounts from the standard core-hour cost. -
--cores <integer>
/--core-count <integer>
The number of vCPUs to allocate to the job. vCPUs are always allocated in pairs, so odd numbers will be rounded up to the next multiple of two. In addition, if more than 4GB per two vCPUs is requested via theram
flag, the number of vCPUs will be increased to maintain a 4:2 ratio between memory and vCPUs. -
--mem <integer>
/--ram <integer>
The amount of RAM in GBs to allocate to the job. If less than 4GB per two vCPUs is requested, the amount of memory will be increased to maintain a 4:2 ratio between memory and vCPUs. -
--gpu <string>
/--gpu-type <string>
The GPU type required for the job. -
--gpus <integer>
/--gpu-count <integer>
The number of GPUs to allocate to the job. -
--scratch <integer>
/--disk <integer>
The amount of scratch disk space to provide for the job. Currently ignored
The following command line flags can be included multiple times in a single command:
--input-file <string>
specify the path to a file on the local file system that should be uploaded and made available to the job. If an archive file (.tar.gz or .zip) is specified, it will be automatically expanded inside a folder in the JWD before job execution.--input-file-cached <string>
specify the path to a file on the local file system that should be uploaded and staged locally (but read-only) to the job’s working directory. Currently treated identically to standard input files--input-data <string>
specify the path to a file on Lancium’s Persistent Data Service that should be made available to the job-
--input-data-cached <string>
specify the path to a file on Lancium’s Persistent Data Service that should be made locally available (but read-only) to the job’s working directory. Currently treated identically to standard input data --input-url <string>
specify the URL of a file that should be downloaded and made available to the job--input-url-cached <string>
specify the URL of a file that should be downloaded and staged locally (but read-only) to the job’s working directory. Currently treated identically to standard input URLs-o <string>
/--output <string>
/-o <string>:<path>
/--output <string>:<path>
specify a file name that should be expected to exist at the end of the job. If found, the file will be copied out of the JWD and made available to the user, and if a path to the pesistant storage is provided, the file will be saved there.-e <string>=<string>
/--env <string>=<string>
specify environment variables for the job--duplicate <int>
number of times to duplicate this job--mpi
flag to run an MPI job--mpi_version
version of MPI an MPI job should run on--tasks
number of MPI tasks--tasks-per-node
number of MPI tasks to run per node
Examples
Example: Create and submit a job in a single step using command line flags
To run a job in a single step, the lcli job run
command must be used with all necessary input flags. The necessary input flags are --name
, --command
, and --image
. Most jobs will also want a --input-file
flag or one of its variations (--input-file-cached
, --input-data
, --input-data-cached
, --input-url
, input-url-cached
). The following command:
$ lcli job run --name "List job working directory" \
--command "ls" --image lancium/ubuntu --cores 4 --mem 8
creates and submits a job with the following specification (json) to be computed:
{
"id": 12646,
"name": "List job working directory",
"status": "created",
"qos": 'high',
"command_line": "ls",
"image": "lancium/ubuntu",
"resources": {
"core_count": 4,
"gpu_count": null,
"memory": 8,
"gpu": null,
"scratch": null
},
"max_run_time": 259200,
"created_at": "2021-02-02T19:08:15.874Z",
"updated_at": "2021-02-02T19:08:15.874Z"
}
Example: Create and submit a job in a single step using a job spec file
In addition to manually entering the required flags each time a lcli job run
command is used, users can create a job spec file such as:
$ cat /tmp/job_spec.json
{
"name": "List job working directory",
"qos": 'high',
"command_line": "ls",
"image": "lancium/ubuntu",
"resources": {
"core_count": 4,
"memory": 8
},
"max_run_time": 259200
}
Once a spec is created for the desired job, a job can be run using the --spec
flag:
$ lcli job run --spec /tmp/job_spec.json
{
"id": 12647,
"name": "List job working directory",
"status": "created",
"qos": 'high',
"command_line": "ls",
"image": "lancium/ubuntu",
"resources": {
"core_count": 4,
"gpu_count": null,
"memory": 8,
"gpu": null,
"scratch": null
},
"max_run_time": 259200,
"created_at": "2021-02-02T19:12:42.376Z",
"updated_at": "2021-02-02T19:12:42.376Z"
}
Example: Create a job, update its definition and submit
Jobs do not need to be run in a single step. Users can create a job:
$ lcli job create
{
"id": 12649,
"status": "created",
"qos": 'high',
"resources": {
"core_count": 2,
"gpu_count": null,
"memory": 4,
"gpu": null,
"scratch": null
},
"max_run_time": 259200,
"created_at": "2021-02-02T19:17:18.731Z",
"updated_at": "2021-02-02T19:17:18.731Z",
"input_files": []
}
And then update the job using lcli job update [job ID]
with the necessary and optional flags needed for said job:
$ lcli job update 12649 --name "example job with updates" --cores 4 \
--command "ls" --image lancium/ubuntu
{
"id": 12649,
"name": "example job with updates",
"status": "created",
"qos": 'high',
"command_line": "ls",
"image": "lancium/ubuntu",
"resources": {
"core_count": 4,
"gpu_count": null,
"memory": 4,
"gpu": null,
"scratch": null
},
"max_run_time": 259200,
"created_at": "2021-02-02T19:17:18.731Z",
"updated_at": "2021-02-02T19:19:27.825Z",
"input_files": []
}
$ lcli job update 12649 --ram 8 --notes "more details on updates"
{
"id": 12649,
"name": "example job with updates",
"notes": "more details on updates",
"status": "created",
"qos": 'high',
"command_line": "ls",
"image": "lancium/ubuntu",
"resources": {
"core_count": 4,
"gpu_count": null,
"memory": 8,
"gpu": null,
"scratch": null
},
"max_run_time": 259200,
"created_at": "2021-02-02T19:17:18.731Z",
"updated_at": "2021-02-02T19:21:02.394Z",
"input_files": []
}
$ lcli job submit 12649
Note that you can update a job multiple times. To check on the status of the job, simply use the lcli job show [job ID]
call:
$ lcli job show 46440
#output
{
"id": 46440,
"name": "random job",
"status": "finished",
"qos": "high",
"command_line": "python inputtest.py",
"image": "lancium/ubuntu",
"resources": {
"core_count": 2,
"gpu_count": null,
"memory": 4,
"gpu": null,
"scratch": null
},
"max_run_time": 259200,
"input_files": [
{
"id": 10722,
"name": "inputtest.py",
"source_type": "file",
"source": "/home/rap/bin/inputtest.py",
"cache": false,
"upload_complete": true,
"chunks_received": [
[
1,
105
]
]
}
],
"output_files": [
{
"name": "stderr.txt",
"size": 527,
"available": true
},
{
"name": "stdout.txt",
"size": 11,
"available": true
}
],
"created_at": "2022-04-05T13:42:09.139Z",
"updated_at": "2022-04-05T14:00:09.275Z",
"submitted_at": "2022-04-05T13:42:10.963Z",
"completed_at": "2022-04-05T13:42:27.000Z",
"cost": "0.0",
"memory_used": 38215680
}
Many other options are available for the job outputs by simply following the command line and using the –help flag to see what options are available to you. For example, you can download an output file to your local directory or view an output file.