CoeGSS HPC Job Submission component allows users to submit HPC jobs to the clusters of HPC centers (HPCaaS). HPC Job Submission uses Cloudify orchestration toolchain as a backend. The latter assumes that the user must prepare Cloudify blueprint for any job which is supposed to be submitted. The connection between Cloudify and HPC clusters is established by means of Cloudify HPC plugin.
The system provides access to the job submission services only for authenticated users. Users authenticate the HPC Job Submission via the same credentials they use to access the portal.
Authenticated users should use the general CoeGSS portal credentials to get access to the job submission service. In the CoeGSS portal, job submission process includes the following steps common for all Cloudify application:
As soon as, the parallel job is finished, the user will be notified by the system. Afterwards, the user can remove deployment and uninstall blueprint if there are no plans for running HPC job once again.
Note that ordinary user is not permitted to see jobs of another users.
Administrator of the job submission services has access to the Cloudify Manager Web Interface. This is a standard Web UI that comes with Cloudify Manager out-of-the-box. Cloudify Manager Web UI provides the administrator with full control of the workflow for all submitted applications. In particular, the administrator can:
More details on Cloudify Web UI can be found here.
Besides Web UI, administrator can control job submission process via Cloudify CLI.
In order to upload the blueprint, the user should switch to the "Blueprints" tab with the list of all available for this user blueprints and push the button "Create new blueprint". It will redirect the user to the Blueprint upload form.
In this form, the user should specify blueprint ID, blueprint file name,
as well as the path to the tar.gz
archive
which contains blueprint file and another files that blueprint uses
such as bootstrapping/reverting scripts, etc.
Figure 2: Blueprint upload form
The blueprint process starts when the user pushes "Upload" button.
In order to deploy the blueprint, the user should switch to the "Deployments" tab with the list deployments and runs made by this user and push the button "Install". It will redirect the user to the Blueprint deployment form, where she/he should specify deployment ID, blueprint ID, and input file for the deployment.
Figure 3: List of deployments
Afterwards deployment can be executed, by pressing the button "Run Jobs". The user can track the status of running jobs by pushing the button "Refresh".
Blueprint files are YAML files written in accordance with OASIS TOSCA standard that describe the execution plans for the lifecycle of the application including installing, starting, terminating, orchestrating, and monitoring steps.
This section presents details of a blueprint which defines a simple HPC job.
For a more sophisticated example, we refer to the well-documented blueprint for CoeGSS network reconstruction tool
available from here.
Furthermore, this github
repository contains a number of Cloudify HPC blueprints of different complexity.
tosca_definitions_version: cloudify_dsl_1_3 imports: # to speed things up, it is possible to download this file, - http://raw.githubusercontent.com/mso4sc/cloudify-hpc-plugin/master/resources/types/cfy_types.yaml # HPC pluging - http://raw.githubusercontent.com/MSO4SC/cloudify-hpc-plugin/master/plugin.yaml - inputs-def.yaml
inputs: ############################################ # arguments of the script ############################################ new_file_name: description: Name of the file created by the HPC job default: "test.txt" type: string ############################################ # Details of the application lifecycle ############################################ # First HPC configuration coegss_hlrs_hazelhen: description: Configuration for the primary HPC to be used default: {} # Job prefix name job_prefix: description: Job name prefix in HPCs default: 'coegss_sn4sp_' type: string ############################################ # Data publishing ############################################ coegss_datacatalogue_entrypoint: description: entrypoint of the data catalogue default: "https://coegss1.man.poznan.pl" coegss_datacatalogue_key: description: API Key to publish the outputs default: "" coegss_output_dataset: description: ID of the CKAN output dataset default: ""
Sample inputs for this example are presented here.
first_hpc
for the cluster Hazelhen
and the job single_job
assigned to it (see relationships
section).
Results of the job should be published to CKAN.
node_templates: first_hpc: type: hpc.nodes.Compute properties: config: { get_input: coegss_hlrs_hazelhen } job_prefix: { get_input: job_prefix } base_dir: "$HOME" workdir_prefix: "cloudify_coegss_" skip_cleanup: True single_job: type: hpc.nodes.Job properties: job_options: type: 'SBATCH' command: "touch.script" deployment: bootstrap: 'scripts/bootstrap_sbatch_example.sh' revert: 'scripts/revert_sbatch_example.sh' inputs: - { get_input: new_file_name } skip_cleanup: True publish: - type: "CKAN" entrypoint: { get_input: coegss_datacatalogue_entrypoint } api_key: { get_input: coegss_datacatalogue_key } dataset: { get_input: coegss_output_dataset } file_path: { get_input: new_file_name } name: "Sampled synthetic population" description: "" relationships: - type: job_contained_in_hpc target: first_hpc
outputs
section where the user should specify Cloudify outputs.
outputs: single_job_name: description: single job name in the HPC value: { get_attribute: [single_job, job_name] }
Blueprint inputs define blueprint parameters with which application should be deployed and executed. The inputs usually specify parameters of the algorithm, as well as parameters for defining the system configuration and blueprint lifecycle. E.g., the above-mentioned blueprint requires the following parameters:
new_file_name
defines name of the file created by the HPC job.
hpc_configuration
defines workload manager information and credentials to the remote HPC system.
This input does not have a default value and, thus, must be specifies in inputs when the blueprint is deployed.
job_prefix
which contains job name prefix on HPC. Default prefix is coegss
.
coegss_datacatalogue_key
defines API key of the CKAN user.
coegss_output_dataset
defines dataset with output resources
Example of the input file:
new_file_name: "unusual-test.csv" coegss_datacatalogue_key: "<some-ckan-api-key>" coegss_output_dataset: "coegss-network-reconstruction-results" # HLRS Hazelhen cluster configuration coegss_hlrs_hazelhen: credentials: host: "hazelhen.hww.hlrs.de" user: "USERNAME" private_key: | -----BEGIN RSA PRIVATE KEY----- -----END RSA PRIVATE KEY----- private_key_password: "PRIVATE_KEY_PASSWORD" password: "PASSWORD" login_shell: true country_tz: "Europe/Stuttgart" workload_manager: "TORQUE"
Note that all the inputs must be initialized in the input file unless the blueprint specifies default values for them.