Running the Pipeline

Starting a pipeline run

The Pre-Processing Pipeline can be run from the command line using a CWL runner, e.g., cwltool or toil. Below we describe how the pipeline can be run with these runners.

$ cwltool --no-container $PREPROCESS_ROOT/workflows/pipeline.cwl input.json

$ toil-cwl-runner --no-container $PREPROCESS_ROOT/workflows/pipeline.cwl input.json

where $PREPROCESS_ROOT refers to the location where the CWL files have been installed; this environment variable needs to be set! The pipeline parameters are provided via a JSON file, described at the bottom of this page. Additionally, cwltool and toil come with a number of useful command line arguments, some of which are listed below. Please refer to their respective documentation for a full overview.

Note

Do not forget to add the --no-container option when the dependencies have been installed locally on system. Besides fully running the pipeline in a container (as explained in the next section), this is the only supported way to execute the Pre-Processing Pipeline.

Starting a run from within a container

If you followed the Docker installation instructions on the Downloading and Installation page, you can run the container using Docker as follows:

$ docker run --rm -v <source_directory>:<mount_point> -w <mount_point> preprocess cwltool --preserve-entire-environment --no-container /usr/local/share/prep/workflows/pipeline.cwl input.json

Since the Pre-Processing Pipeline is running inside a container, do not forgot to add the --no-container option to your CWL runner.

cwltool options

There are a number of command-line options you might want consider adding when running cwltool:

  • --outdir: specifies the (relative) path to the directory containing the output of the pipeline (make sure to mount this directory when running the pipeline in a container)

  • --log-dir: specifies the location of the log files produces by the stdout and stderr of a CommandLineTool (make sure to mount this directory when running the pipeline in a container)

  • --preserve-entire-environment: use your system’s environment variables when manually installing the dependencies (or when running the pipeline inside a container)

  • --no-container: do not execute jobs in a container (add this when the dependencies have been installed manually or when running fully inside a container)

  • --singularity: use the Apptainer (previously Singularity) runtime for running containers instead of Docker

  • --debug: more verbose output, useful when debugging

Make sure to mount the output and log directories, specified by --outdir and --log-dir, when running the pipeline inside a container to ensure the files are not lost after execution.

A full overview of CLI arguments is available in their documentation.

toil options

Similarly, these options might be of interest when using toil:

  • --outdir: specifies the path to the directory containing the output of the pipeline

  • --workDir: specifies the path to the directory where the temporary files generated by Toil should be placed

  • --log-dir: specifies the location of the log files produces by the stdout and stderr of a CommandLineTool

  • --logFile: path to the main log file

  • —-jobStore: path to the Toil job-store (must not exist yet)

  • —batchSystem: use a specific batch system of a HPC cluster (e.g., slurm or single_machine)

  • --preserve-entire-environment: use your system’s environment variables when manually installing the dependencies

  • --no-container: do not execute jobs in a container (add this when the dependencies have been installed manually or when running fully inside a container)

  • --singularity: use the Apptainer (previously Singularity) runtime for running containers instead of Docker

  • --stats: with this option Toil collects runtime statistics (they can be used by toil stats)

Make sure to mount the output, log directories, and working directories, when running the pipeline inside a container to ensure the files are not lost after execution.

A full overview of CLI arguments is available in their documentation.

Configuring the pipeline

The parameters of the pipeline are provided as a JSON file. As an example, a minimal input could be a list of MeasurementSets (MSs) that you would like to process:

{
     "msin": [
         {
             "class": "Directory",
             "path": "/data/L888536_SAP000_SB026_uv.MS"
         },
         {
             "class": "Directory",
             "path": "/data/L888536_SAP000_SB027_uv.MS"
         }
     ]
 }

Refer to the Overview of the Pipeline section for a full overview of all pipeline parameters and their default values.