Create a new workflow

Add a workflow to the git repo

Each workflow has a corresponding directory in the automation repository.

The git repository workflow directory serves as definition/configuration of the workflow. It must contain a Jenkinsfile to instruct Jenkins on which steps to execute and a python script for each task that is executed as part of the workflow. Further executables or configuration files can be added. The software used to perform a particular task (CMSSW, private code, etc.) should be handled in a separate repository to separate the code development from the automation configuration which is what the git repository described here is all about.

Note

Drawing a line between what is calibration code and what is configuration is at times not straightforward. For instance, regarding tasks that run CMSSW we for sure won't include a separate version of CMSSW code into the automation repository, but there are workflows for which only a dedicated _cfi.py is needed, and therefore it seems impractical to just have an extra repository to host a single file. In such a case the single python file is considered as configuration and is hosted in the automation repo. In other cases where new plugins are added to existing CMSSW packages or new packages are added altogether it is natural to have a separate repository for that.

As an example of the first case see the validation step of the online-pulse-shape workflow, while for the second see online-ecalelf and the related CMSSW package

The template-workflow serves as a prototype for the definition of a workflow with a single task executing jobs in parallel using HTCondor on CERN lxplus.

Task implementation

Each task is handled by a TaskHandler class. The base handler class provide a sckelethon for the actual implementation of a class that can handle all the task features and transitions.

The AutoCtrlScriptBase class on which all handlers are based provide the features needed to quickly turn a python class into an executable script. This feature provide a certain degree of flexibility: use the class from interactive python while developing while packaging all the functions into a predefined script structure to be used by the Jenkinsfile in production.

The handler controls the task status and execution by interfacing with the influx database. A RunCtrl instance is created as a member of the class (accessible as self.rctrl). The RunCtrl provides the interface to the run status table in the database, where the information about the status of each task is stored for each CMS DAQ run.

The handler needs also an instance of the [JobCtrl][jctrl] in order to interact with the job table in the database. The job table contains the status of all jobs executed by the automation. The [JobCtrl][jctrl] init options are used to select the jobs belonging to a given task. Contrary to the RunCtrl instance the [JobCtrl][jctrl] one is not created as a class memeber by the base class but is instanciated on the fly within the handler methods. This is necessary since a different instance has to be created for each run number.

The handler class should implement the submit, resubmit, check methods. These define the 3 crucial status transitions a task goes through:

class method	transitions
submit	"new" -> "processing"
	"reprocess" -> "processing"
resubmit	"processing" -> "processing"
check	"processing" -> "done"
	"processing" -> "failed"

The state here referes to the task, i.e. "failed" means that at least one job has been marked as permanently failed (after a number of retries that is specified as a command line option to the check command).

Failed jobs are supposed to be resubmitted by calling the resubmit method by the handler while permanent failures, once marked as such by the check method, requires manual intervention to be fixed. The manual resubmission will set the status of the task to reprocess to clearly mark in the task history that the task went through a manual resubmission.

The set of methods (submit, resubmit and check) is not strictly defined in the system. Each task can in principle define more or less. Nonetheless, the 3 methods provide flexibility for the majority of tasks and simplicity. As explained in the Workflow and task concepts section, if a task requires two jobs to be executed in parallel one should consider splitting the task into two separate ones. Keeping the task as atomic as possible should ensure the 3 methods above are enough to implement every task. The typical task progression follows this scheme:

graph LR
  N[new] --> |<b><mark>submit</mark></b>| P[processing];
  P --> |<mark><b>resubmit</b></mark>| P;
  P --> C{<mark><b>check</b></mark>};
  C --> D[done];
  C --> F[failed];
  F --> |<i>manual reprocess</i>| R[reprocess];
  R --> P;

The check method provide by the HandlerBase is general enough to fit all tasks implemented so far. The submit and resubmit methods instead are generalized only for submission of simple HTCondor jobs

Single job tasks requires the implementation of a dedicated handler class following the standard structure explained below. Although fairly general the HTCHandler class might not cover all the use cases in which parallel jobs are required, in which case one should extend it or implement a new class well.

The handler classes available in the ecalautoctrl package do not implement directly the groups and files methods. These two methods should provide, respectively, the grouping of runs to be processed together and the relevant input files for each group. The implementation of the two methods are left to derived classes and decorators. Decorators offer a quick way to provide different classes with the same functionality, akin to multiple inheritance but with a higher level of flexibility. A number of decorators are implemented in the ecalautoctrl package and provide standard grouping and file access like: group_by_run, group_by_fill, group_by_intlumi, prev_task_data_source and dbs_data_source. A set of handler specialized with the above decorators is also provided like: HTCHandlerByFill. Dedicated ones can be created on the fly in and stored in the workflow branch like:

@prev_task_data_source # (1)
@process_by_fill(fill_complete=True) # (2)
class ZeeMonHandler(HandlerBase):
    """
    Execute all the steps to generate the Zee monitoring plots.
    Process fills that have been dumped (completed).

    :param task: workflow name.
    :param prev_input: name of the workflow from which gather the input data.
    :param deps_tasks: list of workfow dependencies.
    """

    ...

load input data from the influxdb, accessing the output of a previous task specified in the class constructor using the prev_input option.
Process runs by grouping them in based on the LHC fill number. Only process the data once the fill has been completed.

Typically, the first task in a workflow will read data from DBS (a.k.a. DAS), either CMS prompt reco datasets or RAW files from calibration streams. It is important to notice that the automation framework is independent of the CMS DAQ and T0 processing (and vice-versa), therefore one has to pay attention to the synchronization between the two system. Two particular aspects are crucial:

Conddb: a reconstruction task might require synchronization with Conddb (condition database). In particular RAW files from calibration streams are usually processed and made available on EOS by T0 before the corresponding physics dataset for the same run have been fully processed in prompt reco. Often the RAW data are available even before calibration from dedicated system (e.g. laser corrections for ECAL) or the PCL (e.g. ECAL pedestals) are updated by the respective workflows. To ensure that the reconstruction happens with the correct conditions the workflow can accept lock functions that will prevent a run to be processed even if in status "new" unless a certain condition is met. The Conddb related lock functions are available here.
Tier0: data files from T0 of any data tier (RAW, PROMPT, EXPRESS, ALCARECO, etc) are copied to several sites with disk storage after being processed at T0. The information about each file is uploaded to DBS at regular intervals by the T0 software. For each run and dataset combination the information concerning the files that contain the data for that combination is uploaded several times during the T0 processing. In practice this means that files of a given dataset containing data from a given run might appear while jobs that will produce more of those files are still running at T0. This can lead to automation jobs running only on a partial dataset if a task peaks up a run before the processing at T0 for that run and dataset has been fully completed. A set of locks to prevent this issue are provided here

An example of how to use the locks can be found here:

#!/usr/bin/env python3
import sys
from ecalautoctrl import HTCHandlerByRunDBS, CondDBLockGT, T0ProcDatasetLock
from ecalautoctrl.TaskHandlers import AutoCtrlScriptBase

if __name__ == '__main__':
    laser_ped_lock = CondDBLockGT(records=['EcalLaserAPDPNRatiosRcd', 'EcalPedestalsRcd']) # (1)
    t0lock = T0ProcDatasetLock(dataset='/AlCaPhiSym', stage='Repack') # (2)

    handler = HTCHandlerByRunDBS(task='phisym-reco',
                                 dsetname='/AlCaPhiSym/*/RAW',
                                 locks=[laser_ped_lock, t0lock]) # (3)

    ret = handler()

    sys.exit(ret)

get_opts = AutoCtrlScriptBase.export_options(HTCHandlerByRunDBS) # (4)

Create a lock to wait for a payload to be appended to the specified tags with a "since" time/run-number that is more recent than the run end time.
Create a lock to wait for T0 to fully process and copy to EOS the RAW data from the /AlCaPhiSym stream.
Set the locks in the handler class.
Use the export_options static method to generate a function (assigned to get_opts) that can be used by sphinx to generate automatic documentation (including command line options and their default values) of the python script.