Shifterguide

This page contains detailed instruction for the ECAL shifters (usually DOC/DGL and PFG) in charge of monitoring the status of the automation workflow for the calibration of the prompt reconstruction.

ECAL automation monitoring

The three main tools to monitor the status of the automation are:

Grafana: to have an overview of the runs being processed as well as logs and jobs outputs.
Jenkins: the instance where jobs get scheduled for execution.
ECAL Automation Mattermost notification channel: the public Mattermost channel where the automation system sends notifications

Note

Experts and developers might want to join the ECAL Automation - DEV Mattermost channel as well, were low level notifications are sent as well as notifications from the development branches of each workflow.

The Grafana home dashboard offers links to various subsystems monitoring dashboards.

ECAL grafana overview

Grafana ECAL home with links to the monitoring dashboard

Each dashboard offers monitoring of different workflows/utilities:

Prompt calibration monitoring: the main automation monitoring page, the Processing overview panel displays the status of each workflow for each run.

ECAL automation known issues and action

Workflow failures are notified by the system on Mattermost as well on the main monitoring panel on Grafana.

Currently there is only one major failure that can't be automatically tracked by the system. This happens when Jenkins looses connection with lxplus and in the process of re-establishing it the job queue gets filled and the lxplus node is subsequently marked as suspended by Jenkins.

Warning

The following steps can only be performed by someone with admin privileges on the Jenkins instance.

This failure can only be detected by monitoring from time to time (the failure rate is below once a day) the status of the system on Jenkins. In case the lxplus node gets marked as suspended the job queue will start filling up:

Filled job queue

Full build queue

While under nodes lxplus will appear in this state:

Lxplus suspended

Lxplus in suspended state

To restore the lxplus node operation and free up the build queue execute the following line in the Jenkins script console:

Jenkins.instance.getNode('lxplus').toComputer().setAcceptingTasks(true);

Executing this line will not return any value, nonetheless if successful the execution of the line above will immediately bring lxplus back online and the build queue will start freeing up.

Trigger jobs

At the moment the validation of the new transparency corrections for HLT and L1 is handled separately within the CMSSW Jenkins instance. The jobs there are started by pushing changes to a dedicated GitHub repository [ADD LINK]. The corrections are computed every fill and deployed during the interfill both at L1 and HLT. More details on the impact of the transparency corrections at HLT

In order to trigger the jobs a "scheduler" is needed. Three scheduler are currently running in the ECAL automation Jenkins instance:

List of ecaltrg items

The scheduler jobs (builds in the Jenkins jargon) run on lxplus. Jenkins is configured to connect to the lxplus using the ecaltrg service account in order to access the common afs area where the schedulers code is installed (/afs/cern.ch/work/e/ecaltrg/run3/). The connections are setup on two identical (barring the name) nodes: lxplus-trg and lxplus-trg-backup. As the name suggests the latter is a backup in case the main node is not available. All three schedulers are configured to try to run jobs on both, lxplus-trg-backup is called into action whenever lxplus-trg is not available (offline, crashing, etc.).

The node status can be checked either on the Jenkins main dashboard (scrolling down along the left side of the page) or by directly accessing the nodes page. The nodes page should look something like this:

Nodes page normal

Jenkins nodes page: normal configuration with lxplus-trg online and lxplus-trg-backup offline (small red cross)

Whenever lxplus-trg goes offline the same page will show the node as offline. If the backup node went into action the page should display this:

Nodes page backup

Jenkins nodes page: backup configuration with lxplus-trg offline and lxplus-trg-backup online

A notification is also sent (by each job) to the ETT shift chat. The notification looks like this:

backup notification

Warning

All operations described below require special permissions. If the "launch agent" and "disconnect" buttons mentioned below are not visible it means your account does not have the ecaltrg role within Jenkins. Ask to be added to the cms-ecal-trigger-team e-group.

If this happens it means that one or few more jobs might have failed and the backup node successfully went into action to resume the jobs exectution. The lxplus-trg node is configured to handle the jobs load better than backup, therefore it is advisable to restore the "normal" operating mode by bringing lxplus-trg online and disconnecting lxplus-trg-backup. To do this from the nodes' page first click on lxplus-trg and click on the "launch agent" button:

Launch lxplus-trg agent

You will be redirected to the connection page where you will see logs from the attempted connection to lxplus. After few seconds the logs will stop, and you should be able to read "Agent successfully connected and online" at the bottom. In case of problems the connection will get stuck before displaying the above message and intervention from the experts is required.

After connecting the lxplus-trg node it's time to go back to the main nodes page and disconnect the lxplus-trg-backup node by accessing its own page and clicking on the "Disconnet" button on the left panel:

Launch lxplus-trg agent

You will be asked to enter a message for the disconnection reason. Please enter something meaningful (most of the time a simple "resuming standard operation" is enough).

If the node status is restored correctly each job will send a successful notification acknowledging the switch back to the lxplus-trg node:

restore notification