Shifterguide
This page contains detailed instruction for the ECAL shifters (usually DOC/DGL and PFG) in charge of monitoring the status of the automation workflow for the calibration of the prompt reconstruction.
ECAL automation monitoring
The three main tools to monitor the status of the automation are:
- Grafana: to have an overview of the runs being processed as well as logs and jobs outputs.
- Jenkins: the instance where jobs get scheduled for execution.
- ECAL Automation Mattermost notification channel: the public Mattermost channel where the automation system sends notifications
Note
Experts and developers might want to join the ECAL Automation - DEV Mattermost channel as well, were low level notifications are sent as well as notifications from the development branches of each workflow.
The Grafana home dashboard offers links to various subsystems monitoring dashboards.
Grafana ECAL home with links to the monitoring dashboard
Each dashboard offers monitoring of different workflows/utilities:
- Prompt calibration monitoring: the main automation monitoring page, the
Processing overview
panel displays the status of each workflow for each run.
ECAL automation known issues and action
Workflow failures are notified by the system on Mattermost as well on the main monitoring panel on Grafana.
Currently there is only one major failure that can't be automatically tracked by the system.
This happens when Jenkins looses connection with lxplus
and in the process of
re-establishing it the job queue gets filled and the lxplus
node is subsequently marked as
suspended
by Jenkins.
Warning
The following steps can only be performed by someone with admin privileges on the Jenkins instance.
This failure can only be detected by monitoring from time to time (the failure rate is
below once a day) the status of the system on Jenkins.
In case the lxplus
node gets marked as suspended the job queue will start filling up:
Full build queue
While under nodes lxplus will appear in this state:
Lxplus in suspended state
To restore the lxplus
node operation and free up the build queue execute the following
line in the Jenkins script console:
Executing this line will not return any value, nonetheless if successful the execution
of the line above will immediately bring lxplus
back online and the build queue will start freeing up.
Trigger jobs
At the moment the validation of the new transparency corrections for HLT and L1 is handled separately within the CMSSW Jenkins instance. The jobs there are started by pushing changes to a dedicated GitHub repository [ADD LINK]. The corrections are computed every fill and deployed during the interfill both at L1 and HLT. More details on the impact of the transparency corrections at HLT
In order to trigger the jobs a "scheduler" is needed. Three scheduler are currently running in the ECAL automation Jenkins instance:
The scheduler jobs (builds in the Jenkins jargon) run on lxplus. Jenkins is configured to
connect to the lxplus using the ecaltrg
service account in order to access the common afs
area where the schedulers code is installed (/afs/cern.ch/work/e/ecaltrg/run3/
).
The connections are setup on two identical (barring the name) nodes: lxplus-trg
and lxplus-trg-backup
. As the name suggests the latter is a backup in case the main node is not available.
All three schedulers are configured to try to run jobs on both, lxplus-trg-backup
is called into action whenever lxplus-trg
is not available (offline, crashing, etc.).
The node status can be checked either on the Jenkins main dashboard (scrolling down along the left side of the page) or by directly accessing the nodes page. The nodes page should look something like this:
Jenkins nodes page: normal configuration with lxplus-trg
online and lxplus-trg-backup
offline (small red cross)
Whenever lxplus-trg
goes offline the same page will show the node as offline. If the backup node went into action the page should display this:
Jenkins nodes page: backup configuration with lxplus-trg
offline and lxplus-trg-backup
online
A notification is also sent (by each job) to the ETT shift chat. The notification looks like this:
Warning
All operations described below require special permissions. If the "launch agent" and
"disconnect" buttons mentioned below are not visible it means your account does not have
the ecaltrg role within Jenkins. Ask to be added to the cms-ecal-trigger-team
e-group.
If this happens it means that one or few more jobs might have failed and the backup node
successfully went into action to resume the jobs exectution. The lxplus-trg
node is configured
to handle the jobs load better than backup, therefore it is advisable to restore the "normal"
operating mode by bringing lxplus-trg
online and disconnecting lxplus-trg-backup
.
To do this from the nodes' page first click on lxplus-trg
and click on the "launch agent" button:
You will be redirected to the connection page where you will see logs from the attempted connection to lxplus. After few seconds the logs will stop, and you should be able to read "Agent successfully connected and online" at the bottom. In case of problems the connection will get stuck before displaying the above message and intervention from the experts is required.
After connecting the lxplus-trg
node it's time to go back to the main nodes page and
disconnect the lxplus-trg-backup
node by accessing its own page and clicking on the
"Disconnet" button on the left panel:
You will be asked to enter a message for the disconnection reason. Please enter something meaningful (most of the time a simple "resuming standard operation" is enough).
If the node status is restored correctly each job will send a successful notification acknowledging the switch back to the lxplus-trg
node: