Scalable Job Tracking — HTCondor Manual 24.8.1 documentation (original) (raw)

Launch this tutorial in a Jupyter Notebook on Binder: Binder

The Python bindings provide two scalable mechanisms for tracking jobs:

Both poll- and event-based tracking have their strengths and weaknesses; the intrepid user can even combine both methodologies to have extremely reliable, low-latency job status tracking.

In this module, we outline the important design considerations behind each approach and walk through examples.

Poll-based Tracking

Poll-based tracking involves periodically querying the schedd(s) for jobs of interest. We have covered the technical aspects of querying the Schedd in prior tutorials. Beside the technical means of polling, important aspects to consider are how often the poll should be performed and how much data should be retrieved.

Note: When Schedd.query is used, the query will cause the schedd to fork up to SCHEDD_QUERY_WORKERS simultaneous workers. Beyond that point, queries will be handled in a non-blocking manner inside the main condor_schedd process. Thus, the memory used by many concurrent queries can be reduced by decreasing SCHEDD_QUERY_WORKERS.

A job tracking system should not query the Schedd more than once a minute. Aim to minimize the data returned from the query through the use of the projection; minimize the number of jobs returned by using a query constraint. Better yet, use the AutoCluster flag to have Schedd.query return a list of job summaries instead of individual jobs.

Advantages:

Disadvantages:

Event-based Tracking

Each job in the Schedd can specify the UserLog attribute; the Schedd will atomically append a machine-parseable event to the specified file for every state transition the job goes through. By keeping track of the events in the logs, we can build an in-memory representation of the job queue state.

Advantages:

Disadvantages:

At a technical level, event tracking is implemented with the htcondor.JobEventLog class.

jel = htcondor.JobEventLog("/tmp/job_one.log") for event in jel.events(stop_after=0): ... print event

The return value of JobEventLog.events() is an iterator over htcondor.JobEvent objects. The example above does not block.