JENKINS-45295 Swarm Client should update labels on the fly when labelsFile changes by basil · Pull Request #110 · jenkinsci/swarm-plugin (original) (raw)

Problem

In JENKINS-45295 and JENKINS-48247, users have reported that when the contents of the file specified by -labelsFile are changed, the client is blindly restarted, interrupting any running jobs. Needless to say, this is disruptive.

Evaluation

LabelFileWatcher has two modes of operation: "soft" updates (which attempt to modify the labels in place) and "hard" updates (which restart the client, interrupting any running jobs). A soft update is first tried, followed by a hard update if the soft update is unsuccessful. I verified experimentally that after changing the labels file, a soft update was attempted and failed. Then the hard update was done, which restarted the client.

Why did the soft update fail? I stepped through the failure and saw that the soft update process made a call to the backend, but the backend returned a 404, claiming that a node by that name could not be found. Looking into this further, I saw that the node created on the backend had a hash appended to it, while the node name being provided by the client did not have this hash. This provided the root cause: we were setting up the label watcher before the node had been named by the server. The requested name was not actually used by the server, but the label watcher thread was still trying to use it when communicating with the server.

Incidentally, the hash is only appended to the node name when -disableClientsUniqueId is not provided, and I was able to work around the issue by passing in -disableClientsUniqueId.

Solution

Create the node on the server before creating the label watcher thread, and use the name returned by the server when setting up the label watcher thread.

Implementation

The implementation was pretty straightforward. I had to wean our test framework off -disableClientsUniqueId in order to be able to test this. I started by removing -disableClientsUniqueId from TestUtils, which entailed writing a custom getComputer function to iterate through nodes with a possible hash in the name. That having been done, we were no longer using -disableClientsUniqueId in PipelineTest, which needed it for tests that restart the agent. I added it back to just those tests. That having been done, I removed disableClientsUniqueId from the label tests.