[SDXL ControlNet Training] Follow-up fixes by sayakpaul · Pull Request #4188 · huggingface/diffusers (original) (raw)

Multiple runs of the script might have the same args. Is that ok? I'm not sure I haven't thought through well enough.

If that is the case, we would want to avoid the execution of the map fn and instead load from the cache no? Or will there be undesired consequences of that?

In any case, coming to your suggestion on

Ideally we can get the PID of the parent process if the parent process is accelerate and hash that. If the parent process is not accelerate, we don't have to pass any additional fingerprint

Are you thinking of something like:

with accelerator.main_process_first(): from datasets.fingerprint import Hasher

# fingerprint used by the cache for the other processes to load the result
# details: https://github.com/huggingface/diffusers/pull/4038#discussion_r1266078401
if accelerator.is_main_process:
    pid = os.getpid()
new_fingerprint = Hasher.hash(pid)
train_dataset = train_dataset.map(compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint)