How to debug cloud-init - cloud-init 25.1.2 documentation (original) (raw)
Toggle table of contents sidebar
There are several cloud-init failure modes that one may need to debug. Debugging is specific to the scenario, but the starting points are often similar:
- I cannot log in
- Cloud-init did not run
- Cloud-init did the unexpected
- Cloud-init never finished running
I can’t log in to my instance¶
One of the more challenging scenarios to debug is when you don’t have shell access to your instance. You have a few options:
- Acquire log messages from the serial console and check for any errors.
- To access instances without SSH available, create a user with password access (using the user-data) and log in via the cloud serial port console. This only works if
cc_users_groups
successfully ran. - Try running the same user-data locally, such as in one of thetutorials. Use LXD or QEMU locally to get a shell or logs then debug with these steps.
- Try copying the image to your local system, mount the filesystem locally and inspect the image logs for clues.
Cloud-init did not run¶
- Check the output of
cloud-init status --long
- what is the value of the
'extended_status'
key? - what is the value of the
'boot_status_code'
key?
See our reported status explanation for more information on the status.
- what is the value of the
- Check the contents of
/run/cloud-init/ds-identify.log
This log file is used when the platform that cloud-init is running onis detected. This stage enables or disables cloud-init. - Check the status of the services
systemctl status cloud-init-local.service cloud-init-network.service\
cloud-config.service cloud-final.service
Cloud-init may have started to run, but not completed. This shows how many, and which, cloud-init stages completed.
Cloud-init ran, but didn’t do what I want it to¶
- If you are using cloud-init’s user-datacloud config, make sure to validate your user-data cloud config
- Check for errors in
cloud-init status --long
- what is the value of the
'errors'
key? - what is the value of the
'recoverable_errors'
key?
See our guide on exported errors for more information on these exported errors.
- what is the value of the
- For more context on errors, check the logs files:
/var/log/cloud-init.log
/var/log/cloud-init-output.log
Identify errors in the logs and the lines preceding these errors.
Ask yourself:- According to the log files, what went wrong?
- How does the cloud-init error relate to the configuration provided to this instance?
- What does the documentation say about the parts of the configuration that relate to this error? Did a configuration module fail?
- What failure state is cloud-init in?
Cloud-init never finished running¶
There are many reasons why cloud-init may fail to complete. Some reasons are internal to cloud-init, but in other cases, cloud-init failure to complete may be a symptom of failure in other components of the system, or the result of a user configuration.
External reasons¶
- Other services failed or are stuck.
- Bugs in the kernel or drivers.
- Bugs in external userspace tools that are called by
cloud-init
.
Internal reasons¶
- A command in
bootcmd
orruncmd
that never completes (e.g., runningcloud-init status --wait will deadlock). - Configurations that disable timeouts or set extremely high timeout values.
To start debugging¶
- Check
dmesg
for errors:
dmesg -T | grep -i -e warning -e error -e fatal -e exception - Investigate other systemd services that failed
- Check the output of
cloud-init status --long
- what is the value of the
'extended_status'
key? - what is the value of the
'boot_status_code'
key?
See our guide on exported errors for more information on these exported errors.
- what is the value of the
- Inspect running services boot stage:
$ systemctl list-jobs --after
JOB UNIT TYPE STATE
150 cloud-final.service start waiting
└─ waiting for job 147 (cloud-init.target/start) - -
155 blocking-daemon.service start running
└─ waiting for job 150 (cloud-final.service/start) - -
147 cloud-init.target start waiting
3 jobs listed.
In the above example we can see that cloud-final.service
is waiting and is ordered before cloud-init.target
, and thatblocking-daemon.service
is currently running and is ordered before cloud-final.service
. From this output, we deduce that cloud-init is not complete because the service named blocking-daemon.service
hasn’t yet completed, and that we should investigate blocking-daemon.service
to understand why it is still running.
5. Use the PID of the running service to find all running subprocesses. Any running process that was spawned by cloud-init may be blocking cloud-init from continuing.
Ask yourself:
- Which process is still running?
- Why is this process still running?
- How does this process relate to the configuration that I provided?
- For more context on errors, check the logs files:
/var/log/cloud-init.log
/var/log/cloud-init-output.log
Identify errors in the logs and the lines preceding these errors.
Ask yourself:- According to the log files, what went wrong?
- How does the cloud-init error relate to the configuration provided to this instance?
- What does the documentation say about the parts of the configuration that relate to this error?