GitHub - moonshine-ai/pi-help-bot: Voice assistant for Raspberry Pi setup and Q&A (original) (raw)

Local AI voice agent to handle headless wifi setup.

Introduction

Moonshine Voice is an open source framework for building AI voice agents that run locally on edge devices, like the Raspberry Pi. This project shows how you can use the library to connect to a new wifi network without requiring a display, mouse, or keyboard, by talking to an agent. You can see a five-minute run-through in the video below:

Pi Help Bot demo video

Try it yourself

The simplest way to test it for yourself is to download my customized Raspberry Pi OS image with everything pre-installed, flash it to an SD card, boot up your own Pi 5, and plug in your own headset or earbuds. This image runs the AI agent script on startup and listens for new audio devices being plugged in, so you should hear the agent greet you after boot completes (usually a minute or so). If you run into any problems, check out the troubleshooting guide.

The supported commands are:

Hardware

You'll need a Raspberry Pi 5 and a pair of headphones or earbuds. The framework does run on older Pis, but the 5 gives the lowest latency.

Installing

To set up the agent on a fresh OS image, you first need to install some system libraries and tools:

sudo apt-get update sudo apt-get install -y portaudio19-dev pipewire-alsa git git-lfs curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env

Then clone this repository:

git clone https://github.com/moonshine-ai/pi-help-bot cd pi-help-bot

Install all the Python dependencies:

uv venv source .venv/bin/activate uv pip install -r requirements.txt

Running

The pi-help-bot.py script implements the voice agent. For testing you can run it using Python from your terminal:

You should see some log messages, and then a spoken greeting from the agent.

For practical use you'll want the script to run on startup, which you can set up using the first-run.sh script:

Creating an image

If you want to flash your own SD cards with the voice agent pre-installed, here are the steps I use:

sudo apt clean rm -rf ~/.bash_history sudo rm /etc/ssh/ssh_host_*

Zero out free space to help compression

sudo dd if=/dev/zero of=/zero bs=4M sudo rm /zero sudo dd if=/dev/mmcblk0 of=/media/pi/USB322FD/pi-help-bot-0.0.1.img bs=4M status=progress sudo ./pishrink.sh -z /media/pi/USB322FD/pi-help-bot-0.0.1.img

Troubleshooting

You'll need some way of accessing a terminal to debug issues, either a monitor, keyboard, and mouse, or ssh and a wifi connection.

If you're plugging in an audio device like headphones and you're not hearing anything, try running the speaker-test tool:

This should output a continuous stream of noise to the default speaker. If you still don't hear anything, then your audio device isn't being recognized at the system level.

Sometimes an HDMI connection shows up as an audio output device, but may not actually be audible (depending on your setup) so it's worth unplugging any displays to see if that changes the results.

If you are hearing something, but it's very quiet, then you can try the alsamixer tool to boost the system volume:

You'll see an ASCII-based UI. Press the up arrow key to increase the volume, and then the escape key when you're done.

If you want to check on the system service that runs on boot, you can see if it's running with:

systemctl status pi-help-bot.service

If you want to see all of the logging from the script, try:

journalctl -u pi-help-bot.service -f

You should see errors in here if files can't be found, as well as information about audio devices. The --log-io argument is set by default, which will show conversations between people and the agent. If you're concerned about recording this information, you can remove that flag.

To temporarily stop the agent system service running in the current session:

sudo systemctl stop pi-help-bot.service

Implementation guide

This agent is designed as an example you can use as a reference and extend for your own projects. The source code is in pi-help-bot.py and you can find more information about Moonshine Voice and the DialogFlow specification we use at github.com/moonshine-ai/moonshine#getting-started-with-a-conversational-agent.

The summary is that we register phrases to listen for (like "Set up wifi"), the flow controller listens out for them, and then calls the Python function that's registered to any matched utterances. The matching is done fuzzily, against semantic meaning, so different ways of saying the same thing (like "Can you set up your wifi" or "Please do wifi setup") will trigger the correct function.

Each function then implements a conversational flow. This can be as simple as just responding with information:

def report_wifi_status(d: Dialog): ssid = _current_wifi_ssid() if ssid: yield d.say(f"Wi-Fi is connected to {ssid}.") else: yield d.say("Wi-Fi is not connected on this Raspberry Pi.")

You can also use it to handle complex, multi-step interactions with the user:

def connect_to_wifi(d: Dialog): input_ssid = yield d.ask("What's the name of your Wi-Fi network? Say list if you want to pick from a list or spell if you want to spell out the start of the name") input_ssid = input_ssid.strip()

    networks = _scan_wifi_networks()

    if input_ssid.lower().strip(string.punctuation) == "list":
        yield d.say("Say yes to the network you want to connect to.")
        for network in networks:
            if (yield d.confirm(f"{network}?")):
                input_ssid = network
                break
    elif input_ssid.lower().strip(string.punctuation) == "spell":
        input_ssid = yield d.ask("Spell out the start of the network name.", mode=SPELLED)

    found_ssid = fuzzy_match_network(input_ssid, networks)
    if found_ssid is None:
        yield d.say(f"Sorry, I couldn't find a matching network for {input_ssid}.")
        return

    password = yield d.ask(
        f"Please spell the Wi-Fi password for {found_ssid} one character at a time, and say done when finished.",
        mode=SPELLED,
    )

    yield d.say(f"Connecting to {found_ssid}.")

    try:
        result = subprocess.run(
            ["sudo", "nmcli", "device", "wifi",
                "connect", found_ssid, "password", password],
            capture_output=True, text=True, timeout=30,
        )
    except FileNotFoundError:
        yield d.say("Sorry, network manager was not found on this system.")
        return
    except subprocess.TimeoutExpired:
        yield d.say("Sorry, the connection attempt timed out.")
        return

    if result.returncode == 0:
        yield d.say(f"Connected to {found_ssid}.")
    else:
        print(f"[ERROR] nmcli stderr: {result.stderr}", file=sys.stderr)
        yield d.say(
            f"Sorry, I wasn't able to connect to {found_ssid}. "
            "Please check the network name and password and try again."
        )

dialog_flow.register_flow("Connect to Wi-Fi", connect_to_wifi)

The first thing this function does is ask the user to give them the name of the network they want to join, through the call:

input_ssid = yield d.ask("What's the name of your Wi-Fi network?...")

The Dialog class lets you ask users questions and will return the string containing the what they said in response. The only unusual feature here, compared to regular Python code, is the yield keyword. Because it may take some time for the user to respond, we call yield to hand back control to the main script until their response has been received. This is a general pattern for DialogFlow and you'll see it wherever we're waiting for the user to say something, to avoid blocking.

    if input_ssid.lower().strip(string.punctuation) == "list":
        yield d.say("Say yes to the network you want to connect to.")
        for network in networks:
            if (yield d.confirm(f"{network}?")):
                input_ssid = network
                break

Our example application supports a few different input methods - running through a list of networks, spelling out the first few letters, or saying the name. Here we implement the list approach by looping through all the available networks and asking the user whether each is the one they want. Here you can see that regular loops and conditional statements work as you'd expect in Python.

For each network, we call confirm(), which asks a question and then waits for a positive or negative result. Like all matching in the system this is done semantically, so "okay", "affirmative", and "go ahead" will work as well as a straightforward "yes".

    password = yield d.ask(
        f"Please spell the Wi-Fi password for {found_ssid} one character at a time, and say done when finished.",
        mode=SPELLED,
    )

Password input is tricky, because they consist of arbitrary letters, digits, and symbols, and so they have to be spelled out by the user. Moonshine supports this through the mode=SPELLED argument. This asks the user to spell out each character, and uses a fine-tuned model to recognise what the user is saying for each. As well as supporting regular utterances like "aitch" or "capital zee", it also supports the NATO alphabet ("alpha", "bravo", etc) and even short descriptive phrases like "E as in elephant". It repeats back what it heard, and lets you delete mistakes.

    try:
        result = subprocess.run(
            ["sudo", "nmcli", "device", "wifi",
                "connect", found_ssid, "password", password],
            capture_output=True, text=True, timeout=30,
        )
    except FileNotFoundError:
        yield d.say("Sorry, network manager was not found on this system.")
        return
    except subprocess.TimeoutExpired:
        yield d.say("Sorry, the connection attempt timed out.")
        return

The flow also works with other control structures like exception handlers, so you can specify your conversations using idiomatic code, even for error recovery.

Most of the rest of the code in pi-help-bot.py is for handling hot-plugging of audio devices, since that's fairly complicated and error-prone on Linux. You can safely ignore this unless your project requires handling swapping of audio devices.

License

The code is released under the MIT License.

Acknowledgments

Thanks to all of the open source projects that Moonshine Voice uses, and for the Raspberry Pi team for their help on this project.