short confirmation tones on daemon events so the user gets eyes-free "did it hear me?" feedback without watching the terminal. NOT TTS — short pre-generated .wav beeps. - audio_out.py — reusable audio-OUT module (the reverse of audio.py's capture, the less-tested WSLg direction). three-tier player: paplay-first (a SEPARATE process, so it doesn't contend with the sounddevice mic stream on the duplex-flaky WSLg bridge), then in-process sounddevice, then powershell.exe SoundPlayer. best-effort per-backend volume. plays a wav path and knows nothing about events — v0.3 TTS reuses it. - sound.py — Earcons: the single event->tone map (wake/accept/no_match/submit) gated by [sound] config (master enabled + per-event flags). daemon._handle wiring: an injected command plays accept (submit plays submit); no-match / target-missing / unknown-context plays no_match; pure daemon-control commands (list/version/…) play nothing. - sounds/ — committed earcon wavs + generate.py (regen-only). committed (not generated at install) so the package is self-contained and a missing tone can never appear. packaged via pyproject [tool.setuptools.package-data]. - [sound] config: enabled (master, on), on_wake (OFF by default — bleed/chatty), on_accept/on_no_match/on_submit (on), volume (0-1 best-effort), [sound.files] overrides. - claudedo test-tone — plays each tone, the audio-OUT gate (mirrors test-audio). - install.sh now also checks RDPSink (audio-out) alongside RDPSource. INVARIANT: earcons are fire-and-forget on a worker thread and NEVER block or break the inject path. a missing tone file or dead speaker logs once and is swallowed, never raised — a broken speaker must never stop "claudedo yes" from injecting. de-risks the WSLg audio-OUT path that v0.3 TTS-readback will reuse. Signed-off-by: disqualifier <dev@disqualifier.me>
299 lines
16 KiB
Markdown
299 lines
16 KiB
Markdown
# claudedo
|
||
|
||
Voice control for [Claude Code](https://claude.com/claude-code) on **WSL2**.
|
||
|
||
`claudedo` listens on your mic, runs **local** speech-to-text, recognizes a wake
|
||
phrase plus a small command grammar, and injects the matching keystrokes into your
|
||
active Claude Code tmux session via `tmux send-keys`. You answer Claude Code's
|
||
prompts ("yes", "option one", "approve") and dictate prompts **by voice** — including
|
||
hands-free while another window (a game) is focused.
|
||
|
||
It exists because Claude Code's native `/voice` is hardcoded-blocked in WSL (it
|
||
assumes WSL has no audio). Modern WSL2 + WSLg *does* have working mic input via
|
||
PulseAudio/RDP. `claudedo` captures the mic itself, transcribes on-device, and drives
|
||
Claude Code over tmux — fully local and private. You run it in a terminal you watch.
|
||
|
||
## How it works
|
||
|
||
```
|
||
mic (WSLg/PulseAudio RDPSource)
|
||
-> sounddevice capture
|
||
-> faster-whisper (local STT, on-device)
|
||
-> wake gate: utterance must start with a wake phrase, else DISCARD locally
|
||
-> grammar match (yes/no/one..four/approve/deny/send/type/space/backspace/erase/
|
||
mode/set/target/unset/list/context/reload/system/cancel)
|
||
-> resolve target session (one-shot > sticky ~/.claude-active > auto/none)
|
||
-> tmux send-keys -t <session> "<keys>"
|
||
-> log the action to the watched terminal ([session]/[SYSTEM]/[VOICE], colored)
|
||
```
|
||
|
||
**Privacy by construction.** STT runs on-device. In listen mode, any speech that
|
||
doesn't start with a wake phrase is dropped the instant it's transcribed — never
|
||
stored, never sent anywhere. That's what makes always-listening acceptable while
|
||
you're on voice comms in a game.
|
||
|
||
**Injection is PTY-only.** `claudedo` only ever calls `tmux send-keys`. It never uses
|
||
OS-level keyboard input and installs no system-wide keyboard hook. Keystrokes are
|
||
text into a Linux pseudo-terminal — they work regardless of which window is focused
|
||
and never touch Windows input or a game/anticheat's view.
|
||
|
||
## Install
|
||
|
||
```bash
|
||
git clone <repo> claudedo && cd claudedo
|
||
./install.sh
|
||
```
|
||
|
||
`install.sh` is idempotent. It installs the WSL audio deps, writes the `~/.asoundrc`
|
||
Pulse shim, verifies the mic path, pip-installs the package, primes the Whisper
|
||
model, and installs the **cc kit** (`~/.config/claudedo/cc.sh`, sourced from every
|
||
`~/.zshrc`/`~/.bashrc` you have). It also checks the two Windows-side bits it can't
|
||
automate and tells you to fix them:
|
||
|
||
- **WSLg present** (`/mnt/wslg/PulseServer`). If missing: `wsl --update` in Windows,
|
||
then `wsl --shutdown`, then re-run.
|
||
- **Mic permission**: Windows Settings → Privacy & security → Microphone → enable
|
||
*"Let desktop apps access your microphone"*. Required.
|
||
|
||
Verify the riskiest piece (mic capture) first:
|
||
|
||
```bash
|
||
claudedo test-audio
|
||
```
|
||
|
||
## Usage
|
||
|
||
**Run it in a terminal you watch — that's the product.** You launch `claudedo
|
||
start` and it drops into a visible listen loop (pass `--check` to run a mic check
|
||
first). Each utterance prints a timestamped, colored line — `HH:MM:SS [claude-libs]
|
||
heard "…" →
|
||
typed 'fix'` (green for injected, red for drops, `[SYSTEM]`/`[VOICE]` for state and
|
||
recognition). That terminal is your recognition/action console; you attach to the
|
||
`claude-<name>` session in another pane to watch the keystrokes land. It runs in the
|
||
foreground by design — the console is the point — though `claudedo stop` can signal a
|
||
stray instance.
|
||
|
||
```bash
|
||
claudedo start # the visible listen loop (listen mode default; no mic check)
|
||
claudedo start --check # run a mic check before listening
|
||
claudedo start --mode ptt # push-to-talk instead (desk-only — see Modes)
|
||
claudedo status # running? mode? target session?
|
||
claudedo stop # stop a running daemon
|
||
claudedo reload # reload config.toml + contexts.toml in a running daemon
|
||
claudedo set <name> # set the sticky target -> claude-<name> (alias: switch)
|
||
claudedo unset # clear the sticky target
|
||
claudedo list # list running claude-* sessions
|
||
claudedo test-audio # verify the mic capture path
|
||
claudedo test-tone # play each earcon (verify the audio-OUT path)
|
||
```
|
||
|
||
### Modes
|
||
|
||
- **listen (default)** — continuous capture; only acts on utterances that **start
|
||
with a wake phrase**; all other speech is transcribed locally and discarded
|
||
instantly. This is the hands-free path and works while a game is focused, because
|
||
the trigger is your voice over the mic bridge — not a keyboard hook.
|
||
- **ptt** — push-to-talk. **Desk-only:** it captures only while the daemon's own
|
||
terminal window is focused. There is deliberately **no global hotkey** — a
|
||
system-wide keyboard hook is the keylogger/cheat silhouette anticheats watch for,
|
||
and `claudedo` refuses to install one. For hands-free-while-gaming, use listen
|
||
mode. (Terminals don't deliver key-up events, so PTT is press-to-start /
|
||
press-to-stop in the daemon window, not literal hold.)
|
||
|
||
Switch at runtime by voice: "claudedo mode listen" / "claudedo mode ptt".
|
||
|
||
## Command grammar
|
||
|
||
Wake phrases (listen mode), fuzzy-matched. The default list is **"claudedo"**,
|
||
**"claude do"**, **"hey claude"**, **"ok claude"**, **"okay claude"** — Whisper has
|
||
no token for the coined word "claudedo" and renders it as real words ("claude do"),
|
||
so that spelling is listed explicitly. Matching is lenient (case/space-insensitive).
|
||
Add the spellings you actually see (turn on `print_heard` to find them). In PTT mode
|
||
the wake phrase is optional. When a command's wake phrase matched loosely (e.g. you
|
||
said "okay clouds"), the heard line notes which phrase it assumed —
|
||
`heard "okay clouds list" -> LIST (wake: okay claude)`.
|
||
|
||
| Say | Does |
|
||
|---|---|
|
||
| `yes` / `no` | answer a yes/no prompt |
|
||
| `one` / `two` / `three` / `four` | pick numbered option 1–4 |
|
||
| `approve` / `deny` | allow / deny a permission prompt |
|
||
| `send` / `enter` | submit (Enter) |
|
||
| `type <phrase>` | insert literal text, **no** submit (read-before-send; say "send") |
|
||
| `space [<n>]` (also `add [a] space`, `insert <n> spaces`) | insert n spaces (default 1) |
|
||
| `backspace [<n>]` (alias `delete`) | delete n chars (default 1), capped at the last submit boundary |
|
||
| `erase` (alias `clear`/`wipe`) | delete everything typed since the last submit/boundary |
|
||
| `debug <text>` (alias `echo`) | just print what you said to the console (test wake/STT; injects nothing) |
|
||
| `mode ptt` / `mode listen` | switch input mode |
|
||
| `set <name>` (alias `sticky`/`switch`) | set the **sticky** target → `claude-<name>` (persists) |
|
||
| `target <name> <command>` | **one-shot** override: run that command on `claude-<name>` for this utterance only; sticky default unchanged |
|
||
| `unset` (alias `unsticky`) | clear the sticky target |
|
||
| `list` | list running `claude-*` sessions to the daemon console |
|
||
| `context <name> <instruction>` (alias `prepare`) | inject a `contexts.toml` blurb as a preamble + the dictated instruction, then **wait** (no submit — say "send") |
|
||
| `reload` | re-read `config.toml` + `contexts.toml` live (no daemon restart, model stays loaded) |
|
||
| `system status` | print mode / target / model / context count to the console (daemon-control; never injects) |
|
||
| `system reload [config\|contexts]` | reload one or both config files |
|
||
| `commands` (alias `help`/`menu`) | print the voice-command menu to the console |
|
||
| `customs` (alias `custom`) | list the loaded context names |
|
||
| `version` | print the claudedo version to the console |
|
||
| `cancel` / `escape` | back out of a prompt |
|
||
|
||
Optional filler (`select` / `use` / `choose`) may precede any command and is ignored:
|
||
`select yes` and `use yes` behave like `yes`. (`select 1` is still the select command.)
|
||
|
||
When no sticky target is set, a bare command does nothing and asks you to `set` one
|
||
(the default). Set `auto_target = true` to instead auto-use the single running
|
||
`claude-*` session when there's exactly one; with several running it always does
|
||
nothing and asks you to `set` one.
|
||
|
||
Number words are normalized to digits before matching ("one"/"won" → 1).
|
||
|
||
## Targeting
|
||
|
||
`~/.claude-active` holds the **sticky** target session name (e.g.
|
||
`claude-rethink-public`). The **cc kit** writes this file when you attach, and
|
||
`claudedo set <name>` (alias `sticky`/`switch`) overwrites it; `unset` clears it.
|
||
A `target <name>` voice command is a **one-shot** that does NOT touch the sticky
|
||
default — it routes a single command and the next bare command reverts to sticky.
|
||
|
||
Resolution order (one place — `target.resolve()`): one-shot if present →
|
||
sticky if set and the session exists → else, only if `auto_target = true`, the single
|
||
running `claude-*` session → else (default, or zero/several sessions) do nothing and
|
||
say so. It never guesses, and never injects into a nonexistent session.
|
||
|
||
Every name maps to `claude-<name>` through one helper (`target.session_name()`), and
|
||
the cc kit mirrors it exactly — so `cc libs` (shell) and `set libs` (voice) refer
|
||
to the same session `claude-libs`. The name is your **stable, speakable handle**:
|
||
because the kit forces an explicit name (no basename guessing), you always know the
|
||
exact word to say.
|
||
|
||
The cc kit lives in `~/.config/claudedo/cc.sh` (sourced from your rc; works under
|
||
bash and zsh). Every command **requires an explicit name**:
|
||
|
||
```bash
|
||
cc <name> # attach/create claude-<name>; writes ~/.claude-active
|
||
ccr <name> # re-attach an existing claude-<name> only
|
||
ccl # list claude-* sessions
|
||
cck <name> # kill claude-<name>
|
||
cckl # kill all claude-* sessions
|
||
```
|
||
|
||
## Contexts (named reference blurbs)
|
||
|
||
`contexts.toml` holds named reference snippets you can inject ahead of a dictated
|
||
instruction with the **`context <name> <instruction>`** voice command (alias
|
||
`prepare`). It lives next to `config.toml`
|
||
(`$CLAUDEDO_CONTEXTS` → `~/.config/claudedo/contexts.toml` → `./contexts.toml`); a
|
||
missing file just means no contexts (the feature is opt-in).
|
||
|
||
```toml
|
||
[contexts]
|
||
webhooks = "discord webhooks — test: <url> (safe to spam), live: <url> (real, careful)"
|
||
testing = "use the test/staging resources only, never touch prod"
|
||
```
|
||
|
||
Saying `context webhooks send a test message` injects the `webhooks` blurb as a
|
||
preamble, then the dictated instruction, and **waits** — nothing is auto-submitted. You
|
||
say `send` to submit (**read-before-send**; Claude's own permission prompt is the
|
||
backstop for anything consequential). A bare `context webhooks` injects just the blurb.
|
||
One context per command (no stacking yet); an unknown name announces and injects
|
||
nothing.
|
||
|
||
Names are **spoken and fuzzy-matched**, so keep them simple and distinct — they're
|
||
looked up on a despaced/lowercased key, so `web hooks` / `web-hooks` / `webhooks` all
|
||
resolve the same block. Assembly is config-gated: `behavior.context_multiline` (default
|
||
`true`) puts the blurb and instruction on separate lines via a Shift+Enter soft newline;
|
||
set it `false` to flatten onto one line with `context_separator` (default `" — "`) if
|
||
Shift+Enter is unreliable in your terminal.
|
||
|
||
Edit `contexts.toml`, then say **`reload`** (or run `claudedo reload`) — it re-reads
|
||
`config.toml` and `contexts.toml` live without restarting the daemon or reloading the
|
||
Whisper model. The **`system`** namespace gives daemon-control by voice without touching
|
||
Claude: `system status` (mode / target / model / context count) and `system reload
|
||
[config|contexts]`.
|
||
|
||
## Earcons (audio feedback tones)
|
||
|
||
Short confirmation tones play on key events so you get **eyes-free feedback** — "did it
|
||
hear me?" — without watching the terminal. They're tones, not speech (not TTS): a bright
|
||
blip when a command is accepted/injected, a low buzz when nothing matched, a rising chime
|
||
on submit, and an optional blip on wake. Tones are short (<300ms) and quiet, and they're
|
||
**additive** to the console feed — mute them and read at the desk, or hear them eyes-free.
|
||
|
||
Verify the audio-OUT path (the reverse of `test-audio`, and the less-tested direction on
|
||
WSLg) with:
|
||
|
||
```bash
|
||
claudedo test-tone # plays each tone through WSLg — the audio-out gate
|
||
```
|
||
|
||
Tones play through WSLg's PulseAudio sink, **paplay-first** (a separate process, so it
|
||
doesn't contend with the sounddevice mic stream), falling back to in-process sounddevice,
|
||
then `powershell.exe` on the Windows host. Playback is **fire-and-forget**: a dead speaker
|
||
or a missing tone file logs once and is ignored — audio-out can never block or break a
|
||
command (`claudedo yes` injects whether or not the speaker works).
|
||
|
||
Configure under `[sound]`: `enabled` (master, default on), per-event `on_wake` (default
|
||
**off** — a blip right before you speak can bleed into the command capture, and it's
|
||
chatty), `on_accept` / `on_no_match` / `on_submit` (default on), and `volume` (0.0–1.0,
|
||
best-effort — scaled for sounddevice, `--volume` for paplay, ignored by the PowerShell
|
||
fallback). A `[sound.files]` table can point any event at your own `.wav`. The shipped
|
||
tones live in the package (`claudedo/sounds/*.wav`); `claudedo/sounds/generate.py` is a
|
||
synthetic-beep fallback that can regenerate a placeholder set (it does **not** reproduce
|
||
the shipped tones — running it overwrites them with plain beeps).
|
||
|
||
## The confirmed Claude Code keymap
|
||
|
||
The keystrokes in [`keys.py`](src/claudedo/keys.py) were confirmed **empirically**
|
||
against a live `claude` v2.1.191 session (not assumed):
|
||
|
||
- Numbered prompts (trust prompt, permission prompt): pressing the **bare digit**
|
||
selects **and confirms immediately** — **no trailing Enter**.
|
||
- Arrow keys move the highlight without acting; Enter then confirms (modeled as an
|
||
alternative sequence).
|
||
- Permission prompt is `1. Yes / 2. Yes, and don't ask again / 3. No`; Escape cancels.
|
||
- Literal text goes in via `send-keys -l` (no submit); a bare Enter submits.
|
||
|
||
If Claude Code changes its prompt UI, re-confirm against a live session and update
|
||
`keys.py` — it is the single source of truth.
|
||
|
||
## Config
|
||
|
||
Everything tunable lives in [`config.toml`](config.toml): wake phrases, mode + PTT
|
||
key, Whisper model/language/device, `[vad]` endpointing, and `[behavior]`
|
||
(`type_autosend`, fuzzy thresholds, `filler_words`, `auto_target`, `print_heard`).
|
||
The default model is **`small.en`** (the English-only small model — ~1s/command on a
|
||
strong CPU, more accurate on English than multilingual `small` at the same speed);
|
||
`medium`/`medium.en` are more accurate but ~3× slower (noticeable lag), `base.en` is
|
||
snappier/less accurate, `large-v3` most accurate/slowest. Every `heard` line shows the
|
||
STT latency as `(<ms>/<audio>s)` so you can see what a model change costs. VAD
|
||
endpointing ends a capture after `[vad].silence_ms` (700) of trailing silence, capped
|
||
at `max_seconds` (15). `claudedo -c <path> ...` points at a specific config; otherwise
|
||
it searches
|
||
`$CLAUDEDO_CONFIG`, `~/.config/claudedo/config.toml`, then `./config.toml`.
|
||
|
||
- **STT biasing.** The transcriber is seeded with an `initial_prompt` built from the
|
||
configured wake phrases + command vocabulary (one source — `grammar.vocabulary()`),
|
||
so Whisper is conditioned to expect "claudedo" and the command words.
|
||
- **Split fuzzy thresholds.** `wake_fuzzy_threshold` (default `0.65`, lenient) vs
|
||
`command_fuzzy_threshold` (default `0.8`, tight). The asymmetry is deliberate: a
|
||
false *wake* is cheap (it wakes, finds no command, does nothing), but a false
|
||
*command* fires the wrong action. Prefer expanding command synonyms over loosening
|
||
the command threshold.
|
||
- **`[vad]` endpointing.** Capture starts on speech and ends after `silence_ms`
|
||
(default 700) of trailing silence — Alexa-style record-until-pause — capped at
|
||
`max_seconds` (default 15). The pause both ends a command and separates it from
|
||
following chatter (the chatter is a separate capture the wake gate discards).
|
||
- **`auto_target`** (default `false`): with no sticky target and one session running,
|
||
`false` does nothing and asks you to `set`; `true` auto-uses that session.
|
||
- **`print_heard`** (default `false`, debug): prints non-wake transcripts so you can
|
||
see how Whisper renders your wake word, then tune the wake list/threshold.
|
||
- **`context_multiline`** (default `true`) / **`context_separator`** (default `" — "`):
|
||
how the `context` command assembles the blurb and instruction — a Shift+Enter soft
|
||
newline between them, or (when `false`) flattened onto one line with the separator.
|
||
|
||
## Requirements
|
||
|
||
Windows 11 + WSL2 (Ubuntu) with WSLg, Python 3.10+, tmux, the `claude` CLI, and
|
||
either bash or zsh (the cc kit supports both).
|