claudedo/README.md
disqualifier 2fa3abab63 v0.2.0: context injection + system daemon-control namespace
context injection — named reference blurbs from contexts.toml injected ahead of a
dictated instruction, read-before-send (never auto-submits):
- new contexts.py mirrors config.py: [contexts] name = "blurb"; missing file = empty
  set; names validated as simple words, looked up on a despaced/lowercased key so
  "web hooks"/"web-hooks"/"webhooks" all resolve the same block.
- grammar: context|prepare <name> <instruction> -> Action("context", (name, dictation)).
  same-utterance dictation (everything after <name> is literal, incl. "send"); bare
  context <name> injects just the blurb. one-shot targeting composes:
  [target <name>] [context <ctx>] [filler] <dictation>.
- daemon assembles blurb + (Shift+Enter soft newline | flattened separator) + dictation
  via the existing send_literal/type path, tracks the uncommitted-input buffer, and
  WAITS. config-gated by behavior.context_multiline / context_separator. unknown context
  name announces and injects nothing.

system daemon-control namespace — lands the pass-through vs control split the router was
structured for. reserved leading "system" routes to _do_system (never injects to
claude): system status (mode/target/model/contexts) and system reload [config|contexts].

live reload — voice reload + CLI claudedo reload (SIGHUP) re-read config.toml +
contexts.toml without reinitializing the loaded whisper model. customs now lists loaded
contexts. install.sh installs the contexts.toml template copy-if-absent (else .new).

keys.NEWLINE (S-Enter) added for the soft-newline assembly. wake list unchanged.

Signed-off-by: disqualifier <dev@disqualifier.me>
2026-06-26 18:08:08 -04:00

268 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# claudedo
Voice control for [Claude Code](https://claude.com/claude-code) on **WSL2**.
`claudedo` listens on your mic, runs **local** speech-to-text, recognizes a wake
phrase plus a small command grammar, and injects the matching keystrokes into your
active Claude Code tmux session via `tmux send-keys`. You answer Claude Code's
prompts ("yes", "option one", "approve") and dictate prompts **by voice** — including
hands-free while another window (a game) is focused.
It exists because Claude Code's native `/voice` is hardcoded-blocked in WSL (it
assumes WSL has no audio). Modern WSL2 + WSLg *does* have working mic input via
PulseAudio/RDP. `claudedo` captures the mic itself, transcribes on-device, and drives
Claude Code over tmux — fully local and private. You run it in a terminal you watch.
## How it works
```
mic (WSLg/PulseAudio RDPSource)
-> sounddevice capture
-> faster-whisper (local STT, on-device)
-> wake gate: utterance must start with a wake phrase, else DISCARD locally
-> grammar match (yes/no/one..four/approve/deny/send/type/space/backspace/erase/
mode/set/target/unset/list/context/reload/system/cancel)
-> resolve target session (one-shot > sticky ~/.claude-active > auto/none)
-> tmux send-keys -t <session> "<keys>"
-> log the action to the watched terminal ([session]/[SYSTEM]/[VOICE], colored)
```
**Privacy by construction.** STT runs on-device. In listen mode, any speech that
doesn't start with a wake phrase is dropped the instant it's transcribed — never
stored, never sent anywhere. That's what makes always-listening acceptable while
you're on voice comms in a game.
**Injection is PTY-only.** `claudedo` only ever calls `tmux send-keys`. It never uses
OS-level keyboard input and installs no system-wide keyboard hook. Keystrokes are
text into a Linux pseudo-terminal — they work regardless of which window is focused
and never touch Windows input or a game/anticheat's view.
## Install
```bash
git clone <repo> claudedo && cd claudedo
./install.sh
```
`install.sh` is idempotent. It installs the WSL audio deps, writes the `~/.asoundrc`
Pulse shim, verifies the mic path, pip-installs the package, primes the Whisper
model, and installs the **cc kit** (`~/.config/claudedo/cc.sh`, sourced from every
`~/.zshrc`/`~/.bashrc` you have). It also checks the two Windows-side bits it can't
automate and tells you to fix them:
- **WSLg present** (`/mnt/wslg/PulseServer`). If missing: `wsl --update` in Windows,
then `wsl --shutdown`, then re-run.
- **Mic permission**: Windows Settings → Privacy & security → Microphone → enable
*"Let desktop apps access your microphone"*. Required.
Verify the riskiest piece (mic capture) first:
```bash
claudedo test-audio
```
## Usage
**Run it in a terminal you watch — that's the product.** You launch `claudedo
start` and it drops into a visible listen loop (pass `--check` to run a mic check
first). Each utterance prints a timestamped, colored line — `HH:MM:SS [claude-libs]
heard "…" →
typed 'fix'` (green for injected, red for drops, `[SYSTEM]`/`[VOICE]` for state and
recognition). That terminal is your recognition/action console; you attach to the
`claude-<name>` session in another pane to watch the keystrokes land. It runs in the
foreground by design — the console is the point — though `claudedo stop` can signal a
stray instance.
```bash
claudedo start # the visible listen loop (listen mode default; no mic check)
claudedo start --check # run a mic check before listening
claudedo start --mode ptt # push-to-talk instead (desk-only — see Modes)
claudedo status # running? mode? target session?
claudedo stop # stop a running daemon
claudedo reload # reload config.toml + contexts.toml in a running daemon
claudedo set <name> # set the sticky target -> claude-<name> (alias: switch)
claudedo unset # clear the sticky target
claudedo list # list running claude-* sessions
claudedo test-audio # verify the mic capture path
```
### Modes
- **listen (default)** — continuous capture; only acts on utterances that **start
with a wake phrase**; all other speech is transcribed locally and discarded
instantly. This is the hands-free path and works while a game is focused, because
the trigger is your voice over the mic bridge — not a keyboard hook.
- **ptt** — push-to-talk. **Desk-only:** it captures only while the daemon's own
terminal window is focused. There is deliberately **no global hotkey** — a
system-wide keyboard hook is the keylogger/cheat silhouette anticheats watch for,
and `claudedo` refuses to install one. For hands-free-while-gaming, use listen
mode. (Terminals don't deliver key-up events, so PTT is press-to-start /
press-to-stop in the daemon window, not literal hold.)
Switch at runtime by voice: "claudedo mode listen" / "claudedo mode ptt".
## Command grammar
Wake phrases (listen mode), fuzzy-matched. The default list is **"claudedo"**,
**"claude do"**, **"hey claude"**, **"ok claude"**, **"okay claude"** — Whisper has
no token for the coined word "claudedo" and renders it as real words ("claude do"),
so that spelling is listed explicitly. Matching is lenient (case/space-insensitive).
Add the spellings you actually see (turn on `print_heard` to find them). In PTT mode
the wake phrase is optional. When a command's wake phrase matched loosely (e.g. you
said "okay clouds"), the heard line notes which phrase it assumed —
`heard "okay clouds list" -> LIST (wake: okay claude)`.
| Say | Does |
|---|---|
| `yes` / `no` | answer a yes/no prompt |
| `one` / `two` / `three` / `four` | pick numbered option 14 |
| `approve` / `deny` | allow / deny a permission prompt |
| `send` / `enter` | submit (Enter) |
| `type <phrase>` | insert literal text, **no** submit (read-before-send; say "send") |
| `space [<n>]` (also `add [a] space`, `insert <n> spaces`) | insert n spaces (default 1) |
| `backspace [<n>]` (alias `delete`) | delete n chars (default 1), capped at the last submit boundary |
| `erase` (alias `clear`/`wipe`) | delete everything typed since the last submit/boundary |
| `debug <text>` (alias `echo`) | just print what you said to the console (test wake/STT; injects nothing) |
| `mode ptt` / `mode listen` | switch input mode |
| `set <name>` (alias `sticky`/`switch`) | set the **sticky** target → `claude-<name>` (persists) |
| `target <name> <command>` | **one-shot** override: run that command on `claude-<name>` for this utterance only; sticky default unchanged |
| `unset` (alias `unsticky`) | clear the sticky target |
| `list` | list running `claude-*` sessions to the daemon console |
| `context <name> <instruction>` (alias `prepare`) | inject a `contexts.toml` blurb as a preamble + the dictated instruction, then **wait** (no submit — say "send") |
| `reload` | re-read `config.toml` + `contexts.toml` live (no daemon restart, model stays loaded) |
| `system status` | print mode / target / model / context count to the console (daemon-control; never injects) |
| `system reload [config\|contexts]` | reload one or both config files |
| `commands` (alias `help`/`menu`) | print the voice-command menu to the console |
| `customs` (alias `custom`) | list the loaded context names |
| `version` | print the claudedo version to the console |
| `cancel` / `escape` | back out of a prompt |
Optional filler (`select` / `use` / `choose`) may precede any command and is ignored:
`select yes` and `use yes` behave like `yes`. (`select 1` is still the select command.)
When no sticky target is set, a bare command does nothing and asks you to `set` one
(the default). Set `auto_target = true` to instead auto-use the single running
`claude-*` session when there's exactly one; with several running it always does
nothing and asks you to `set` one.
Number words are normalized to digits before matching ("one"/"won" → 1).
## Targeting
`~/.claude-active` holds the **sticky** target session name (e.g.
`claude-rethink-public`). The **cc kit** writes this file when you attach, and
`claudedo set <name>` (alias `sticky`/`switch`) overwrites it; `unset` clears it.
A `target <name>` voice command is a **one-shot** that does NOT touch the sticky
default — it routes a single command and the next bare command reverts to sticky.
Resolution order (one place — `target.resolve()`): one-shot if present →
sticky if set and the session exists → else, only if `auto_target = true`, the single
running `claude-*` session → else (default, or zero/several sessions) do nothing and
say so. It never guesses, and never injects into a nonexistent session.
Every name maps to `claude-<name>` through one helper (`target.session_name()`), and
the cc kit mirrors it exactly — so `cc libs` (shell) and `set libs` (voice) refer
to the same session `claude-libs`. The name is your **stable, speakable handle**:
because the kit forces an explicit name (no basename guessing), you always know the
exact word to say.
The cc kit lives in `~/.config/claudedo/cc.sh` (sourced from your rc; works under
bash and zsh). Every command **requires an explicit name**:
```bash
cc <name> # attach/create claude-<name>; writes ~/.claude-active
ccr <name> # re-attach an existing claude-<name> only
ccl # list claude-* sessions
cck <name> # kill claude-<name>
cckl # kill all claude-* sessions
```
## Contexts (named reference blurbs)
`contexts.toml` holds named reference snippets you can inject ahead of a dictated
instruction with the **`context <name> <instruction>`** voice command (alias
`prepare`). It lives next to `config.toml`
(`$CLAUDEDO_CONTEXTS` → `~/.config/claudedo/contexts.toml``./contexts.toml`); a
missing file just means no contexts (the feature is opt-in).
```toml
[contexts]
webhooks = "discord webhooks — test: <url> (safe to spam), live: <url> (real, careful)"
testing = "use the test/staging resources only, never touch prod"
```
Saying `context webhooks send a test message` injects the `webhooks` blurb as a
preamble, then the dictated instruction, and **waits** — nothing is auto-submitted. You
say `send` to submit (**read-before-send**; Claude's own permission prompt is the
backstop for anything consequential). A bare `context webhooks` injects just the blurb.
One context per command (no stacking yet); an unknown name announces and injects
nothing.
Names are **spoken and fuzzy-matched**, so keep them simple and distinct — they're
looked up on a despaced/lowercased key, so `web hooks` / `web-hooks` / `webhooks` all
resolve the same block. Assembly is config-gated: `behavior.context_multiline` (default
`true`) puts the blurb and instruction on separate lines via a Shift+Enter soft newline;
set it `false` to flatten onto one line with `context_separator` (default `" — "`) if
Shift+Enter is unreliable in your terminal.
Edit `contexts.toml`, then say **`reload`** (or run `claudedo reload`) — it re-reads
`config.toml` and `contexts.toml` live without restarting the daemon or reloading the
Whisper model. The **`system`** namespace gives daemon-control by voice without touching
Claude: `system status` (mode / target / model / context count) and `system reload
[config|contexts]`.
## The confirmed Claude Code keymap
The keystrokes in [`keys.py`](src/claudedo/keys.py) were confirmed **empirically**
against a live `claude` v2.1.191 session (not assumed):
- Numbered prompts (trust prompt, permission prompt): pressing the **bare digit**
selects **and confirms immediately****no trailing Enter**.
- Arrow keys move the highlight without acting; Enter then confirms (modeled as an
alternative sequence).
- Permission prompt is `1. Yes / 2. Yes, and don't ask again / 3. No`; Escape cancels.
- Literal text goes in via `send-keys -l` (no submit); a bare Enter submits.
If Claude Code changes its prompt UI, re-confirm against a live session and update
`keys.py` — it is the single source of truth.
## Config
Everything tunable lives in [`config.toml`](config.toml): wake phrases, mode + PTT
key, Whisper model/language/device, `[vad]` endpointing, and `[behavior]`
(`type_autosend`, fuzzy thresholds, `filler_words`, `auto_target`, `print_heard`).
The default model is **`small.en`** (the English-only small model — ~1s/command on a
strong CPU, more accurate on English than multilingual `small` at the same speed);
`medium`/`medium.en` are more accurate but ~3× slower (noticeable lag), `base.en` is
snappier/less accurate, `large-v3` most accurate/slowest. Every `heard` line shows the
STT latency as `(<ms>/<audio>s)` so you can see what a model change costs. VAD
endpointing ends a capture after `[vad].silence_ms` (700) of trailing silence, capped
at `max_seconds` (15). `claudedo -c <path> ...` points at a specific config; otherwise
it searches
`$CLAUDEDO_CONFIG`, `~/.config/claudedo/config.toml`, then `./config.toml`.
- **STT biasing.** The transcriber is seeded with an `initial_prompt` built from the
configured wake phrases + command vocabulary (one source — `grammar.vocabulary()`),
so Whisper is conditioned to expect "claudedo" and the command words.
- **Split fuzzy thresholds.** `wake_fuzzy_threshold` (default `0.65`, lenient) vs
`command_fuzzy_threshold` (default `0.8`, tight). The asymmetry is deliberate: a
false *wake* is cheap (it wakes, finds no command, does nothing), but a false
*command* fires the wrong action. Prefer expanding command synonyms over loosening
the command threshold.
- **`[vad]` endpointing.** Capture starts on speech and ends after `silence_ms`
(default 700) of trailing silence — Alexa-style record-until-pause — capped at
`max_seconds` (default 15). The pause both ends a command and separates it from
following chatter (the chatter is a separate capture the wake gate discards).
- **`auto_target`** (default `false`): with no sticky target and one session running,
`false` does nothing and asks you to `set`; `true` auto-uses that session.
- **`print_heard`** (default `false`, debug): prints non-wake transcripts so you can
see how Whisper renders your wake word, then tune the wake list/threshold.
- **`context_multiline`** (default `true`) / **`context_separator`** (default `" — "`):
how the `context` command assembles the blurb and instruction — a Shift+Enter soft
newline between them, or (when `false`) flattened onto one line with the separator.
## Requirements
Windows 11 + WSL2 (Ubuntu) with WSLg, Python 3.10+, tmux, the `claude` CLI, and
either bash or zsh (the cc kit supports both).