claudedo/README.md

# claudedo

Voice control for [Claude Code](https://claude.com/claude-code) on **WSL2**.

`claudedo` listens on your mic, runs **local** speech-to-text, recognizes a wake
phrase plus a small command grammar, and injects the matching keystrokes into your
active Claude Code tmux session via `tmux send-keys`. You answer Claude Code's
prompts ("yes", "option one", "approve") and dictate prompts **by voice** — including
hands-free while another window (a game) is focused.

It exists because Claude Code's native `/voice` is hardcoded-blocked in WSL (it
assumes WSL has no audio). Modern WSL2 + WSLg *does* have working mic input via
PulseAudio/RDP. `claudedo` captures the mic itself, transcribes on-device, and drives
Claude Code over tmux — fully local and private. You run it in a terminal you watch.

## How it works

```
mic (WSLg/PulseAudio RDPSource)
  -> sounddevice capture
  -> faster-whisper (local STT, on-device)
  -> wake gate: utterance must start with a wake phrase, else DISCARD locally
  -> grammar match (yes/no/one..four/approve/deny/send/type/space/backspace/erase/
                    mode/set/target/unset/list/cancel)
  -> resolve target session (one-shot > sticky ~/.claude-active > auto/none)
  -> tmux send-keys -t <session> "<keys>"
  -> log the action to the watched terminal ([session]/[SYSTEM]/[VOICE], colored)
```

**Privacy by construction.** STT runs on-device. In listen mode, any speech that
doesn't start with a wake phrase is dropped the instant it's transcribed — never
stored, never sent anywhere. That's what makes always-listening acceptable while
you're on voice comms in a game.

**Injection is PTY-only.** `claudedo` only ever calls `tmux send-keys`. It never uses
OS-level keyboard input and installs no system-wide keyboard hook. Keystrokes are
text into a Linux pseudo-terminal — they work regardless of which window is focused
and never touch Windows input or a game/anticheat's view.

## Install

```bash
git clone <repo> claudedo && cd claudedo
./install.sh
```

`install.sh` is idempotent. It installs the WSL audio deps, writes the `~/.asoundrc`
Pulse shim, verifies the mic path, pip-installs the package, primes the Whisper
model, and installs the **cc kit** (`~/.config/claudedo/cc.sh`, sourced from every
`~/.zshrc`/`~/.bashrc` you have). It also checks the two Windows-side bits it can't
automate and tells you to fix them:

- **WSLg present** (`/mnt/wslg/PulseServer`). If missing: `wsl --update` in Windows,
  then `wsl --shutdown`, then re-run.
- **Mic permission**: Windows Settings → Privacy & security → Microphone → enable
  *"Let desktop apps access your microphone"*. Required.

Verify the riskiest piece (mic capture) first:

```bash
claudedo test-audio
```

## Usage

**Run it in a terminal you watch — that's the product.** You launch `claudedo
start` and it drops into a visible listen loop (pass `--check` to run a mic check
first). Each utterance prints a timestamped, colored line — `HH:MM:SS [claude-libs]
heard "…" →
typed 'fix'` (green for injected, red for drops, `[SYSTEM]`/`[VOICE]` for state and
recognition). That terminal is your recognition/action console; you attach to the
`claude-<name>` session in another pane to watch the keystrokes land. It runs in the
foreground by design — the console is the point — though `claudedo stop` can signal a
stray instance.

```bash
claudedo start            # the visible listen loop (listen mode default; no mic check)
claudedo start --check    # run a mic check before listening
claudedo start --mode ptt # push-to-talk instead (desk-only — see Modes)
claudedo status           # running? mode? target session?
claudedo stop             # stop a running daemon
claudedo set <name>       # set the sticky target -> claude-<name> (alias: switch)
claudedo unset            # clear the sticky target
claudedo list             # list running claude-* sessions
claudedo test-audio       # verify the mic capture path
```

### Modes

- **listen (default)** — continuous capture; only acts on utterances that **start
  with a wake phrase**; all other speech is transcribed locally and discarded
  instantly. This is the hands-free path and works while a game is focused, because
  the trigger is your voice over the mic bridge — not a keyboard hook.
- **ptt** — push-to-talk. **Desk-only:** it captures only while the daemon's own
  terminal window is focused. There is deliberately **no global hotkey** — a
  system-wide keyboard hook is the keylogger/cheat silhouette anticheats watch for,
  and `claudedo` refuses to install one. For hands-free-while-gaming, use listen
  mode. (Terminals don't deliver key-up events, so PTT is press-to-start /
  press-to-stop in the daemon window, not literal hold.)

Switch at runtime by voice: "claudedo mode listen" / "claudedo mode ptt".

## Command grammar

Wake phrases (listen mode), fuzzy-matched. The default list is **"claudedo"**,
**"claude do"**, **"hey claude"**, **"ok claude"**, **"okay claude"** — Whisper has
no token for the coined word "claudedo" and renders it as real words ("claude do"),
so that spelling is listed explicitly. Matching is lenient (case/space-insensitive).
Add the spellings you actually see (turn on `print_heard` to find them). In PTT mode
the wake phrase is optional.

| Say | Does |
|---|---|
| `yes` / `no` | answer a yes/no prompt |
| `one` / `two` / `three` / `four` | pick numbered option 1–4 |
| `approve` / `deny` | allow / deny a permission prompt |
| `send` / `enter` | submit (Enter) |
| `type <phrase>` | insert literal text, **no** submit (read-before-send; say "send") |
| `space [<n>]` (also `add [a] space`, `insert <n> spaces`) | insert n spaces (default 1) |
| `backspace [<n>]` (alias `delete`) | delete n chars (default 1), capped at the last submit boundary |
| `erase` (alias `clear`/`wipe`) | delete everything typed since the last submit/boundary |
| `debug <text>` (alias `echo`) | just print what you said to the console (test wake/STT; injects nothing) |
| `mode ptt` / `mode listen` | switch input mode |
| `set <name>` (alias `sticky`/`switch`) | set the **sticky** target → `claude-<name>` (persists) |
| `target <name> <command>` | **one-shot** override: run that command on `claude-<name>` for this utterance only; sticky default unchanged |
| `unset` (alias `unsticky`) | clear the sticky target |
| `list` | list running `claude-*` sessions to the daemon console |
| `commands` (alias `help`/`menu`) | print the voice-command menu to the console |
| `customs` (alias `custom`) | custom commands — arriving in v0.2.0 (stub for now) |
| `cancel` / `escape` | back out of a prompt |

Optional filler (`select` / `use` / `choose`) may precede any command and is ignored:
`select yes` and `use yes` behave like `yes`. (`select 1` is still the select command.)

When no sticky target is set, a bare command does nothing and asks you to `set` one
(the default). Set `auto_target = true` to instead auto-use the single running
`claude-*` session when there's exactly one; with several running it always does
nothing and asks you to `set` one.

Number words are normalized to digits before matching ("one"/"won" → 1).

## Targeting

`~/.claude-active` holds the **sticky** target session name (e.g.
`claude-rethink-public`). The **cc kit** writes this file when you attach, and
`claudedo set <name>` (alias `sticky`/`switch`) overwrites it; `unset` clears it.
A `target <name>` voice command is a **one-shot** that does NOT touch the sticky
default — it routes a single command and the next bare command reverts to sticky.

Resolution order (one place — `target.resolve()`): one-shot if present →
sticky if set and the session exists → else, only if `auto_target = true`, the single
running `claude-*` session → else (default, or zero/several sessions) do nothing and
say so. It never guesses, and never injects into a nonexistent session.

Every name maps to `claude-<name>` through one helper (`target.session_name()`), and
the cc kit mirrors it exactly — so `cc libs` (shell) and `set libs` (voice) refer
to the same session `claude-libs`. The name is your **stable, speakable handle**:
because the kit forces an explicit name (no basename guessing), you always know the
exact word to say.

The cc kit lives in `~/.config/claudedo/cc.sh` (sourced from your rc; works under
bash and zsh). Every command **requires an explicit name**:

```bash
cc <name>    # attach/create claude-<name>; writes ~/.claude-active
ccr <name>   # re-attach an existing claude-<name> only
ccl          # list claude-* sessions
cck <name>   # kill claude-<name>
cckl         # kill all claude-* sessions
```

## The confirmed Claude Code keymap

The keystrokes in [`keys.py`](src/claudedo/keys.py) were confirmed **empirically**
against a live `claude` v2.1.191 session (not assumed):

- Numbered prompts (trust prompt, permission prompt): pressing the **bare digit**
  selects **and confirms immediately** — **no trailing Enter**.
- Arrow keys move the highlight without acting; Enter then confirms (modeled as an
  alternative sequence).
- Permission prompt is `1. Yes / 2. Yes, and don't ask again / 3. No`; Escape cancels.
- Literal text goes in via `send-keys -l` (no submit); a bare Enter submits.

If Claude Code changes its prompt UI, re-confirm against a live session and update
`keys.py` — it is the single source of truth.

## Config

Everything tunable lives in [`config.toml`](config.toml): wake phrases, mode + PTT
key, Whisper model/language/device, `[vad]` endpointing, and `[behavior]`
(`type_autosend`, fuzzy thresholds, `filler_words`, `auto_target`, `print_heard`).
The default model is **`small`** (~1s/command on a strong CPU — snappy, and good
enough with initial_prompt biasing); `medium` is more accurate on the coined wake
word but ~3× slower (noticeable lag), `large-v3` most accurate/slowest. Every `heard`
line shows the STT latency as `(<ms>/<audio>s)` so you can see what a model change
costs. `claudedo -c <path> ...` points at a specific config; otherwise it searches
`$CLAUDEDO_CONFIG`, `~/.config/claudedo/config.toml`, then `./config.toml`.

- **STT biasing.** The transcriber is seeded with an `initial_prompt` built from the
  configured wake phrases + command vocabulary (one source — `grammar.vocabulary()`),
  so Whisper is conditioned to expect "claudedo" and the command words.
- **Split fuzzy thresholds.** `wake_fuzzy_threshold` (default `0.6`, lenient) vs
  `command_fuzzy_threshold` (default `0.8`, tight). The asymmetry is deliberate: a
  false *wake* is cheap (it wakes, finds no command, does nothing), but a false
  *command* fires the wrong action. Prefer expanding command synonyms over loosening
  the command threshold.
- **`[vad]` endpointing.** Capture starts on speech and ends after `silence_ms`
  (default 800) of trailing silence — Alexa-style record-until-pause — capped at
  `max_seconds` (default 10). The pause both ends a command and separates it from
  following chatter (the chatter is a separate capture the wake gate discards).
- **`auto_target`** (default `false`): with no sticky target and one session running,
  `false` does nothing and asks you to `set`; `true` auto-uses that session.
- **`print_heard`** (default `false`, debug): prints non-wake transcripts so you can
  see how Whisper renders your wake word, then tune the wake list/threshold.

## Requirements

Windows 11 + WSL2 (Ubuntu) with WSLg, Python 3.10+, tmux, the `claude` CLI, and
either bash or zsh (the cc kit supports both).