Compare commits

...

6 Commits

Author SHA1 Message Date
4abdfd56bc feat: start skips the mic check by default; --check to opt in
invert the pre-listen mic check — default is no check (just start listening); pass
'claudedo start --check' to run it. replaces the old --skip-audio-check flag.

Signed-off-by: disqualifier <dev@disqualifier.me>
2026-06-26 02:25:20 -04:00
e6dadab143 feat: debug/echo command — print the spoken phrase to the console
'<wake> debug <text>' (alias echo) echoes what you said to the console as
[VOICE] debug: "..." and injects nothing — a no-target test command for checking
wake + STT transcription. added to the STT vocab so it's biased for.

Signed-off-by: disqualifier <dev@disqualifier.me>
2026-06-26 02:22:37 -04:00
5064f912a4 fix: install.sh installs config.toml to ~/.config/claudedo
the daemon's config lookup falls through to ./config.toml only, so without a copy in
the standard dir it was repo-cwd-only. install config.toml to ~/.config/claudedo/ —
copy if absent, else write config.toml.new beside the user's edited copy (never
clobber). also gitignore COMPACT.md (handoff doc kept on disk, untracked).

Signed-off-by: disqualifier <dev@disqualifier.me>
2026-06-26 01:53:25 -04:00
a51c2fbdd4 feat: v0.1.3 STT tuning — medium model, initial_prompt bias, split thresholds, VAD config
default stt.model -> medium (biggest accuracy gain for the coined wake word;
small/large-v3 documented alternatives). seed faster-whisper with an initial_prompt
derived from the configured wake phrases + command vocabulary (grammar.vocabulary /
initial_prompt, one source — command synonyms now live in named _*_VERBS tuples).

split the single fuzzy threshold into wake_fuzzy_threshold (0.6, lenient — a false
wake is cheap) and command_fuzzy_threshold (0.8, tight — a false command fires the
wrong action); grammar.parse() takes both. add a [vad] config section (silence_ms,
max_seconds) for the existing Alexa-style record-until-pause endpointing, which
captures a command whole and lets the trailing pause separate it from following
chatter (that chatter is a separate capture the wake gate discards). bump to 0.1.3.

Signed-off-by: disqualifier <dev@disqualifier.me>
2026-06-26 01:41:48 -04:00
bd6597352a feat: 'add [a] space' / 'insert <n> spaces' phrasing; drop 'claude due' wake
map 'add a space'/'add space'/'insert two spaces' to the space command (count read
from either side of the noun). remove 'claude due' from the default wake list (it
double-rendered with 'claude do' and wasn't wanted). docs synced.

Signed-off-by: disqualifier <dev@disqualifier.me>
2026-06-26 01:27:16 -04:00
08bbe3ce58 docs: sync README with v0.1.2 (wake list, editing cmds, auto_target, console)
reconcile README with the shipped code and with CLAUDE.md: full 6-phrase wake list
(claudedo/claude do/claude due/hey claude/ok claude/okay claude) with the Whisper
rationale; space/backspace/erase in the grammar + flow; colored prefixed console
output description; fix the auto_target contradiction (default false = require
set/target, not auto-pick); drop the stale 'backgroundable'.

Signed-off-by: disqualifier <dev@disqualifier.me>
2026-06-26 01:22:00 -04:00
11 changed files with 195 additions and 84 deletions

1
.gitignore vendored
View File

@ -1,4 +1,5 @@
CLAUDE.md CLAUDE.md
COMPACT.md
__pycache__/ __pycache__/
*.pyc *.pyc

View File

@ -11,7 +11,7 @@ hands-free while another window (a game) is focused.
It exists because Claude Code's native `/voice` is hardcoded-blocked in WSL (it It exists because Claude Code's native `/voice` is hardcoded-blocked in WSL (it
assumes WSL has no audio). Modern WSL2 + WSLg *does* have working mic input via assumes WSL has no audio). Modern WSL2 + WSLg *does* have working mic input via
PulseAudio/RDP. `claudedo` captures the mic itself, transcribes on-device, and drives PulseAudio/RDP. `claudedo` captures the mic itself, transcribes on-device, and drives
Claude Code over tmux — fully local, private, backgroundable. Claude Code over tmux — fully local and private. You run it in a terminal you watch.
## How it works ## How it works
@ -20,9 +20,11 @@ mic (WSLg/PulseAudio RDPSource)
-> sounddevice capture -> sounddevice capture
-> faster-whisper (local STT, on-device) -> faster-whisper (local STT, on-device)
-> wake gate: utterance must start with a wake phrase, else DISCARD locally -> wake gate: utterance must start with a wake phrase, else DISCARD locally
-> grammar match (yes/no/one..four/approve/deny/send/type/mode/set/target/cancel) -> grammar match (yes/no/one..four/approve/deny/send/type/space/backspace/erase/
-> resolve target session (~/.claude-active) mode/set/target/unset/list/cancel)
-> resolve target session (one-shot > sticky ~/.claude-active > auto/none)
-> tmux send-keys -t <session> "<keys>" -> tmux send-keys -t <session> "<keys>"
-> log the action to the watched terminal ([session]/[SYSTEM]/[VOICE], colored)
``` ```
**Privacy by construction.** STT runs on-device. In listen mode, any speech that **Privacy by construction.** STT runs on-device. In listen mode, any speech that
@ -62,16 +64,19 @@ claudedo test-audio
## Usage ## Usage
**Run it in a terminal you watch — that's the product.** You launch `claudedo **Run it in a terminal you watch — that's the product.** You launch `claudedo
start`, it does a quick mic check, then drops into a visible listen loop that prints start` and it drops into a visible listen loop (pass `--check` to run a mic check
`heard → matched → sent` for every utterance. That terminal is your first). Each utterance prints a timestamped, colored line — `HH:MM:SS [claude-libs]
recognition/action console; you attach to the `claude-<name>` session in another pane heard "…" →
to watch the keystrokes land. There is no backgrounding/daemon mode — the whole point typed 'fix'` (green for injected, red for drops, `[SYSTEM]`/`[VOICE]` for state and
is the console you read. recognition). That terminal is your recognition/action console; you attach to the
`claude-<name>` session in another pane to watch the keystrokes land. It runs in the
foreground by design — the console is the point — though `claudedo stop` can signal a
stray instance.
```bash ```bash
claudedo start # mic-check, then the visible listen loop (listen mode default) claudedo start # the visible listen loop (listen mode default; no mic check)
claudedo start --check # run a mic check before listening
claudedo start --mode ptt # push-to-talk instead (desk-only — see Modes) claudedo start --mode ptt # push-to-talk instead (desk-only — see Modes)
claudedo start --skip-audio-check # skip the pre-listen mic check
claudedo status # running? mode? target session? claudedo status # running? mode? target session?
claudedo stop # stop a running daemon claudedo stop # stop a running daemon
claudedo set <name> # set the sticky target -> claude-<name> (alias: switch) claudedo set <name> # set the sticky target -> claude-<name> (alias: switch)
@ -97,9 +102,12 @@ Switch at runtime by voice: "claudedo mode listen" / "claudedo mode ptt".
## Command grammar ## Command grammar
Wake phrases (listen mode), fuzzy-matched: **"claudedo"**, **"hey claude"**. Wake phrases (listen mode), fuzzy-matched. The default list is **"claudedo"**,
"claudedo" is a coined word, so the matcher is lenient (accepts "claude do", **"claude do"**, **"hey claude"**, **"ok claude"**, **"okay claude"** — Whisper has
"clauddo", "cloud do", …). In PTT mode the wake phrase is optional. no token for the coined word "claudedo" and renders it as real words ("claude do"),
so that spelling is listed explicitly. Matching is lenient (case/space-insensitive).
Add the spellings you actually see (turn on `print_heard` to find them). In PTT mode
the wake phrase is optional.
| Say | Does | | Say | Does |
|---|---| |---|---|
@ -108,9 +116,10 @@ Wake phrases (listen mode), fuzzy-matched: **"claudedo"**, **"hey claude"**.
| `approve` / `deny` | allow / deny a permission prompt | | `approve` / `deny` | allow / deny a permission prompt |
| `send` / `enter` | submit (Enter) | | `send` / `enter` | submit (Enter) |
| `type <phrase>` | insert literal text, **no** submit (read-before-send; say "send") | | `type <phrase>` | insert literal text, **no** submit (read-before-send; say "send") |
| `space [<n>]` | insert n spaces (default 1) | | `space [<n>]` (also `add [a] space`, `insert <n> spaces`) | insert n spaces (default 1) |
| `backspace [<n>]` (alias `delete`) | delete n chars (default 1), capped at the last submit boundary | | `backspace [<n>]` (alias `delete`) | delete n chars (default 1), capped at the last submit boundary |
| `erase` (alias `clear`/`wipe`) | delete everything typed since the last submit/boundary | | `erase` (alias `clear`/`wipe`) | delete everything typed since the last submit/boundary |
| `debug <text>` (alias `echo`) | just print what you said to the console (test wake/STT; injects nothing) |
| `mode ptt` / `mode listen` | switch input mode | | `mode ptt` / `mode listen` | switch input mode |
| `set <name>` (alias `sticky`/`switch`) | set the **sticky** target → `claude-<name>` (persists) | | `set <name>` (alias `sticky`/`switch`) | set the **sticky** target → `claude-<name>` (persists) |
| `target <name> <command>` | **one-shot** override: run that command on `claude-<name>` for this utterance only; sticky default unchanged | | `target <name> <command>` | **one-shot** override: run that command on `claude-<name>` for this utterance only; sticky default unchanged |
@ -121,8 +130,10 @@ Wake phrases (listen mode), fuzzy-matched: **"claudedo"**, **"hey claude"**.
Optional filler (`select` / `use` / `choose`) may precede any command and is ignored: Optional filler (`select` / `use` / `choose`) may precede any command and is ignored:
`select yes` and `use yes` behave like `yes`. (`select 1` is still the select command.) `select yes` and `use yes` behave like `yes`. (`select 1` is still the select command.)
When no sticky target is set, a bare command auto-targets the **only** running When no sticky target is set, a bare command does nothing and asks you to `set` one
`claude-*` session; if several are running it does nothing and asks you to `set` one. (the default). Set `auto_target = true` to instead auto-use the single running
`claude-*` session when there's exactly one; with several running it always does
nothing and asks you to `set` one.
Number words are normalized to digits before matching ("one"/"won" → 1). Number words are normalized to digits before matching ("one"/"won" → 1).
@ -135,9 +146,9 @@ A `target <name>` voice command is a **one-shot** that does NOT touch the sticky
default — it routes a single command and the next bare command reverts to sticky. default — it routes a single command and the next bare command reverts to sticky.
Resolution order (one place — `target.resolve()`): one-shot if present → Resolution order (one place — `target.resolve()`): one-shot if present →
sticky if set and the session exists → else the only running `claude-*` session → sticky if set and the session exists → else, only if `auto_target = true`, the single
else (zero or several) do nothing and say so. It never guesses, and never injects running `claude-*` session → else (default, or zero/several sessions) do nothing and
into a nonexistent session. say so. It never guesses, and never injects into a nonexistent session.
Every name maps to `claude-<name>` through one helper (`target.session_name()`), and Every name maps to `claude-<name>` through one helper (`target.session_name()`), and
the cc kit mirrors it exactly — so `cc libs` (shell) and `set libs` (voice) refer the cc kit mirrors it exactly — so `cc libs` (shell) and `set libs` (voice) refer
@ -174,19 +185,29 @@ If Claude Code changes its prompt UI, re-confirm against a live session and upda
## Config ## Config
Everything tunable lives in [`config.toml`](config.toml): wake phrases, mode + PTT Everything tunable lives in [`config.toml`](config.toml): wake phrases, mode + PTT
key, Whisper model/language/device, audio segmentation thresholds, and `[behavior]` key, Whisper model/language/device, `[vad]` endpointing, and `[behavior]`
(`type_autosend`, `filler_words`, `auto_target`, `print_heard`). The default model is (`type_autosend`, fuzzy thresholds, `filler_words`, `auto_target`, `print_heard`).
`small`; bump to `medium` if the coined wake word is recognized poorly. `claudedo -c The default model is **`medium`** (best accuracy for the coined wake word on a strong
<path> ...` points at a specific config; otherwise it searches `$CLAUDEDO_CONFIG`, CPU); `small` is faster/less accurate, `large-v3` most accurate. `claudedo -c <path>
...` points at a specific config; otherwise it searches `$CLAUDEDO_CONFIG`,
`~/.config/claudedo/config.toml`, then `./config.toml`. `~/.config/claudedo/config.toml`, then `./config.toml`.
- **`auto_target`** (default `false`): with no sticky target set and exactly one - **STT biasing.** The transcriber is seeded with an `initial_prompt` built from the
`claude-*` session running, `false` makes a bare command do nothing and ask you to configured wake phrases + command vocabulary (one source — `grammar.vocabulary()`),
`set` one; `true` auto-targets that single session. so Whisper is conditioned to expect "claudedo" and the command words.
- **`print_heard`** (default `false`, debug): prints non-wake transcripts to the - **Split fuzzy thresholds.** `wake_fuzzy_threshold` (default `0.6`, lenient) vs
console so you can see how Whisper renders your wake word. Turn it on to debug `command_fuzzy_threshold` (default `0.8`, tight). The asymmetry is deliberate: a
detection, then off. Whisper has no token for "claudedo" — it commonly emits false *wake* is cheap (it wakes, finds no command, does nothing), but a false
"claude do" or "claude due", both of which are in the default wake list. *command* fires the wrong action. Prefer expanding command synonyms over loosening
the command threshold.
- **`[vad]` endpointing.** Capture starts on speech and ends after `silence_ms`
(default 800) of trailing silence — Alexa-style record-until-pause — capped at
`max_seconds` (default 10). The pause both ends a command and separates it from
following chatter (the chatter is a separate capture the wake gate discards).
- **`auto_target`** (default `false`): with no sticky target and one session running,
`false` does nothing and asks you to `set`; `true` auto-uses that session.
- **`print_heard`** (default `false`, debug): prints non-wake transcripts so you can
see how Whisper renders your wake word, then tune the wake list/threshold.
## Requirements ## Requirements

View File

@ -5,7 +5,7 @@
# wake phrases for listen mode. fuzzy-matched: case/space-insensitive, lenient on # wake phrases for listen mode. fuzzy-matched: case/space-insensitive, lenient on
# the coined word "claudedo" (whisper renders it inconsistently). number words are # the coined word "claudedo" (whisper renders it inconsistently). number words are
# normalized to digits before command matching. # normalized to digits before command matching.
phrases = ["claudedo", "claude do", "claude due", "hey claude", "ok claude", "okay claude"] phrases = ["claudedo", "claude do", "hey claude", "ok claude", "okay claude"]
[input] [input]
# "listen" (default): continuous capture; only acts on utterances that start with a # "listen" (default): continuous capture; only acts on utterances that start with a
@ -21,10 +21,10 @@ mode = "listen"
ptt_key = "space" ptt_key = "space"
[stt] [stt]
# faster-whisper model size. "small" is a good accuracy/latency balance for the # faster-whisper model size. "medium" is the default — biggest accuracy gain for the
# short command grammar (~sub-second per chunk on a strong cpu). if the coined wake # coined wake word ("claudedo" / "claude do") and fine on a strong cpu. "small" is
# word "claudedo" is recognized poorly, bump to "medium" (slower per chunk). # faster but less accurate; "large-v3" is most accurate if medium still struggles.
model = "small" model = "medium"
language = "en" language = "en"
# mic device: "auto", or a sounddevice device index (integer) / substring of a # mic device: "auto", or a sounddevice device index (integer) / substring of a
# device name. run `claudedo test-audio` to list devices. # device name. run `claudedo test-audio` to list devices.
@ -36,21 +36,30 @@ compute = "auto"
# capture parameters. 16 kHz mono is what whisper expects. # capture parameters. 16 kHz mono is what whisper expects.
samplerate = 16000 samplerate = 16000
channels = 1 channels = 1
# listen-mode silence segmentation: an utterance ends after this many seconds below # rms energy below this counts as silence (the VAD onset/endpoint floor).
# the rms threshold. keeps latency low without streaming.
silence_threshold = 0.012 silence_threshold = 0.012
silence_duration = 0.8
# ignore utterances shorter than this (clicks, coughs). # ignore utterances shorter than this (clicks, coughs).
min_utterance = 0.3 min_utterance = 0.3
# hard cap on a single utterance so a stuck stream can't grow unbounded.
max_utterance = 15.0 [vad]
# Alexa-style record-until-pause endpointing (listen mode). capture starts on speech
# onset and ends after this much trailing silence — the natural end of an utterance.
# a real pause both ends the command AND separates it from following chatter (the
# chatter becomes a separate capture that the wake gate then discards).
silence_ms = 800
# hard cap so continuous noise can't record forever.
max_seconds = 10.0
[behavior] [behavior]
# dictation never auto-submits: "type <phrase>" inserts literal text only; you say # dictation never auto-submits: "type <phrase>" inserts literal text only; you say
# "send" separately to submit (read-before-send). # "send" separately to submit (read-before-send).
type_autosend = false type_autosend = false
# fuzzy match ratio (0..1) required to accept a wake phrase / command token. # fuzzy match ratios (0..1). the asymmetry is deliberate: a false WAKE is cheap (it
match_threshold = 0.8 # wakes, finds no command, does nothing), so wake is lenient; a false COMMAND fires
# the WRONG action, so commands stay tight. lower = more lenient = more matches.
# prefer expanding command synonyms over loosening command_fuzzy_threshold.
wake_fuzzy_threshold = 0.6
command_fuzzy_threshold = 0.8
# optional filler words that may precede a command and are ignored for matching: # optional filler words that may precede a command and are ignored for matching:
# "select yes" / "use yes" behave like "yes". (a filler word followed by a digit is # "select yes" / "use yes" behave like "yes". (a filler word followed by a digit is
# the select command, e.g. "select 1", and is not dropped.) # the select command, e.g. "select 1", and is not dropped.)

View File

@ -94,6 +94,18 @@ mkdir -p "$CONF_DIR"
install -m 0644 "$REPO_DIR/shell/cc.sh" "$CONF_DIR/cc.sh" install -m 0644 "$REPO_DIR/shell/cc.sh" "$CONF_DIR/cc.sh"
echo " wrote $CONF_DIR/cc.sh" echo " wrote $CONF_DIR/cc.sh"
# install config.toml to the standard location so the daemon finds it from any dir.
# never clobber an edited user config: copy only if absent, else drop a .new to diff.
if [ ! -f "$CONF_DIR/config.toml" ]; then
install -m 0644 "$REPO_DIR/config.toml" "$CONF_DIR/config.toml"
echo " wrote $CONF_DIR/config.toml"
elif ! cmp -s "$REPO_DIR/config.toml" "$CONF_DIR/config.toml"; then
install -m 0644 "$REPO_DIR/config.toml" "$CONF_DIR/config.toml.new"
echo " kept your $CONF_DIR/config.toml; new default written to config.toml.new (diff to merge)"
else
echo " $CONF_DIR/config.toml already current"
fi
# wire EVERY rc that exists (the user may have both zsh and bash). # wire EVERY rc that exists (the user may have both zsh and bash).
wired_any=0 wired_any=0
for RC in "$HOME/.zshrc" "$HOME/.bashrc"; do for RC in "$HOME/.zshrc" "$HOME/.bashrc"; do

View File

@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
[project] [project]
name = "claudedo" name = "claudedo"
version = "0.1.2" version = "0.1.3"
description = "voice-control daemon for claude code (local STT -> tmux send-keys)" description = "voice-control daemon for claude code (local STT -> tmux send-keys)"
readme = "README.md" readme = "README.md"
requires-python = ">=3.10" requires-python = ">=3.10"

View File

@ -1,3 +1,3 @@
"""claudedo — voice-control daemon for claude code (local STT -> tmux send-keys)""" """claudedo — voice-control daemon for claude code (local STT -> tmux send-keys)"""
__version__ = "0.1.2" __version__ = "0.1.3"

View File

@ -33,12 +33,12 @@ def cmd_start(args: argparse.Namespace) -> int:
config = _load_or_die(args.config) config = _load_or_die(args.config)
if args.mode: if args.mode:
config.mode = args.mode config.mode = args.mode
if not args.skip_audio_check: if args.check:
print("checking mic before listening (speak briefly) ...") print("checking mic before listening (speak briefly) ...")
peak = _probe_mic(config, seconds=2.0, verbose=False) peak = _probe_mic(config, seconds=2.0, verbose=False)
if peak is None or peak < 0.02: if peak is None or peak < 0.02:
print("mic check failed — no usable input.", file=sys.stderr) print("mic check failed — no usable input.", file=sys.stderr)
print("run `claudedo test-audio` to debug; or `claudedo start --skip-audio-check`", print("run `claudedo test-audio` to debug, or `claudedo start` to skip the check",
file=sys.stderr) file=sys.stderr)
return 1 return 1
print(f"mic OK (peak {peak:.3f}).") print(f"mic OK (peak {peak:.3f}).")
@ -217,8 +217,8 @@ def build_parser() -> argparse.ArgumentParser:
sp = sub.add_parser("start", help="run the daemon (foreground)") sp = sub.add_parser("start", help="run the daemon (foreground)")
sp.add_argument("--mode", choices=("listen", "ptt"), help="override input mode") sp.add_argument("--mode", choices=("listen", "ptt"), help="override input mode")
sp.add_argument("--skip-audio-check", action="store_true", sp.add_argument("--check", action="store_true",
help="skip the pre-listen mic check") help="run a mic check before listening (off by default)")
sp.set_defaults(func=cmd_start) sp.set_defaults(func=cmd_start)
sub.add_parser("stop", help="stop a running daemon").set_defaults(func=cmd_stop) sub.add_parser("stop", help="stop a running daemon").set_defaults(func=cmd_stop)

View File

@ -44,11 +44,12 @@ class Config:
samplerate: int samplerate: int
channels: int channels: int
silence_threshold: float silence_threshold: float
silence_duration: float vad_silence_ms: int
vad_max_seconds: float
min_utterance: float min_utterance: float
max_utterance: float
type_autosend: bool type_autosend: bool
match_threshold: float wake_fuzzy_threshold: float
command_fuzzy_threshold: float
filler_words: tuple[str, ...] filler_words: tuple[str, ...]
auto_target: bool auto_target: bool
print_heard: bool print_heard: bool
@ -98,7 +99,7 @@ def load_config(explicit: str | os.PathLike | None = None) -> Config:
if mode not in _VALID_MODES: if mode not in _VALID_MODES:
raise ConfigError(f"[input].mode must be one of {_VALID_MODES}, got {mode!r}") raise ConfigError(f"[input].mode must be one of {_VALID_MODES}, got {mode!r}")
model = _require(raw, "stt", "model", (str,), "small") model = _require(raw, "stt", "model", (str,), "medium")
if model not in _VALID_MODELS: if model not in _VALID_MODELS:
log.warning("unknown stt model %r — passing through to faster-whisper", model) log.warning("unknown stt model %r — passing through to faster-whisper", model)
@ -113,19 +114,25 @@ def load_config(explicit: str | os.PathLike | None = None) -> Config:
samplerate=int(_require(raw, "audio", "samplerate", (int,), 16000)), samplerate=int(_require(raw, "audio", "samplerate", (int,), 16000)),
channels=int(_require(raw, "audio", "channels", (int,), 1)), channels=int(_require(raw, "audio", "channels", (int,), 1)),
silence_threshold=float(_require(raw, "audio", "silence_threshold", (int, float), 0.012)), silence_threshold=float(_require(raw, "audio", "silence_threshold", (int, float), 0.012)),
silence_duration=float(_require(raw, "audio", "silence_duration", (int, float), 0.8)), vad_silence_ms=int(_require(raw, "vad", "silence_ms", (int,), 800)),
vad_max_seconds=float(_require(raw, "vad", "max_seconds", (int, float), 10.0)),
min_utterance=float(_require(raw, "audio", "min_utterance", (int, float), 0.3)), min_utterance=float(_require(raw, "audio", "min_utterance", (int, float), 0.3)),
max_utterance=float(_require(raw, "audio", "max_utterance", (int, float), 15.0)),
type_autosend=bool(_require(raw, "behavior", "type_autosend", (bool,), False)), type_autosend=bool(_require(raw, "behavior", "type_autosend", (bool,), False)),
match_threshold=float(_require(raw, "behavior", "match_threshold", (int, float), 0.8)), wake_fuzzy_threshold=float(_require(raw, "behavior", "wake_fuzzy_threshold", (int, float), 0.6)),
command_fuzzy_threshold=float(_require(raw, "behavior", "command_fuzzy_threshold",
(int, float), 0.8)),
filler_words=tuple(_require(raw, "behavior", "filler_words", (list,), filler_words=tuple(_require(raw, "behavior", "filler_words", (list,),
["select", "use", "choose"])), ["select", "use", "choose"])),
auto_target=bool(_require(raw, "behavior", "auto_target", (bool,), False)), auto_target=bool(_require(raw, "behavior", "auto_target", (bool,), False)),
print_heard=bool(_require(raw, "behavior", "print_heard", (bool,), False)), print_heard=bool(_require(raw, "behavior", "print_heard", (bool,), False)),
source_path=path, source_path=path,
) )
if not 0.0 < cfg.match_threshold <= 1.0: for label, val in (("wake_fuzzy_threshold", cfg.wake_fuzzy_threshold),
raise ConfigError("[behavior].match_threshold must be in (0, 1]") ("command_fuzzy_threshold", cfg.command_fuzzy_threshold)):
if not 0.0 < val <= 1.0:
raise ConfigError(f"[behavior].{label} must be in (0, 1]")
if cfg.vad_silence_ms <= 0 or cfg.vad_max_seconds <= 0:
raise ConfigError("[vad].silence_ms and max_seconds must be positive")
if cfg.samplerate <= 0 or cfg.channels <= 0: if cfg.samplerate <= 0 or cfg.channels <= 0:
raise ConfigError("[audio].samplerate and channels must be positive") raise ConfigError("[audio].samplerate and channels must be positive")
return cfg return cfg

View File

@ -136,6 +136,7 @@ class Daemon:
model=cfg.stt_model, language=cfg.stt_language, model=cfg.stt_model, language=cfg.stt_language,
device=cfg.stt_compute if cfg.stt_compute in ("cpu", "cuda") else "auto", device=cfg.stt_compute if cfg.stt_compute in ("cpu", "cuda") else "auto",
compute_type="auto", compute_type="auto",
initial_prompt=grammar.initial_prompt(cfg.wake_phrases),
) )
if audio.warm_up(cfg.samplerate, cfg.channels, self._device): if audio.warm_up(cfg.samplerate, cfg.channels, self._device):
log.info("mic warmed up (source live)") log.info("mic warmed up (source live)")
@ -151,20 +152,20 @@ class Daemon:
return audio.record_while( return audio.record_while(
cfg.samplerate, cfg.channels, self._device, cfg.samplerate, cfg.channels, self._device,
held=lambda: not self._ptt.wait_press(self.stopped), held=lambda: not self._ptt.wait_press(self.stopped),
max_utterance=cfg.max_utterance, min_utterance=cfg.min_utterance, max_utterance=cfg.vad_max_seconds, min_utterance=cfg.min_utterance,
) )
return audio.record_until_silence( return audio.record_until_silence(
cfg.samplerate, cfg.channels, self._device, cfg.samplerate, cfg.channels, self._device,
silence_threshold=cfg.silence_threshold, silence_duration=cfg.silence_duration, silence_threshold=cfg.silence_threshold, silence_duration=cfg.vad_silence_ms / 1000.0,
min_utterance=cfg.min_utterance, max_utterance=cfg.max_utterance, min_utterance=cfg.min_utterance, max_utterance=cfg.vad_max_seconds,
stop=self.stopped, stop=self.stopped,
) )
def _handle(self, transcript: str) -> None: def _handle(self, transcript: str) -> None:
cfg = self.config cfg = self.config
require_wake = self.mode == "listen" require_wake = self.mode == "listen"
parsed = grammar.parse(transcript, cfg.wake_phrases, cfg.match_threshold, require_wake, parsed = grammar.parse(transcript, cfg.wake_phrases, cfg.wake_fuzzy_threshold,
filler=cfg.filler_words) cfg.command_fuzzy_threshold, require_wake, filler=cfg.filler_words)
if parsed is None or parsed.action is None: if parsed is None or parsed.action is None:
self._console.emit(VOICE, f'heard "{transcript}" -> no command matched', "yellow") self._console.emit(VOICE, f'heard "{transcript}" -> no command matched', "yellow")
return return
@ -192,6 +193,9 @@ class Daemon:
sessions = target.list_sessions() sessions = target.list_sessions()
self._console.emit(SYSTEM, "list -> " + (", ".join(sessions) if sessions else "(none running)")) self._console.emit(SYSTEM, "list -> " + (", ".join(sessions) if sessions else "(none running)"))
return return
if action.name == "debug":
self._console.emit(VOICE, f'debug: "{action.arg}"', "yellow")
return
session, reason = target.resolve(parsed.one_shot, auto_target=cfg.auto_target) session, reason = target.resolve(parsed.one_shot, auto_target=cfg.auto_target)
if session is None: if session is None:
@ -257,7 +261,8 @@ class Daemon:
invariant: non-command speech is discarded, never recorded. invariant: non-command speech is discarded, never recorded.
""" """
cfg = self.config cfg = self.config
return grammar.strip_wake(transcript, cfg.wake_phrases, cfg.match_threshold, True) is not None return grammar.strip_wake(transcript, cfg.wake_phrases,
cfg.wake_fuzzy_threshold, True) is not None
def _print_startup(self) -> None: def _print_startup(self) -> None:
cfg = self.config cfg = self.config

View File

@ -33,11 +33,33 @@ _COUNT_WORDS = {
"sixteen": 16, "seventeen": 17, "eighteen": 18, "nineteen": 19, "twenty": 20, "sixteen": 16, "seventeen": 17, "eighteen": 18, "nineteen": 19, "twenty": 20,
} }
_YES_VERBS = ("yes", "yeah", "yep", "yup")
_NO_VERBS = ("no", "nope", "nah")
_APPROVE_VERBS = ("approve", "allow")
_DENY_VERBS = ("deny", "reject")
_SUBMIT_VERBS = ("send", "enter", "submit")
_CANCEL_VERBS = ("cancel", "escape")
_TYPE_VERBS = ("type", "dictate", "write")
_BACKSPACE_VERBS = ("backspace", "delete")
_SPACE_VERBS = ("space", "spacebar")
_ADD_VERBS = ("add", "insert")
_ERASE_VERBS = ("erase", "clear", "wipe")
_DEBUG_VERBS = ("debug", "echo")
_MODE_VERBS = ("mode",)
_STICKY_VERBS = ("set", "sticky", "switch") _STICKY_VERBS = ("set", "sticky", "switch")
_ONESHOT_VERBS = ("target",) _ONESHOT_VERBS = ("target",)
_UNSET_VERBS = ("unset", "unsticky") _UNSET_VERBS = ("unset", "unsticky")
_LIST_VERBS = ("list", "sessions") _LIST_VERBS = ("list", "sessions")
_SELECT_VERBS = ("select", "option", "choose", "number") _SELECT_VERBS = ("select", "option", "choose", "number")
# every command/synonym word, for biasing the STT toward the vocabulary we expect.
_COMMAND_WORDS = (
_YES_VERBS + _NO_VERBS + _APPROVE_VERBS + _DENY_VERBS + _SUBMIT_VERBS
+ _CANCEL_VERBS + _TYPE_VERBS + _BACKSPACE_VERBS + _SPACE_VERBS + _ADD_VERBS
+ _ERASE_VERBS + _DEBUG_VERBS + _MODE_VERBS + _STICKY_VERBS + _ONESHOT_VERBS + _UNSET_VERBS
+ _LIST_VERBS + _SELECT_VERBS + ("ptt", "listen")
+ ("one", "two", "three", "four")
)
DEFAULT_FILLER = ("select", "use", "choose") DEFAULT_FILLER = ("select", "use", "choose")
@ -79,6 +101,26 @@ def normalize(text: str) -> str:
return " ".join(tokens) return " ".join(tokens)
def vocabulary(wake_phrases: list[str]) -> list[str]:
"""the wake + command vocabulary, deduped in first-seen order.
single source for biasing the STT: the same wake phrases the matcher uses plus
every command/synonym word in _COMMAND_WORDS. no separate hardcoded copy.
"""
seen: dict[str, None] = {}
for word in list(wake_phrases) + list(_COMMAND_WORDS):
key = word.strip()
if key and key not in seen:
seen[key] = None
return list(seen)
def initial_prompt(wake_phrases: list[str]) -> str:
"""a comma-joined vocabulary string to pass faster-whisper as initial_prompt,
conditioning transcription toward the words we expect (esp. the coined wake)"""
return ", ".join(vocabulary(wake_phrases))
def _ratio(a: str, b: str) -> float: def _ratio(a: str, b: str) -> float:
return SequenceMatcher(None, a, b).ratio() return SequenceMatcher(None, a, b).ratio()
@ -163,34 +205,42 @@ def match_command(remainder: str, threshold: float) -> Action | None:
if head in _INDEX_WORDS: if head in _INDEX_WORDS:
return Action("select", _INDEX_WORDS[head]) return Action("select", _INDEX_WORDS[head])
if _fuzzy_in(head, ("yes", "yeah", "yep", "yup"), threshold): if _fuzzy_in(head, _YES_VERBS, threshold):
return Action("yes") return Action("yes")
if _fuzzy_in(head, ("no", "nope", "nah"), threshold): if _fuzzy_in(head, _NO_VERBS, threshold):
return Action("no") return Action("no")
if _fuzzy_in(head, ("approve", "allow"), threshold): if _fuzzy_in(head, _APPROVE_VERBS, threshold):
return Action("approve") return Action("approve")
if _fuzzy_in(head, ("deny", "reject"), threshold): if _fuzzy_in(head, _DENY_VERBS, threshold):
return Action("deny") return Action("deny")
if _fuzzy_in(head, ("send", "enter", "submit"), threshold): if _fuzzy_in(head, _SUBMIT_VERBS, threshold):
return Action("submit") return Action("submit")
if _fuzzy_in(head, ("cancel", "escape"), threshold): if _fuzzy_in(head, _CANCEL_VERBS, threshold):
return Action("cancel") return Action("cancel")
if _fuzzy_in(head, _SELECT_VERBS, threshold) and rest and rest[0] in _INDEX_WORDS: if _fuzzy_in(head, _SELECT_VERBS, threshold) and rest and rest[0] in _INDEX_WORDS:
return Action("select", _INDEX_WORDS[rest[0]]) return Action("select", _INDEX_WORDS[rest[0]])
if _fuzzy_in(head, ("type", "dictate", "write"), threshold): if _fuzzy_in(head, _TYPE_VERBS, threshold):
text = " ".join(rest).strip() text = " ".join(rest).strip()
return Action("type", text) if text else None return Action("type", text) if text else None
if _fuzzy_in(head, ("backspace", "delete"), threshold): if _fuzzy_in(head, _BACKSPACE_VERBS, threshold):
return Action("backspace", _leading_count(rest, default=1)) return Action("backspace", _leading_count(rest, default=1))
if _fuzzy_in(head, ("space",), threshold): if _fuzzy_in(head, _SPACE_VERBS, threshold):
return Action("space", _leading_count(rest, default=1)) return Action("space", _leading_count(rest, default=1))
if _fuzzy_in(head, ("erase", "clear", "wipe"), threshold): if _fuzzy_in(head, _ADD_VERBS, threshold) and rest:
tail = [t for t in rest if t not in ("a", "an")]
if any(_fuzzy_in(t, ("space", "spaces"), threshold) for t in tail):
count = next((int(t) for t in tail if t.isdigit()),
next((_COUNT_WORDS[t] for t in tail if t in _COUNT_WORDS), 1))
return Action("space", count)
if _fuzzy_in(head, _ERASE_VERBS, threshold):
return Action("erase") return Action("erase")
if _fuzzy_in(head, _DEBUG_VERBS, threshold):
return Action("debug", " ".join(rest).strip())
if _fuzzy_in(head, ("mode",), threshold) and rest: if _fuzzy_in(head, _MODE_VERBS, threshold) and rest:
if _fuzzy_in(rest[0], ("ptt",), threshold) or "push" in rest[0]: if _fuzzy_in(rest[0], ("ptt",), threshold) or "push" in rest[0]:
return Action("mode", "ptt") return Action("mode", "ptt")
if _fuzzy_in(rest[0], ("listen",), threshold): if _fuzzy_in(rest[0], ("listen",), threshold):
@ -221,24 +271,28 @@ def _strip_filler(tokens: list[str], filler: tuple[str, ...], threshold: float)
return tokens return tokens
def parse(transcript: str, wake_phrases: list[str], threshold: float, def parse(transcript: str, wake_phrases: list[str], wake_threshold: float,
require_wake: bool, filler: tuple[str, ...] = DEFAULT_FILLER) -> ParsedCommand | None: command_threshold: float, require_wake: bool,
filler: tuple[str, ...] = DEFAULT_FILLER) -> ParsedCommand | None:
"""full parse: wake gate -> optional one-shot target -> filler -> command. """full parse: wake gate -> optional one-shot target -> filler -> command.
returns a ParsedCommand (one_shot, action), or None if the wake gate dropped the wake_threshold gates the wake phrase (lenient a false wake is cheap, it just
utterance (listen mode, no wake phrase). a ParsedCommand with action=None means a finds no command); command_threshold gates the command words (stricter a false
wake phrase was present but no command matched. command fires the wrong action). returns a ParsedCommand (one_shot, action), or
None if the wake gate dropped the utterance (listen mode, no wake phrase). a
ParsedCommand with action=None means a wake phrase was present but no command
matched.
""" """
remainder = strip_wake(transcript, wake_phrases, threshold, require_wake) remainder = strip_wake(transcript, wake_phrases, wake_threshold, require_wake)
if remainder is None: if remainder is None:
return None return None
tokens = remainder.split(" ") if remainder else [] tokens = remainder.split(" ") if remainder else []
one_shot: str | None = None one_shot: str | None = None
if tokens and _fuzzy_in(tokens[0], _ONESHOT_VERBS, threshold) and len(tokens) >= 2: if tokens and _fuzzy_in(tokens[0], _ONESHOT_VERBS, command_threshold) and len(tokens) >= 2:
one_shot = tokens[1] one_shot = tokens[1]
tokens = tokens[2:] tokens = tokens[2:]
tokens = _strip_filler(tokens, filler, threshold) tokens = _strip_filler(tokens, filler, command_threshold)
action = match_command(" ".join(tokens), threshold) action = match_command(" ".join(tokens), command_threshold)
return ParsedCommand(one_shot=one_shot, action=action) return ParsedCommand(one_shot=one_shot, action=action)

View File

@ -82,8 +82,9 @@ class Transcriber:
"""a loaded faster-whisper model that transcribes float32 mono audio chunks""" """a loaded faster-whisper model that transcribes float32 mono audio chunks"""
def __init__(self, model: str = "small", language: str = "en", device: str = "auto", def __init__(self, model: str = "small", language: str = "en", device: str = "auto",
compute_type: str = "auto") -> None: compute_type: str = "auto", initial_prompt: str | None = None) -> None:
self.language = language self.language = language
self.initial_prompt = initial_prompt
self._model = self._load(model, device, compute_type) self._model = self._load(model, device, compute_type)
self._warm() self._warm()
@ -120,6 +121,7 @@ class Transcriber:
beam_size=1, beam_size=1,
vad_filter=True, vad_filter=True,
condition_on_previous_text=False, condition_on_previous_text=False,
initial_prompt=self.initial_prompt,
) )
text = " ".join(seg.text for seg in segments).strip() text = " ".join(seg.text for seg in segments).strip()
return text return text