Compare commits

..

No commits in common. "4abdfd56bc69eb14bbc4620a52095c45004bf0b8" and "d96dc3898f8c465b993909d367b48787576fe9c4" have entirely different histories.

11 changed files with 84 additions and 195 deletions

1
.gitignore vendored
View File

@ -1,5 +1,4 @@
CLAUDE.md CLAUDE.md
COMPACT.md
__pycache__/ __pycache__/
*.pyc *.pyc

View File

@ -11,7 +11,7 @@ hands-free while another window (a game) is focused.
It exists because Claude Code's native `/voice` is hardcoded-blocked in WSL (it It exists because Claude Code's native `/voice` is hardcoded-blocked in WSL (it
assumes WSL has no audio). Modern WSL2 + WSLg *does* have working mic input via assumes WSL has no audio). Modern WSL2 + WSLg *does* have working mic input via
PulseAudio/RDP. `claudedo` captures the mic itself, transcribes on-device, and drives PulseAudio/RDP. `claudedo` captures the mic itself, transcribes on-device, and drives
Claude Code over tmux — fully local and private. You run it in a terminal you watch. Claude Code over tmux — fully local, private, backgroundable.
## How it works ## How it works
@ -20,11 +20,9 @@ mic (WSLg/PulseAudio RDPSource)
-> sounddevice capture -> sounddevice capture
-> faster-whisper (local STT, on-device) -> faster-whisper (local STT, on-device)
-> wake gate: utterance must start with a wake phrase, else DISCARD locally -> wake gate: utterance must start with a wake phrase, else DISCARD locally
-> grammar match (yes/no/one..four/approve/deny/send/type/space/backspace/erase/ -> grammar match (yes/no/one..four/approve/deny/send/type/mode/set/target/cancel)
mode/set/target/unset/list/cancel) -> resolve target session (~/.claude-active)
-> resolve target session (one-shot > sticky ~/.claude-active > auto/none)
-> tmux send-keys -t <session> "<keys>" -> tmux send-keys -t <session> "<keys>"
-> log the action to the watched terminal ([session]/[SYSTEM]/[VOICE], colored)
``` ```
**Privacy by construction.** STT runs on-device. In listen mode, any speech that **Privacy by construction.** STT runs on-device. In listen mode, any speech that
@ -64,19 +62,16 @@ claudedo test-audio
## Usage ## Usage
**Run it in a terminal you watch — that's the product.** You launch `claudedo **Run it in a terminal you watch — that's the product.** You launch `claudedo
start` and it drops into a visible listen loop (pass `--check` to run a mic check start`, it does a quick mic check, then drops into a visible listen loop that prints
first). Each utterance prints a timestamped, colored line — `HH:MM:SS [claude-libs] `heard → matched → sent` for every utterance. That terminal is your
heard "…" → recognition/action console; you attach to the `claude-<name>` session in another pane
typed 'fix'` (green for injected, red for drops, `[SYSTEM]`/`[VOICE]` for state and to watch the keystrokes land. There is no backgrounding/daemon mode — the whole point
recognition). That terminal is your recognition/action console; you attach to the is the console you read.
`claude-<name>` session in another pane to watch the keystrokes land. It runs in the
foreground by design — the console is the point — though `claudedo stop` can signal a
stray instance.
```bash ```bash
claudedo start # the visible listen loop (listen mode default; no mic check) claudedo start # mic-check, then the visible listen loop (listen mode default)
claudedo start --check # run a mic check before listening
claudedo start --mode ptt # push-to-talk instead (desk-only — see Modes) claudedo start --mode ptt # push-to-talk instead (desk-only — see Modes)
claudedo start --skip-audio-check # skip the pre-listen mic check
claudedo status # running? mode? target session? claudedo status # running? mode? target session?
claudedo stop # stop a running daemon claudedo stop # stop a running daemon
claudedo set <name> # set the sticky target -> claude-<name> (alias: switch) claudedo set <name> # set the sticky target -> claude-<name> (alias: switch)
@ -102,12 +97,9 @@ Switch at runtime by voice: "claudedo mode listen" / "claudedo mode ptt".
## Command grammar ## Command grammar
Wake phrases (listen mode), fuzzy-matched. The default list is **"claudedo"**, Wake phrases (listen mode), fuzzy-matched: **"claudedo"**, **"hey claude"**.
**"claude do"**, **"hey claude"**, **"ok claude"**, **"okay claude"** — Whisper has "claudedo" is a coined word, so the matcher is lenient (accepts "claude do",
no token for the coined word "claudedo" and renders it as real words ("claude do"), "clauddo", "cloud do", …). In PTT mode the wake phrase is optional.
so that spelling is listed explicitly. Matching is lenient (case/space-insensitive).
Add the spellings you actually see (turn on `print_heard` to find them). In PTT mode
the wake phrase is optional.
| Say | Does | | Say | Does |
|---|---| |---|---|
@ -116,10 +108,9 @@ the wake phrase is optional.
| `approve` / `deny` | allow / deny a permission prompt | | `approve` / `deny` | allow / deny a permission prompt |
| `send` / `enter` | submit (Enter) | | `send` / `enter` | submit (Enter) |
| `type <phrase>` | insert literal text, **no** submit (read-before-send; say "send") | | `type <phrase>` | insert literal text, **no** submit (read-before-send; say "send") |
| `space [<n>]` (also `add [a] space`, `insert <n> spaces`) | insert n spaces (default 1) | | `space [<n>]` | insert n spaces (default 1) |
| `backspace [<n>]` (alias `delete`) | delete n chars (default 1), capped at the last submit boundary | | `backspace [<n>]` (alias `delete`) | delete n chars (default 1), capped at the last submit boundary |
| `erase` (alias `clear`/`wipe`) | delete everything typed since the last submit/boundary | | `erase` (alias `clear`/`wipe`) | delete everything typed since the last submit/boundary |
| `debug <text>` (alias `echo`) | just print what you said to the console (test wake/STT; injects nothing) |
| `mode ptt` / `mode listen` | switch input mode | | `mode ptt` / `mode listen` | switch input mode |
| `set <name>` (alias `sticky`/`switch`) | set the **sticky** target → `claude-<name>` (persists) | | `set <name>` (alias `sticky`/`switch`) | set the **sticky** target → `claude-<name>` (persists) |
| `target <name> <command>` | **one-shot** override: run that command on `claude-<name>` for this utterance only; sticky default unchanged | | `target <name> <command>` | **one-shot** override: run that command on `claude-<name>` for this utterance only; sticky default unchanged |
@ -130,10 +121,8 @@ the wake phrase is optional.
Optional filler (`select` / `use` / `choose`) may precede any command and is ignored: Optional filler (`select` / `use` / `choose`) may precede any command and is ignored:
`select yes` and `use yes` behave like `yes`. (`select 1` is still the select command.) `select yes` and `use yes` behave like `yes`. (`select 1` is still the select command.)
When no sticky target is set, a bare command does nothing and asks you to `set` one When no sticky target is set, a bare command auto-targets the **only** running
(the default). Set `auto_target = true` to instead auto-use the single running `claude-*` session; if several are running it does nothing and asks you to `set` one.
`claude-*` session when there's exactly one; with several running it always does
nothing and asks you to `set` one.
Number words are normalized to digits before matching ("one"/"won" → 1). Number words are normalized to digits before matching ("one"/"won" → 1).
@ -146,9 +135,9 @@ A `target <name>` voice command is a **one-shot** that does NOT touch the sticky
default — it routes a single command and the next bare command reverts to sticky. default — it routes a single command and the next bare command reverts to sticky.
Resolution order (one place — `target.resolve()`): one-shot if present → Resolution order (one place — `target.resolve()`): one-shot if present →
sticky if set and the session exists → else, only if `auto_target = true`, the single sticky if set and the session exists → else the only running `claude-*` session →
running `claude-*` session → else (default, or zero/several sessions) do nothing and else (zero or several) do nothing and say so. It never guesses, and never injects
say so. It never guesses, and never injects into a nonexistent session. into a nonexistent session.
Every name maps to `claude-<name>` through one helper (`target.session_name()`), and Every name maps to `claude-<name>` through one helper (`target.session_name()`), and
the cc kit mirrors it exactly — so `cc libs` (shell) and `set libs` (voice) refer the cc kit mirrors it exactly — so `cc libs` (shell) and `set libs` (voice) refer
@ -185,29 +174,19 @@ If Claude Code changes its prompt UI, re-confirm against a live session and upda
## Config ## Config
Everything tunable lives in [`config.toml`](config.toml): wake phrases, mode + PTT Everything tunable lives in [`config.toml`](config.toml): wake phrases, mode + PTT
key, Whisper model/language/device, `[vad]` endpointing, and `[behavior]` key, Whisper model/language/device, audio segmentation thresholds, and `[behavior]`
(`type_autosend`, fuzzy thresholds, `filler_words`, `auto_target`, `print_heard`). (`type_autosend`, `filler_words`, `auto_target`, `print_heard`). The default model is
The default model is **`medium`** (best accuracy for the coined wake word on a strong `small`; bump to `medium` if the coined wake word is recognized poorly. `claudedo -c
CPU); `small` is faster/less accurate, `large-v3` most accurate. `claudedo -c <path> <path> ...` points at a specific config; otherwise it searches `$CLAUDEDO_CONFIG`,
...` points at a specific config; otherwise it searches `$CLAUDEDO_CONFIG`,
`~/.config/claudedo/config.toml`, then `./config.toml`. `~/.config/claudedo/config.toml`, then `./config.toml`.
- **STT biasing.** The transcriber is seeded with an `initial_prompt` built from the - **`auto_target`** (default `false`): with no sticky target set and exactly one
configured wake phrases + command vocabulary (one source — `grammar.vocabulary()`), `claude-*` session running, `false` makes a bare command do nothing and ask you to
so Whisper is conditioned to expect "claudedo" and the command words. `set` one; `true` auto-targets that single session.
- **Split fuzzy thresholds.** `wake_fuzzy_threshold` (default `0.6`, lenient) vs - **`print_heard`** (default `false`, debug): prints non-wake transcripts to the
`command_fuzzy_threshold` (default `0.8`, tight). The asymmetry is deliberate: a console so you can see how Whisper renders your wake word. Turn it on to debug
false *wake* is cheap (it wakes, finds no command, does nothing), but a false detection, then off. Whisper has no token for "claudedo" — it commonly emits
*command* fires the wrong action. Prefer expanding command synonyms over loosening "claude do" or "claude due", both of which are in the default wake list.
the command threshold.
- **`[vad]` endpointing.** Capture starts on speech and ends after `silence_ms`
(default 800) of trailing silence — Alexa-style record-until-pause — capped at
`max_seconds` (default 10). The pause both ends a command and separates it from
following chatter (the chatter is a separate capture the wake gate discards).
- **`auto_target`** (default `false`): with no sticky target and one session running,
`false` does nothing and asks you to `set`; `true` auto-uses that session.
- **`print_heard`** (default `false`, debug): prints non-wake transcripts so you can
see how Whisper renders your wake word, then tune the wake list/threshold.
## Requirements ## Requirements

View File

@ -5,7 +5,7 @@
# wake phrases for listen mode. fuzzy-matched: case/space-insensitive, lenient on # wake phrases for listen mode. fuzzy-matched: case/space-insensitive, lenient on
# the coined word "claudedo" (whisper renders it inconsistently). number words are # the coined word "claudedo" (whisper renders it inconsistently). number words are
# normalized to digits before command matching. # normalized to digits before command matching.
phrases = ["claudedo", "claude do", "hey claude", "ok claude", "okay claude"] phrases = ["claudedo", "claude do", "claude due", "hey claude", "ok claude", "okay claude"]
[input] [input]
# "listen" (default): continuous capture; only acts on utterances that start with a # "listen" (default): continuous capture; only acts on utterances that start with a
@ -21,10 +21,10 @@ mode = "listen"
ptt_key = "space" ptt_key = "space"
[stt] [stt]
# faster-whisper model size. "medium" is the default — biggest accuracy gain for the # faster-whisper model size. "small" is a good accuracy/latency balance for the
# coined wake word ("claudedo" / "claude do") and fine on a strong cpu. "small" is # short command grammar (~sub-second per chunk on a strong cpu). if the coined wake
# faster but less accurate; "large-v3" is most accurate if medium still struggles. # word "claudedo" is recognized poorly, bump to "medium" (slower per chunk).
model = "medium" model = "small"
language = "en" language = "en"
# mic device: "auto", or a sounddevice device index (integer) / substring of a # mic device: "auto", or a sounddevice device index (integer) / substring of a
# device name. run `claudedo test-audio` to list devices. # device name. run `claudedo test-audio` to list devices.
@ -36,30 +36,21 @@ compute = "auto"
# capture parameters. 16 kHz mono is what whisper expects. # capture parameters. 16 kHz mono is what whisper expects.
samplerate = 16000 samplerate = 16000
channels = 1 channels = 1
# rms energy below this counts as silence (the VAD onset/endpoint floor). # listen-mode silence segmentation: an utterance ends after this many seconds below
# the rms threshold. keeps latency low without streaming.
silence_threshold = 0.012 silence_threshold = 0.012
silence_duration = 0.8
# ignore utterances shorter than this (clicks, coughs). # ignore utterances shorter than this (clicks, coughs).
min_utterance = 0.3 min_utterance = 0.3
# hard cap on a single utterance so a stuck stream can't grow unbounded.
[vad] max_utterance = 15.0
# Alexa-style record-until-pause endpointing (listen mode). capture starts on speech
# onset and ends after this much trailing silence — the natural end of an utterance.
# a real pause both ends the command AND separates it from following chatter (the
# chatter becomes a separate capture that the wake gate then discards).
silence_ms = 800
# hard cap so continuous noise can't record forever.
max_seconds = 10.0
[behavior] [behavior]
# dictation never auto-submits: "type <phrase>" inserts literal text only; you say # dictation never auto-submits: "type <phrase>" inserts literal text only; you say
# "send" separately to submit (read-before-send). # "send" separately to submit (read-before-send).
type_autosend = false type_autosend = false
# fuzzy match ratios (0..1). the asymmetry is deliberate: a false WAKE is cheap (it # fuzzy match ratio (0..1) required to accept a wake phrase / command token.
# wakes, finds no command, does nothing), so wake is lenient; a false COMMAND fires match_threshold = 0.8
# the WRONG action, so commands stay tight. lower = more lenient = more matches.
# prefer expanding command synonyms over loosening command_fuzzy_threshold.
wake_fuzzy_threshold = 0.6
command_fuzzy_threshold = 0.8
# optional filler words that may precede a command and are ignored for matching: # optional filler words that may precede a command and are ignored for matching:
# "select yes" / "use yes" behave like "yes". (a filler word followed by a digit is # "select yes" / "use yes" behave like "yes". (a filler word followed by a digit is
# the select command, e.g. "select 1", and is not dropped.) # the select command, e.g. "select 1", and is not dropped.)

View File

@ -94,18 +94,6 @@ mkdir -p "$CONF_DIR"
install -m 0644 "$REPO_DIR/shell/cc.sh" "$CONF_DIR/cc.sh" install -m 0644 "$REPO_DIR/shell/cc.sh" "$CONF_DIR/cc.sh"
echo " wrote $CONF_DIR/cc.sh" echo " wrote $CONF_DIR/cc.sh"
# install config.toml to the standard location so the daemon finds it from any dir.
# never clobber an edited user config: copy only if absent, else drop a .new to diff.
if [ ! -f "$CONF_DIR/config.toml" ]; then
install -m 0644 "$REPO_DIR/config.toml" "$CONF_DIR/config.toml"
echo " wrote $CONF_DIR/config.toml"
elif ! cmp -s "$REPO_DIR/config.toml" "$CONF_DIR/config.toml"; then
install -m 0644 "$REPO_DIR/config.toml" "$CONF_DIR/config.toml.new"
echo " kept your $CONF_DIR/config.toml; new default written to config.toml.new (diff to merge)"
else
echo " $CONF_DIR/config.toml already current"
fi
# wire EVERY rc that exists (the user may have both zsh and bash). # wire EVERY rc that exists (the user may have both zsh and bash).
wired_any=0 wired_any=0
for RC in "$HOME/.zshrc" "$HOME/.bashrc"; do for RC in "$HOME/.zshrc" "$HOME/.bashrc"; do

View File

@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
[project] [project]
name = "claudedo" name = "claudedo"
version = "0.1.3" version = "0.1.2"
description = "voice-control daemon for claude code (local STT -> tmux send-keys)" description = "voice-control daemon for claude code (local STT -> tmux send-keys)"
readme = "README.md" readme = "README.md"
requires-python = ">=3.10" requires-python = ">=3.10"

View File

@ -1,3 +1,3 @@
"""claudedo — voice-control daemon for claude code (local STT -> tmux send-keys)""" """claudedo — voice-control daemon for claude code (local STT -> tmux send-keys)"""
__version__ = "0.1.3" __version__ = "0.1.2"

View File

@ -33,12 +33,12 @@ def cmd_start(args: argparse.Namespace) -> int:
config = _load_or_die(args.config) config = _load_or_die(args.config)
if args.mode: if args.mode:
config.mode = args.mode config.mode = args.mode
if args.check: if not args.skip_audio_check:
print("checking mic before listening (speak briefly) ...") print("checking mic before listening (speak briefly) ...")
peak = _probe_mic(config, seconds=2.0, verbose=False) peak = _probe_mic(config, seconds=2.0, verbose=False)
if peak is None or peak < 0.02: if peak is None or peak < 0.02:
print("mic check failed — no usable input.", file=sys.stderr) print("mic check failed — no usable input.", file=sys.stderr)
print("run `claudedo test-audio` to debug, or `claudedo start` to skip the check", print("run `claudedo test-audio` to debug; or `claudedo start --skip-audio-check`",
file=sys.stderr) file=sys.stderr)
return 1 return 1
print(f"mic OK (peak {peak:.3f}).") print(f"mic OK (peak {peak:.3f}).")
@ -217,8 +217,8 @@ def build_parser() -> argparse.ArgumentParser:
sp = sub.add_parser("start", help="run the daemon (foreground)") sp = sub.add_parser("start", help="run the daemon (foreground)")
sp.add_argument("--mode", choices=("listen", "ptt"), help="override input mode") sp.add_argument("--mode", choices=("listen", "ptt"), help="override input mode")
sp.add_argument("--check", action="store_true", sp.add_argument("--skip-audio-check", action="store_true",
help="run a mic check before listening (off by default)") help="skip the pre-listen mic check")
sp.set_defaults(func=cmd_start) sp.set_defaults(func=cmd_start)
sub.add_parser("stop", help="stop a running daemon").set_defaults(func=cmd_stop) sub.add_parser("stop", help="stop a running daemon").set_defaults(func=cmd_stop)

View File

@ -44,12 +44,11 @@ class Config:
samplerate: int samplerate: int
channels: int channels: int
silence_threshold: float silence_threshold: float
vad_silence_ms: int silence_duration: float
vad_max_seconds: float
min_utterance: float min_utterance: float
max_utterance: float
type_autosend: bool type_autosend: bool
wake_fuzzy_threshold: float match_threshold: float
command_fuzzy_threshold: float
filler_words: tuple[str, ...] filler_words: tuple[str, ...]
auto_target: bool auto_target: bool
print_heard: bool print_heard: bool
@ -99,7 +98,7 @@ def load_config(explicit: str | os.PathLike | None = None) -> Config:
if mode not in _VALID_MODES: if mode not in _VALID_MODES:
raise ConfigError(f"[input].mode must be one of {_VALID_MODES}, got {mode!r}") raise ConfigError(f"[input].mode must be one of {_VALID_MODES}, got {mode!r}")
model = _require(raw, "stt", "model", (str,), "medium") model = _require(raw, "stt", "model", (str,), "small")
if model not in _VALID_MODELS: if model not in _VALID_MODELS:
log.warning("unknown stt model %r — passing through to faster-whisper", model) log.warning("unknown stt model %r — passing through to faster-whisper", model)
@ -114,25 +113,19 @@ def load_config(explicit: str | os.PathLike | None = None) -> Config:
samplerate=int(_require(raw, "audio", "samplerate", (int,), 16000)), samplerate=int(_require(raw, "audio", "samplerate", (int,), 16000)),
channels=int(_require(raw, "audio", "channels", (int,), 1)), channels=int(_require(raw, "audio", "channels", (int,), 1)),
silence_threshold=float(_require(raw, "audio", "silence_threshold", (int, float), 0.012)), silence_threshold=float(_require(raw, "audio", "silence_threshold", (int, float), 0.012)),
vad_silence_ms=int(_require(raw, "vad", "silence_ms", (int,), 800)), silence_duration=float(_require(raw, "audio", "silence_duration", (int, float), 0.8)),
vad_max_seconds=float(_require(raw, "vad", "max_seconds", (int, float), 10.0)),
min_utterance=float(_require(raw, "audio", "min_utterance", (int, float), 0.3)), min_utterance=float(_require(raw, "audio", "min_utterance", (int, float), 0.3)),
max_utterance=float(_require(raw, "audio", "max_utterance", (int, float), 15.0)),
type_autosend=bool(_require(raw, "behavior", "type_autosend", (bool,), False)), type_autosend=bool(_require(raw, "behavior", "type_autosend", (bool,), False)),
wake_fuzzy_threshold=float(_require(raw, "behavior", "wake_fuzzy_threshold", (int, float), 0.6)), match_threshold=float(_require(raw, "behavior", "match_threshold", (int, float), 0.8)),
command_fuzzy_threshold=float(_require(raw, "behavior", "command_fuzzy_threshold",
(int, float), 0.8)),
filler_words=tuple(_require(raw, "behavior", "filler_words", (list,), filler_words=tuple(_require(raw, "behavior", "filler_words", (list,),
["select", "use", "choose"])), ["select", "use", "choose"])),
auto_target=bool(_require(raw, "behavior", "auto_target", (bool,), False)), auto_target=bool(_require(raw, "behavior", "auto_target", (bool,), False)),
print_heard=bool(_require(raw, "behavior", "print_heard", (bool,), False)), print_heard=bool(_require(raw, "behavior", "print_heard", (bool,), False)),
source_path=path, source_path=path,
) )
for label, val in (("wake_fuzzy_threshold", cfg.wake_fuzzy_threshold), if not 0.0 < cfg.match_threshold <= 1.0:
("command_fuzzy_threshold", cfg.command_fuzzy_threshold)): raise ConfigError("[behavior].match_threshold must be in (0, 1]")
if not 0.0 < val <= 1.0:
raise ConfigError(f"[behavior].{label} must be in (0, 1]")
if cfg.vad_silence_ms <= 0 or cfg.vad_max_seconds <= 0:
raise ConfigError("[vad].silence_ms and max_seconds must be positive")
if cfg.samplerate <= 0 or cfg.channels <= 0: if cfg.samplerate <= 0 or cfg.channels <= 0:
raise ConfigError("[audio].samplerate and channels must be positive") raise ConfigError("[audio].samplerate and channels must be positive")
return cfg return cfg

View File

@ -136,7 +136,6 @@ class Daemon:
model=cfg.stt_model, language=cfg.stt_language, model=cfg.stt_model, language=cfg.stt_language,
device=cfg.stt_compute if cfg.stt_compute in ("cpu", "cuda") else "auto", device=cfg.stt_compute if cfg.stt_compute in ("cpu", "cuda") else "auto",
compute_type="auto", compute_type="auto",
initial_prompt=grammar.initial_prompt(cfg.wake_phrases),
) )
if audio.warm_up(cfg.samplerate, cfg.channels, self._device): if audio.warm_up(cfg.samplerate, cfg.channels, self._device):
log.info("mic warmed up (source live)") log.info("mic warmed up (source live)")
@ -152,20 +151,20 @@ class Daemon:
return audio.record_while( return audio.record_while(
cfg.samplerate, cfg.channels, self._device, cfg.samplerate, cfg.channels, self._device,
held=lambda: not self._ptt.wait_press(self.stopped), held=lambda: not self._ptt.wait_press(self.stopped),
max_utterance=cfg.vad_max_seconds, min_utterance=cfg.min_utterance, max_utterance=cfg.max_utterance, min_utterance=cfg.min_utterance,
) )
return audio.record_until_silence( return audio.record_until_silence(
cfg.samplerate, cfg.channels, self._device, cfg.samplerate, cfg.channels, self._device,
silence_threshold=cfg.silence_threshold, silence_duration=cfg.vad_silence_ms / 1000.0, silence_threshold=cfg.silence_threshold, silence_duration=cfg.silence_duration,
min_utterance=cfg.min_utterance, max_utterance=cfg.vad_max_seconds, min_utterance=cfg.min_utterance, max_utterance=cfg.max_utterance,
stop=self.stopped, stop=self.stopped,
) )
def _handle(self, transcript: str) -> None: def _handle(self, transcript: str) -> None:
cfg = self.config cfg = self.config
require_wake = self.mode == "listen" require_wake = self.mode == "listen"
parsed = grammar.parse(transcript, cfg.wake_phrases, cfg.wake_fuzzy_threshold, parsed = grammar.parse(transcript, cfg.wake_phrases, cfg.match_threshold, require_wake,
cfg.command_fuzzy_threshold, require_wake, filler=cfg.filler_words) filler=cfg.filler_words)
if parsed is None or parsed.action is None: if parsed is None or parsed.action is None:
self._console.emit(VOICE, f'heard "{transcript}" -> no command matched', "yellow") self._console.emit(VOICE, f'heard "{transcript}" -> no command matched', "yellow")
return return
@ -193,9 +192,6 @@ class Daemon:
sessions = target.list_sessions() sessions = target.list_sessions()
self._console.emit(SYSTEM, "list -> " + (", ".join(sessions) if sessions else "(none running)")) self._console.emit(SYSTEM, "list -> " + (", ".join(sessions) if sessions else "(none running)"))
return return
if action.name == "debug":
self._console.emit(VOICE, f'debug: "{action.arg}"', "yellow")
return
session, reason = target.resolve(parsed.one_shot, auto_target=cfg.auto_target) session, reason = target.resolve(parsed.one_shot, auto_target=cfg.auto_target)
if session is None: if session is None:
@ -261,8 +257,7 @@ class Daemon:
invariant: non-command speech is discarded, never recorded. invariant: non-command speech is discarded, never recorded.
""" """
cfg = self.config cfg = self.config
return grammar.strip_wake(transcript, cfg.wake_phrases, return grammar.strip_wake(transcript, cfg.wake_phrases, cfg.match_threshold, True) is not None
cfg.wake_fuzzy_threshold, True) is not None
def _print_startup(self) -> None: def _print_startup(self) -> None:
cfg = self.config cfg = self.config

View File

@ -33,33 +33,11 @@ _COUNT_WORDS = {
"sixteen": 16, "seventeen": 17, "eighteen": 18, "nineteen": 19, "twenty": 20, "sixteen": 16, "seventeen": 17, "eighteen": 18, "nineteen": 19, "twenty": 20,
} }
_YES_VERBS = ("yes", "yeah", "yep", "yup")
_NO_VERBS = ("no", "nope", "nah")
_APPROVE_VERBS = ("approve", "allow")
_DENY_VERBS = ("deny", "reject")
_SUBMIT_VERBS = ("send", "enter", "submit")
_CANCEL_VERBS = ("cancel", "escape")
_TYPE_VERBS = ("type", "dictate", "write")
_BACKSPACE_VERBS = ("backspace", "delete")
_SPACE_VERBS = ("space", "spacebar")
_ADD_VERBS = ("add", "insert")
_ERASE_VERBS = ("erase", "clear", "wipe")
_DEBUG_VERBS = ("debug", "echo")
_MODE_VERBS = ("mode",)
_STICKY_VERBS = ("set", "sticky", "switch") _STICKY_VERBS = ("set", "sticky", "switch")
_ONESHOT_VERBS = ("target",) _ONESHOT_VERBS = ("target",)
_UNSET_VERBS = ("unset", "unsticky") _UNSET_VERBS = ("unset", "unsticky")
_LIST_VERBS = ("list", "sessions") _LIST_VERBS = ("list", "sessions")
_SELECT_VERBS = ("select", "option", "choose", "number") _SELECT_VERBS = ("select", "option", "choose", "number")
# every command/synonym word, for biasing the STT toward the vocabulary we expect.
_COMMAND_WORDS = (
_YES_VERBS + _NO_VERBS + _APPROVE_VERBS + _DENY_VERBS + _SUBMIT_VERBS
+ _CANCEL_VERBS + _TYPE_VERBS + _BACKSPACE_VERBS + _SPACE_VERBS + _ADD_VERBS
+ _ERASE_VERBS + _DEBUG_VERBS + _MODE_VERBS + _STICKY_VERBS + _ONESHOT_VERBS + _UNSET_VERBS
+ _LIST_VERBS + _SELECT_VERBS + ("ptt", "listen")
+ ("one", "two", "three", "four")
)
DEFAULT_FILLER = ("select", "use", "choose") DEFAULT_FILLER = ("select", "use", "choose")
@ -101,26 +79,6 @@ def normalize(text: str) -> str:
return " ".join(tokens) return " ".join(tokens)
def vocabulary(wake_phrases: list[str]) -> list[str]:
"""the wake + command vocabulary, deduped in first-seen order.
single source for biasing the STT: the same wake phrases the matcher uses plus
every command/synonym word in _COMMAND_WORDS. no separate hardcoded copy.
"""
seen: dict[str, None] = {}
for word in list(wake_phrases) + list(_COMMAND_WORDS):
key = word.strip()
if key and key not in seen:
seen[key] = None
return list(seen)
def initial_prompt(wake_phrases: list[str]) -> str:
"""a comma-joined vocabulary string to pass faster-whisper as initial_prompt,
conditioning transcription toward the words we expect (esp. the coined wake)"""
return ", ".join(vocabulary(wake_phrases))
def _ratio(a: str, b: str) -> float: def _ratio(a: str, b: str) -> float:
return SequenceMatcher(None, a, b).ratio() return SequenceMatcher(None, a, b).ratio()
@ -205,42 +163,34 @@ def match_command(remainder: str, threshold: float) -> Action | None:
if head in _INDEX_WORDS: if head in _INDEX_WORDS:
return Action("select", _INDEX_WORDS[head]) return Action("select", _INDEX_WORDS[head])
if _fuzzy_in(head, _YES_VERBS, threshold): if _fuzzy_in(head, ("yes", "yeah", "yep", "yup"), threshold):
return Action("yes") return Action("yes")
if _fuzzy_in(head, _NO_VERBS, threshold): if _fuzzy_in(head, ("no", "nope", "nah"), threshold):
return Action("no") return Action("no")
if _fuzzy_in(head, _APPROVE_VERBS, threshold): if _fuzzy_in(head, ("approve", "allow"), threshold):
return Action("approve") return Action("approve")
if _fuzzy_in(head, _DENY_VERBS, threshold): if _fuzzy_in(head, ("deny", "reject"), threshold):
return Action("deny") return Action("deny")
if _fuzzy_in(head, _SUBMIT_VERBS, threshold): if _fuzzy_in(head, ("send", "enter", "submit"), threshold):
return Action("submit") return Action("submit")
if _fuzzy_in(head, _CANCEL_VERBS, threshold): if _fuzzy_in(head, ("cancel", "escape"), threshold):
return Action("cancel") return Action("cancel")
if _fuzzy_in(head, _SELECT_VERBS, threshold) and rest and rest[0] in _INDEX_WORDS: if _fuzzy_in(head, _SELECT_VERBS, threshold) and rest and rest[0] in _INDEX_WORDS:
return Action("select", _INDEX_WORDS[rest[0]]) return Action("select", _INDEX_WORDS[rest[0]])
if _fuzzy_in(head, _TYPE_VERBS, threshold): if _fuzzy_in(head, ("type", "dictate", "write"), threshold):
text = " ".join(rest).strip() text = " ".join(rest).strip()
return Action("type", text) if text else None return Action("type", text) if text else None
if _fuzzy_in(head, _BACKSPACE_VERBS, threshold): if _fuzzy_in(head, ("backspace", "delete"), threshold):
return Action("backspace", _leading_count(rest, default=1)) return Action("backspace", _leading_count(rest, default=1))
if _fuzzy_in(head, _SPACE_VERBS, threshold): if _fuzzy_in(head, ("space",), threshold):
return Action("space", _leading_count(rest, default=1)) return Action("space", _leading_count(rest, default=1))
if _fuzzy_in(head, _ADD_VERBS, threshold) and rest: if _fuzzy_in(head, ("erase", "clear", "wipe"), threshold):
tail = [t for t in rest if t not in ("a", "an")]
if any(_fuzzy_in(t, ("space", "spaces"), threshold) for t in tail):
count = next((int(t) for t in tail if t.isdigit()),
next((_COUNT_WORDS[t] for t in tail if t in _COUNT_WORDS), 1))
return Action("space", count)
if _fuzzy_in(head, _ERASE_VERBS, threshold):
return Action("erase") return Action("erase")
if _fuzzy_in(head, _DEBUG_VERBS, threshold):
return Action("debug", " ".join(rest).strip())
if _fuzzy_in(head, _MODE_VERBS, threshold) and rest: if _fuzzy_in(head, ("mode",), threshold) and rest:
if _fuzzy_in(rest[0], ("ptt",), threshold) or "push" in rest[0]: if _fuzzy_in(rest[0], ("ptt",), threshold) or "push" in rest[0]:
return Action("mode", "ptt") return Action("mode", "ptt")
if _fuzzy_in(rest[0], ("listen",), threshold): if _fuzzy_in(rest[0], ("listen",), threshold):
@ -271,28 +221,24 @@ def _strip_filler(tokens: list[str], filler: tuple[str, ...], threshold: float)
return tokens return tokens
def parse(transcript: str, wake_phrases: list[str], wake_threshold: float, def parse(transcript: str, wake_phrases: list[str], threshold: float,
command_threshold: float, require_wake: bool, require_wake: bool, filler: tuple[str, ...] = DEFAULT_FILLER) -> ParsedCommand | None:
filler: tuple[str, ...] = DEFAULT_FILLER) -> ParsedCommand | None:
"""full parse: wake gate -> optional one-shot target -> filler -> command. """full parse: wake gate -> optional one-shot target -> filler -> command.
wake_threshold gates the wake phrase (lenient a false wake is cheap, it just returns a ParsedCommand (one_shot, action), or None if the wake gate dropped the
finds no command); command_threshold gates the command words (stricter a false utterance (listen mode, no wake phrase). a ParsedCommand with action=None means a
command fires the wrong action). returns a ParsedCommand (one_shot, action), or wake phrase was present but no command matched.
None if the wake gate dropped the utterance (listen mode, no wake phrase). a
ParsedCommand with action=None means a wake phrase was present but no command
matched.
""" """
remainder = strip_wake(transcript, wake_phrases, wake_threshold, require_wake) remainder = strip_wake(transcript, wake_phrases, threshold, require_wake)
if remainder is None: if remainder is None:
return None return None
tokens = remainder.split(" ") if remainder else [] tokens = remainder.split(" ") if remainder else []
one_shot: str | None = None one_shot: str | None = None
if tokens and _fuzzy_in(tokens[0], _ONESHOT_VERBS, command_threshold) and len(tokens) >= 2: if tokens and _fuzzy_in(tokens[0], _ONESHOT_VERBS, threshold) and len(tokens) >= 2:
one_shot = tokens[1] one_shot = tokens[1]
tokens = tokens[2:] tokens = tokens[2:]
tokens = _strip_filler(tokens, filler, command_threshold) tokens = _strip_filler(tokens, filler, threshold)
action = match_command(" ".join(tokens), command_threshold) action = match_command(" ".join(tokens), threshold)
return ParsedCommand(one_shot=one_shot, action=action) return ParsedCommand(one_shot=one_shot, action=action)

View File

@ -82,9 +82,8 @@ class Transcriber:
"""a loaded faster-whisper model that transcribes float32 mono audio chunks""" """a loaded faster-whisper model that transcribes float32 mono audio chunks"""
def __init__(self, model: str = "small", language: str = "en", device: str = "auto", def __init__(self, model: str = "small", language: str = "en", device: str = "auto",
compute_type: str = "auto", initial_prompt: str | None = None) -> None: compute_type: str = "auto") -> None:
self.language = language self.language = language
self.initial_prompt = initial_prompt
self._model = self._load(model, device, compute_type) self._model = self._load(model, device, compute_type)
self._warm() self._warm()
@ -121,7 +120,6 @@ class Transcriber:
beam_size=1, beam_size=1,
vad_filter=True, vad_filter=True,
condition_on_previous_text=False, condition_on_previous_text=False,
initial_prompt=self.initial_prompt,
) )
text = " ".join(seg.text for seg in segments).strip() text = " ".join(seg.text for seg in segments).strip()
return text return text