Compare commits
6 Commits
d96dc3898f
...
4abdfd56bc
| Author | SHA1 | Date | |
|---|---|---|---|
| 4abdfd56bc | |||
| e6dadab143 | |||
| 5064f912a4 | |||
| a51c2fbdd4 | |||
| bd6597352a | |||
| 08bbe3ce58 |
1
.gitignore
vendored
1
.gitignore
vendored
@ -1,4 +1,5 @@
|
||||
CLAUDE.md
|
||||
COMPACT.md
|
||||
|
||||
__pycache__/
|
||||
*.pyc
|
||||
|
||||
81
README.md
81
README.md
@ -11,7 +11,7 @@ hands-free while another window (a game) is focused.
|
||||
It exists because Claude Code's native `/voice` is hardcoded-blocked in WSL (it
|
||||
assumes WSL has no audio). Modern WSL2 + WSLg *does* have working mic input via
|
||||
PulseAudio/RDP. `claudedo` captures the mic itself, transcribes on-device, and drives
|
||||
Claude Code over tmux — fully local, private, backgroundable.
|
||||
Claude Code over tmux — fully local and private. You run it in a terminal you watch.
|
||||
|
||||
## How it works
|
||||
|
||||
@ -20,9 +20,11 @@ mic (WSLg/PulseAudio RDPSource)
|
||||
-> sounddevice capture
|
||||
-> faster-whisper (local STT, on-device)
|
||||
-> wake gate: utterance must start with a wake phrase, else DISCARD locally
|
||||
-> grammar match (yes/no/one..four/approve/deny/send/type/mode/set/target/cancel)
|
||||
-> resolve target session (~/.claude-active)
|
||||
-> grammar match (yes/no/one..four/approve/deny/send/type/space/backspace/erase/
|
||||
mode/set/target/unset/list/cancel)
|
||||
-> resolve target session (one-shot > sticky ~/.claude-active > auto/none)
|
||||
-> tmux send-keys -t <session> "<keys>"
|
||||
-> log the action to the watched terminal ([session]/[SYSTEM]/[VOICE], colored)
|
||||
```
|
||||
|
||||
**Privacy by construction.** STT runs on-device. In listen mode, any speech that
|
||||
@ -62,16 +64,19 @@ claudedo test-audio
|
||||
## Usage
|
||||
|
||||
**Run it in a terminal you watch — that's the product.** You launch `claudedo
|
||||
start`, it does a quick mic check, then drops into a visible listen loop that prints
|
||||
`heard → matched → sent` for every utterance. That terminal is your
|
||||
recognition/action console; you attach to the `claude-<name>` session in another pane
|
||||
to watch the keystrokes land. There is no backgrounding/daemon mode — the whole point
|
||||
is the console you read.
|
||||
start` and it drops into a visible listen loop (pass `--check` to run a mic check
|
||||
first). Each utterance prints a timestamped, colored line — `HH:MM:SS [claude-libs]
|
||||
heard "…" →
|
||||
typed 'fix'` (green for injected, red for drops, `[SYSTEM]`/`[VOICE]` for state and
|
||||
recognition). That terminal is your recognition/action console; you attach to the
|
||||
`claude-<name>` session in another pane to watch the keystrokes land. It runs in the
|
||||
foreground by design — the console is the point — though `claudedo stop` can signal a
|
||||
stray instance.
|
||||
|
||||
```bash
|
||||
claudedo start # mic-check, then the visible listen loop (listen mode default)
|
||||
claudedo start # the visible listen loop (listen mode default; no mic check)
|
||||
claudedo start --check # run a mic check before listening
|
||||
claudedo start --mode ptt # push-to-talk instead (desk-only — see Modes)
|
||||
claudedo start --skip-audio-check # skip the pre-listen mic check
|
||||
claudedo status # running? mode? target session?
|
||||
claudedo stop # stop a running daemon
|
||||
claudedo set <name> # set the sticky target -> claude-<name> (alias: switch)
|
||||
@ -97,9 +102,12 @@ Switch at runtime by voice: "claudedo mode listen" / "claudedo mode ptt".
|
||||
|
||||
## Command grammar
|
||||
|
||||
Wake phrases (listen mode), fuzzy-matched: **"claudedo"**, **"hey claude"**.
|
||||
"claudedo" is a coined word, so the matcher is lenient (accepts "claude do",
|
||||
"clauddo", "cloud do", …). In PTT mode the wake phrase is optional.
|
||||
Wake phrases (listen mode), fuzzy-matched. The default list is **"claudedo"**,
|
||||
**"claude do"**, **"hey claude"**, **"ok claude"**, **"okay claude"** — Whisper has
|
||||
no token for the coined word "claudedo" and renders it as real words ("claude do"),
|
||||
so that spelling is listed explicitly. Matching is lenient (case/space-insensitive).
|
||||
Add the spellings you actually see (turn on `print_heard` to find them). In PTT mode
|
||||
the wake phrase is optional.
|
||||
|
||||
| Say | Does |
|
||||
|---|---|
|
||||
@ -108,9 +116,10 @@ Wake phrases (listen mode), fuzzy-matched: **"claudedo"**, **"hey claude"**.
|
||||
| `approve` / `deny` | allow / deny a permission prompt |
|
||||
| `send` / `enter` | submit (Enter) |
|
||||
| `type <phrase>` | insert literal text, **no** submit (read-before-send; say "send") |
|
||||
| `space [<n>]` | insert n spaces (default 1) |
|
||||
| `space [<n>]` (also `add [a] space`, `insert <n> spaces`) | insert n spaces (default 1) |
|
||||
| `backspace [<n>]` (alias `delete`) | delete n chars (default 1), capped at the last submit boundary |
|
||||
| `erase` (alias `clear`/`wipe`) | delete everything typed since the last submit/boundary |
|
||||
| `debug <text>` (alias `echo`) | just print what you said to the console (test wake/STT; injects nothing) |
|
||||
| `mode ptt` / `mode listen` | switch input mode |
|
||||
| `set <name>` (alias `sticky`/`switch`) | set the **sticky** target → `claude-<name>` (persists) |
|
||||
| `target <name> <command>` | **one-shot** override: run that command on `claude-<name>` for this utterance only; sticky default unchanged |
|
||||
@ -121,8 +130,10 @@ Wake phrases (listen mode), fuzzy-matched: **"claudedo"**, **"hey claude"**.
|
||||
Optional filler (`select` / `use` / `choose`) may precede any command and is ignored:
|
||||
`select yes` and `use yes` behave like `yes`. (`select 1` is still the select command.)
|
||||
|
||||
When no sticky target is set, a bare command auto-targets the **only** running
|
||||
`claude-*` session; if several are running it does nothing and asks you to `set` one.
|
||||
When no sticky target is set, a bare command does nothing and asks you to `set` one
|
||||
(the default). Set `auto_target = true` to instead auto-use the single running
|
||||
`claude-*` session when there's exactly one; with several running it always does
|
||||
nothing and asks you to `set` one.
|
||||
|
||||
Number words are normalized to digits before matching ("one"/"won" → 1).
|
||||
|
||||
@ -135,9 +146,9 @@ A `target <name>` voice command is a **one-shot** that does NOT touch the sticky
|
||||
default — it routes a single command and the next bare command reverts to sticky.
|
||||
|
||||
Resolution order (one place — `target.resolve()`): one-shot if present →
|
||||
sticky if set and the session exists → else the only running `claude-*` session →
|
||||
else (zero or several) do nothing and say so. It never guesses, and never injects
|
||||
into a nonexistent session.
|
||||
sticky if set and the session exists → else, only if `auto_target = true`, the single
|
||||
running `claude-*` session → else (default, or zero/several sessions) do nothing and
|
||||
say so. It never guesses, and never injects into a nonexistent session.
|
||||
|
||||
Every name maps to `claude-<name>` through one helper (`target.session_name()`), and
|
||||
the cc kit mirrors it exactly — so `cc libs` (shell) and `set libs` (voice) refer
|
||||
@ -174,19 +185,29 @@ If Claude Code changes its prompt UI, re-confirm against a live session and upda
|
||||
## Config
|
||||
|
||||
Everything tunable lives in [`config.toml`](config.toml): wake phrases, mode + PTT
|
||||
key, Whisper model/language/device, audio segmentation thresholds, and `[behavior]`
|
||||
(`type_autosend`, `filler_words`, `auto_target`, `print_heard`). The default model is
|
||||
`small`; bump to `medium` if the coined wake word is recognized poorly. `claudedo -c
|
||||
<path> ...` points at a specific config; otherwise it searches `$CLAUDEDO_CONFIG`,
|
||||
key, Whisper model/language/device, `[vad]` endpointing, and `[behavior]`
|
||||
(`type_autosend`, fuzzy thresholds, `filler_words`, `auto_target`, `print_heard`).
|
||||
The default model is **`medium`** (best accuracy for the coined wake word on a strong
|
||||
CPU); `small` is faster/less accurate, `large-v3` most accurate. `claudedo -c <path>
|
||||
...` points at a specific config; otherwise it searches `$CLAUDEDO_CONFIG`,
|
||||
`~/.config/claudedo/config.toml`, then `./config.toml`.
|
||||
|
||||
- **`auto_target`** (default `false`): with no sticky target set and exactly one
|
||||
`claude-*` session running, `false` makes a bare command do nothing and ask you to
|
||||
`set` one; `true` auto-targets that single session.
|
||||
- **`print_heard`** (default `false`, debug): prints non-wake transcripts to the
|
||||
console so you can see how Whisper renders your wake word. Turn it on to debug
|
||||
detection, then off. Whisper has no token for "claudedo" — it commonly emits
|
||||
"claude do" or "claude due", both of which are in the default wake list.
|
||||
- **STT biasing.** The transcriber is seeded with an `initial_prompt` built from the
|
||||
configured wake phrases + command vocabulary (one source — `grammar.vocabulary()`),
|
||||
so Whisper is conditioned to expect "claudedo" and the command words.
|
||||
- **Split fuzzy thresholds.** `wake_fuzzy_threshold` (default `0.6`, lenient) vs
|
||||
`command_fuzzy_threshold` (default `0.8`, tight). The asymmetry is deliberate: a
|
||||
false *wake* is cheap (it wakes, finds no command, does nothing), but a false
|
||||
*command* fires the wrong action. Prefer expanding command synonyms over loosening
|
||||
the command threshold.
|
||||
- **`[vad]` endpointing.** Capture starts on speech and ends after `silence_ms`
|
||||
(default 800) of trailing silence — Alexa-style record-until-pause — capped at
|
||||
`max_seconds` (default 10). The pause both ends a command and separates it from
|
||||
following chatter (the chatter is a separate capture the wake gate discards).
|
||||
- **`auto_target`** (default `false`): with no sticky target and one session running,
|
||||
`false` does nothing and asks you to `set`; `true` auto-uses that session.
|
||||
- **`print_heard`** (default `false`, debug): prints non-wake transcripts so you can
|
||||
see how Whisper renders your wake word, then tune the wake list/threshold.
|
||||
|
||||
## Requirements
|
||||
|
||||
|
||||
33
config.toml
33
config.toml
@ -5,7 +5,7 @@
|
||||
# wake phrases for listen mode. fuzzy-matched: case/space-insensitive, lenient on
|
||||
# the coined word "claudedo" (whisper renders it inconsistently). number words are
|
||||
# normalized to digits before command matching.
|
||||
phrases = ["claudedo", "claude do", "claude due", "hey claude", "ok claude", "okay claude"]
|
||||
phrases = ["claudedo", "claude do", "hey claude", "ok claude", "okay claude"]
|
||||
|
||||
[input]
|
||||
# "listen" (default): continuous capture; only acts on utterances that start with a
|
||||
@ -21,10 +21,10 @@ mode = "listen"
|
||||
ptt_key = "space"
|
||||
|
||||
[stt]
|
||||
# faster-whisper model size. "small" is a good accuracy/latency balance for the
|
||||
# short command grammar (~sub-second per chunk on a strong cpu). if the coined wake
|
||||
# word "claudedo" is recognized poorly, bump to "medium" (slower per chunk).
|
||||
model = "small"
|
||||
# faster-whisper model size. "medium" is the default — biggest accuracy gain for the
|
||||
# coined wake word ("claudedo" / "claude do") and fine on a strong cpu. "small" is
|
||||
# faster but less accurate; "large-v3" is most accurate if medium still struggles.
|
||||
model = "medium"
|
||||
language = "en"
|
||||
# mic device: "auto", or a sounddevice device index (integer) / substring of a
|
||||
# device name. run `claudedo test-audio` to list devices.
|
||||
@ -36,21 +36,30 @@ compute = "auto"
|
||||
# capture parameters. 16 kHz mono is what whisper expects.
|
||||
samplerate = 16000
|
||||
channels = 1
|
||||
# listen-mode silence segmentation: an utterance ends after this many seconds below
|
||||
# the rms threshold. keeps latency low without streaming.
|
||||
# rms energy below this counts as silence (the VAD onset/endpoint floor).
|
||||
silence_threshold = 0.012
|
||||
silence_duration = 0.8
|
||||
# ignore utterances shorter than this (clicks, coughs).
|
||||
min_utterance = 0.3
|
||||
# hard cap on a single utterance so a stuck stream can't grow unbounded.
|
||||
max_utterance = 15.0
|
||||
|
||||
[vad]
|
||||
# Alexa-style record-until-pause endpointing (listen mode). capture starts on speech
|
||||
# onset and ends after this much trailing silence — the natural end of an utterance.
|
||||
# a real pause both ends the command AND separates it from following chatter (the
|
||||
# chatter becomes a separate capture that the wake gate then discards).
|
||||
silence_ms = 800
|
||||
# hard cap so continuous noise can't record forever.
|
||||
max_seconds = 10.0
|
||||
|
||||
[behavior]
|
||||
# dictation never auto-submits: "type <phrase>" inserts literal text only; you say
|
||||
# "send" separately to submit (read-before-send).
|
||||
type_autosend = false
|
||||
# fuzzy match ratio (0..1) required to accept a wake phrase / command token.
|
||||
match_threshold = 0.8
|
||||
# fuzzy match ratios (0..1). the asymmetry is deliberate: a false WAKE is cheap (it
|
||||
# wakes, finds no command, does nothing), so wake is lenient; a false COMMAND fires
|
||||
# the WRONG action, so commands stay tight. lower = more lenient = more matches.
|
||||
# prefer expanding command synonyms over loosening command_fuzzy_threshold.
|
||||
wake_fuzzy_threshold = 0.6
|
||||
command_fuzzy_threshold = 0.8
|
||||
# optional filler words that may precede a command and are ignored for matching:
|
||||
# "select yes" / "use yes" behave like "yes". (a filler word followed by a digit is
|
||||
# the select command, e.g. "select 1", and is not dropped.)
|
||||
|
||||
12
install.sh
12
install.sh
@ -94,6 +94,18 @@ mkdir -p "$CONF_DIR"
|
||||
install -m 0644 "$REPO_DIR/shell/cc.sh" "$CONF_DIR/cc.sh"
|
||||
echo " wrote $CONF_DIR/cc.sh"
|
||||
|
||||
# install config.toml to the standard location so the daemon finds it from any dir.
|
||||
# never clobber an edited user config: copy only if absent, else drop a .new to diff.
|
||||
if [ ! -f "$CONF_DIR/config.toml" ]; then
|
||||
install -m 0644 "$REPO_DIR/config.toml" "$CONF_DIR/config.toml"
|
||||
echo " wrote $CONF_DIR/config.toml"
|
||||
elif ! cmp -s "$REPO_DIR/config.toml" "$CONF_DIR/config.toml"; then
|
||||
install -m 0644 "$REPO_DIR/config.toml" "$CONF_DIR/config.toml.new"
|
||||
echo " kept your $CONF_DIR/config.toml; new default written to config.toml.new (diff to merge)"
|
||||
else
|
||||
echo " $CONF_DIR/config.toml already current"
|
||||
fi
|
||||
|
||||
# wire EVERY rc that exists (the user may have both zsh and bash).
|
||||
wired_any=0
|
||||
for RC in "$HOME/.zshrc" "$HOME/.bashrc"; do
|
||||
|
||||
@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
||||
|
||||
[project]
|
||||
name = "claudedo"
|
||||
version = "0.1.2"
|
||||
version = "0.1.3"
|
||||
description = "voice-control daemon for claude code (local STT -> tmux send-keys)"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.10"
|
||||
|
||||
@ -1,3 +1,3 @@
|
||||
"""claudedo — voice-control daemon for claude code (local STT -> tmux send-keys)"""
|
||||
|
||||
__version__ = "0.1.2"
|
||||
__version__ = "0.1.3"
|
||||
|
||||
@ -33,12 +33,12 @@ def cmd_start(args: argparse.Namespace) -> int:
|
||||
config = _load_or_die(args.config)
|
||||
if args.mode:
|
||||
config.mode = args.mode
|
||||
if not args.skip_audio_check:
|
||||
if args.check:
|
||||
print("checking mic before listening (speak briefly) ...")
|
||||
peak = _probe_mic(config, seconds=2.0, verbose=False)
|
||||
if peak is None or peak < 0.02:
|
||||
print("mic check failed — no usable input.", file=sys.stderr)
|
||||
print("run `claudedo test-audio` to debug; or `claudedo start --skip-audio-check`",
|
||||
print("run `claudedo test-audio` to debug, or `claudedo start` to skip the check",
|
||||
file=sys.stderr)
|
||||
return 1
|
||||
print(f"mic OK (peak {peak:.3f}).")
|
||||
@ -217,8 +217,8 @@ def build_parser() -> argparse.ArgumentParser:
|
||||
|
||||
sp = sub.add_parser("start", help="run the daemon (foreground)")
|
||||
sp.add_argument("--mode", choices=("listen", "ptt"), help="override input mode")
|
||||
sp.add_argument("--skip-audio-check", action="store_true",
|
||||
help="skip the pre-listen mic check")
|
||||
sp.add_argument("--check", action="store_true",
|
||||
help="run a mic check before listening (off by default)")
|
||||
sp.set_defaults(func=cmd_start)
|
||||
|
||||
sub.add_parser("stop", help="stop a running daemon").set_defaults(func=cmd_stop)
|
||||
|
||||
@ -44,11 +44,12 @@ class Config:
|
||||
samplerate: int
|
||||
channels: int
|
||||
silence_threshold: float
|
||||
silence_duration: float
|
||||
vad_silence_ms: int
|
||||
vad_max_seconds: float
|
||||
min_utterance: float
|
||||
max_utterance: float
|
||||
type_autosend: bool
|
||||
match_threshold: float
|
||||
wake_fuzzy_threshold: float
|
||||
command_fuzzy_threshold: float
|
||||
filler_words: tuple[str, ...]
|
||||
auto_target: bool
|
||||
print_heard: bool
|
||||
@ -98,7 +99,7 @@ def load_config(explicit: str | os.PathLike | None = None) -> Config:
|
||||
if mode not in _VALID_MODES:
|
||||
raise ConfigError(f"[input].mode must be one of {_VALID_MODES}, got {mode!r}")
|
||||
|
||||
model = _require(raw, "stt", "model", (str,), "small")
|
||||
model = _require(raw, "stt", "model", (str,), "medium")
|
||||
if model not in _VALID_MODELS:
|
||||
log.warning("unknown stt model %r — passing through to faster-whisper", model)
|
||||
|
||||
@ -113,19 +114,25 @@ def load_config(explicit: str | os.PathLike | None = None) -> Config:
|
||||
samplerate=int(_require(raw, "audio", "samplerate", (int,), 16000)),
|
||||
channels=int(_require(raw, "audio", "channels", (int,), 1)),
|
||||
silence_threshold=float(_require(raw, "audio", "silence_threshold", (int, float), 0.012)),
|
||||
silence_duration=float(_require(raw, "audio", "silence_duration", (int, float), 0.8)),
|
||||
vad_silence_ms=int(_require(raw, "vad", "silence_ms", (int,), 800)),
|
||||
vad_max_seconds=float(_require(raw, "vad", "max_seconds", (int, float), 10.0)),
|
||||
min_utterance=float(_require(raw, "audio", "min_utterance", (int, float), 0.3)),
|
||||
max_utterance=float(_require(raw, "audio", "max_utterance", (int, float), 15.0)),
|
||||
type_autosend=bool(_require(raw, "behavior", "type_autosend", (bool,), False)),
|
||||
match_threshold=float(_require(raw, "behavior", "match_threshold", (int, float), 0.8)),
|
||||
wake_fuzzy_threshold=float(_require(raw, "behavior", "wake_fuzzy_threshold", (int, float), 0.6)),
|
||||
command_fuzzy_threshold=float(_require(raw, "behavior", "command_fuzzy_threshold",
|
||||
(int, float), 0.8)),
|
||||
filler_words=tuple(_require(raw, "behavior", "filler_words", (list,),
|
||||
["select", "use", "choose"])),
|
||||
auto_target=bool(_require(raw, "behavior", "auto_target", (bool,), False)),
|
||||
print_heard=bool(_require(raw, "behavior", "print_heard", (bool,), False)),
|
||||
source_path=path,
|
||||
)
|
||||
if not 0.0 < cfg.match_threshold <= 1.0:
|
||||
raise ConfigError("[behavior].match_threshold must be in (0, 1]")
|
||||
for label, val in (("wake_fuzzy_threshold", cfg.wake_fuzzy_threshold),
|
||||
("command_fuzzy_threshold", cfg.command_fuzzy_threshold)):
|
||||
if not 0.0 < val <= 1.0:
|
||||
raise ConfigError(f"[behavior].{label} must be in (0, 1]")
|
||||
if cfg.vad_silence_ms <= 0 or cfg.vad_max_seconds <= 0:
|
||||
raise ConfigError("[vad].silence_ms and max_seconds must be positive")
|
||||
if cfg.samplerate <= 0 or cfg.channels <= 0:
|
||||
raise ConfigError("[audio].samplerate and channels must be positive")
|
||||
return cfg
|
||||
|
||||
@ -136,6 +136,7 @@ class Daemon:
|
||||
model=cfg.stt_model, language=cfg.stt_language,
|
||||
device=cfg.stt_compute if cfg.stt_compute in ("cpu", "cuda") else "auto",
|
||||
compute_type="auto",
|
||||
initial_prompt=grammar.initial_prompt(cfg.wake_phrases),
|
||||
)
|
||||
if audio.warm_up(cfg.samplerate, cfg.channels, self._device):
|
||||
log.info("mic warmed up (source live)")
|
||||
@ -151,20 +152,20 @@ class Daemon:
|
||||
return audio.record_while(
|
||||
cfg.samplerate, cfg.channels, self._device,
|
||||
held=lambda: not self._ptt.wait_press(self.stopped),
|
||||
max_utterance=cfg.max_utterance, min_utterance=cfg.min_utterance,
|
||||
max_utterance=cfg.vad_max_seconds, min_utterance=cfg.min_utterance,
|
||||
)
|
||||
return audio.record_until_silence(
|
||||
cfg.samplerate, cfg.channels, self._device,
|
||||
silence_threshold=cfg.silence_threshold, silence_duration=cfg.silence_duration,
|
||||
min_utterance=cfg.min_utterance, max_utterance=cfg.max_utterance,
|
||||
silence_threshold=cfg.silence_threshold, silence_duration=cfg.vad_silence_ms / 1000.0,
|
||||
min_utterance=cfg.min_utterance, max_utterance=cfg.vad_max_seconds,
|
||||
stop=self.stopped,
|
||||
)
|
||||
|
||||
def _handle(self, transcript: str) -> None:
|
||||
cfg = self.config
|
||||
require_wake = self.mode == "listen"
|
||||
parsed = grammar.parse(transcript, cfg.wake_phrases, cfg.match_threshold, require_wake,
|
||||
filler=cfg.filler_words)
|
||||
parsed = grammar.parse(transcript, cfg.wake_phrases, cfg.wake_fuzzy_threshold,
|
||||
cfg.command_fuzzy_threshold, require_wake, filler=cfg.filler_words)
|
||||
if parsed is None or parsed.action is None:
|
||||
self._console.emit(VOICE, f'heard "{transcript}" -> no command matched', "yellow")
|
||||
return
|
||||
@ -192,6 +193,9 @@ class Daemon:
|
||||
sessions = target.list_sessions()
|
||||
self._console.emit(SYSTEM, "list -> " + (", ".join(sessions) if sessions else "(none running)"))
|
||||
return
|
||||
if action.name == "debug":
|
||||
self._console.emit(VOICE, f'debug: "{action.arg}"', "yellow")
|
||||
return
|
||||
|
||||
session, reason = target.resolve(parsed.one_shot, auto_target=cfg.auto_target)
|
||||
if session is None:
|
||||
@ -257,7 +261,8 @@ class Daemon:
|
||||
invariant: non-command speech is discarded, never recorded.
|
||||
"""
|
||||
cfg = self.config
|
||||
return grammar.strip_wake(transcript, cfg.wake_phrases, cfg.match_threshold, True) is not None
|
||||
return grammar.strip_wake(transcript, cfg.wake_phrases,
|
||||
cfg.wake_fuzzy_threshold, True) is not None
|
||||
|
||||
def _print_startup(self) -> None:
|
||||
cfg = self.config
|
||||
|
||||
@ -33,11 +33,33 @@ _COUNT_WORDS = {
|
||||
"sixteen": 16, "seventeen": 17, "eighteen": 18, "nineteen": 19, "twenty": 20,
|
||||
}
|
||||
|
||||
_YES_VERBS = ("yes", "yeah", "yep", "yup")
|
||||
_NO_VERBS = ("no", "nope", "nah")
|
||||
_APPROVE_VERBS = ("approve", "allow")
|
||||
_DENY_VERBS = ("deny", "reject")
|
||||
_SUBMIT_VERBS = ("send", "enter", "submit")
|
||||
_CANCEL_VERBS = ("cancel", "escape")
|
||||
_TYPE_VERBS = ("type", "dictate", "write")
|
||||
_BACKSPACE_VERBS = ("backspace", "delete")
|
||||
_SPACE_VERBS = ("space", "spacebar")
|
||||
_ADD_VERBS = ("add", "insert")
|
||||
_ERASE_VERBS = ("erase", "clear", "wipe")
|
||||
_DEBUG_VERBS = ("debug", "echo")
|
||||
_MODE_VERBS = ("mode",)
|
||||
_STICKY_VERBS = ("set", "sticky", "switch")
|
||||
_ONESHOT_VERBS = ("target",)
|
||||
_UNSET_VERBS = ("unset", "unsticky")
|
||||
_LIST_VERBS = ("list", "sessions")
|
||||
_SELECT_VERBS = ("select", "option", "choose", "number")
|
||||
|
||||
# every command/synonym word, for biasing the STT toward the vocabulary we expect.
|
||||
_COMMAND_WORDS = (
|
||||
_YES_VERBS + _NO_VERBS + _APPROVE_VERBS + _DENY_VERBS + _SUBMIT_VERBS
|
||||
+ _CANCEL_VERBS + _TYPE_VERBS + _BACKSPACE_VERBS + _SPACE_VERBS + _ADD_VERBS
|
||||
+ _ERASE_VERBS + _DEBUG_VERBS + _MODE_VERBS + _STICKY_VERBS + _ONESHOT_VERBS + _UNSET_VERBS
|
||||
+ _LIST_VERBS + _SELECT_VERBS + ("ptt", "listen")
|
||||
+ ("one", "two", "three", "four")
|
||||
)
|
||||
DEFAULT_FILLER = ("select", "use", "choose")
|
||||
|
||||
|
||||
@ -79,6 +101,26 @@ def normalize(text: str) -> str:
|
||||
return " ".join(tokens)
|
||||
|
||||
|
||||
def vocabulary(wake_phrases: list[str]) -> list[str]:
|
||||
"""the wake + command vocabulary, deduped in first-seen order.
|
||||
|
||||
single source for biasing the STT: the same wake phrases the matcher uses plus
|
||||
every command/synonym word in _COMMAND_WORDS. no separate hardcoded copy.
|
||||
"""
|
||||
seen: dict[str, None] = {}
|
||||
for word in list(wake_phrases) + list(_COMMAND_WORDS):
|
||||
key = word.strip()
|
||||
if key and key not in seen:
|
||||
seen[key] = None
|
||||
return list(seen)
|
||||
|
||||
|
||||
def initial_prompt(wake_phrases: list[str]) -> str:
|
||||
"""a comma-joined vocabulary string to pass faster-whisper as initial_prompt,
|
||||
conditioning transcription toward the words we expect (esp. the coined wake)"""
|
||||
return ", ".join(vocabulary(wake_phrases))
|
||||
|
||||
|
||||
def _ratio(a: str, b: str) -> float:
|
||||
return SequenceMatcher(None, a, b).ratio()
|
||||
|
||||
@ -163,34 +205,42 @@ def match_command(remainder: str, threshold: float) -> Action | None:
|
||||
if head in _INDEX_WORDS:
|
||||
return Action("select", _INDEX_WORDS[head])
|
||||
|
||||
if _fuzzy_in(head, ("yes", "yeah", "yep", "yup"), threshold):
|
||||
if _fuzzy_in(head, _YES_VERBS, threshold):
|
||||
return Action("yes")
|
||||
if _fuzzy_in(head, ("no", "nope", "nah"), threshold):
|
||||
if _fuzzy_in(head, _NO_VERBS, threshold):
|
||||
return Action("no")
|
||||
if _fuzzy_in(head, ("approve", "allow"), threshold):
|
||||
if _fuzzy_in(head, _APPROVE_VERBS, threshold):
|
||||
return Action("approve")
|
||||
if _fuzzy_in(head, ("deny", "reject"), threshold):
|
||||
if _fuzzy_in(head, _DENY_VERBS, threshold):
|
||||
return Action("deny")
|
||||
if _fuzzy_in(head, ("send", "enter", "submit"), threshold):
|
||||
if _fuzzy_in(head, _SUBMIT_VERBS, threshold):
|
||||
return Action("submit")
|
||||
if _fuzzy_in(head, ("cancel", "escape"), threshold):
|
||||
if _fuzzy_in(head, _CANCEL_VERBS, threshold):
|
||||
return Action("cancel")
|
||||
|
||||
if _fuzzy_in(head, _SELECT_VERBS, threshold) and rest and rest[0] in _INDEX_WORDS:
|
||||
return Action("select", _INDEX_WORDS[rest[0]])
|
||||
|
||||
if _fuzzy_in(head, ("type", "dictate", "write"), threshold):
|
||||
if _fuzzy_in(head, _TYPE_VERBS, threshold):
|
||||
text = " ".join(rest).strip()
|
||||
return Action("type", text) if text else None
|
||||
|
||||
if _fuzzy_in(head, ("backspace", "delete"), threshold):
|
||||
if _fuzzy_in(head, _BACKSPACE_VERBS, threshold):
|
||||
return Action("backspace", _leading_count(rest, default=1))
|
||||
if _fuzzy_in(head, ("space",), threshold):
|
||||
if _fuzzy_in(head, _SPACE_VERBS, threshold):
|
||||
return Action("space", _leading_count(rest, default=1))
|
||||
if _fuzzy_in(head, ("erase", "clear", "wipe"), threshold):
|
||||
if _fuzzy_in(head, _ADD_VERBS, threshold) and rest:
|
||||
tail = [t for t in rest if t not in ("a", "an")]
|
||||
if any(_fuzzy_in(t, ("space", "spaces"), threshold) for t in tail):
|
||||
count = next((int(t) for t in tail if t.isdigit()),
|
||||
next((_COUNT_WORDS[t] for t in tail if t in _COUNT_WORDS), 1))
|
||||
return Action("space", count)
|
||||
if _fuzzy_in(head, _ERASE_VERBS, threshold):
|
||||
return Action("erase")
|
||||
if _fuzzy_in(head, _DEBUG_VERBS, threshold):
|
||||
return Action("debug", " ".join(rest).strip())
|
||||
|
||||
if _fuzzy_in(head, ("mode",), threshold) and rest:
|
||||
if _fuzzy_in(head, _MODE_VERBS, threshold) and rest:
|
||||
if _fuzzy_in(rest[0], ("ptt",), threshold) or "push" in rest[0]:
|
||||
return Action("mode", "ptt")
|
||||
if _fuzzy_in(rest[0], ("listen",), threshold):
|
||||
@ -221,24 +271,28 @@ def _strip_filler(tokens: list[str], filler: tuple[str, ...], threshold: float)
|
||||
return tokens
|
||||
|
||||
|
||||
def parse(transcript: str, wake_phrases: list[str], threshold: float,
|
||||
require_wake: bool, filler: tuple[str, ...] = DEFAULT_FILLER) -> ParsedCommand | None:
|
||||
def parse(transcript: str, wake_phrases: list[str], wake_threshold: float,
|
||||
command_threshold: float, require_wake: bool,
|
||||
filler: tuple[str, ...] = DEFAULT_FILLER) -> ParsedCommand | None:
|
||||
"""full parse: wake gate -> optional one-shot target -> filler -> command.
|
||||
|
||||
returns a ParsedCommand (one_shot, action), or None if the wake gate dropped the
|
||||
utterance (listen mode, no wake phrase). a ParsedCommand with action=None means a
|
||||
wake phrase was present but no command matched.
|
||||
wake_threshold gates the wake phrase (lenient — a false wake is cheap, it just
|
||||
finds no command); command_threshold gates the command words (stricter — a false
|
||||
command fires the wrong action). returns a ParsedCommand (one_shot, action), or
|
||||
None if the wake gate dropped the utterance (listen mode, no wake phrase). a
|
||||
ParsedCommand with action=None means a wake phrase was present but no command
|
||||
matched.
|
||||
"""
|
||||
remainder = strip_wake(transcript, wake_phrases, threshold, require_wake)
|
||||
remainder = strip_wake(transcript, wake_phrases, wake_threshold, require_wake)
|
||||
if remainder is None:
|
||||
return None
|
||||
|
||||
tokens = remainder.split(" ") if remainder else []
|
||||
one_shot: str | None = None
|
||||
if tokens and _fuzzy_in(tokens[0], _ONESHOT_VERBS, threshold) and len(tokens) >= 2:
|
||||
if tokens and _fuzzy_in(tokens[0], _ONESHOT_VERBS, command_threshold) and len(tokens) >= 2:
|
||||
one_shot = tokens[1]
|
||||
tokens = tokens[2:]
|
||||
|
||||
tokens = _strip_filler(tokens, filler, threshold)
|
||||
action = match_command(" ".join(tokens), threshold)
|
||||
tokens = _strip_filler(tokens, filler, command_threshold)
|
||||
action = match_command(" ".join(tokens), command_threshold)
|
||||
return ParsedCommand(one_shot=one_shot, action=action)
|
||||
|
||||
@ -82,8 +82,9 @@ class Transcriber:
|
||||
"""a loaded faster-whisper model that transcribes float32 mono audio chunks"""
|
||||
|
||||
def __init__(self, model: str = "small", language: str = "en", device: str = "auto",
|
||||
compute_type: str = "auto") -> None:
|
||||
compute_type: str = "auto", initial_prompt: str | None = None) -> None:
|
||||
self.language = language
|
||||
self.initial_prompt = initial_prompt
|
||||
self._model = self._load(model, device, compute_type)
|
||||
self._warm()
|
||||
|
||||
@ -120,6 +121,7 @@ class Transcriber:
|
||||
beam_size=1,
|
||||
vad_filter=True,
|
||||
condition_on_previous_text=False,
|
||||
initial_prompt=self.initial_prompt,
|
||||
)
|
||||
text = " ".join(seg.text for seg in segments).strip()
|
||||
return text
|
||||
|
||||
Loading…
Reference in New Issue
Block a user