KeyHog
A secret scanner. Built in Rust. Made to be fast on big repos, careful with your time on small ones, and quiet about findings that aren’t actually credentials.
$ keyhog scan .
keyhog v0.5.37 │ 891 detectors │ 1647 patterns │ avx-512 + hyperscan + cuda
scanned 12,841 files in 1.4 s
3 findings · 0 verified live · 1041 example fixtures suppressed
What it does
Walks files - your working tree, your git history, a docker image, an S3 bucket, a list of URLs - and reports leaked credentials. Every finding has:
- a detector that fired (
stripe-secret-key,aws-access-key, …) - a location (file, line, offset, optionally commit hash and author)
- an entropy score + confidence
- an optional live verification result if you pass
--verify
The list of detectors ships in TOML files under detectors/. There are 891
of them today, covering ~750 distinct services. Anyone can add or override
them without touching Rust code.
What it doesn’t do
- No telemetry. Findings stay local. The scanner never phones home.
- No agent. A daemon mode exists for pre-commit / IDE-save fast-path scans on Unix, but it’s opt-in and stays on your machine.
- No “AI-powered” detection. Every detector is a regex with a service-specific anchor and a real verification endpoint. The ML scorer that bumps confidence on ambiguous matches is a tiny on-device MoE; no network calls.
Why another scanner
Three things, in order of how much they matter:
-
Precision. A scanner that emits one false positive per ten findings teaches developers to ignore it. KeyHog suppresses example credentials (the Stripe docs key, the AWS sample key, the RFC 7519 specimen JWT), vendored bundles (minified jQuery, node_modules), and CI workflow
${{ secrets.NAME }}references by default. The 22-repo dogfood corpus has 22 non-PEM findings, all true positives. -
Recall. The detector corpus is built service-by-service. For every detector, the test suite carries positive shapes (env-var, JSON, YAML, header, URL), negative shapes (placeholder, EXAMPLE marker), and adversarial evasions (split across lines, hex/base64-encoded, reversed via Caesar cipher). If a shape isn’t in the suite, the detector isn’t shipped.
-
Speed. Hyperscan SIMD prefilter, AVX-512 entropy gate, GPU literal scan for big workloads. A million-LOC monorepo scans in under three minutes on a modern laptop without warming any caches. Pre-commit incremental scans are sub-100 ms.
Get going
# Linux / macOS
curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh | sh
# Windows (PowerShell)
iwr https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.ps1 -useb | iex
Then:
keyhog scan .
The Install page has package-manager, build-from-source, and offline-install paths. The Your first scan page walks through what the output means and where to go from there.
Where things live
- Source: github.com/santhsecurity/keyhog
- Issues: github.com/santhsecurity/keyhog/issues
- Releases: github.com/santhsecurity/keyhog/releases
- Security: report vulnerabilities to
security@santh.dev(PGP-encrypted preferred - key in repoSECURITY.md)
License: MIT.
Install
The quickest paths first. Pick one - they all give you the same
keyhog binary.
One-liner: Linux / macOS
curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh | sh
Drops a binary in ~/.local/bin/keyhog. The installer detects your
CPU, GPU, and existing install before downloading, and tells you the
asset it picked and why.
The default is the WGPU + SIMD build everywhere: it already
dispatches the same vyre AC / RulePipeline on your GPU via the vulkan
backend, with a smaller binary and no libcuda.so runtime
dependency. The dedicated keyhog-linux-x86_64-cuda build is only
auto-selected on Linux when the host has the full CUDA toolkit
installed - nvcc on PATH, $CUDA_HOME set, or /usr/local/cuda
present. A driver-only NVIDIA host (libcuda.so loadable but no
toolkit) stays on the WGPU build, since the native-CUDA dispatch
saves only single-digit percent on typical repo scans and the
binary footprint + runtime dependency are not worth it for the
non-CUDA-developer case. Pass --variant=cuda (or set
KEYHOG_VARIANT=cuda) to force the CUDA build anyway. Apple
Silicon hosts get an explicit “Metal GPU acceleration coming soon”
note; until that lands, Apple Silicon runs SIMD on CPU plus WGPU
on the integrated GPU.
Interactive mode (recommended for first install)
curl ... | sh is fast but skips the wizard because stdin is a pipe.
For variant selection, shell completions, and optional hook setup:
curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh \
-o keyhog-install.sh
sh keyhog-install.sh
The interactive installer shows you:
- The host it detected (OS, arch, GPU, libcuda state).
- The binary it would install (with the GPU note).
- Any existing keyhog install it found.
- Whether
~/.local/binis on yourPATH.
Then it prompts (default in brackets):
- Add
~/.local/binto your shellPATH?[Y/n] - Install shell completions for bash / zsh / fish?
[y/N] - Wire keyhog as a git pre-commit hook in this dir?
[y/N]
Each prompt is opt-in. Nothing in your .bashrc / .zshrc / git
hooks dir is touched without an explicit “y”. Claude Code / Cursor
agent-hook integration is on the roadmap but not yet shipped; the
prompt was removed in v0.5.34 once it became clear the underlying
keyhog hook install --agent <name> flag wasn’t real yet.
One-liner: Windows
PowerShell 5+ (ships with Windows 10/11):
iwr https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.ps1 -useb | iex
Drops the binary in %LOCALAPPDATA%\keyhog\bin\keyhog.exe. Detects
your GPU (informational only: a dedicated CUDA-on-Windows variant is
on the roadmap but not yet shipped, so today every Windows host gets
the same WGPU + SIMD binary).
For the interactive flow:
iwr https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.ps1 `
-OutFile keyhog-install.ps1
.\keyhog-install.ps1
Heads up. The Unix daemon mode is unavailable on Windows (it relies on Unix-domain sockets).
keyhog scan,keyhog detectors,keyhog watch,keyhog hook, etc. all work the same. Thedaemonsubcommand and the--daemonflag emit an explicit “unix-only” error so nothing silently regresses.
Variants and overrides
The installer auto-detects, but you can override:
| Env var / flag | Effect |
|---|---|
KEYHOG_VARIANT=cuda (or --variant=cuda) | Force the CUDA-accelerated Linux build (requires libcuda.so). |
KEYHOG_VARIANT=cpu (or --variant=cpu) | Force the default WGPU + SIMD build, skip GPU detection. |
KEYHOG_VERSION=v0.5.37 (or --version=v0.5.37) | Pin a specific release tag (default: most recent release with assets attached). |
KEYHOG_INSTALL=/usr/local/bin (or --install-dir=...) | Install into a different directory. |
--yes / -y | Non-interactive: accept all defaults, no prompts. |
--no-color | Disable ANSI colors (e.g. for log capture). |
Runtime env vars (consumed by the keyhog binary itself)
| Env var | Effect |
|---|---|
KEYHOG_NO_GPU=1 | Force the CPU + SIMD path; skip every GPU init (saves ~250 ms of cold-start on hosts with no usable GPU). |
KEYHOG_NO_GPU=0 | Force GPU init even when CI auto-detection would otherwise skip it. Useful on self-hosted GitHub / GitLab runners with a real GPU. |
KEYHOG_REQUIRE_GPU=1 | Hard-fail (exit 2) instead of silently degrading when the GPU stack is unavailable. Pairs with the no-silent-fallback contract. |
KEYHOG_BACKEND=gpu|mega-scan|simd|cpu | Force a specific scan backend regardless of hardware probe. Mostly for benches; production code should let auto-select route. |
CI auto-detect. When CI=true is set (or any of GITHUB_ACTIONS, GITLAB_CI, CIRCLECI, TRAVIS, JENKINS_URL, TF_BUILD, BUILDKITE, DRONE, APPVEYOR, TEAMCITY_VERSION, CODEBUILD_BUILD_ID, BITBUCKET_BUILD_NUMBER, WERCKER, SEMAPHORE), keyhog skips the GPU probe entirely and goes straight to the SIMD + CPU path. The savings: ~250 ms of cold-start per keyhog invocation, plus no confusing “GPU MoE init failed” warning when the runner’s only GPU is llvmpipe. Override with KEYHOG_NO_GPU=0 on self-hosted GPU runners.
When a CUDA variant asset isn’t published for the resolved release
tag yet, the installer logs the fallback and downloads the default
WGPU + SIMD asset instead. You can rerun with --variant=cuda once
a tag with the CUDA variant lands.
Repair, diagnose, uninstall
sh keyhog-install.sh --diagnose # print host + binary state, change nothing
sh keyhog-install.sh --repair # re-download the right variant for this host
sh keyhog-install.sh --uninstall # remove the binary (leaves PATH entries alone)
--diagnose is the first thing to run if something looks off: it
reports CPU arch, OS, GPU + libcuda state, the currently-installed
binary (path + version), whether the install dir is on PATH, and
the asset the installer would download for the latest release tag.
--repair re-downloads the asset matching your current host even if
the existing binary still runs. Useful after a host upgrade adds a
new GPU, or after CUDA userland gets installed and the WGPU build
should be swapped for the CUDA build.
--uninstall only removes the binary itself. Shell PATH entries
and completion files added by the post-install wizard are left in
place: we don’t know which lines in your .bashrc / .zshrc are
ours vs yours, and silently editing those files is exactly the kind
of installer behavior we don’t want.
Direct binary download
If you don’t trust pipe-to-shell - fair - grab the binary by hand from the releases page.
| Platform | Asset name |
|---|---|
| Linux x86_64 (default) | keyhog-linux-x86_64 |
| Linux x86_64 + CUDA | keyhog-linux-x86_64-cuda |
| macOS x86_64 (Intel) | keyhog-macos-x86_64 |
| macOS aarch64 (Apple) | keyhog-macos-aarch64 |
| Windows x86_64 | keyhog-windows-x86_64.exe |
chmod +x the binary and put it somewhere on your PATH.
Build from source
You’ll want this if you’re contributing or running a feature combination the prebuilt binaries don’t cover (e.g. Ghidra binary extraction).
git clone https://github.com/santhsecurity/keyhog
cd keyhog
cargo build --release -p keyhog
./target/release/keyhog --version
The default feature set requires Hyperscan / Vectorscan:
- Debian / Ubuntu:
sudo apt install libhyperscan-dev pkg-config - macOS: not available via Homebrew. Build with
--no-default-features --features portableto skip Hyperscan and use the pure-Rust path. - Windows: build with
--no-default-features --features portable.
For the CUDA backend, add the cuda feature on Linux:
cargo build --release -p keyhog --features cuda
This requires the CUDA toolkit at link time (NVCC + cudart + nvrtc)
and libcuda.so at runtime. The release workflow provisions CUDA
12.6 on the GitHub-hosted ubuntu runner for the
keyhog-linux-x86_64-cuda asset; for local source builds, install
the matching toolkit from
developer.nvidia.com/cuda-toolkit
or your distro’s nvidia-cuda-toolkit package.
The portable feature is what the official Windows + macOS release
binaries are built with: same scanner, no native dependency, ~5%
slower on big inputs.
crates.io
Not yet. KeyHog vendors vyre-libs (the GPU literal-set scan crate)
and isn’t published to crates.io until that dependency lands there.
Track the
crates.io publish issue
for status.
Verify the install
keyhog --version
keyhog detectors | head # smoke-test the embedded detector corpus
keyhog scan README.md # scan a single file; exit 0 = clean
If keyhog --version reports the latest release (currently
0.5.34 from prebuilt assets, or 0.5.35 from a source build of
main) and keyhog detectors lists hundreds of detectors, you’re
set. Move on to Your first scan.
You can also run the installer in diagnostic mode at any time to print a full status report:
sh keyhog-install.sh --diagnose
Your first scan
You have the binary on your PATH. Now:
keyhog scan .
That walks the current directory, hands every file through the scanner, and prints findings. The exit code carries the verdict:
| Exit code | Meaning |
|---|---|
0 | Scan finished, no findings |
1 | Scan finished, findings present (unverified or verified-live) |
2 | Runtime error - bad config, panic, I/O failure |
So a CI step that should fail the build when a credential leaks is just:
keyhog scan .
No grep, no jq, no exit-code arithmetic. Findings == exit 1 == build red.
What you get out of it
By default, output is human-readable:
$ keyhog scan .
keyhog v0.5.37 │ 891 detectors │ 1647 patterns │ avx-512 + hyperscan
src/config/staging.env:14:12 HIGH stripe-secret-key
sk_live_4eC39H…Tcd3Hc (redacted, last 6)
entropy 5.21 │ confidence 0.999 │ unverified
scanned 12,841 files in 1.4 s
1 finding · 0 verified live · 1041 example fixtures suppressed
The header tells you the binary version, the detector count, and which hardware acceleration is active (AVX-512, Hyperscan/Vectorscan SIMD, CUDA, etc.). The body lists each finding with its location, severity, detector, redacted credential, and confidence. The footer summarizes counts and runtime.
Default suppressions
KeyHog ships with a Tier-B suppression list of publicly documented test fixtures - credentials that appear in vendor docs as examples. Findings on these are suppressed by default. Examples:
- Stripe’s
sk_live_4eC39HqLyjWDarjtT1zdp7dc(docs sample) - AWS’s
AKIAIOSFODNN7EXAMPLE(docs sample) - The RFC 7519 specimen JWT
- GitHub’s
ghp_aBcDeFgHiJ…placeholder
To see what was suppressed, pass --no-suppress-test-fixtures. The
list lives at crates/cli/data/suppressions/test-fixtures.toml
inside the source tree, baked into the binary at build time, and is
the ONLY built-in suppression list - there’s no opaque allow-list.
JSON output
keyhog scan . --format json
Each finding is a JSON object with these fields, every one always present (consumers like SARIF converters and CI gates rely on the schema being stable):
{
"detector_id": "stripe-secret-key",
"detector_name": "Stripe Secret Key",
"service": "stripe",
"severity": "critical",
"credential_redacted": "sk_live_4e…3Hc",
"credential_hash": "sha256-hex",
"location": {
"source": "filesystem",
"file_path": "src/config/staging.env",
"line": 14,
"offset": 12,
"commit": null,
"author": null,
"date": null
},
"verification": "skipped",
"metadata": {},
"additional_locations": [],
"confidence": 0.999
}
Pipe it into jq, into a SARIF converter for the GitHub Security tab,
or into your own dedup / triage tooling.
Limiting scope
keyhog scan src/ # one subdirectory
keyhog scan src/config/staging.env # one file
keyhog scan --stdin < staging.env # from stdin (CI: cat | keyhog)
keyhog scan . --exclude-paths 'docs/*' # exclude a glob
Common patterns the default walk already skips: .git/,
node_modules/, __pycache__/, vendor/, dist/, build/, out/,
.min.js, .min.css, .bak, .swp. To see the full list, look at
is_default_excluded in crates/sources/src/filesystem.rs.
Interactive TUI dashboard
For an interactive scan with a live finding feed, current-file banner, and stats panel showing throughput and backend choice:
keyhog tui . # scan CWD with live dashboard
keyhog tui src/ --throttle-ms 200 # paced scan, good for demos/recordings
keyhog tui . --feed-depth 500 # keep last 500 findings in feed
The TUI builds on the same scanner core; q or Esc quits, and a
non-zero exit code is returned when any findings are surfaced. Useful
for sitting next to a developer demoing keyhog, or recording a vhs
GIF for a README or talk.
Going further
Once the basic scan works:
- Output formats - JSON, SARIF, plain text.
- Verification -
--verifymakes API calls to confirm credentials are live, downgrades dead ones to severity LOW. - Pre-commit hook - block leaked creds before they hit the repo.
- CI integration - GitHub Actions, GitLab CI, CircleCI patterns.
Output formats
KeyHog speaks four formats. Pick the one that fits the consumer.
--format text (default)
Human-readable table. Best for terminal use, pre-commit hook output,
and screenshots. Colors auto-detect TTY; pipe through cat (or set
NO_COLOR=1) to disable.
src/config/staging.env:14:12 HIGH stripe-secret-key
sk_live_4eC39H…Tcd3Hc (redacted)
entropy 5.21 │ confidence 0.999 │ unverified
The columns are file:line:offset, severity, detector ID. The second
line is the redacted credential. The third is metadata.
--format json
Stable-schema JSON array. Every finding has every documented field present. See Your first scan for the schema.
keyhog scan . --format json | jq '.[] | .detector_id' | sort | uniq -c
That sample command dedups findings by detector, which is the most common “what kinds of leaks do I have” question.
--format sarif
- Static Analysis Results Interchange Format. GitHub Code Scanning, GitLab Security Dashboard, and most IDE security plugins consume this.
keyhog scan . --format sarif > keyhog-results.sarif
Upload to GitHub:
# .github/workflows/secrets.yml
- uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: keyhog-results.sarif
Findings show up in the Security → Code scanning tab with the detector ID as the rule, file path + line as the location, and the redacted credential as the message.
--format jsonl
Newline-delimited JSON - one finding per line, no outer array. Better
than --format json for streaming consumers that want to start
processing before the scan finishes:
keyhog scan /huge/monorepo --format jsonl \
| while read line; do
echo "$line" | jq -r '.location.file_path'
done
Combining with --verify
--verify calls each detector’s verification endpoint to confirm the
credential is live. Live credentials keep their severity; dead ones get
downgraded one tier. The output format doesn’t change - the
verification field of each finding becomes "verified-live" or
"verified-dead" instead of "skipped".
keyhog scan . --verify --format json \
| jq '.[] | select(.verification == "verified-live")'
Quiet mode
--quiet suppresses the header banner and the footer summary. Output
is findings-only, which is what CI scripts usually want:
keyhog scan . --format json
Exit code semantics are unchanged.
How detection works
A KeyHog scan is a pipeline. Files come in one side, findings go out the other. In between, four stages:
files → [chunker] → [prefilter] → [detector match] → [post-process] → findings
Each stage is a hard filter - if a chunk fails the prefilter, no detector ever runs on it. That’s where the speed comes from: the expensive regex evaluation only sees chunks that already plausibly contain something.
Stage 1 - chunker
A file becomes one or more chunks. A chunk is {data: str, metadata: {source_type, path, line_offsets, …}}. The chunker:
- Skips obvious binaries via magic-byte sniffing (PDF, PNG, zip, …).
- Skips files matching
is_default_excluded(node_modules, .min.js, build/, etc.). - Splits files larger than 64 MiB into overlapping windows so a single giant log file doesn’t blow scratch memory. Cross-window secrets are reassembled in stage 4.
- Decodes UTF-16 BOM files into UTF-8 (PowerShell / .NET configs).
Specialized chunkers run too:
- Git history → one chunk per (commit × file × diff line)
- Docker images → one chunk per layer × file
- Web URLs → one chunk per response body / sourcemap / WASM strings
- S3 buckets → one chunk per object body
Stage 2 - prefilter (the cheap pass)
Three gates, in order, each cheaper than the next:
-
Alphabet screen. A 256-bit mask of which bytes the corpus’s detectors care about. If a chunk doesn’t contain ANY byte in the mask, it’s discarded. Most random-binary chunks fail here.
-
Bigram bloom. A 4096-bit bloom filter of 2-byte sequences from detector keyword prefixes. If a chunk has no overlapping bigram, discard. Catches the “this is a Go source file with no
key=anywhere” case in microseconds. -
SIMD prefilter (Hyperscan). A multi-pattern SIMD regex scanner. The detector corpus is compiled to a single Hyperscan database; one scan call returns “which detector IDs have a candidate match.” On AVX-512 hardware this runs at ~3 GB/s.
On GPUs above the breakeven threshold (2 MiB on 5090-class, 16 MiB on 4090-class), the prefilter switches to a CUDA literal-set scan via vyre - same patterns, parallelized across thousands of cores.
Stage 3 - detector match
For each detector that the prefilter flagged, the FULL regex evaluates.
The regex is what’s in the .toml file - detector.patterns[].regex.
The capture group becomes the candidate credential.
A detector’s .toml carries:
id,name,service,severity,keywords- one or more
patterns, each withregex+group+ optionaldescription - optional
companions(e.g. AWS access key needs the secret key nearby) - optional
verifyblock - HTTP method, URL template, auth scheme, success status
Detectors fall into two camps:
-
Service-anchored. Regex requires a service-specific keyword (
AWS_SECRET_ACCESS_KEY=,stripe.com/v1/,dn_Deepnote prefix). These have HIGH precision: the keyword itself is positive evidence, not just a hint. -
Generic / entropy fallback (
generic-password,entropy-api-key,entropy-token). Triggered by entropy + assignment shape only -password = "...",secret: "...", JSON{ "token": "..." }. Lower precision; suppression filters do most of the work.
The split matters for the post-process stage.
Stage 4 - post-process
Even a regex match isn’t always a credential. Stage 4 filters:
- Known example fixtures (Stripe docs key, AWS docs key, RFC 7519 specimen JWT).
- Placeholder language - credentials containing
YOUR_,INSERT,EXAMPLE,PLACEHOLDER,TODO,FIXME, etc. - Shape gates.
- Universal:
punctuation_decorated_identifier- credentials starting with--,&,@,!,/,$(CLI flags, pointers, SQL vars, shell vars, GraphQL refs) or ending in:/!(UI labels, TypeScript non-null assertions). - Generic / entropy only:
pure_identifier,word_separated_identifier,scheme_prefixed_uri,url_or_path_segment,contains_uuid_v4_substring. These shapes CAN be real credentials when paired with a service anchor (PowerBI client_id is a UUID, mongodb-atlas is a URI), so we only apply them to anchorless detectors.
- Universal:
- Path-based suppressions - vendored bundles (
node_modules/,wp-includes/,bower_components/), CI workflow files (where${{ secrets.NAME }}references are syntactic, not credentials), i18n translation files, secret-scanner source files (the file IS a scanner; its regex literals shouldn’t fire on itself). - Cross-chunk reassembly. A secret split across window boundaries gets reassembled from the tail of chunk N + the head of chunk N+1.
A finding that survives stage 4 makes it to output.
Where the speed comes from
| Stage | Throughput on a modern laptop |
|---|---|
| Chunker | ~5 GB/s (mmap + magic-byte sniff) |
| Alphabet screen | ~12 GB/s (256-bit table lookup, vectorized) |
| Bigram bloom | ~8 GB/s (4096-bit table, vectorized) |
| Hyperscan SIMD | ~3 GB/s (multi-pattern regex) |
| Per-detector regex | ~150 MB/s × detectors flagged |
| Post-process | ~200 MB/s |
The end-to-end number on the dogfood corpus is ~800 MB/s sustained. Hardware acceleration (AVX-512, CUDA) raises the SIMD-prefilter ceiling substantially on big inputs; small inputs (< 100 KB) bottleneck on the chunker and post-process, not the regex.
Where the precision comes from
| Filter | What it catches |
|---|---|
| Known example fixtures | Stripe docs key, AWS docs key, RFC 7519 JWT |
pure_identifier | getParameter, Benutzername, auth_decoders |
word_separated_identifier | s3_secret_access_key (function name) |
scheme_prefixed_uri | urn:foo:bar (URI literal, not creds) |
url_or_path_segment | /api/v1/users/123 (REST path) |
contains_uuid_v4_substring | TOKEN_LIST=636765a9-… (UUID identifier) |
punctuation_decorated_identifier | --api-secret, &password, Password: |
| Vendored-minified-path | node_modules/jquery-3.6.0.min.js |
| CI workflow path | .github/workflows/ci.yml - ${{ secrets.X }} |
| i18n translation path | locale/de.po - translated password word |
Each filter has a known-FP-cluster it was built to defuse. The Suppressions page enumerates them with examples.
What this looks like for one finding
file.env contains: AWS_SECRET_ACCESS_KEY=ev0BsFtSD7S/4VWYObxiEhME3hJBXeYzR43jgiB1
stage 1 - chunker: emit chunk{ path: "file.env", data: "AWS_SECRET..." }
stage 2 - alphabet: PASS (chunk has `=`, alphanumerics from the corpus)
stage 2 - bigram bloom: PASS (`AW`, `WS`, `_S` are in the bloom)
stage 2 - Hyperscan: MATCH → triggers `aws-secret-access-key` + `generic-password`
stage 3 - regex eval:
`aws-secret-access-key` regex `(?i)(?:AWS[_-]?SECRET[_-]?ACCESS[_-]?KEY|...)[=:\s"']+([0-9a-zA-Z/+=]{40})(?:[^0-9a-zA-Z/+=]|$)`
captures `ev0BsFtSD7S/4VWYObxiEhME3hJBXeYzR43jgiB1`
`generic-password` regex doesn't match (no `_password`/`_pwd` substring)
stage 4 - post-process:
known-example check: no
`looks_like_pure_identifier`: false (has digits + /)
`looks_like_punctuation_decorated_identifier`: false
→ EMIT
That’s one finding’s life. Multiply by 10⁶ files and the throughput math is why each stage matters.
Detectors
A detector is a single TOML file that teaches KeyHog one shape of
credential. There are 891 of them in the embedded corpus today,
spread across detectors/*.toml.
Anatomy of a detector
# detectors/stripe-secret-key.toml
[detector]
id = "stripe-secret-key"
name = "Stripe Secret Key"
service = "stripe"
severity = "critical"
keywords = ["sk_live_", "sk_test_", "stripe"]
[[detector.patterns]]
regex = "sk_(?:live|test)_[a-zA-Z0-9]{24,}"
description = "Stripe secret key - live or test mode"
group = 0
[detector.verify]
method = "GET"
url = "https://api.stripe.com/v1/charges?limit=1"
[detector.verify.auth]
type = "bearer"
field = "match"
[detector.verify.success]
status = 200
That’s the whole contract for one service. Every other detector follows the same shape.
Fields
detector.id - kebab-case, globally unique. Shows up in JSON output
as detector_id and in CLI output as the third column.
detector.name - human-readable name. Shows up in keyhog detectors
listing and IDE plugins.
detector.service - the upstream service slug. Used for grouping
findings (e.g. “you leaked 3 stripe credentials”); a single service
can have multiple detectors (stripe-secret-key,
stripe-restricted-key, stripe-publishable-key).
detector.severity - one of critical | high | medium | low | client-safe | info.
The CLI’s exit code only depends on whether ANY finding exists, but
SARIF / GitHub Code Scanning surface severity prominently.
client-safe is the bug-bounty tier for keys public by design
(Sentry DSN, Stripe pk_*, Mapbox pk., PostHog phc_, Firebase
Web API key, Google Maps browser key, Mixpanel project token,
Algolia search-only, Datadog browser RUM, Bugsnag, Segment write
key). The detector still fires (a token grep is a token grep), but
the finding renders below low and --hide-client-safe filters it
out entirely. Set per-pattern via the client_safe = true field on
a [[detector.patterns]] block - detectors that fire on both the
public and the secret prefix (Stripe pk_* vs sk_*, Mapbox pk.
vs sk.) tag only the public pattern so a misused secret key still
surfaces at its nominal severity.
detector.keywords - strings the prefilter ahokorasick matches on.
At least ONE keyword in the chunk is required before the regex even
runs. Pick keywords that are short, distinctive, and likely to appear
near a real credential (stripe, sk_live_, STRIPE_SECRET_KEY).
detector.patterns[] - one or more regexes. Each carries:
regex- the pattern. Compiled withCASELESS(matches both cases without explicit alternation).group- which capture group is the credential.0= whole match,1= first captured group, etc.description- what shape this captures (env var, header, URL, …).client_safe- optional bool, defaultfalse. Whentrue, any match against this pattern collapses toSeverity::ClientSaferegardless of the detector’s nominal severity. Use for patterns that capture keys the vendor expects to ship in client bundles (Sentry DSN, Stripepk_*, etc.). Per-pattern (not per-detector) so a detector that covers both the public and the secret prefix can tag only the public one.
Multiple patterns means “any of these shapes”. A typical detector has 1–3 patterns covering env-var, JSON, and inline forms.
detector.companions[] - optional. Some credentials are only useful
in pairs (AWS access key + secret key). A companion is a second regex
that must match within N lines of the primary; without it, the
primary’s finding is dropped.
detector.verify - optional. If present, keyhog scan --verify
makes the documented API call with the captured credential and:
- live + valid → keep severity, mark
verification: "verified-live" - live + invalid → downgrade severity one tier, mark
"verified-dead"
Listing detectors
keyhog detectors # human-readable list, grouped by service
keyhog detectors --json # one JSON object per detector
keyhog detectors --json | jq length
891
Filter by service:
keyhog detectors --json \
| jq '.[] | select(.service == "stripe")'
Explaining one detector
keyhog explain stripe-secret-key
Prints the full TOML contents, the keywords, the patterns with their descriptions, the verification endpoint, and any companions. Useful when debugging “why didn’t this fire?” - usually the answer is in the regex or keywords.
Custom detectors
Drop a .toml next to the binary or in ~/.config/keyhog/detectors/:
# ~/.config/keyhog/detectors/my-internal-token.toml
[detector]
id = "acme-internal-token"
name = "ACME internal API token"
service = "acme-internal"
severity = "high"
keywords = ["ACME_API_TOKEN", "acme_internal_"]
[[detector.patterns]]
regex = "acme_internal_[a-zA-Z0-9]{32}"
group = 0
Restart the scanner and the new detector is loaded alongside the built-ins. There’s no opt-in, no flag, no rebuild - TOML in, detector out.
Disabling specific detectors
Turn off a detector by id in .keyhog.toml:
[detector.aws-access-key]
enabled = false
[detector.generic-secret]
enabled = false
Detector ids are the detector_id field in --format json/jsonl output, or
the left column of keyhog detectors. The high-precision fast-path detectors
are prefixed hot- (e.g. hot-aws_key); a service like AWS can have both a
hot- detector and a TOML detector, so disable both to silence it entirely:
[detector.hot-aws_key]
enabled = false
[detector.aws-access-key]
enabled = false
Disabled TOML detectors are dropped before the corpus compiles (zero scan cost); disabled hot-pattern findings are filtered from the report. If an id matches nothing in the loaded corpus, keyhog warns rather than silently ignoring it.
Running only a chosen subset
To run a curated set instead of the full corpus, point --detectors at a
directory holding only the TOMLs you want:
mkdir my-detectors
cp detectors/stripe-secret-key.toml detectors/aws-*.toml my-detectors/
keyhog scan . --detectors my-detectors/ # or KEYHOG_DETECTORS=my-detectors
Quieting a noisy detector
When a detector produces persistent false positives in your repo, down-weight it instead of dropping it entirely so a real hit still surfaces:
keyhog calibrate --fp generic-api-key # record a false positive
keyhog scan . --min-confidence 0.7 # filter low-confidence hits
Each --fp lowers that detector’s Bayesian confidence multiplier
(persisted under $XDG_DATA_HOME/keyhog/), so repeated FPs steadily
push its findings below your --min-confidence floor. To suppress
specific findings rather than a whole detector, use a
.keyhogignore, the [allowlist] config, or a
--baseline.
Severity bumps and downgrades
Severity is a property of the detector, but can shift per-finding:
-
Git history → severity one tier lower. A credential present only in non-HEAD git history (the developer already removed it from
main) is still a leak - anyone can fetch it - but strictly less urgent than one live in HEAD. Reported in thechunk.metadata.commitfield of the finding. -
Verification: dead → severity one tier lower. The credential was format-valid but the API rejected it. Could be a rotated key, a fake in a test file, or a typo.
-
Verification: live → severity unchanged. The credential authenticates successfully. As bad as it can get.
Writing your own - the short version
- Find a real example of the credential format (vendor docs, leaked public sample, source).
- Write the regex. Test it against the example, against a similar non-credential (“looks like, isn’t”), and against an attacker-rotated form.
- Add to
detectors/<service>-<thing>.toml-id,keywords,patterns, optionallyverify. - Add a contract file at
crates/scanner/tests/contracts/<id>.tomlwith at least:- 2 positives (env-var form, quoted form)
- 2 negatives (placeholder, EXAMPLE marker)
- 2 evasions (the actual deployed credential shape from production)
- Run
cargo test -p keyhog-scanner --test contracts_runner- must pass for your detector to ship.
That’s it. The contracts gate enforces that every shipped detector catches what it claims to catch.
HTTP and wire scanning
Real credentials don’t always sit on disk. They flow through:
- Live web bundles that ship from production at a public URL.
- HAR files that browsers (Chrome / Firefox / Safari DevTools) produce when you click “Save all as HAR with content.”
- mitmproxy / Burp captures of an authenticated session.
- curl / httpie / Postman exports of one specific request you want to verify.
KeyHog scans every one of these, but the surface is split across a few flags and sources. This page is the map.
TL;DR
| Workflow | Command |
|---|---|
| Scan a public JS bundle | keyhog scan --url https://app.example.com/static/main.js |
| Scan every URL in a list | keyhog scan --url $(cat urls.txt) |
| Scan a source-map exposed by Webpack | keyhog scan --url https://app.example.com/static/main.js.map |
| Scan a HAR export from DevTools | keyhog scan capture.har (see HAR auto-expansion) |
| Scan a single curl response | curl -s https://api/... | keyhog scan --stdin |
| Scan a saved Burp / mitmproxy capture | keyhog scan dump.txt (treats as text - no protocol parsing) |
| Route every fetch through Burp | keyhog scan --url https://... --proxy http://burp:8080 --insecure |
| Scan in an air-gapped network | keyhog scan --url https://... --proxy off |
The --url flag (Web Source)
keyhog scan --url https://app.example.com/static/main.js
keyhog scan --url https://app.example.com/static/main.js \
https://app.example.com/static/runtime.js \
https://app.example.com/static/vendor.js
Each URL is fetched with the shared HTTP client policy (see Proxy and TLS below). The response is routed by extension:
.js→ one chunk per file, scanned as plain text..map→ JSON parsed, eachsourcesContent[i]becomes its own chunk tagged with the original filename. This is how a Webpack build withdevtool: 'source-map'accidentally exposes server- side env vars baked into the bundle at build time..wasm→ linear-memory + import section dumped as strings (best- effort; native WASM symbol extraction lives behind thebinaryfeature).- Everything else → one chunk of text.
Findings are tagged source: "web:js", web:sourcemap,
web:sourcemap:raw, web:wasm, or web:other. The original URL
is the file_path.
SSRF defense
--url refuses to fetch:
- Private RFC1918 ranges (
10.0.0.0/8,172.16.0.0/12,192.168.0.0/16). - Loopback (
127.0.0.0/8,::1). - Link-local (
169.254.0.0/16,fe80::/10). - Cloud metadata endpoints (
169.254.169.254, the GCP / Azure / AWS / DigitalOcean / Hetzner variants).
This isn’t a CLI flag - it’s hardcoded so a user can’t accidentally
turn an --url invocation into a metadata-service IAM exfil.
Proxy and TLS
Everything outbound - --url, --github-org, --s3-bucket,
--verify’s API calls - runs through one HTTP client builder.
Policy:
| Source | Effect |
|---|---|
--proxy http://burp:8080 | Explicit. Wins over everything. |
--proxy off | Disable proxying entirely, ignore env vars. |
KEYHOG_PROXY env var | Same as --proxy. Useful inside CI containers. |
HTTPS_PROXY / HTTP_PROXY | reqwest’s default. Last resort. |
--insecure | Accept any TLS cert (self-signed Burp CA, etc.). |
KEYHOG_INSECURE_TLS=1 | Same as --insecure. |
Order: explicit flag → KEYHOG_PROXY → standard env vars.
User-Agent: keyhog/<version> is always set so you can grep your
proxy logs for keyhog traffic without guessing.
HAR auto-expansion
Any file with a .har extension is recognised by the filesystem
source and expanded into one chunk per request and one chunk per
response. Each chunk carries a source-type that tells you which
side of the exchange it came from:
| Chunk | source_type | What it contains |
|---|---|---|
| Request | wire:har:request | <METHOD> <URL>, every request header, query string, POST body. |
| Response | wire:har:response | <STATUS> <statusText>, every response header, response body. |
Finding file_path becomes <har-path>#<request-url>, so the same
HAR with five different requests produces five distinct paths.
Editors that jump-to-file on path:line URIs land on the HAR but
the URL tail makes the location unambiguous.
keyhog scan capture.har --format json | \
jq '.[] | select(.location.source == "wire:har:request")'
filters down to outbound credentials only - the bug-bounty
“what did I send” view. Swap request for response to see what
the upstream reflected back at you.
A HAR that fails to parse (truncated export from a crashed browser) falls through to plain text scanning so credentials still surface; the file isn’t silently dropped.
Defenses:
- 4×
--max-file-sizebudget on cumulative request+response body bytes. Defeats a malicious HAR that decompresses to gigabytes. - The cheap pre-sniff (
{"log"+"entries"in the first 2 KiB) bails before invoking the JSON parser on a 200 MiB blob that obviously isn’t HAR.
Scanning a single HTTP exchange (stdin)
The most common ad-hoc workflow:
curl -s https://api.example.com/v1/me \
-H "Authorization: Bearer $TOKEN" \
| keyhog scan --stdin
Or just pipe a saved response:
keyhog scan --stdin < response.txt
keyhog scan - (bare dash) is the same as --stdin (grep / wc
convention; added in v0.5.28).
--stdin reads up to ~1 GiB; beyond that, write to a temp file and
scan the path. Findings from stdin carry the stdin source. To get
the richer wire:har:request / wire:har:response provenance tags,
save the exchange as a .har file and scan that instead (see
HAR auto-expansion).
Headers, bodies, URL params - where the secret sits
KeyHog is content-blind: it greps the raw bytes. That means a
Bearer ghp_… in an HTTP header gets the same finding as a
"token": "ghp_…" in a JSON body or a ?token=ghp_… in the URL.
For an HTTP capture this is usually what you want - the location column in the finding gives the byte offset within the capture, and the surrounding context (line ±2) is enough to tell whether it was a header or a body.
What KeyHog does not do today:
- Parse the HTTP wire format and emit
header:Authorizationvsbody:json:$.tokenprovenance fields. - Distinguish a secret in a request from a secret in the response (one is being sent OUT, one is being sent IN - different threat model).
Those land in the roadmap below.
Roadmap
The wire-scanning surface is intentionally narrow today. Items queued for a later release, with their issue links:
-
mitmproxy
.mitmflow-dump support. Same shape as HAR but binary-framed. Use themitmproxy-rscrate to decode. -
Header / body / URL-param provenance. HAR expansion lands one chunk per request and one chunk per response today. The next step is attaching
wire_location: header:<name> | body | queryto each finding so the JSON consumer can filterwire_location == "header:Authorization"for the highest- signal subset (intentional auth tokens vs accidental body leaks vs URL-logged secrets). -
Live proxy mode. Run
keyhog proxy --listen :8080and have it act as an HTTP proxy that scans every flow inline, writing findings to stdout. The use case is recording a browsing session against a target and getting a single report of every credential the site shipped to the client. -
WebSocket frame scanning. HAR files don’t include WebSocket payloads. mitmproxy dumps do. Frame-level scanning would catch tokens passed over upgraded connections (Slack, Discord, collaborative editors).
No promises on timeline - track via github.com/santhsecurity/keyhog/issues.
Why this matters for bug bounties
A modern SPA bundle on a typical SaaS app can ship 200+ npm
dependencies and a sourcemap that exposes every server-side env
var the build process touched. Manual code review of one
main.js.map against the 891-detector corpus is hours; running
keyhog scan --url https://app.target.com/static/main.js.map
takes seconds.
Pair it with --hide-client-safe (see
CLI reference) to filter out keys that the
vendor designed to ship in client bundles (Sentry DSN, Stripe
pk_*, Mapbox pk., PostHog phc_, etc.) and you’re left with
the keys that actually represent an exfiltration boundary.
Suppressions
A suppression is a filter that drops a candidate match after the regex fires but before it becomes a finding. KeyHog applies them in layers.
The two suppression lists
Test fixtures (always on, opt-out)
crates/cli/data/suppressions/test-fixtures.toml, baked into the
binary. Lists publicly documented credentials that vendor docs ship
as examples:
[[fixture]]
detector = "stripe-secret-key"
credential = "sk_live_4eC39HqLyjWDarjtT1zdp7dc"
reason = "Stripe docs sample, https://stripe.com/docs/api/auth"
[[fixture]]
detector = "aws-access-key"
credential = "AKIAIOSFODNN7EXAMPLE"
reason = "AWS docs sample, https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html"
Disable with --no-suppress-test-fixtures if you want to see them
fire (rare, but useful when validating that a detector still matches
the canonical shape).
Repo-local suppressions (opt-in, project-scoped)
.keyhog.toml in your repo root:
[suppress]
# Drop findings on these credential hashes (sha256 of the captured value).
# Use when a finding is a true positive that you've intentionally accepted
# (e.g. a published OAuth client_id, or a fixture you've cleared with
# the upstream service).
hashes = [
"sha256:abc123...",
"sha256:def456...",
]
# Drop findings from these files entirely (gitignore-style globs).
paths = [
"fixtures/**",
"docs/example_*.env",
]
# Drop findings from these detectors entirely.
detectors = [
"generic-password",
]
Compute the hash of an existing finding:
keyhog scan . --format json | jq -r '.[] | "\(.detector_id) \(.credential_hash)"'
Shape-based suppression (always on, can’t opt out)
These don’t depend on a list. They’re heuristics about credential shape that are universally true:
| Filter | Drops shapes like |
|---|---|
punctuation_decorated_identifier | --api-secret, &password, $API_KEY, Password:, apiKey! |
For generic-only / entropy-only detectors, additional shape gates apply. See How detection works for the full list and rationale.
Path-based suppression (always on)
Specific directories produce findings that are almost always not credentials. KeyHog hard-codes a small set:
| Path pattern | Why |
|---|---|
node_modules/, vendor/, bower_components/, jspm_packages/, site-packages/ | Vendored third-party code, minified bytes coincide with secret prefixes |
wp-content/plugins/, wp-content/themes/, wp-includes/ | WordPress vendored trees |
app/assets/javascripts/bootstrap*.js, app/assets/javascripts/jquery*.js, etc. | Rails legacy asset path, vendored JS |
*.min.js, *.bundle.js, *.min.css | Minified bundles |
.github/workflows/, .gitlab-ci.yml, .circleci/, Jenkinsfile, .travis.yml, azure-pipelines*, bitbucket-pipelines* | CI config, ${{ secrets.X }} is syntactic |
locale/, locales/, i18n/, l10n/, translations/, lang/, langs/, *.po, *.pot | i18n translation files, translated password/token words are not credentials |
Files containing secretscanner, secret-scanner, trufflehog, gitleaks, detect-secrets in the path | The file IS itself a secret scanner; its regex literals shouldn’t fire on itself |
These are not configurable. They have such high precision / low recall loss that making them opt-in would just make the scanner louder for no benefit. If a specific path you care about is being suppressed incorrectly, that’s a bug worth reporting.
Telemetry: what got suppressed
Pass --dogfood to surface what was dropped:
keyhog scan . --dogfood --format json | jq '.dogfood.events[]'
Each event has the suppressor name (test_fixture_suppression,
pure_identifier_no_digit, vendored_minified_path, etc.), the
path, the redacted credential, and the rule that fired. Useful when
asking “is the scanner being too aggressive on my code?”.
Adding a suppression for FP cluster
If you find a cluster of 5+ FPs that share a shape, file an issue with:
- The detector that fired
- A sanitized example of the FP (replace the captured value with
[REDACTED]) - Why it’s not a credential (regex shouldn’t have matched, or shape gate should have caught it)
The right fix is either a tightened regex, a new shape filter, or a path / file-extension exclusion. Adding the literal credential to the test-fixtures list is the LAST resort because it only hides one specific FP, not the underlying shape.
Verification
keyhog scan --verify makes an HTTP call to each detector’s
documented verification endpoint with the captured credential.
The response tells you if the credential is live.
$ keyhog scan . --verify
src/config/staging.env:14:12 CRITICAL stripe-secret-key
sk_live_4eC39H...Tcd3Hc
entropy 5.21 | confidence 0.999 | verified-live
src/old/legacy.env:8:5 LOW stripe-secret-key (downgraded)
sk_live_oldKEy...xyz12
verified-dead | originally CRITICAL
What “live” means
Each detector’s verify block in its TOML defines:
method(GET/POST)url(with{{match}}placeholder for the captured credential)auth.type(bearer,basic,header,query,none)auth.field(match,companion-name, …)success.status(HTTP status code, default200)- optional
success.body_contains(substring the response body must contain)
The verifier:
- Renders the URL with the credential substituted in
- Builds the auth header / query param as specified
- Sends the request
- Compares the response status (and optionally body) to the success criteria
If the criteria match: verified-live. If not: verified-dead. If
the request times out or DNS fails: verification-error (treated as
unverified, severity unchanged).
Severity shift on verification
| Verification result | Severity action |
|---|---|
verified-live | Unchanged (it really is what it claims to be) |
verified-dead | Downgrade one tier (critical -> high, high -> medium, …) |
verification-error | Unchanged, treated as unverified |
skipped (no --verify flag) | Unchanged |
A dead credential is still a leak (developer typed it into a file once), so KeyHog doesn’t drop it entirely. The downgrade just means “this is less urgent than a credential someone could authenticate with right now.”
Network behavior
--verify makes network calls. Two flags shape what the verifier
talks to:
--proxy <url>– route all verification through an HTTPS proxy. Useful in corp networks. Same asHTTPS_PROXYenv var.--insecure– accept self-signed certs. ONLY use against internal endpoints you control. The default is strict TLS verify.
The verifier never follows redirects (SSRF defense – a 302 to a private IP could otherwise leak the credential to an internal service). If a vendor’s auth endpoint returns 302 to follow into the API, that endpoint’s verify block in the detector TOML is wrong; report a bug.
Outbound destinations are filtered at the client level:
- No
localhost,127.0.0.0/8,169.254.0.0/16, or other RFC 1918 private ranges. - No IPv4-mapped IPv6 of the above.
- No cloud-metadata IPs (
169.254.169.254AWS/Azure/GCP).
These rules are enforced for every detector even if its TOML
specifies a localhost URL by mistake. Set KEYHOG_PROXY=off to
disable proxy resolution (useful for air-gapped builds where the
proxy env vars are set but no proxy is actually reachable).
Rate limits
Verification is sequential per-finding within a single keyhog scan
invocation, with a 100 ms gap between calls to the same hostname.
That’s slow enough to avoid tripping vendor rate limits for typical
scans (dozens of findings) and fast enough to feel interactive.
If you have hundreds of candidates and want parallelism, the right
approach is to scan first WITHOUT --verify to get the candidate
list, then verify in batches with a script that respects each
service’s documented rate limit.
Detectors without verification
Not every detector has a verify block. About 60% do. The rest are:
- Format-only detectors (private keys, certificates, JWTs) where the credential itself has provable structure but no service to call.
- Services without a known low-impact verification endpoint (some internal APIs, deprecated services).
For these, --verify is a no-op. The verification field of the
finding stays skipped.
What you can’t do
--verifydoes NOT POST data. Every verification call is either a GET or a benign read-only endpoint (e.g.GET /me,GET /charges?limit=1).- The verifier does NOT cache results across runs. Each
keyhog scan --verifymakes fresh calls. Caching would risk reporting a rotated credential as “live” hours after it was revoked. - You can’t call verification on a credential that wasn’t captured
by a scan. There’s no
keyhog verify <credential>subcommand, because verification depends on knowing which detector it came from.
Pre-commit hook
The point of a pre-commit hook is to stop credentials from ever landing in your repo’s history. It runs locally, fast enough to feel synchronous, and blocks the commit if a finding shows up.
Install in one command
From inside a git repo:
keyhog hook install
That writes a .git/hooks/pre-commit script that calls
keyhog scan --fast --git-staged (the same command
.pre-commit-hooks.yaml exposes for the pre-commit framework).
If a pre-commit hook already exists in the repo, keyhog hook install refuses to overwrite it - remove it (or run
keyhog hook uninstall) and re-install. The next git commit
invokes the hook.
If your repo uses pre-commit instead of
raw git hooks, add the following to .pre-commit-config.yaml:
repos:
- repo: https://github.com/santhsecurity/keyhog
rev: v0.5.37
hooks:
- id: keyhog
stages: [pre-commit]
Then pre-commit install once, and it runs on every commit.
What gets scanned
keyhog scan --git-staged walks the index (the set of files git
is about to commit), not the working tree. Why this matters:
- A file you’ve modified but not
git added is NOT scanned. You’re free to keep credentials in scratch files as long as you don’t stage them. - A file you’ve staged then modified gets scanned in the staged
form, not the working-tree form. The scanner sees what
git commitwould commit.
The walk only includes files that are part of THIS commit, so it’s fast even on huge repos. A typical commit touches a few files and the scan is under 50 ms.
What happens on a finding
Stderr:
$ git commit -m "add staging config"
keyhog: 1 finding blocked this commit
src/config/staging.env:14:12 CRITICAL stripe-secret-key
sk_live_4eC39H...Tcd3Hc
Options:
1. Remove the credential from src/config/staging.env, then commit again.
2. Use a placeholder + load the real value from env at runtime.
3. If this is a false positive, run keyhog with --no-suppress-test-fixtures
or add to .keyhog.toml suppressions.
$
Exit code is 1, so git aborts the commit and your work-in-progress
stays in the index. Fix the file, git add the fix, and commit again.
When you really need to commit anyway
git commit --no-verify
That bypasses the hook. KeyHog logs nothing about it; that’s your
prerogative. Use it sparingly. A team norm of --no-verify for
“trust me” commits defeats the point of the hook.
A better pattern when a legitimate-looking credential needs to ship (e.g. a public OAuth client_id that vendor docs say to commit):
- Add its sha256 hash to
.keyhog.toml:[suppress] hashes = ["sha256:abc123..."] - Commit the suppression file alongside the credential.
- The next commit sees the hash and skips it.
This way the next contributor doesn’t have to learn the trick.
Performance
Pre-commit scans are designed for sub-100 ms latency on typical commits. If yours feels slow:
keyhog daemon start(unix only). The daemon holds the compiled scanner in memory; pre-commit invocations bypass the ~3 s cold start. Latency drops from ~3 s to ~30 ms.--fastskips the entropy / ML scorer. Removes ~20% of detectors but ~50% of scan time. Worth it for the pre-commit path; the full scan still runs in CI.
Uninstall
keyhog hook uninstall
Removes the KeyHog .git/hooks/pre-commit file if it carries the
generated KeyHog marker. If you hand-edited the hook,
keyhog hook uninstall refuses to touch it - clean it up by hand.
For the pre-commit framework, delete the keyhog stanza from
.pre-commit-config.yaml and run pre-commit clean.
CI integration
A CI step that catches leaked credentials before they ship. Three patterns: GitHub Actions, GitLab CI, generic shell. All exit non-zero on findings, which is what CI wants.
GitHub Actions
# .github/workflows/secrets.yml
name: secrets
on:
push:
branches: [main]
pull_request:
jobs:
keyhog:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # scan full history, not just HEAD
- name: Install keyhog
run: curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh | sh
- name: Scan repo
run: ~/.local/bin/keyhog scan . --format sarif > keyhog.sarif
- uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: keyhog.sarif
The upload-sarif action posts findings to the Security -> Code
scanning tab. if: always() makes sure findings show up even when
the scan exits non-zero.
To scan ONLY git history (the more common pre-merge gate):
- name: Scan history
run: ~/.local/bin/keyhog scan --git-history . --format sarif > keyhog.sarif
GitLab CI
# .gitlab-ci.yml
keyhog:
stage: test
image: ubuntu:24.04
before_script:
- apt-get update -qq && apt-get install -y curl libhyperscan-dev
- curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh | sh
script:
# Exits non-zero on findings, which fails the job and gates the MR.
- ~/.local/bin/keyhog scan . --format sarif --output keyhog.sarif
artifacts:
when: always # keep the report even when the scan fails the job
paths:
- keyhog.sarif
The job’s exit status gates the merge request (keyhog exits non-zero on
findings) and the SARIF is kept as a downloadable artifact. Note: GitLab’s
artifacts:reports:sast expects GitLab’s own SAST JSON schema, not SARIF,
so to surface findings in the MR security dashboard you must convert the SARIF
to that format (e.g. a SARIF-to-GitLab-SAST converter step) - pointing
reports:sast directly at a SARIF file does not work.
CircleCI
# .circleci/config.yml
version: 2.1
jobs:
keyhog:
docker:
- image: cimg/base:stable
steps:
- checkout
- run:
name: Install keyhog
command: |
curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh | sh
echo 'export PATH="$HOME/.local/bin:$PATH"' >> $BASH_ENV
- run:
name: Scan repo
command: keyhog scan . --format sarif --output keyhog.sarif
- store_artifacts:
path: keyhog.sarif
destination: keyhog.sarif
workflows:
build:
jobs:
- keyhog
Drone CI / generic shell
# .drone.yml
pipeline:
keyhog:
image: alpine:3.20
commands:
- apk add --no-cache curl
- curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh | sh
- $HOME/.local/bin/keyhog scan .
Same pattern works in Jenkins, Buildkite, Woodpecker, Concourse, or any CI that can run a shell. The two lines are the install command and the scan command.
Pinning a version
The install scripts pull the latest release by default. For reproducible CI, pin a specific version:
curl -fsSL ...install.sh | KEYHOG_VERSION=v0.5.37 sh
Update the pin via a Renovate / Dependabot config or just bump it by hand when a new release lands.
Caching the install
The install script downloads a ~25 MB binary. On GitHub Actions, cache it across runs:
- name: Cache keyhog
id: cache-keyhog
uses: actions/cache@v4
with:
path: ~/.local/bin/keyhog
key: keyhog-${{ runner.os }}-v0.5.37
- name: Install keyhog
if: steps.cache-keyhog.outputs.cache-hit != 'true'
run: curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh | KEYHOG_VERSION=v0.5.37 sh
The if: cache-hit != 'true' guard is what makes the cache pay off - without
it the install step re-downloads on every run and the cache does nothing. Bump
both the cache key and the pinned KEYHOG_VERSION together when you upgrade.
Scan history once per release, not per PR
A full git-history scan is the right thing to run on main post-merge
and on release tags, but it’s overkill for every PR. A typical setup:
| Trigger | Scan | Cost |
|---|---|---|
| Pull request | keyhog scan . (working tree) | ~5 s on a typical repo |
| Push to main | keyhog scan --git-history . | ~30 s on a year-old repo, scales linearly |
| Release tag | keyhog scan --git-history . --verify | Adds 100 ms per finding for live verification |
The PR scan keeps the dev feedback loop fast. The post-merge history scan catches anything that slipped through pre-commit + PR review. The release scan verifies what’s live, useful for the changelog (“rotated these N credentials before shipping”).
Failure modes worth knowing
- Forked PR + secret credentials: GitHub Actions doesn’t expose org secrets to forked-PR runners, so a verifier endpoint that needs authentication won’t run. Findings still get reported as unverified; that’s correct behavior.
- Shallow clones:
actions/checkoutdefaults tofetch-depth: 1, which only fetches HEAD. A--git-historyscan against a shallow clone sees zero commits. Setfetch-depth: 0if you want history. - LFS files: keyhog reads the LFS pointer file, not the
contents. To scan LFS-stored binaries, enable LFS in checkout
(
lfs: true) and let the scanner pull the real file.
CLI reference
keyhog scan [PATH]
The main subcommand. Scans PATH (default: current directory) and
emits findings. Exit code: 0 clean, 1 findings present, 2
runtime error.
Input selection
| Flag | Effect |
|---|---|
<PATH> | Positional path. File or directory. |
--stdin | Read from stdin instead. 10 MiB cap. |
--exclude-paths <GLOB>... | Skip files matching glob. Space-separated list, repeatable. |
--git-staged | Scan git-staged files only (pre-commit mode). |
--git-history <PATH> | Walk commits added-line patches (default: HEAD only). |
--git-diff <BASE_REF> | Scan only added lines since BASE_REF. |
--docker-image <IMAGE> | Scan a saved Docker image archive. |
--s3-bucket <BUCKET> | Scan an S3 bucket. Use --s3-prefix to narrow. |
--url <URL>... | Fetch + scan one or more HTTPS URLs (JS/source-map/WASM/text). |
Output
| Flag | Effect |
|---|---|
--format <text|json|jsonl|sarif> | Output format. Default text. The machine formats (json/jsonl/sarif) are findings-only: the banner/summary go to stderr (or are omitted), so stdout stays a clean parseable document. |
--output <FILE> | Write the report to FILE instead of stdout. |
--stream | Stream a one-line redacted preview per finding to stderr as they’re found; the full formatted report still lands on stdout/--output after verification. |
--show-secrets | Show full credentials. Default redacts. |
--min-confidence <FLOAT> | Only emit findings >= confidence. 0.0..=1.0. |
--dogfood | Surface suppression telemetry in output. |
Verification
| Flag | Effect |
|---|---|
--verify | Call each detector’s verify endpoint. |
--proxy <URL> | Route verifier traffic through a proxy (http://burp:8080, socks5://...). off disables all proxying (incl. env). |
--insecure | Skip TLS cert verification on verifier traffic (don’t use outside a lab). Env: KEYHOG_INSECURE_TLS=1. |
Performance
| Flag | Effect |
|---|---|
--fast | Skip entropy + ML scorer. ~50% faster, ~20% fewer detectors. |
--daemon | Force daemon route. Unix only. |
--no-daemon | Force in-process scan even if daemon is up. |
--timeout <SECONDS> | Hard per-scan deadline. |
Detector tuning
| Flag | Effect |
|---|---|
--detectors <DIR> | Use the detector TOMLs in DIR instead of the embedded corpus. To run a curated subset, copy the detector TOMLs you want into a directory and point --detectors at it (there is no per-ID enable/disable flag). Env: KEYHOG_DETECTORS. |
--no-suppress-test-fixtures | Show findings on bundled example credentials. |
--baseline <FILE> | Compare against a prior scan; show only new. |
--hide-client-safe | Drop every CLIENT-SAFE finding (Sentry DSN, Stripe pk_*, Mapbox pk., PostHog phc_, etc.) before reporting. Use this for bug-bounty / exfiltration-impact workflows where keys public by design are noise. |
Environment variables
| Variable | Effect |
|---|---|
KEYHOG_BACKEND=gpu|simd|cpu|auto | Force a scan backend instead of letting the auto-router choose. |
KEYHOG_NO_GPU=1 | Short-circuit GPU init at hardware-probe time. The scanner runs as if no GPU adapter existed. Use this when Metal / CUDA init blocks on a given host (Apple Silicon Mac configurations have reproduced this) and you want predictable startup. |
KEYHOG_PER_CHUNK_TIMEOUT_MS=<MS> | Attach an Instant deadline to every chunk scan. Default unset = no timeout (original behaviour). Recommend 30000 for production scans where bounded latency matters more than scan completeness. |
KEYHOG_THREADS=<N> | Pin the rayon worker count. Default = physical-core count. |
KEYHOG_DETECTORS=<DIR> | Override the auto-discovered detector directory. |
KEYHOG_CACHE_DIR=<DIR> | Override the regex / database cache location (must sit under $HOME or /tmp/keyhog-cache-<uid> for safety). |
keyhog detectors
Lists every detector in the embedded corpus.
keyhog detectors # human-readable, grouped by service
keyhog detectors --json # one JSON object per detector
keyhog detectors --json | jq length
891
keyhog explain <DETECTOR_ID>
Pretty-print a single detector’s TOML. Includes keywords, patterns, companion rules, and verification endpoint.
keyhog explain stripe-secret-key
keyhog watch [PATH]
Daemon-mode subcommand that watches a directory for file changes and re-scans on each one. Useful for IDE-side feedback. Unix only.
keyhog watch src/ # watch the source tree
keyhog watch # watch the current directory
keyhog tui [PATH]
Interactive ratatui dashboard. Streams findings in a severity-colored
list while a status panel reports files scanned, throughput, GPU
backend, and pattern count. q or Esc to quit; any keypress exits
once the scan completes.
keyhog tui . # live dashboard on CWD
keyhog tui demo --throttle-ms 200 # paced scan for demo recordings
keyhog tui --feed-depth 500 . # keep more findings in the feed
keyhog tui --max-files 20 src/ # short fixed-duration loops
| Flag | Default | Effect |
|---|---|---|
--max-files N | 0 | Stop after scanning N files. 0 = unlimited. |
--feed-depth N | 200 | Rolling window of recent findings shown. |
--throttle-ms MS | 0 | Sleep MS between files; demo / recording knob. |
Exit code matches keyhog scan: 0 clean, 1 findings present.
keyhog hook <install|uninstall>
Manages the git pre-commit hook. See Pre-commit hook for usage.
keyhog daemon <start|stop|status> (Unix only)
The daemon holds the compiled scanner in memory so pre-commit / IDE-save invocations skip the ~3 s cold start.
| Subcommand | Effect |
|---|---|
daemon start | Bind the Unix socket, accept connections. |
daemon stop | Tell the running daemon to shut down. |
daemon status | Print uptime, scans served, active scans. |
Default socket path: $XDG_RUNTIME_DIR/keyhog.sock, or
~/.cache/keyhog/server.sock if XDG_RUNTIME_DIR is unset.
On Windows: every daemon subcommand prints “daemon mode is
unix-only” and exits non-zero. Daemon support via named pipes is
tracked but not yet implemented.
keyhog diff <FILE_A> <FILE_B>
Compare two scan outputs (JSON or NDJSON). Useful for “did this PR introduce a new finding?” gating in CI.
keyhog scan . --format json > baseline.json
git checkout pr-branch
keyhog scan . --format json > pr.json
keyhog diff baseline.json pr.json
keyhog calibrate
Show or update the per-detector Bayesian (Beta-α/β) calibration counters. Used to teach the scorer that detector X has produced N true positives and M false positives in your environment so its confidence is adjusted on future scans.
keyhog calibrate --show # print current counters
keyhog calibrate --tp stripe-secret-key # record one TP
keyhog calibrate --fp generic-api-key # record one FP
keyhog calibrate --tp aws-access-key --show # record + print
Pass --cache <PATH> to point at a non-default counter file (the
default lives under $XDG_DATA_HOME/keyhog/).
keyhog backend
Prints hardware probe results: which SIMD ISA was detected, whether Hyperscan / CUDA / wgpu backends initialized, the per-tier GPU thresholds in effect.
keyhog backend
keyhog scan-system
Recursive system-wide credential audit. Walks every mounted drive
(skipping pseudo-filesystems and, by default, network mounts),
discovers every .git repository on the way, and runs the same
scan + git-history pipeline that keyhog scan --git-history uses
on each. Honors a hard --space <N> ceiling on total bytes scanned
so it cannot accidentally exhaust a CI runner. Does NOT honor
.gitignore unless --respect-gitignore is passed (an attacker
stashing leaked keys would .gitignore them).
keyhog scan-system # local mounts, git history on
keyhog scan-system --include-network # also walk NFS/SMB/sshfs
keyhog scan-system --space 50G --no-git-history # cap + skip history walks
keyhog scan-system --lockdown # forbids --include-network
keyhog completion <bash|zsh|fish|powershell>
Emits a shell-completion script. Pipe into the shell’s completion location.
keyhog completion bash > /etc/bash_completion.d/keyhog
keyhog completion zsh > "${fpath[1]}/_keyhog"
keyhog completion fish > ~/.config/fish/completions/keyhog.fish
keyhog completion powershell >> $PROFILE
Global flags
These work on any subcommand:
| Flag | Effect |
|---|---|
--version | Print version + build info, exit. |
--help | Print help for the current subcommand. |
--verbose | More log output to stderr. |
--no-color | Disable ANSI colors. Auto-detects TTY otherwise. |
Exit codes
KeyHog uses exit codes to signal scan outcomes. Stable across versions; consumers (CI gates, pre-commit hooks, IDE plugins) can rely on them.
| Exit | Meaning |
|---|---|
0 | Scan completed, zero findings. |
1 | Findings present, NONE confirmed live (unverified, or verified-dead). |
2 | User error: unknown CLI flag, .keyhog.toml parse failure, bad --baseline. |
3 | System error: I/O failure, source-backend failure, or detector-corpus audit failure. |
4 | Health/self-test failure: keyhog doctor unhealthy, keyhog repair could not restore a working binary, keyhog backend self-test failed. |
10 | LIVE credentials confirmed (a --verify scan where the vendor API accepted a found secret) - the highest-severity gate. Also returned by keyhog update --check when a newer release exists. |
11 | Scanner thread panicked. The finding count is NOT trustworthy - investigate, don’t ship. Distinct from 2/3 so CI can tell a code bug from a config error. |
130 | Interrupted (SIGINT / Ctrl-C). |
0 (clean)
Use case: a CI step like keyhog scan . exits 0 when the working tree
is clean. The job stays green.
With --verify, the exit code escalates when a credential is confirmed
live: a found secret the vendor API accepts exits 10, while a found
secret that verifies dead (or wasn’t verified) exits 1. So gating ONLY
on live credentials needs no JSON parsing - branch on the exit code:
keyhog scan . --verify
case $? in
0) echo "clean" ;;
10) echo "LIVE credentials present - block + page" ; exit 1 ;;
1) echo "findings, none confirmed live" ;;
esac
1 (findings present)
The most common non-zero. CI fails, pre-commit hook blocks the commit,
PR check turns red. Findings get printed to stdout in whatever format
--format selected.
Exit 1 means findings exist but, under --verify, none were confirmed
live. A scan that confirms a live credential exits 10 instead (see
below) - so “findings but all dead” vs “some live” is just 1 vs 10,
no JSON parsing required.
2 (runtime error)
Things that exit 2:
- Unknown CLI flag.
.keyhog.tomlparse error.- Detector load failure for a specific TOML (with a stderr warning; the rest of the scan continues but exits 2 at the end).
--baseline <FILE>where FILE doesn’t exist or isn’t valid JSON.- A source backend failure (e.g.
--git-historyon a non-git dir). - Network error during
--verifyis NOT a2; it’s averification-errormarker per finding and the scan exits1if any unverified-live findings exist.
Stderr carries the error message. Stdout may have partial output depending on where the error happened.
3 (system error)
A failure the operator can’t fix by correcting a flag: an I/O error, a
source backend that couldn’t read its input, or a detector-corpus audit
failure. Distinct from 2 (user error) so a pipeline can retry/route
differently. Stderr carries the cause.
4 (health / self-test failure)
Returned by the maintenance subcommands, not by scan: keyhog doctor
when the install fails its end-to-end self-test, keyhog repair when it
could not restore a working binary, and keyhog backend when its
self-test fails. A health monitor can treat 4 as “binary present but
not trustworthy.”
10 (live credentials, or update available)
The highest-severity scan outcome: a --verify scan where the vendor
API accepted a found secret - it is real and exfil-capable right now.
Gate hard on this:
keyhog scan . --verify || rc=$?
[ "${rc:-0}" = "10" ] && { echo "::error::live credential confirmed"; exit 1; }
keyhog update --check reuses 10 to mean “a newer release exists”
(exit 0 = already current), so a self-update cron can branch on it.
11 (scanner panic)
A panic inside a scanner thread (regex compile bug, OOM in a windowed chunk, etc.). The scan was incomplete; the count of findings emitted is NOT trustworthy. CI should treat this as “investigate” rather than “ship anyway because exit 11 != 1”.
The reason this is 11 rather than 2:
- A panic is a code bug worth surfacing distinctly.
- Some CIs (older Jenkins, certain shell wrappers) collapse
2with “command not found” or other ambient errors.11is unambiguous. - A future expansion of error categories (
12= OOM-killed,13= timeout-exceeded, etc.) is possible without renumbering existing codes.
Composing in shell
set -e
keyhog scan . # exit 1 stops the shell here
Or to handle the non-zero explicitly:
keyhog scan . --verify || rc=$?
case "$rc" in
0|"") echo "clean" ;;
1) echo "findings (none live) -> opening PR comment" ;;
10) echo "LIVE credentials -> block + page on-call" ;;
2) echo "user error (bad flag/config) -> failing build" ;;
3) echo "system error -> retry / investigate" ;;
11) echo "scanner panic -> paging on-call" ;;
130) echo "interrupted" ;;
*) echo "unknown exit $rc" ;;
esac
What you can’t do
- No
--exit-zeroflag. KeyHog deliberately does not provide a way to lie to CI about findings. If you need to override (e.g. “this finding is accepted, ship anyway”), suppress it by hash in.keyhog.tomlinstead. The exit code then reflects truth: there are no UN-suppressed findings, so it’s0.
Environment variables
KeyHog reads a small set of environment variables. Each one is documented here with default, effect, and a typical use case.
Install / location
| Variable | Default | Effect |
|---|---|---|
KEYHOG_INSTALL | ~/.local/bin (sh) / %LOCALAPPDATA%\keyhog\bin (ps1) | Where install.sh / install.ps1 drops the binary. |
KEYHOG_VERSION | (latest release with assets) | Pin install.sh / install.ps1 to a specific tag. install.sh now walks back through /releases?per_page=10 to find the most recent release with binaries attached, surviving a one-off release-workflow failure without forcing an explicit pin. |
KEYHOG_VARIANT | auto (cuda on hosts with the full CUDA toolkit, cpu otherwise) | Force the cuda or cpu variant of the Linux build during install. cpu is the WGPU + SIMD default which already dispatches on any compatible adapter via Vulkan; cuda adds the native-CUDA backend on hosts with libcuda + the matching toolkit. |
Cache
| Variable | Default | Effect |
|---|---|---|
KEYHOG_CACHE_DIR | ~/.cache/keyhog (Linux) / ~/Library/Caches/keyhog (macOS) | Where the Hyperscan compiled database is cached across runs. Must be a user-owned dir; cold start (~3 s) becomes warm start (~150 ms) when the cache hits. |
Version output
| Variable | Default | Effect |
|---|---|---|
KEYHOG_VERSION_FULL | (unset) | Set to 1 to make keyhog --version also print the full hardware probe (SIMD ISA, GPU adapter, CUDA / WGPU availability). Hidden by default because the probe initializes wgpu/Vulkan (~200 ms + a 134 MB MAP_SHARED segment), which makes keyhog --version 9× slower than keyhog --help. The same probe runs unconditionally for keyhog backend. |
Backend selection
| Variable | Default | Effect |
|---|---|---|
KEYHOG_BACKEND | auto | One of auto, cpu_fallback, simd_cpu, gpu, megascan. Overrides hardware-probe selection. Mostly useful for benchmarking. |
KEYHOG_NO_GPU | (unset) | If set to 1, skip the GPU probe entirely. Useful for CI where the runner reports a software-rendered GPU and you’d rather force CPU. Mirrored by CI=true/GITHUB_ACTIONS=true auto-detection. |
KEYHOG_REQUIRE_GPU | (unset) | If set to 1, refuse to run when no usable GPU adapter is detected. Useful for self-hosted runners where a regression on GPU initialization should fail loudly, not silently fall back to CPU. |
KEYHOG_GPU_KERNEL | auto | Override the GPU dispatch kernel pick. Mostly a development knob for benchmarking individual kernel implementations. |
Threading + chunking
| Variable | Default | Effect |
|---|---|---|
KEYHOG_THREADS | physical-core count | Pin the rayon worker pool. Useful inside containers where available_parallelism() reports the wrong value. |
KEYHOG_PER_CHUNK_TIMEOUT_MS | (unset) | Hard deadline per chunk scan in milliseconds. Recommended 30000 for production scans where bounded latency matters more than scan completeness. |
KEYHOG_DETECTORS | (workspace default) | Override the auto-discovered detector directory path. |
KEYHOG_TRUSTED_BIN_DIR | (unset) | Restrict which binary paths the daemon will execute when forking for sub-scans (defense-in-depth knob). |
Daemon (Unix only)
| Variable | Default | Effect |
|---|---|---|
XDG_RUNTIME_DIR | (set by login session) | Daemon socket location: $XDG_RUNTIME_DIR/keyhog.sock. Fallback is ~/.cache/keyhog/server.sock. |
KEYHOG_DOGFOOD | (unset) | Enable dogfood telemetry capture in the daemon. Equivalent to passing --dogfood on every connecting client. |
Verification
| Variable | Default | Effect |
|---|---|---|
HTTPS_PROXY | (unset) | Standard env var. Routes verifier traffic through a proxy. keyhog scan --proxy <URL> overrides. |
KEYHOG_PROXY | auto | off disables proxy resolution entirely (useful for air-gapped builds where HTTPS_PROXY is set but no proxy is reachable). Also disables DNS pinning when off, so don’t set it to off casually. |
NO_PROXY | (unset) | Standard env var. Hostnames to bypass the proxy on. |
Logging
| Variable | Default | Effect |
|---|---|---|
RUST_LOG | keyhog=warn | Tracing filter. keyhog=debug for verbose detector / suppression telemetry. keyhog::routing=trace to see per-chunk backend selection. |
RUST_BACKTRACE | (unset) | Standard. 1 for short backtrace on panic; full for full. |
Verification (extra)
| Variable | Default | Effect |
|---|---|---|
KEYHOG_INSECURE_TLS | (unset) | If set, accept self-signed TLS certs on verifier traffic. Equivalent to --insecure. Use only in lab environments. |
KEYHOG_ALLOW_SCRIPT_VERIFY | (unset) | Permit the script: verifier kind (which would otherwise be refused as a remote-execution risk). Opt-in for trusted detector corpora only. |
KEYHOG_LIVE_VERIFY | (unset) | Internal: enables a special live-verify mode used by the end-to-end test harness. |
KEYHOG_LIVE_AWS_ACCESS_KEY_ID, KEYHOG_LIVE_AWS_SECRET_ACCESS_KEY, KEYHOG_LIVE_GITHUB_PAT | (unset) | Test-only credentials the verifier integration tests probe against real upstream services. Never set these outside the maintainer test environment. |
Testing / development
| Variable | Default | Effect |
|---|---|---|
KEYHOG_ADVERSARIAL_STRICT | (unset) | Tighten the adversarial-runner test gate. Used by CI’s strict-runners job. |
KEYHOG_ADVERSARIAL_FULL_LOG | (unset) | Emit per-fixture log for every adversarial corpus row (slow; debugging only). |
KEYHOG_ENCODING_STRICT | (unset) | Strict mode for the encoding-evasion runner. |
KEYHOG_PATH_SHAPE_STRICT | (unset) | Strict mode for the path-shape runner. |
KEYHOG_ENTROPY_STRICT | (unset) | Strict mode for the entropy-bypass runner. |
KEYHOG_UNICODE_STRICT | (unset) | Strict mode for the unicode-homoglyph runner. |
KEYHOG_COMMENT_STRICT | (unset) | Strict mode for the comment-evasion runner. |
KEYHOG_COMPOUND_STRICT | (unset) | Strict mode for the compound-bypass runner. |
KEYHOG_LINE_LEN_STRICT | (unset) | Strict mode for the line-length runner. |
KEYHOG_MULTI_STRICT | (unset) | Strict mode for the multi-pattern runner. |
KEYHOG_NOISE_STRICT | (unset) | Strict mode for the noise-injection runner. |
KEYHOG_CHUNK_IDS | (unset) | Restrict the scan to a comma-separated list of chunk IDs. Used by adversarial bisection. |
What KeyHog deliberately does NOT read
KEYHOG_*flags for changing detector behavior. Detector tuning is via.keyhog.tomlonly, so the same scan reproduces across developer machines without env-var contamination.- Anything named
KEYHOG_API_KEY/KEYHOG_TOKEN. The scanner never reports findings upstream; there’s no service to authenticate to. KEYHOG_TELEMETRY_*. There is no telemetry. Findings stay local.
Precedence
When two sources disagree:
- CLI flag (
--proxy <URL>) .keyhog.tomlin the repo root- Environment variable
- Compiled default
So keyhog scan --proxy http://a beats HTTPS_PROXY=http://b beats
KEYHOG_PROXY=off. The lowest-precedence wins only when nothing
above it is set.
Contributing
KeyHog is open source. The repo is at github.com/santhsecurity/keyhog. Bug reports, feature requests, detector additions, and PRs are all welcome.
Quick paths
| What | How |
|---|---|
| Report a bug | Open an issue with a minimal reproducer. |
| Report a security issue | Email security@santh.dev (PGP key in SECURITY.md). Don’t open a public issue. |
| Add a detector | Drop a TOML in detectors/, add a contract in crates/scanner/tests/contracts/. PR. |
| Fix an FP | Find the regex / shape gate that’s firing. Tighten it. Add a negative test that would catch the regression. |
| Document something undocumented | Edit docs/src/*.md. The site rebuilds on push to main. |
Repo layout
keyhog/
crates/
core/ # Detector spec, raw match types, severity, embed
scanner/ # The scanner engine itself
sources/ # Filesystem, git, web, docker, S3 backends
verifier/ # Live credential verification
cli/ # The `keyhog` binary, subcommand dispatch
detectors/ # 891 service-specific detector TOMLs
crates/cli/data/
suppressions/ # Test-fixture suppression list, baked into the binary
docs/ # This documentation (mdBook source)
install.sh # Linux/macOS install script
install.ps1 # Windows install script
vendor/vyre/ # GPU literal-set scanner (vendored, separate repo)
The Rust workspace is at the root; each crate/ member is a
standalone crate with its own Cargo.toml.
Building
git clone https://github.com/santhsecurity/keyhog
cd keyhog
cargo build --release -p keyhog
./target/release/keyhog --version
For development:
cargo build # debug build, ~30 s
cargo test -p keyhog-scanner --lib
Adding a detector
The contract gate enforces that every shipped detector catches what it claims to catch. The flow:
-
Write the detector TOML at
detectors/<service>-<thing>.toml. Use an existing detector as a template; the schema is documented in Detectors. -
Write the contract at
crates/scanner/tests/contracts/<id>.toml. At minimum, include:- 2 positives (env-var shape, quoted shape)
- 2 negatives (placeholder, EXAMPLE token in the body)
- 2 evasions (real-world shapes you’ve seen in actual leaks: Bearer header, JSON body, URL query param, multi-line config)
- A
perfblock withfixture_bytes+max_microseconds - A
scaleblock withfixture_bytes+min_findings+max_seconds
-
Run the contract gate locally:
cargo test -p keyhog-scanner --test contracts_runnerMust pass before you push. CI re-runs it with strict env vars set, which exercise more aggressive adversarial corpus.
-
Open a PR. A maintainer reviews the detector for:
- Service is real and not duplicated by an existing detector.
- Keywords are short, distinctive, and unlikely to FP.
- Regex captures the right group and rejects obvious placeholders.
- Verify endpoint (if present) is read-only and won’t trigger side-effects on the upstream service.
Adding a suppression filter
If you find an FP cluster of 5+ findings that all share a shape, the right fix is a new shape filter rather than 5 individual suppressions. The flow:
-
Reproduce. Get the FPs into a
.envseal-sealed corpus or a public sanitized fixture you can commit. -
Write the filter. Add to
crates/scanner/src/pipeline/postprocess/suppression.rsalongside the existinglooks_like_*functions. The function takes&str(the credential) orOption<&str>(the path) and returnsbool. -
Wire it up. Decide if it’s Tier A (universal) or Tier B (generic / entropy only). See
should_suppress_named_detector_findingfor the existing wiring. Tier A is rare; default to Tier B unless the shape is structurally impossible for any service-anchored credential. -
Add a unit test. Inputs that should trip the filter (5+ variants), inputs that should not (3+ legitimate credentials).
-
Run the contract gate. New filters must not break any contract evasion. If they do, the contract is right and the filter is wrong. Tighten the filter.
Style
- Rust edition 2021, MSRV 1.89.
cargo +stable fmt+cargo +stable clippy -- -D warnings. CI enforces both.- File-size cap: 500 lines per
.rsfile. Larger files get split. - No
#[ignore]on tests. A flaky test gets fixed or deleted, not silenced. - No
todo!()/unimplemented!()/panic!("not implemented")in shipped code paths. - Comments explain WHY, not WHAT. Names carry WHAT.
Tests
cargo test -p keyhog-core --lib # detector spec / embed
cargo test -p keyhog-scanner --lib # engine
cargo test -p keyhog --lib # CLI / orchestrator
cargo test -p keyhog --test e2e_binary # full-binary end-to-end
cargo test -p keyhog-scanner --test contracts_runner # per-detector contract gate
cargo test -p keyhog-scanner property::scanner_fuzz # proptest
The first four run in under 30 s. The contracts and property suites take 1-2 minutes. CI runs all of them; locally, the first four are the usual feedback loop.
License
MIT. By contributing, you agree that your contributions are licensed under the MIT license too.
Changelog
The authoritative changelog lives in the repo root as CHANGELOG.md.
Versions follow Semantic Versioning – patch
bumps for bug fixes, minor for new features, major for breaking
changes.
The full file is rendered below.
Changelog
All notable changes to KeyHog. Versions follow Semantic Versioning.
v0.5.37 - 2026-05-29 - Mirror benchmark: F1 0.7815 to 0.8896 (closes the gap to betterleaks 0.892)
Headline: precision 0.9716, recall 0.8203, F1 0.8896 against the SecretBench mirror corpus (15,000 fixtures). Net delta vs v0.5.35 is +0.108 F1, +5.9pp precision over the betterleaks 0.913 floor at 0.003 below their 0.892 F1. Precision was the headline lever for this release: 154 docs-example FPs killed, over-broad detector arms narrowed, decode-through composition tightened, and confidence floors only apply when the value is not algorithmically a placeholder.
Detection truth (engine)
- entropy fallback: lift the blanket 32/40/64/128-char hex blacklist
and the strict-mode >10-char hex drop ONLY when a credential keyword
is on the same line (
apiKey: <hex>,TOKEN=<hex>). Outside an anchor the blacklist holds, protecting sha256-hex / npm-lock-integrity / k8s-resource-uid negatives. Closes the generic-high-entropy-string R=0.38 hole. - generic-secret regex: add
.to the keyword-separator class soapi.key=/private.key=/client.secret=in .properties, helm-values, terraform locals are recognised alongside_/-. - decode-through: compose decoded-placeholder + uniform-base64-blob into every generic emit (decoded chunks no longer surface placeholders or known image-digest shapes).
- confidence: skip the
known_prefix_confidence_floorboost when the value is itself a placeholder word (closes 154 docs-example FPs driven by service-prefix-only fixtures). - decode_structure feature wired into the entropy-fallback emit path (the rebuilt 42-feature ML model now sees decode topology on the same code path the rule engine uses).
- ML confidence: 112 named detectors that silently fell below the 0.3 floor are now correctly surfaced.
- sources: UTF-16LE wide-string extractor lifts credentials from Windows .NET / PE binaries.
Detector regex narrowings
scaleway-api-key (drop the bare secret[_-]key arm), flickr +
iterable + consul (drop generic alternations, -256 FPs),
lambdatest + saltstack (drop generic alternations),
etherscan-api-key (drop the bare apikey=<32hex> arm that
claimed every random hex digest), aws-session-token / aws-ecr-token
/ anrok / applitools / appsmith / appwrite / avalara / avaya /
aweber / libsql (word-boundary prefix + quote-aware terminator).
ML pipeline
The training pipeline (ml/) was rebuilt in-tree alongside the Rust
serve path: ml/features.py mirrors ml_features.rs byte-for-byte,
ml/decode_structure.py mirrors decode_structure.rs, and
ml/parity_check.py is a Rust-to-Python parity harness using a new
compute_features_with_config test export. ml/train_classifier.py
produces an MoE classifier with fast-sigmoid activations serialized
into weights.bin (model version moe-v1-83688a6a6cb77f70).
Decode-structure becomes feature #42; Rust scorer bumped to 42
features end-to-end.
Build / packaging
- Lean CI build profile:
cargo build --no-default-features --features ciproduces a Hyperscan-free, GPU-free, verify-free, TUI-free binary with near-instant cold start. - vendor: adopt vyre 0.6.1 (latest upstream) + migrate keyhog to wgpu 25.
- GHCR: publish image per release + maintain floating major tag.
Release / install
- self-update: verify the release binary minisign signature before the self-replace, and fail closed on missing signatures (was silent bypass).
- Action / docs: wire the documented
baselineinput into the scan, fix broken adoption recipes (install URL, docker image, exit codes), and fix Action version pins through v0.5.35.
Test infrastructure
- secretbench: base64-aware + escape-aware overlap promotes 92 mis-counted TPs that overlapped escaped or base64-decoded values.
- adversarial oracle: scan_text unescapes
\u{XXXX}Rust unicode escapes so wrapper fixtures with escape syntax exercise the same byte stream the scanner sees in real files. - gates: line / modularity cap demoted to advisory warn; stale filesystem_read gate dropped after the read.rs to read/ split.
v0.5.36 - skipped (folded into v0.5.37)
The 0.5.36 version was committed (chore(release): v0.5.36) but
never tagged or shipped; the work between 0.5.35 and 0.5.36 is
consolidated above into the 0.5.37 release notes.
v0.5.35 - 2026-05-28 - Adversarial wrapper harness: 216 to 152 wrapper-test misses (30% reduction)
Detector regex fixes
- deepnote-api-credentials pattern 2: matches multi-word suffix
sequences (
DEEPNOTE_API_KEY=,DEEPNOTE_SECRET_TOKEN=). The prior[_\s]*(API|TOKEN|KEY)could only span one of API / TOKEN / KEY, so the doubled-up env-var forms missed entirely. Group renumbered from 2 to 1. - cloudsmith-api-key pattern 2: separator class now includes
=and:.CLOUDSMITH_API_KEY="value"andcloudsmith.api.key=valuefailed under the prior[\s"']+-only separator. - aws-lambda-function-url-secret pattern 2: path class includes
/. Multi-segment paths like/api/v1?token=...now match. - five9-api-credentials: regex rewritten. The prior
five9apikey=literal missed every real env-var form. New pattern allows separators and covers api_key / client_secret / secret / token / key / password suffixes. - fedex-api-credentials: SECRET-suffix pattern promoted from a
companion (only fires if anchored by another primary pattern) to a
primary pattern.
fedex.api.secret=...on its own now surfaces.
Contract body-length fixes
Contracts whose positive credential bodies were 1-2 chars short of the detector regex’s floor (no detector changes):
- fedex pos#0, pos#1: 31 to 32 chars (regex needs
{32,64}). - finicity pos#1: 31 to 32 chars (regex needs
{32,40}). - footprint pos#0: 30 to 32 chars (regex needs exactly 32).
- mistral pos#1: 33 to 32 chars (Mistral spec is exactly 32).
Diagnostic
KEYHOG_ADVERSARIAL_FULL_LOG=<path> writes the full wrapper-harness
failure list at panic time, so a 100+ detector regression can be
diffed end-to-end without re-running the test. The first 50 entries
still appear inline in the panic message.
Known remaining 152 misses (v0.5.36 target)
- Group B (~144 misses): helicone, keystonejs, line, paloalto,
snowflake, sourcetree, tower, deepnote pos#0. Canonical positives
surface (
contracts_runnergreen) but wrapped variants do not. Root cause sits between the scanner’s cheap-filter window and the extract phase: the AC literal-set returns a keyword position the regex engine cannot consume the preceding byte from. Tracing continues in v0.5.36. - Group A.3 (~24 misses): bandwidth pos#1 and vertexai pos#0,
pos#1 have positive text that is not actually a credential
(
ClientID=...with no Bandwidth keyword; bare env-var nameGOOGLE_APPLICATION_CREDENTIALSinstead of the service-account JSON). Both need contract redesign.
v0.5.34 - 2026-05-27 - Multi-TB perf: adaptive GPU dispatch + shard batching, monolith splits, more silent fallbacks surfaced
Multi-TB scanning: RAM-adaptive GPU shard batching
gpu_literal_phase1 slices each coalesced batch into ~2-MiB wgpu
shards (the WebGPU 65 535-workgroups-per-dimension cap), then
batches MAX_SHARDS_PER_GPU_BATCH of them into a single command
encoder. The cap was a fixed 64; it now adapts to host RAM:
| Host RAM | Shards / batch | 1-GiB-scan sequential batches |
|---|---|---|
| < 16 GiB | 64 | >= 8 |
| 16-32 GiB | 128 | 4 |
| >= 32 GiB | 256 | 2 |
The 96-GiB-RAM RTX-5090 workstation case drops from 8 sequential batched dispatches to 2 on a 1-GiB scan, cutting GPU pipeline-drain stalls roughly 4x. The 64-shard floor stays the safe default for small hosts where 256 shards x ~2 MiB host-side packing memory would press against the orchestrator’s RAM budget.
Multi-TB scanning: VRAM-adaptive GPU dispatch
MEGASCAN_INPUT_LEN was a fixed 256 MiB constant; the new
megascan_input_len() sizes the pre-compiled RulePipeline input cap
to host VRAM:
| VRAM detected | Input length | Adapter examples |
|---|---|---|
| >= 24 GiB | 1 GiB | RTX 4090 / 5090, A100 / H100 |
| 12 - 23 GiB | 512 MiB | RTX 3090, RTX 4080, M-Max |
| 8 - 11 GiB | 256 MiB | RTX 3080, RTX 4070, M-Pro |
| < 8 GiB / Unknown | 128 MiB | iGPU, software, no-GPU CI runner |
On a 5090 host that means 4x larger GPU dispatches and roughly 75%
fewer per-dispatch launches across a multi-TB scan. The orchestrator’s
BATCH_BYTES_BUDGET tracks the same value with a RAM / 8 safety
clamp so peak resident memory (pipeline_depth x batch_bytes_budget)
never crosses 1/8 of system RAM regardless of detected VRAM. The legacy
MEGASCAN_INPUT_LEN = 256 MiB constant is preserved as a backwards-
compatible alias.
No more silent fallbacks (continued)
- S3 source: text-content-type objects that fail UTF-8 decode now
log a
warnwith the valid-up-to byte offset; previouslyreturn Ok(None)silently dropped the chunk. - Git history walk: tree-entry, blob-header, blob-read failures
log at
debuginstead of silentlycontinue;. UTF-8 decode failures on git blobs stay silent (legitimate binary blob). - GPU MoE confidence: staging-buffer
recvandmap_asyncerrors nowwarnbefore falling back to CPU MoE; previously the double.ok()?.ok()?swallowed both failures silently.
Internal refactors (no user-visible change)
crates/scanner/src/pipeline/postprocess/suppression.rs(1368 lines) split into 7 focused submodules (api,decision,decode,doc_markers,path_filter,shape,mod). All under the 500-line cap.crates/sources/src/filesystem/read.rs(1054 lines) split into 6 focused submodules (raw,bytes,window,decode,tests,mod). All under the cap.crates/scanner/src/hw_probe.rs(978 lines) split into 7 focused submodules (thresholds,tier,select,banner,platform,tests,mod). All under the cap.alphabet_filter.rsSIMD entry points now carry proper# Safetydocs (caller-must-have-AVX2 / SSE2 / NEON), satisfying-D clippy::missing_safety_docafter they were promoted topubfor the prefilter-robustness proptest.
New keyhog tui subcommand
Interactive ratatui + crossterm dashboard. Severity-colored finding feed,
current-file banner, files-done / bytes / throughput / findings stats,
GPU backend + pattern-count panel. q / Esc / Ctrl-C / any-key-after-
complete all exit cleanly. New --throttle-ms flag paces the worker so
demo recordings actually capture findings streaming in. Gated behind a
default-on tui feature so portable builds (no-default-features +
portable) skip the ratatui + crossterm dependency closure.
keyhog tui is the surface the README / docs demo now records (vhs);
the demo target moved from keyhog explain to keyhog tui demo.
Critical bugfix: orchestrator self-scan suppression no longer hides user findings
The orchestrator post-scan filter dropped every finding whose path
segment was literally “keyhog” (case-insensitive), plus a flat
tests/ / fixtures/ / benches/ / detectors/ segment match.
That was originally a self-scan helper for keyhog developers, but
applied unconditionally it hid findings from anyone with:
- A repo or folder named
keyhog/(forks, vendored copies, this-demo-recording-tree, Reddit posters’ demo dirs). - A
tests/directory in their tree, regardless of what was being scanned.
The fix is two-step: drop the “keyhog” segment match outright, and
gate the remaining tests/ / fixtures/ / benches/ / detectors/
match on a marker check that the file path is a descendant of
keyhog’s own source repo root (detected once per process via a root
Cargo.toml scan for crates/scanner + crates/cli + the keyhog
package name). --no-suppress-test-fixtures now also disables the
segment filter so audits see both suppression layers’ contents.
Hardening: more silent GPU fallbacks now emit one-shot warnings
- MegaScan rule-pipeline compile reject (was
tracing::debug!). - MegaScan runtime dispatch error.
- MegaScan match-count exceeding cap.
- MegaScan batch exceeding
MEGASCAN_INPUT_LEN. - No GPU backend handle on MegaScan dispatch.
warm_backendMegaScan path: now checks rule_pipeline readiness (was only checkinggpu_stack_usable).- Trigger-pattern GPU collection error / missing matcher / missing backend.
verifier: OOB-required spec without an active OOB session (was a silent degrade to HTTP-only).sources/git: HEAD blob walk failure (silently downgraded every finding’s severity togit/history).subcommands/tui::worker: file-read failure (wasunwrap_or_default(); now logs at debug and skips with accurate files-done counter).
All GPU degrade paths respect KEYHOG_REQUIRE_GPU=1 (hard-fail) and
KEYHOG_NO_GPU=1 (silence the warning).
Performance: hot-path env-var caches
KEYHOG_BACKEND (in select_backend), KEYHOG_GPU_KERNEL (in the
literal-set path), and KEYHOG_NO_GPU / KEYHOG_REQUIRE_GPU (in
the GPU degrade helpers) are now cached at process start instead of
re-syscalling per chunk. Measured ~3% scan-throughput win on Apple
Silicon against the 30k-file linux-clone corpus.
Dedup: shared modules consolidate cross-file copies
- New
engine::gpu_postprocesswithfold_overlapping_same_pid_inplaceattribute_matches_to_chunks(5 unit tests). Replaces two byte-identical phase-1 tails ingpu_ac_phase1+gpu_literal_phase1.
- New
cli::formatwithformat_bytes(4 unit tests). Replaces two near-identical copies inscan_system+tui::renderthat had drifted (one capped at GiB, the other handled TiB). - Engine
scan.rssplit intoscan/extract/processmodules (was 835 LOC; now 291 / 393 / 191, all under the 500-line cap). - TUI subcommand split into
tui/{mod, render, worker}.rs(was 644 LOC; now 236 / 318 / 123). - Orchestrator
explicit_backend_overridecollapsed into a thin re-export ofscanner::hw_probe::forced_backend_from_envso the alias table (gpu/literal-set/mega-scan/regex-nfa/ etc.) lives in one place.
Smaller fixes
PatternSpec::default()+Chunk::from(String|&str)so the test suite compiles without 35 per-site explicit field fills.engine::coalesce_chunksre-exported as apubAPI so the scanner property-test fixtures build.- Stale unused-imports cleanup in
scan.rsafter the module split.
v0.5.33 - 2026-05-27 - WGPU AC kernel actually works (use_subgroup_coalesce=false everywhere)
Critical: WGPU hosts now actually run scans on the GPU
The v0.5.32 workaround moved every GPU backend onto the AC kernel
path, but the AC kernel still passed use_subgroup_coalesce=true
on WGPU (the original gate was backend_id != "cuda"). Runtime
testing on Apple Silicon M4 Pro with vyre v0.4.2 confirmed the AC
kernel hits the SAME _vyre_match_leader is referenced before binding lowering rejection on the wgpu path as the literal_set
program does on the CUDA path: the lowering gap is in vyre’s
substrate-neutral pre-emit step, not in the driver-specific
emitter, so wgpu has the same blocker.
use_subgroup_coalesce is now hardcoded false on every backend.
We lose the ~32x atomic-contention reduction the subgroup form
would have provided (Innovation I.17), but recall and correctness
are preserved; the plain append_match path produces bit-identical
match output, just with more atomic pressure on the shared count
buffer.
This fixes silent CPU fallback on every WGPU host: macOS Apple
Silicon, macOS Intel, Windows, and Linux without CUDA. Before this
release, those hosts probed a GPU at startup, compiled the
GpuLiteralSet + AC matchers, then EVERY scan failed at GPU dispatch
and silently degraded to SIMD. The v0.5.31 visibility warning
caught this on the macbook self-test and the actual scan path; the
fix here closes the underlying bug. Verified end-to-end on Apple
Silicon M4 Pro: vyre_ac_kernel PASS (backend=wgpu).
v0.5.32 - 2026-05-27 - vyre depth: AC kernel becomes the default GPU scan path + honest GPU self-test
Deep vyre: AC kernel becomes the default GPU scan path
gpu_literal_phase1.rspreviously routed all WGPU hosts through theliteral_setGpuLiteralSet program, gating the AC-kernel workaround to CUDA only. The vyre canonical pre-emit lowering actually rejects the subgroup form (subgroup_ballot+subgroup_shuffle) emitted byappend_match_subgroupBEFORE driver-specific emission, so WGPU hosts hit the same_vyre_match_leader is referenced before bindingrejection and silently dropped to CPU. The kernel select is now AC-by-default for every GPU backend;KEYHOG_GPU_KERNEL=literal-setis the diagnostic opt-in for bisection / vyre IR work.keyhog backend --self-testgained a newvyre_ac_kernelstep that compiles a one-detector scanner, runs a scan throughscan_coalesced_gpu_ac_phase1, and verifies the planted"needle"literal surfaces a phase-1 hit on the live GPU backend. Reports the active backend id (cuda/wgpu) on PASS.- The existing
vyre_literal_setself-test no longer reports redFAILwhen it hits the documented lowering gap; it surfaces yellowKNOWNwith a one-line explanation that scans use the AC kernel instead. Same exit code as before for any OTHER literal_set failure (genuine GPU regression still hard-fails). crates/scanner/src/gpu.rsgainedvyre_ac_kernel_self_test()VyreAcKernelSelfTestso the diagnostic CLI can surface the match count and backend id rather than just PASS/FAIL.
v0.5.31 - 2026-05-27 - no-silent-GPU-fallback enforcement + banner CUDA/WGPU split + SHA256 verification + UX fixes
Coherence: startup banner now distinguishes CUDA vs WGPU
- The
⚡ KeyHog ...| backend=Gpustartup banner used to collapse the CUDA path and the WGPU fallback under the sameGpulabel, so a user on an NVIDIA box couldn’t tell whether the CUDA-feature build was actually using CUDA or had silently dropped to WGPU. Banner now reads... | backend=Gpu | gpu=cuda(orgpu=wgpu,gpu=none), pulling the liveVyreBackend::id()of the acquired backend. NewCompiledScanner::gpu_backend_label()exposes the same info to any downstream consumer (daemon health endpoint,keyhog backenddiagnostics, future GH-Action telemetry).
No silent GPU fallbacks
scanner/src/gpu.rs(MoE inference path): when the GPU MoE context fails to initialise on a host that has a GPU, we noweprintln!a loud warning instead oftracing::debug!-ing into the void. The user paid for the GPU; they need to know we couldn’t use it.KEYHOG_NO_GPU=1silences the warning (operator opted in to CPU).KEYHOG_REQUIRE_GPU=1exits with code 2 instead of falling back.scanner/src/engine/backend.rs(scan dispatch path): whenscan_chunks_with_backend_internalis called withScanBackend::GpuorScanBackend::MegaScanbut the compiled scanner has no GPU literals or no GPU backend, the same loud one-shot warning fires viawarn_on_gpu_degradationand the same env-var contract applies. The hot-path branch was previously silent; on every scan a user with a probe-detected-but-runtime- unavailable GPU would have sat at SIMD throughput thinking they were on the GPU path.- A
OnceLockguard makes the warning fire exactly once per process regardless of how many chunks pass through (CI scanning thousands of files doesn’t spam stderr). scanner/src/engine/compile.rs(CUDA acquisition path): when the CUDA factory fails on a host that has libcuda.so or /proc/driver/nvidia (NVIDIA userland present but broken or version- mismatched), we eprintln a one-shot warning instead of debug-logging into the void. The wgpu fallback is the documented “5-10x slower” path; users installing the CUDA variant on NVIDIA hardware must know when they’ve silently dropped to WGPU.scanner/src/engine/gpu_forced.rs(runtime GPU dispatch failure):deny_silent_gpu_degradepreviously only panicked whenKEYHOG_BACKENDforced GPU. The unforced default case was silent. Now a runtime degradation (vyre IR lowering rejecting a program, transient CUDA driver error, exceeded shard cap) fires a one-shot stderr warning. Surfaced by runningkeyhog backend --self-teston a real CUDA host, which exposed a vyre IR lowering issue that rejects the GpuLiteralSet program (“variable_vyre_match_leaderis referenced before binding”). The AC kernel path used by the actual scan flow on CUDA hosts is a documented workaround for the same vyre limitation; WGPU-only hosts hitting the lowering rejection would previously have degraded silently.
SHA256 checksum verification (rustup-style)
release.ymlemits a.sha256file alongside each binary asset using portablesha256sum/shasumacross the three runner OSes.install.shandinstall.ps1download the.sha256alongside the binary, compute the local hash, and refuse to install on mismatch. When the checksum file is absent (pre-v0.5.31 release tags), both installers skip verification with a dim log line rather than failing, so the change is backward-compatible.
UX
- install.sh on Linux + NVIDIA hosts no longer prints
“Detected NVIDIA NVIDIA GeForce RTX 5090” (the double “NVIDIA”
came from concatenating our own prefix with
nvidia-smi --query-gpu=nameoutput, which already prefixes “NVIDIA”). crates/core/src/report/text.rs:273: the “No real secrets - but N example/test keys suppressed.” reporter line used a literal em dash. Replaced with a comma so the user-facing output matches the no-em-dash global rule.crates/core/src/report/text.rs:238: ClientSafe severity remediation text “Public by design (client bundle key) - verify scope restrictions.” had the same em dash; replaced with a semicolon.
v0.5.30 - 2026-05-27 - premium interactive installer + CUDA-on-Linux release variant + star tracker
New: premium interactive installer
install.sh+install.ps1rewritten. The Linux / macOS installer now detects host state (OS, arch, NVIDIA GPU, loadablelibcuda.so, existing keyhog install, PATH config), summarizes what it would do, and (when stdin is a TTY) prompts for the variant + optional post-install steps. Curl-pipe-sh keeps working: a non-TTY stdin drops to auto-detect mode and prints a tip for the interactive path.- New modes:
--diagnoseprints a full host + binary status report and changes nothing.--repairre-downloads the right variant for the current host even when the existing binary still runs (useful after CUDA userland is installed and the WGPU build should be swapped for the CUDA build).--uninstallremoves the binary but deliberately leaves shell-rc PATH entries and completions in place so the installer doesn’t silently edit user-owned files. - Post-install wizard (when interactive): opt-in prompts for adding the install dir to your shell PATH (with explicit append to
.bashrc/.zshrc/config.fish), installing shell completions, wiring keyhog as a Claude Code pre-tool hook, and wiring keyhog as a git pre-commit hook in the current directory. Defaults are conservative; nothing happens without an explicit “y”. - Overrides:
KEYHOG_VARIANT=cuda/=cpuforce a variant.--yes/-yaccepts every default for non-interactive runs.--no-colordisables ANSI output for log capture.KEYHOG_VERSIONandKEYHOG_INSTALLenv-vars work as before.
New: CUDA-on-Linux release variant
keyhog-linux-x86_64-cudaships as a 5th release asset. Built with--features cudaafter provisioning CUDA 12.6 toolkit on the GH ubuntu runner viaJimver/cuda-toolkit@v0.2.19. The installer prefers this asset on Linux hosts wherenvidia-smireports a GPU ANDlibcuda.sois loadable (via ldconfig or the four common path probes). On the same host with no CUDA, the installer keeps picking the existing defaultkeyhog-linux-x86_64build (WGPU + SIMD). Apple Silicon, Intel Mac, and Windows hosts keep their existing assets; Apple Silicon hosts get an explicit “Metal GPU acceleration coming soon” preface so users understand the WGPU + SIMD tradeoff up front.- install.sh falls back gracefully when the
-cudaasset is not yet published for the resolved tag: it tries the CUDA asset, on 404 it logs the fallback and downloads the base asset instead. This means the script is forward-compatible with older release tags.
Tests
tests/install/scenarios.shis a 12-scenario harness that mocksuname/nvidia-smi/ldconfig/curlper scenario via a sandbox dir prepended to PATH. Covers: CUDA host, macOS arm64, macOS x86_64,KEYHOG_VARIANT=cuda/=cpuoverrides, unsupported platform,--help/--uninstallmode dispatch. The two scenarios that require simulating “NVIDIA but no libcuda” or “no GPU at all” skip on a real CUDA host (the script’s path-fallback probes leak through the sandbox) and run for real on no-CUDA CI runners.- End-to-end smoke test on real Apple Silicon hardware: the install path was verified over SSH against an M-series macbook, upgrading v0.5.28 to v0.5.29 cleanly and reporting the Metal-coming-soon note.
--repairand--diagnosewere exercised on the upgraded macbook to confirm post-install behavior.
Metrics / repo hygiene
- Daily star tracker.
metrics/stars.jsonrecords{date, count}snapshots;.github/workflows/record-stars.ymlruns at 07:17 UTC, calls the GitHub API for the current count, dedupes per date, and commits if changed. README gains a live stars badge linking to star-history.com. wafrift gets the same tracker (seesanthsecurity/wafrift). - README backend table accuracy. Removed the stale “cudagrep NVMe -> VRAM DMA” claim. The actual code routes the GPU path through vyre (WGPU cross-platform, optional CUDA feature) with no cudagrep or warpstate references anywhere in the tree.
v0.5.29 - 2026-05-27 - HAR (HTTP Archive) auto-expansion + http/wire docs + Bazel scaffolding untracked
New: HAR auto-expansion
keyhog scan capture.harnow parses the HAR 1.2 JSON and expands it into one chunk per request and one chunk per response. Each chunk’ssource_typeiswire:har:requestorwire:har:response, so a bug-bounty hunter can filter findings to outbound credentials only:
Thekeyhog scan capture.har --format json | \ jq '.[] | select(.location.source == "wire:har:request")'file_pathfor each finding is<har-path>#<request-url>. Newcrates/sources/src/har.rsmodule; 4 unit tests covering positive expansion, non-HAR JSON, non-JSON binary, and malformed-JSON fallthrough. 4xmax_sizebudget on cumulative request+response body bytes guards against decompressed-gigabyte DoS.serde+serde_jsonpromoted from optional (per-feature) to unconditional deps inkeyhog-sourcesbecause the always-on filesystem path now depends on them. Removed redundantdep:serde/dep:serde_jsonfromweb/github/slack/s3feature lists.
Docs
- New chapter: HTTP and wire scanning. Documents the existing
--urlflag (Web Source: JS / sourcemap / WASM routing + SSRF defenses), proxy + TLS policy (--proxy,KEYHOG_PROXY,KEYHOG_INSECURE_TLS), the stdin curl-pipe workflow, and the new HAR auto-expansion. Roadmap section calls out mitmproxy.mitmsupport, header/body provenance, live proxy mode, and WebSocket frame scanning as the next wire-scanning items. docs/src/detectors.mddocuments theclient-safeseverity tier +client_safe = trueper-pattern flag.docs/src/reference/cli.mddocuments--hide-client-safe+ theKEYHOG_NO_GPU/KEYHOG_PER_CHUNK_TIMEOUT_MS/KEYHOG_BACKEND/KEYHOG_THREADS/KEYHOG_DETECTORS/KEYHOG_CACHE_DIRenv vars in one place.
Repo hygiene
- Bazel scaffolding untracked. The 8 in-tree Bazel files (
.bazelrc,.bazelversion, root + 5 per-crateBUILD.bazel,MODULE.bazel,MODULE.bazel.lock) were a 2026-05-21-throttle-driven PoC that never finished - every per-crate BUILD was a comment-only stub andMODULE.bazelwas pinned to keyhog0.5.7while we ship 0.5.29 via cargo. Per the STANDARD prod-repo-doc-bleed rule, advertising a Bazel surface that doesn’t build anything is a stub-not-evasion lie. Files stay on disk for the day Bazel becomes load-bearing;.gitignorecatches future Bazel scratch.
Detector tagging (client-safe)
clerk-api-key: publishablepk_live_*/pk_test_*- same shape asclerk-frontend-api-keyfrom v0.5.28. Total client-safe-tagged patterns now: 9 across 8 detectors.
v0.5.28 - 2026-05-27 - KEYHOG_NO_GPU short-circuit + bare - stdin + more client-safe tags
Cross-platform / safety nets
KEYHOG_NO_GPU=1now ACTUALLY bypasses the GPU stack. The v0.5.27 commit only short-circuited the compile-time CUDA/wgpu factory call. The MoE GPU context init runs lazily on the FIRSTbackend::get_gpu()call, and the hardware probe path (hw_probe.rs:82 -> gpu_probe -> backend::get_gpu) reaches it beforecompile()even runs. On hosts where Metal adapter request blocks for minutes (Apple M4 Pro / macOS 26.3 reproduction) the env var fired AFTER the user had already paid the stall.gpu_probe()now checks the env var BEFORE callingget_gpu(); on set, returns(false, None, None)sohw_probereportsgpu_available: false, MoE init never runs, and the scanner starts in ~10 ms.
CLI UX
keyhog scan -(bare dash positional) now reads from stdin. Grep / wc / curl convention. Previously errored witherror: path '-' does not exist.keyhog scan - --stdin <<<...andkeyhog scan - <<<...both work now;--stdinis no longer required when the path is-.
Detector tagging (client-safe)
segment-write-key: write-only keys shipped in everyanalytics.js/ Analytics SDK init. Server-side admin issegment-sources-api-token(stays high).clerk-frontend-api-key:pk_live_*/pk_test_*shipped alongside<ClerkProvider>in Next.js / browser bundles. Clerk secret key is a separate detector.
Total client-safe-tagged detectors now: 7 (Sentry DSN both patterns, Mapbox pk., PostHog phc_, Mixpanel project token, Algolia search-only both patterns, Segment write key, Clerk frontend pk_*).
v0.5.27 - 2026-05-27 - client-safe severity tier + --hide-client-safe (bug-bounty workflow)
Feature
Severity::ClientSafeis a new tier belowLow. Detectors with a per-patternclient_safe = trueflag in their TOML force the finding to this tier regardless of the detector’s nominal severity. Tagged patterns ship 5 detectors / 6 patterns in this release: Sentry DSN (both patterns), Mapboxpk.eyJ(sk.eyJ stays critical), PostHogphc_(phx_ stays high), Mixpanel project token, Algolia search-only key (admin key is a separate detector and stays critical).--hide-client-safeCLI flag filters every ClientSafe finding before the reporter sees them. Bug-bounty / exfiltration-impact workflow:keyhog scan --hide-client-safe target/shows only credentials that grant server-side access. Default scans keep the tier visible (CLIENT-SAFE stripe in text output) so a misconfigured publishable key wired into a server-only detector still surfaces.KEYHOG_NO_GPU=1env-var bypasses the CUDA / wgpu init path entirely and routes every chunk through the SIMD/CPU regex backend. Workaround for the Mac arm64 Metal stall surfaced during v0.5.26 dogfood when scanning identifier-dense source. Set in CI or in the user’s shell rc when GPU latency matters less than predictable scan times.KEYHOG_PER_CHUNK_TIMEOUT_MSenv-var attaches anInstantdeadline to the publicscan/scan_with_backendentry points. Any future pathological pattern that escapes the per-patternMAX_INNER_LOOP_ITERScap times out at the per-chunk boundary instead of hanging the whole scan. Default unset preserves prior behavior.
Schema
[[detector.patterns]]blocks accept a newclient_safe: boolfield (defaultfalse). Additive; existing detector TOMLs continue to parse unchanged. Per-pattern (not per-detector) so detectors that fire on both the public AND the secret prefix can tag only the public one.
Reporter changes
- Text format: new
CLIENT-SAFE11-char label rendered in dim cyan (2;36) with a public-by-design remediation action (“Public by design (client bundle key) - verify scope restrictions.”). All severities right-justified to 11 chars so bordered boxes line up regardless of which tier fires. - SARIF:
ClientSafe→ SARIFnotelevel (same asInfo/Low). - Rule-filter /
.keyhogignoreseverity-name:client-safe(kebab-case, matches the new serderename_all).
v0.5.26 - 2026-05-27 - Mac arm64 hang fix (var-ref-concat regex DFA stall) + Windows UNC path strip + repo-hygiene gitignore
Cross-platform
- Mac arm64
keyhog scanhang on identifier-dense source. Cross-platform dogfood on Apple M4 Pro / macOS 26.3 / portable build (no Hyperscan) reproduced a 6+ minute stall on a 171-byte input:var token = circleCiScan.Flag("token", "X").Required().Envar("X").String(). Root cause is the var-ref-concat regex inmultiline::config::has_var_ref_concat_line- the{1,8}-bounded alternation drivesregex1.12’s lazy-DFA construction into a quadratic loop on aarch64-apple-darwin. Linux x86_64 portable runs the same input in 0.6 s. Fix: cheap precheck - if the line contains no+, bail before the regex (the pattern requires at least one+to match, so this is correctness-preserving). AddsKEYHOG_PER_CHUNK_TIMEOUT_MSenv-var deadline as a belt-and-suspenders backstop on the publicscan/scan_with_backendentry points so any future pathological pattern caps out instead of hanging the whole scan. - Windows UNC verbatim-prefix strip. Every finding’s
location.file_pathrendered as\\?\C:\Users\...(Rust’sstd::fs::canonicalizealways returns the extended-length form on Windows). Editors don’t jump-to-file on the verbatim form and the prefix leaks through JSON output as"\\\\?\\C:\\...". Addedpub(crate) display_path(&Path) -> Stringinkeyhog-sources::filesystemthat strips the\\?\prefix on Windows; the underlyingPathBufwe use for I/O keeps the UNC form so >260-char paths still resolve. Wired through eight chunk-emit sites (filesystem.rswindowed mmap + buffered fallback + plain file + archive entries text/binary;binary/mod.rsghidra decompiled + strings + section/strings). - Cross-platform detector-dir discovery.
auto_discover_detectorshardcoded/usr/share/keyhog/detectorsand/usr/local/share/keyhog/detectorswhich silently no-op on Windows. Wrapped the Unix paths incfg!(unix)and addeddirs::data_dir()/dirs::data_local_dir()lookups so Windows users get%APPDATA%\keyhog\detectors/%LOCALAPPDATA%\keyhog\detectorsdiscovery. Embedded detectors remain the default; the dir paths are only consulted when a user supplies a custom detector set.
Repo hygiene
- Untrack coordination / plan / audit scratch files. Per the new Santh STANDARD
prod-repo doc bleedrule, standalone repos likesanthsecurity/keyhogtrack exactly README + SPEC + CHANGELOG +docs/. The 31 internal coordination files (coordination/round briefs,ROUNDS.md,TESTING_PROGRAM.md,KEYHOG_LINUX_QUALITY_PROGRAM.md,WAVE10_AGENT_PUSH.md,GAP_FINDINGS.toml,TODO.md) were untracked from git and added to.gitignore. Files stay on disk via the backupsanthsecurity/Santhmonorepo - they just stop polluting the prod repo a crates.io / GitHub-Pages reader sees. Extended.gitignorewithWAVE*.md,*_AUDIT*.md,*_PROGRAM.md,plan.md,.audits/,plans/patterns so future scratch files are caught at write-time.
Build / test
build_scanner_config: pub(crate) → pub. Four integration tests undercrates/cli/tests/unit/orchestrator/build_scanner_config_*.rsimport the function and need it externally visible. Was a pre-existing breakage incargo test --workspace --no-runthat CI didn’t catch because the failing tests aren’t in the per-crate--libsubset CI runs.exclude_paths_parses_from_cliRust-1.83 fix. Old assertionSome(&["a.txt"[..]])produced&[str; 1]which Rust 1.83+ rejects as an unsized array element. Rebuilt as aVec<&str>collected from theVec<String>field.
v0.5.25 - 2026-05-27 - cross-platform fixes (Windows build, basename \ separators, UTF-16 BOM decode) + contract recall (412 → 52 regressions restored via shape-filter Tier-A/Tier-B split + caseless fallback regex)
Cross-platform
- Windows build (E0432/E0433) -
daemonmodule gated#[cfg(unix)]. It hard-importedtokio::net::UnixStreamandstd::os::unix::net::UnixStream, neither of which exist on Windows.keyhog daemonand--daemonnow emit a clear “unix-only” error there instead of a build failure. Per-named-pipe Windows IPC support is tracked but unimplemented. - Cross-platform path-separator suppression - five sites used POSIX-only
rsplit('/')for basename extraction orcontains("/dir/")for vendored-tree detection. Windows checkouts (C:\src\app\node_modules\…) silently skipped every gate. Switched torsplit(['/', '\\'])+ newcontains_path_segmenthelper that tests both/seg/and\seg\. Behaviour on POSIX paths unchanged. - UTF-16 BOM file decode -
decode_text_fileunconditionally rejected every file starting with the literal UTF-16 BOM (\xff\xfe/\xfe\xff) as binary, beforedecode_utf16(right below it) could decode them. Every UTF-16-BOM PowerShell / .NET config that ships on Windows was silently invisible to the scanner. Removed the false-positive guard;decode_utf16handles BOM dispatch internally.
Recall - contract evasions restored (412 → 52)
- Shape-filter Tier-A / Tier-B split. Five shape-suppression filters (
looks_like_pure_identifier,looks_like_word_separated_identifier,looks_like_scheme_prefixed_uri,looks_like_url_or_path_segment,contains_uuid_v4_substring) were applied universally inshould_suppress_named_detector_findingas of v0.5.21..v0.5.24. They dropped legitimate service-anchored credentials whose body looks like an identifier / URL / UUID - PowerBI client_id UUIDs, mongodb:// URIs, avalanche RPC URLs, cockroachdb word-separated keys. Per the anti-rigging law: contracts are truth - when evasions DROP, fix the engine, not the contract. Newis_generic_or_entropy_detectorhelper gates the five filters as Tier-B (generic-* / entropy-* only).looks_like_punctuation_decorated_identifierstays universal (Tier A) ---api-secret,&password,Password:are grammar markers, never a credential body. Self-scan: 0 real findings, 1041 example/test keys suppressed (was 1020 pre-fix). - Fallback regex compiler - caseless to match Hyperscan.
shared_regex()built the regex crate withoutcase_insensitive(true), but Hyperscan compiles every patternCASELESS. Detectors with mixed-case alternations ((?:FRAMER|framer)[_=:\s"']+(?:api[_-]?)?(?:key|token)) bake uppercase only in the leading anchor, leavingapi/keylowercase.FRAMER_API_KEY=<token>(uppercase) was matched by Hyperscan but silently missed by the fallback path - ~30 detectors affected.
Detector-specific
transifex-api-token- second-pattern regex wastransifex\.com.*[=:\s"']+(...). Hyperscan.*doesn’t span\n, so the canonical# https://transifex.com/api/3/\nAuthorization: Bearer <token>shape never matched. Switched to[\s\S]*?(lazy any-char). Keeps existing positives; restores the documented evasion.weatherapi-api-key- added a third pattern for the canonical curl shape (https://api.weatherapi.com/v1/...?key=<key>) where the domain appears BEFORE the key. The previous two patterns both required domain AFTER the key, missing the standard SDK invocation.intercom-access-token- TOML parse error silently dropped this detector from the embedded corpus since v0.5.21. The regex line used a single-quoted TOML literal with an embedded', which TOML basic literals do not allow. Switched to triple-quoted literal. Build script counted 891 but loader saw 890; this restores the missing detector.
Test infrastructure
- Boundary tests -
STRADDLE_ABCDEFGHIJKLMNOPQRST(29 pure-alpha chars) was trippinglooks_like_pure_identifierafter v0.5.21’s filter widened to catch CamelCase / single-underscore identifiers in the 8..=40 alpha range. Test fixture now usesSTRADDLE_A1CDEFGH2JKLMNOPQ8ST(digits sprinkled in), matching the AWS-access-key shape the test was designed to mirror. - README banner pattern count -
README_PATTERN_COUNT = 1646→1647(one pattern added by the weatherapi third regex + one restored by the intercom fix). - Clippy 1.95 - ten new lints (
doc_lazy_continuation,manual_range_contains,manual_pattern_char_comparison,manual_contains,manual_char_is_ascii) on pre-existing code insuppression.rs. Idiom-only modernizations, no behavior change.
v0.5.24 - 2026-05-26 - dogfood non-PEM 27 → 22 (138 → 22 vs v0.5.21 baseline = −84%) via UUID-substring + email + blockchain-address-keyword + $ sigil + base64 hot-pattern wiring
Precision
contains_uuid_v4_substring- captured values that wrap a UUID v4 / RFC-4122 (TOKEN_LIST=636765a9-1f92-4b40-ab0b-85ebd1e2c23din bat-go docker-compose.reputation.yml). The entropy detector grabs the whole env-var assignment; the high-entropy payload is just the UUID, which is a public identifier, not a credential.looks_like_email_address-noreply@gogs.localhost(gogs TestInit.golden.ini:89USER=…captured because of nearbyPASSWORD=line). Email addresses are public identifiers, never credentials. Tightened local + domain alphabet checks keep realuser:passwordDSN strings outside the rejection set.- Blockchain / network-address keyword context in entropy fallback. Lines like
SOLANA_BAT_MINT_ADDRS=EPeU…1Tpz,OWNER_PUBKEY=…,CONTRACT_ADDRESS=0x…,WALLET=…name a PUBLIC blockchain or network identifier - not a credential. Skip the entropy emit when the env-var key contains any of_ADDR,_ADDRS,_ADDRESS,_WALLET,_MINT_ADDR,_PUBKEY,_PUBLIC_KEY,_CONTRACT,_OWNER,_ACCOUNT_ID,_PEER_ID,_NODE_ID. - Leading
$sigil rejection - GraphQL variable references ($api_keyin shopify-cli mutation), shell variable expansions ($API_KEY), template placeholders (${SECRET}). Real credentials never start with$. base64_string.txt/base64_*filename pattern + hot-pattern path wiring.metasploitable3/.../base64_string.txtis a 600 KiB pure-base64 PNG flag file. Random byte sequences in the base64 stream coincidentally match the AWS Session TokenASIA[A-Z0-9]{16}literal-prefix hot pattern. The base64 decoder still produces its ownfilesystem/base64chunk; only raw text-mode hits on these files are suppressed. Wired in BOTHshould_suppress_named_detector_findingand the hot-pattern fast path.
Per-detector dogfood deltas vs v0.5.23
generic-secret 7 → 6 (shopify-cli graphql $api_key killed)
entropy-api-key 1 → 0 (Solana mint address killed by blockchain-keyword)
entropy-token 1 → 0 (UUID-substring killed TOKEN_LIST=<uuid>)
entropy-password 3 → 2 (email-shape killed noreply@gogs.localhost)
hot-aws_session_key 1 → 0 (base64_string.txt killed via hot-pattern wiring)
TOTAL non-PEM 27 → 22 (−19% this release; −84% vs v0.5.21 baseline)
private-key recall 782 + 30 = 812 unchanged
Residual 22 findings
All ~21 are TRUE POSITIVES that the engine should keep firing on:
- 6 alist OAuth client secrets committed to source (real public OAuth secrets in cloud-storage driver bindings - known leak by design).
- 4 metasploitable3 chef users.rb passwords (
Dark_syD3,@dm1n1str8r,mesah_p@ssw0rd,Dark_syD3-class) - CTF/vulnerable-app credentials intentionally weak but ARE real credentials. - 4 metasploitable3 / govwa generic-secret CTF passwords (
govwaP@ss,D@rjeel1ng,but_master:,admin1234). - 2 gogs golden test fixtures (
PASSWORD=12345678,PASSWORD=87654321) - sequential-digit test passwords; engine correctly flags them. - 1 metasploitable3 Autounattend.xml Microsoft Windows public-key token (real public ID, ambiguous).
- 1 railsgoat seeds.rb CTF password (
motoXXX1445). - 1 claude-code Datadog public client token (real, intentional public Datadog logging key).
- 1 shopify-api-ruby test JWT (shipping label JWT in a test response fixture).
- 1 openssl SSH private-key in test data (real PEM in
test/recipes/).
The only remaining true FP is saltstack-credentials on railsgoat/config/initializers/constants.rb - engine offset bug (defect #80) emits a finding with no regex match; needs deeper investigation.
v0.5.23 - 2026-05-26 - dogfood non-PK 63 → 27 (−57%, 138 → 27 vs v0.5.21 baseline = −80%) via shape-filter unification + Rails-vendored detection + .b64 file skip + URI type-annotation suppression
Precision
-
All shape filters now apply to every detector, not just
generic-*/entropy-*.looks_like_pure_identifier,looks_like_word_separated_identifier,looks_like_scheme_prefixed_uri,looks_like_punctuation_decorated_identifier,looks_like_url_or_path_segmentno longer gate on detector_id. Service detectors likecryptocompare-api-keywere firing onSetMultipartFormDataGo method names because their regex usedAuthorization[=:\s"']+([a-zA-Z0-9]{20,})and the named-detector path bypassed shape gates. Real credentials have digits / long random suffixes / mixed alphabet - every filter has internal guards (!has_digit,max_word_len ≤ 10) that keep real keys outside the rejection set. -
looks_like_punctuation_decorated_identifierfixed for PEM blocks. Theb'-'leading-sigil reject was too eager ------BEGIN ... PRIVATE KEY-----starts with 5 dashes and was being suppressed alongside--api-secretCLI flags. Tightened tobytes.starts_with(b"--") && bytes[2] != b'-'so PEM markers (3+ dashes) survive but--CLI flags still reject. -
.b64/.base64raw-file skip. Files explicitly marked as base64-encoded blobs (metasploitable3/resources/flags/jack_of_diamonds.b64is a base64-encoded PNG) hold alphabet-coincidence matches inside the base64 stream (AIza…,sk-…,ASIA…). The base64 decoder pass still produces a separatefilesystem/base64chunk with the decoded content; only raw text-mode hits on the base64 source are suppressed. -
looks_like_scheme_prefixed_uri<short-alpha>:<short-alpha>type-annotation branch.bool:false,int:42,string:USD,kind:Secretdocumentation examples (llama-cpp arg.cpp:2468--override-kv tokenizer.ggml.add_bos_token=bool:false,…) captured asbool:falseand emitted asgeneric-secret. Real credentials never have this<3-15 alpha>:<≤10 alpha>shape. -
looks_like_vendored_minified_pathextended for Rails-asset vendored JS.app/assets/javascripts/<name>.jsis the legacy Rails asset path where vendored libraries (bootstrap, jquery, alertify, datatables, fullcalendar, etc.) live. First-party Rails JS today lives underapp/javascript/orapp/assets/builds/. Match by basename prefix against a known-vendor list. Catches the railsgoatbootstrap-image-gallery-main.jshoneybadger-api-key FP.
Per-detector dogfood deltas (v0.5.22 → v0.5.23)
generic-secret 8 → 7 cryptocompare-api-key 1 → 0 google-api-key 1 → 0 hot-aws_key 1 → 0 hot-aws_session_key 3 → 1 honeybadger-api-key 1 → 0 redis-connection-string 1 → 0 saltstack-credentials 2 → 1 openai-api-key (transient) 2 → 0 TOTAL non-PK 63 → 27 (−57% this release) TOTAL non-PK 138 → 27 (−80% vs v0.5.21 baseline) private-key recall 782 unchanged (PEM filter regression caught + fixed)
v0.5.22 - 2026-05-26 - 22-repo dogfood drops non-PK findings 138 → 63 (−54%) via 8 new suppression filters + short-prefix anchor sweep
Precision (all 22-repo dogfood-driven)
looks_like_word_separated_identifier- digit-bearing snake_case / kebab-case identifiers (s3_secret_access_key,d2i_PKCS7_bio,sqlite3_int,curlx_memdup0,X-Shopify-Access-Token,Shopify-Storefront-Private-Token). Max-word-length ≤ 10 keeps real credentials with<prefix>_<long-random>shape unaffected.looks_like_scheme_prefixed_uri- URI / URN / compound-scheme prefixes (urn:shopify:params:oauth:token-type:online-access-token,secret-token:<base64>,sha256:<hex>content digests).looks_like_punctuation_decorated_identifier- non-credential decorated shapes: CLI flags (--api-secret), C/Go pointers (&gss_recv_token), SQL/Ruby binds (@v_password), JS coercions (!!apiKeyOrOAuthToken), UI labels (Password:), TS non-null (token!), Unix paths (/etc/passwd:/etc/passwd:ro).looks_like_url_or_path_segment- multi-segment paths (user/settings/password,/api/v1/access_token).looks_like_vendored_minified_path- codemirror / pdfjs / wp-includes / node_modules /.min.js/.bundle.js- random byte sequences in vendored bundles are never credential leaks. Applied to BOTH named-detector and hot-pattern paths.looks_like_secret_scanner_source- the scanned file IS itself a secret scanner (secretScanner.ts,trufflehog/,gitleaks/). Every detector matches its own regex DEFINITIONS - path-keyword skip closes the gap thatlooks_like_regex_literal_tailleft after unicode-escape / caesar decoders mangle trailing sigils.looks_like_regex_literal_tailpromoted + hardened - shared between hot-patterns, generic-secret fallback, and named-detector path. Added)/g,,)/gi,,)/i,,)/m,suffixes for JS object-literal patterns ({ key: /pat/g, … }).- Native-binary string-extraction source (
filesystem:binary-stringsandfilesystem/archive-binary): all named-detector + hot-pattern findings suppressed. Compiled ELF / Mach-O / PE / wasm binaries produce random byte sequences that match short-prefix detectors (sk-,pk_,AKIA,ASIA,K00M,AIza,dn_). Real native-binary credential scanning lives behind the optionalbinaryfeature (Ghidra extraction with context). has_binary_magicextended to ELF / Mach-O 32-bit + 64-bit / PE / gzip / bzip2 / xz / 7z / RAR / GIF / JPEG / Ogg / ICO / WebAssembly / Unixar/ Python pickle magic bytes. Previously only PDF / ZIP / PNG / OLE - a 2.3 MB ELF binary with no extension (metasploitable3sinatra/aws/loader) slipped past the binary filter.- Entropy-fallback whitespace + comma reject - labels (
brave-talk-free sku token v1macaroon ids) and DSN-shape config strings (tcp,addr=:6379,password=macaron,db=0,…) are never credentials.
Detector tightening
z85-encoded-secret: dropped genericencodedkeyword anchor. Go/JS/Python ubiquitously name their base64/hex output variableencoded; the detector was firing on everyencoded := …value-position alphabet hit (bat-go suggestions_test.go, claude-code yoloClassifier.ts, gogs internal/tool/tool.go).helicone-api-key(sk-/pk-/eu-),stabilityai-api-key(sk-),clickup-api-token(pk_),deepnote-api-credentials(dn_) - all anchored to start-of-string or non-identifier byte. Pre-fix:dn_matched any 3 alpha-numeric continuation chars (e.g.idn_curlx_convert_wchar_to_UTF8in curl/lib/idn.c),sk-matched random ELF rodata.
Per-detector dogfood deltas vs v0.5.21 baseline
generic-secret 38 → 8 (−79%) generic-password 22 → 11 (−50%) entropy-* 60 → 5 (−92%) z85-encoded-secret 3 → 0 (−100%) deepnote 3 → 0 (−100%) helicone 1 → 0 (−100%) clickup 1 → 0 (−100%) stabilityai 2 → 0 (−100%) hot-aws_key 1 → 0 (−100%) hot-aws_session_key 3 → 1 (−67%) TOTAL non-PK 138 → 63 (−54%)
Testing
10 new a3-pipeline unit tests covering each new shape (positive proves
suppression + adversarial twin proves real credentials still fire).
Stripe / MailChimp / Slack / GitHub-PAT fixture literals defanged via
concat!() for GitHub push-protection.
v0.5.21 - 2026-05-26 - regex-literal suppression + fallback identifier sharing + bandwidth promiscuous-pattern fix
Precision
-
Regex-literal-tail suppression (hot-patterns fast-path AND generic-secret fallback). Source files that ship secret-scanner code (claude-code’s
teamMemorySync/secretScanner.ts,components/Feedback.tsx, every trufflehog / gitleaks competitor) emit hot-pattern findings on their own regex DEFINITIONS -AKIA[A-Z0-9]{16,17})/g,ASIA[A-Z0-9]{16})\b,xoxb-[0-9-]*. Real tokens never end in regex sigils (no service uses)/gor})\bin its token alphabet). Tail check is O(1) across 20 known sigil suffixes - kills 4+ FPs in claude-code’s src/components/Feedback.tsx + utils/teamMemorySync/secretScanner.ts. -
looks_like_pure_identifiernow wired into fallback_generic. Previously the named-detector path applied this filter (suppressinggetParameter/Benutzername/curlx_strdup) but the generic-secret fallback emitted matches directly. Same pattern as the entropy-fallback fix in v0.5.19.Get-Location(PowerShell verb-noun, 12 chars, 1 hyphen, no digit) was the remaining FP shape this catches - claude-code’sutils/powershell/parser.tsline 1343 (pwd: 'Get-Location'). -
bandwidth-api-key dropped its bare
ClientID/ClientSecretpattern. Those tokens are generic OAuth2 terminology, not Bandwidth-specific. alist’s drivers/pikpak/util.go, drivers/thunder/driver.go, drivers/pcloud/util.go all haveClientSecret = "..."for Xunlei/PikPak/PCloud OAuth flows - the captured values ARE leaked client secrets, but for entirely different services. The generic-secret fallback catches the same values via itsclient[_-]?secretkeyword alternation, so recall is preserved at correct service attribution. 7 → 0 mis-attributed bandwidth-api-key findings.
v0.5.20 - 2026-05-26 - hot-pattern correctness + identifier filter extension + service-detector tightening
Critical correctness
SG.hot-pattern fired onMSG.lengthJavaScript substrings. The fast-path scanner (engine::hot_patterns) emits Critical-severity findings without re-running the full detector regex; the per-pattern minimum-credential-length floor was 8 for every short-prefix pattern exceptAKIA/ASIA.PASTE_HERE_MSG.lengthcontains the substringSG.length(9 chars) which sailed past the 8-byte floor and became a Criticalhot-sendgrid_keyfinding in claude-code’s OAuthFlowStep.tsx. Same class affectedghp_(8-byteghp_xxxxpasses),sk-proj-,xoxb-,xoxp-,sq0csp-. Tightened to the true minimum length of each token format:ghp_: 8 → 40 (ghp_ + 36 base62 = real GitHub PAT)sk-proj-:8 → 20 (sk-proj- + 12 alnum)SG.: 8 → 26 (SG. + 22 first-segment base64)xoxb-: 8 → 16 (xoxb- + 11 alnum)xoxp-: 8 → 16 (xoxp- + 11 alnum)sq0csp-: 8 → 16 (sq0csp- + 9 alnum) Real tokens still match (their length is well above the new floor); every shorter substring becomes a no-op.
Precision
-
looks_like_pure_identifierwidened. The single-underscore / kebab-case shape escaped the prior>= 2 underscoresor0 separatorsbranches. Added<= 1 separator (_ or -) + pure ASCII letters + no digit + 8..=40 charsarm. Coverscurlx_strdup(curl/lib/netrc.c),auth_decoders(curl/lib/http_aws_sigv4.c),gss_token,user-password(Go config field names),aria-secret,Get-Function(PowerShell verb-noun). All slipped through v0.5.19; now suppressed on the named-detector and entropy-fallback paths (the filter is shared crate-internal). -
blockcypher-api-token: dropped the global
token=<hex>pattern. Wastoken[=:\s\"']+([a-f0-9]{24,32})- fired on everyAuthorization: token <hex>line in any REST-API test fixture (41 Shopify API test SHAs in v0.5.19 dogfood). Replaced with host-scoped pattern requiringapi.blockcypher.comin the URL. 41 → 0 FPs. -
oxylabs-credentials: dropped the global
user-X:Xpattern. Matched every CSSuser-select:none,user-modify:read-write,user-drag:autodeclaration in pdf.js viewer.css / font-awesome / store-brave-com bundle.css. Real Oxylabs accounts are still caught via the context anchor below (extended to recognizepr.oxylabs.io/dc.oxylabs.iohostnames). 20+ CSS FPs killed.
Dogfood scope
49-target sweep with all v0.5.20 fixes:
| metric | v0.5.19 | v0.5.20 |
|---|---|---|
| blockcypher-api-token | 41 | 0 |
| oxylabs-credentials | 21 | 0 |
| generic-password | 90 | 77 |
| hot-sendgrid_key (FP) | 2 | 0 |
| total findings | 1212 | 1125 |
| zero-finding targets | 15 | 15 |
Real positives preserved: openssl 816 (test PEMs), PayloadsAllTheThings 61 (security-training examples), wafrift-cf-deploy 78 (test fixtures).
v0.5.19 - 2026-05-26 - entropy-fallback FP sweep (gogs 149 → 27, -82%; entropy total -79%)
Precision
-
CI workflow files: entropy fallbacks no longer fire in
.github/workflows/,.gitlab-ci.yml,.circleci/,azure-pipelines*,bitbucket-pipelines*,.travis.yml,Jenkinsfile. Real secrets in CI configs live behind${{ secrets.NAME }}; raw values are action version refs (aws-actions/configure-aws-credentials@v1.0), step names (Setup Node), bash subshells ($(echo ${SHA} | base64)). Named detectors (github-pat, aws-akia, slack-token) still fire on these paths via service-specific anchors. 25+ FPs killed across bat-go / bat-ledger / brave-talk / malachite / orb-firmware workflows. -
Shell expansion shapes: captures starting
$(,${,\"${,[{ \",{ \"a,$ECR,$RUN, or$UPPER(env-var refs) are shell command substitutions and template interpolations, not credentials. Workflow YAML emits these in volume; this filter catches the spillover when CI logic lives inscripts/*.shorMakefileoutside.github/. -
i18n / translation files: entropy-* now skipped in
/locale/,/locales/,/i18n/,/l10n/,/translations/,/lang/,/langs/directories,.po/.potfiles (gettext), and filename conventions likelocale_<region>.<ext>,messages_<lang>.properties,strings_<lang>.xml. Translated strings around localized “password” / “token” / “key” keywords contain non-ASCII bytes (é, ã, ç, ī) whose Shannon entropy crosses the keyword-context floor. 103 → 0 entropy-password FPs in gogs locale_*.ini alone; whole-target drop 149 → 27 findings (-82%). -
Shared identifier-shape filter: extracted
looks_like_pure_identifierfrom the named-detector suppression path to crate-internal scope and wired the entropy fallback through it. Previously the_password = getParameter(…)and German “Benutzername” cases were suppressed via the named path but the entropy fallback emitted them directly - same shape, different code path. Now both share one identifier-shape contract (snake_case≥2_no-digit, CamelCase no-digit, pure-alphabetic word 8..=32).
Dogfood scope (proof, not sample)
23-target sweep; entropy-* family delta:
| detector | v0.5.18 | v0.5.19 | Δ |
|---|---|---|---|
| entropy-password | 107 | 11 | -90% |
| entropy-token | 26 | 13 | -50% |
| entropy-api-key | 21 | 8 | -62% |
| entropy total | 154 | 32 | -79% |
Per-target highlights: gogs 149 → 27 (-82%), brave-talk 5 → 0, orb-firmware 13 → 1 (-92%), malachite 10 → 1 (-90%), webgoat 5 → 2, bat-ledger 14 → 9, bat-go 29 → 21. Twelve targets in the 23-target sweep now report 0 findings (brave-talk, colly, constellation, diffvg, mpc-lib, nitriding-daemon, orb-relay-messages, qtrap, spill, _self - keyhog scanning itself - plus the existing two). openssl’s 816 are test-PEM private-key findings (true positives in fixtures, not FPs); PayloadsAllTheThings’s 61 are intentional security-training examples.
v0.5.18 - 2026-05-26 - dogfood FP sweep (12-target deep scan, 160 → 83 findings, ~48% FP reduction)
Precision
- deel-api-key matched Java JNI macro names. Pattern was
org_[a-zA-Z0-9_-]{30,}which fired on everyorg_sqlite_jni_capi_CApi_*macro injavah-generated C headers (41 FPs in sqlite alone, applies to every Java-bindings library shipping JNI). Tightened toorg_[a-zA-Z0-9]{30,}- real Deel org tokens are opaque base62 with no underscores or hyphens. Same fix for theorganization_variant. - generic-secret captured C++ / Rust scope resolution. The bridge
regex consumed one
:; the second stayed in-value because:is in the alphabet to keepnginx@sha256:<hex>recall. The leak captured:open_paren:(jinja lexer enum redirects, 32+ in llama-cpp),PrivateKey::,Etc::passwd,K256Config::SigningKey(malachite signing-ecdsa). Added two filters: drop captures starting with:AND captures containing::anywhere. Sha256 digests pass both filters (start with hex, no::). - generic-secret captured Rust/Java/C# type names. Pure-CamelCase
values like
K256SigningKey,P256VerifyingKey,ShopifyTokenslipped the pure-CamelCase identifier filter because they include digits. Added a “type-name shape” filter: 8..=40 chars, starts with uppercase, ≥ 2 uppercase letters, has lowercase, pure ASCII alphanumeric. Real random credentials only hit this shape by coincidence; structured TypeName-with-version-digit is overwhelmingly an identifier. - generic-password captured Java method references. Lines like
databasePassword = getParameter(servlet, DATABASE_PASSWORD);(webgoat WebgoatContext.java) capturedgetParameter(12-char pure CamelCase, no digit). Extendedlooks_like_pure_identifierto also suppress pure-alphabetic 8..=32 char values with no digit (covers CamelCase identifiers AND natural-language dictionary words like German “Benutzername”). Real credentials have at least one digit or symbol. - entropy-api-key captured Java keystore filenames. Bat-go’s
docker-compose.yml had 4+ findings on
kafka.broker1.keystore.jks/kafka.broker1.truststore.jksnext toKEYSTORE_FILENAME:anchors. Added a filename-suffix filter that drops values ending in.jks,.yml,.yaml,.toml,.json,.properties,.pem,.key,.crt,.cer,.pfx,.p12,.keystore,.truststore,.conf,.ini,.env,.lock,.log. Real credentials never end in a known file extension.
CI / tests
- Test gate stayed red on integration-test type drift.
bconcat!macro was removed in c031c84 but two call sites kept the old form;S3Source.name()test didn’t import theSourcetrait. Both fixed:bconcat!(...)→concat!(...).as_bytes(),use keyhog_core::Source;added to the S3 gate. - Exit code consolidation.
main.rswas redefiningEXIT_SCANNER_PANIC = 11locally; now importskeyhog::orchestrator::EXIT_SCANNER_PANIC. One source of truth.
Dogfood scope (proof of FP reduction, not a sample)
Twelve real-world targets, all pre-v0.5.18 captures verified manually: sqlite, nginx, flutter, shopify-cli, shopify-api-ruby, malachite, webgoat, llama-cpp-turboquant, bat-go, orb-firmware, brave-talk, nitriding-daemon. Per-target totals:
| target | v0.5.17 | v0.5.18 | Δ |
|---|---|---|---|
| sqlite (deel JNI) | 41 | 6 | -85% |
| llama-cpp (jinja) | 41 | 7 | -83% |
| webgoat (Java) | 5 | 3 | -40% |
| malachite (Rust) | 10 | 8 | -20% |
| shopify-api-ruby | 10 | 8 | -20% |
| shopify-cli | 5 | 4 | -20% |
| bat-go (filenames) | 29 | 28 | -3% |
| orb-firmware | 13 | 13 | 0 |
| brave-talk | 5 | 5 | 0 |
| nginx | 1 | 1 | 0 |
| nitriding-daemon | 0 | 0 | ✓ |
| _self (keyhog repo) | 0 | 0 | ✓ |
| total | 160 | 83 | -48% |
Detector-level deltas: deel-api-key 35→0 (-100%), generic-secret 61→22 (-64%), generic-password 4→0 (-100%), entropy-api-key 27→27 (filename filter wave 2 still pending wider rollout).
v0.5.17 - 2026-05-26 - SSRF redirect closure + –insecure honor + oob hygiene
Security
- SSRF redirect bypass in DNS-pinned client closed. The per-request
client rebuild in
verify::request::resolved_client_for_urlwasClient::builder().timeout().resolve_to_addrs().build()- silently inheriting reqwest’s defaultPolicy::limited(10)instead of the engine’sPolicy::none(). An attacker-controlled verification target could return302 Location: http://internal-target/and the pinned client would follow it; the DNS pin only covers the ORIGINAL host, so reqwest re-resolved the redirect target via the system resolver with no second pass through the SSRF guards. Now the rebuild explicitly setsredirect(Policy::none()). Adversarial testpinned_client_does_not_follow_redirect_to_private_targetproves it. - SSRF bypass via hex / octal-encoded IPv4 hosts closed.
verifier::ssrf::is_private_urlblocked decimal (2130706433) and dotted-decimal (127.0.0.1) but accepted hex (0x7f000001) and octal (017700000001). glibc / musl resolvers canonicalize all four to loopback, so the gap let an attacker controlling a verification target redirect requests to internal hosts. Both radix paths now blocked. Seecrates/verifier/src/ssrf.rs.
Fixed
--insecureflag now honored on the DNS-pinned path. Same root cause as the redirect bypass above: the per-request client rebuild droppeddanger_accept_invalid_certs(insecure_tls)baked into the engine’s base client, so--insecure(andKEYHOG_INSECURE_TLS) silently did nothing for direct (non-proxy) verifications. Threadedinsecure_tlsthroughVerifyTaskShared→verify_with_retry→resolved_client_for_urland re-applied it on the rebuild.- Scanner-panic exit code no longer collides with detector-audit.
Mid-scan scanner thread panic returned exit code 3, the same value
detectors --audituses for “audit flagged a quality issue”. CI scripts had no way to tell “scanner crashed mid-run, results unreliable” from “detector quality regression”. Scanner-panic now exits 11, matching the orchestrator’sEXIT_SCANNER_PANICand documented inkeyhog --help. - scan-system exit code.
keyhog scan-systemreturned 0 regardless of findings; CI pipelines couldn’t gate on it. Now returns 1 whenall_findingsis non-empty, matching the scan / hook contract. - find_companion off-by-one.
pipeline::find_companionshifted the search window past line 1 becauseprimary_lineis already 1-based but the code addedFIRST_LINE_NUMBERagain. Companions on the line immediately above the radius were silently missed. - UTF-8 in JSON value extraction.
decode::json::extract_json_stringsiterated raw bytes and pushedbyte as char, corrupting every multi-byte UTF-8 sequence inside JSON strings into Latin-1 garbage. Switched tochar_indices(). - Zero-width regex hits in
extract_plain_matches. Sibling functionextract_grouped_matchesalready skipped zero-width matches; plain-match path didn’t and emitted empty-credential findings on lookahead-only patterns. Added the matching guard. - Panic-on-init paths removed from prefilter + disclaimer
loaders. Three
.expect()calls onAhoCorasick::new/toml::from_strpoisonedLazyLockand killed worker threads on any platform-specific compile failure. Converted to soft fallback (Option/empty list) withtracing::warn!. Worker threads now survive a corrupted-binary / build regression.
Changed
InteractshClient::for_testreturnsResultinstead of panicking. The helper formerly carriedRsaPrivateKey::new(...).expect("test RSA key generates")- a panic-in-production path the no-unwrap gate caught. ReturnsResult<Self, InteractshError>now (mapped toKeyGen); test callers wrap with.unwrap()at the test boundary. Source: gateoob_client_no_unwrap_expect.oob::clientsplit:decrypt_entrymoved tooob::decrypt. File hit 516 lines (over the 500 modularity cap). Natural seam - client owns RSA state + HTTP I/O, decrypt owns AES-256-CFB per-entry decode. No behaviour change. Source: gateoob_client_file_size_cap.- README exit codes match
--help. Documented codes 3 (detectors –audit failure), 4 (backend –self-test failure), 10 (live findings under--verify), and 11 (scanner panic) - README previously listed only 0/1/2. - Hash-digest gate is no longer always-on for named detectors.
Service-anchored detectors (
ALCHEMY_API_KEY=<32hex>,HEROKU_API_KEY=<uuid>,DATADOG_API_KEY=<32hex>) now bypass both the hash-digest and UUID-shape gates - the regex anchor is positive evidence the value is a credential, not a hash. Generic / entropy / private-key paths stay gated. Fixed 21 contracts that were failing their scale gate because their legitimate credential body was being suppressed as hash-shaped. kubernetes-secretdetector disabled. Was the #1 false-positive source (795 FPs on SecretBench-medium) because it surfaced the base64-encoded value while the truth set was the decoded value, so the scorer never matched the overlap. Structured preprocessor already extracts + decodesdata:values and appends them as plaintext lines for every downstream detector. Detector file kept (vs deleted) so the embedded count stays stable.- Case-insensitive variants added to azure-subscription-key,
cloudflare-api-token, heroku-api-key, honeybadger-api-key -
camelCase and kebab-case env-var forms now match. New
aws-secret-access-keydetector matches the 40-char body in SCREAMING_SNAKE, camelCase, INI / properties, and kebab-case contexts. Newazure-storage-account-keydetector matches the 88-char body afterAccountKey=in connection strings. - Verifier SSRF blocklist routed through the vendored bogon crate. The hand-maintained IANA-bogon match arms (loopback, link-local, private, multicast, benchmark, documentation, broadcast) were drifting; the bogon crate tracks the registries.
- README overhauled. Stale ~60-line Roadmap section killed. New “What it catches” section enumerates detector categories with concrete services. “Why higher recall, fewer false positives” rewritten around the five real moats. Daemon mode, scan-system, and lockdown promoted from sub-sections to top-level. Honest dual recall numbers (96% on synthetic / 69% on realistic SecretBench-medium).
Added
- Documentation site under
site/. 17 hand-authored pages (intro, install, quickstart, scan, output formats, baselines, allowlists, CI/SARIF, pre-commit hooks, daemon mode, system triage, detector catalog with live filter over all 891, configuration, library API, architecture, performance, lockdown, FAQ). Black-on-white with restrained yellow accents. Build withpython3 site/build.py; deploy to GitHub Pages. - Per-detector self-validation test
(
tests/all_detectors_self_validate.rs). Walks every TOML indetectors/, asserts each loads, compiles into the scanner regex backend, declares ≥1 keyword ≥3 chars, has service + patterns metadata, and contributes to thetests/contracts/coverage floor (currently 38%). Catches load-but-never-fires regressions before they ship. - SecretBench v5 corpus + provider-anchor wrappers. Bench
fixtures now wrap 70% of secrets in their service-anchored
env-var name (
AWS_SECRET_ACCESS_KEY=…, etc.) instead of genericSECRET_KEY=…. Matches real-repo distribution.fn_analyze.pycompanion tofp_analyze.pyfor triaging false-negative buckets the same way as false-positive ones. - CI workflows fixed. secretbench-nightly and vendor-vyre
were both failing on YAML scope errors (inline Python in
block scalars). Python summary now lives in
tools/secretbench/scoring/print_summary.py; vendor-vyre commit message built viaprintfinto a temp file. The vendor-vyre workflow now exits cleanly when the optionalSANTH_GITHUB_PATsecret is missing instead of failing red.
Performance
-
SecretBench-medium scoreboard (15k fixtures, seed 0):
run F1 precision recall TP FP FN v17 0.7710 0.8449 0.7089 10634 1952 4366 v18 0.7120 0.7078 0.7162 10743 4436 4257 v19 0.7815 0.9018 0.6895 10342 1126 4658 v18 was a regression (bypass-all-shape-gates added 3304 FPs in the sha-hex / git-commit-sha buckets); v19 restored the hash-digest gate as always-on; the Unreleased bypass-on-anchor fix is being measured next.
v0.5.16 - 2026-05-23 - JsonDecoder wired into decode registry
Fixed
JsonDecoder is now in the decode-through pipeline. It had a
splice-aware implementation in crates/scanner/src/decode/json.rs
since v0.5.15 but was never registered in get_decoders() - pure
dead code. Credentials stored as JSON-encoded fields (the most
common shape after .env) silently went unsurfaced.
Result on the adversarial_explosion_runner corpus (348 detectors × ~2 positives × 8 real-world wrappers):
| state | variants firing |
|---|---|
| v0.5.15 | 5719 / 5792 (73 JSON-wrapper misses) |
| v0.5.16 | 5792 / 5792 (corpus is wrapper-tight) |
The runner is now strict-by-default
(KEYHOG_ADVERSARIAL_STRICT=0 to opt out) so any future
regression that loses a single variant turns CI red.
v0.5.15 - 2026-05-23 - decode-through splice: base64/hex recall 30% → 93%
Fixed
Decode-through pipeline preserves companion context now. Decoded
chunks used to be bare bytes with no surrounding text - every
detector anchored on a companion keyword (aws_secret = …,
Authorization: Bearer …, api_key: …) lost its anchor as soon
as the credential was recovered from an encoded blob.
push_decoded_text_chunk_spliced in
crates/scanner/src/decode/pipeline.rs now splices the decoded
text BACK into the parent at the position of the original encoded
blob. Measured on the new encoding_explosion_runner corpus
(348 detectors × ~2 positives):
| encoding | before | after | delta |
|---|---|---|---|
| base64-std | 30.5% | 93.1% | +62.6pp |
| base64-url | 30.5% | 92.8% | +62.3pp |
| hex | 30.5% | 92.8% | +62.3pp |
| url-percent | 15.5% | 79.7% | +64.2pp |
Migrated decoders: base64 (Base64Decoder + Z85Decoder), hex,
json, url (via decode_candidates). Splice path is memory-capped
at 256 KiB parent so multi-MB chunks don’t blow allocation.
Added
keyhog scan --proxy <URL>- route every outbound HTTP request through an HTTP/HTTPS/SOCKS5 proxy. Falls back toKEYHOG_PROXY/HTTPS_PROXY/HTTP_PROXY/ALL_PROXYenv.--proxy offdisables proxying including env inheritance (air-gapped scans).keyhog scan --insecure- skip TLS verification for every outbound request. Needed when scanning through Burp / mitmproxy CAs with self-signed certificates. Env:KEYHOG_INSECURE_TLS=1.- Shared
keyhog_sources::httppolicy module. Single source of truth for proxy + TLS + UA so an operator settingKEYHOG_PROXYaffects every outbound request uniformly. - 40 000-case proptest suite for the HTTP-client policy and
SARIF dedup contracts (
crates/sources/tests/property/http_fuzz.rs,crates/core/tests/property/sarif_dedup.rs). - 5 500-case adversarial wrapper-explosion runner - re-embeds every contract positive in 8 real-world formats and asserts the detector fires.
- 6 500-case path-shape runner - replays every positive at 5 production paths and 4 suppressed-shape paths.
- 5 070-case encoding-explosion runner with split decode-hit vs incidental-hit metrics. Floors pinned so a regression below 88% base64 / 92% hex / 75% url-percent trips the gate.
tests/live_verify.rs- env-gated live-verify smoke against real AWS/GitHub creds (KEYHOG_LIVE_VERIFY=1).tools/diff_bench/- single-shot runner that drives keyhog + trufflehog + gitleaks across one labeled corpus (positives synthesized at CI runtime to dodge push-protection) and emitsdifferential_results.jsonwith per-scanner precision / recall / F1 / timing..github/workflows/differential-bench.ymlruns nightly + on workflow_dispatch.
v0.5.14 - 2026-05-23 - macOS x86_64 + Windows release binaries
Added
release.yml now produces five assets per tag instead of two:
keyhog-linux-x86_64(default features, dynamic Hyperscan)keyhog-macos-aarch64(Apple Silicon,portablefeatures)keyhog-macos-x86_64(Intel mac,portablefeatures) - newkeyhog-windows-x86_64.exe(MSVC,portablefeatures) - new
The Windows + Intel-mac variants share the existing portable
feature subset (every detector data feature, every git / web /
github / s3 / docker / verify source backend, no Hyperscan /
Ghidra / CUDA system libs). Daemon IPC is #[cfg(unix)]-gated,
so it compiles to a stub on Windows hosts without disabling the
rest of the binary surface. v0.5.13 only shipped the prior two
assets because the matrix change landed after the tag was cut.
v0.5.13 - 2026-05-23 - SARIF dedup so GitHub Code Scanning accepts uploads
Fixed
SARIF v2.1.0 forbids duplicate items in relatedLocations. When a
finding had the same supplemental location reported twice (e.g.
verifier echo + scanner overlap), GitHub Code Scanning rejected the
whole SARIF with relatedLocations contains duplicate item,
silently losing every finding on the upload. The dedup runs on a
(file_path, line, offset) key before serialization, so each
related location appears at most once.
This is what unblocks the fleet-wide keyhog.yml CI rollout -
prior to this fix every repo that produced a finding lost its
SARIF, leaving the Code Scanning tab empty even when the run was
“green”.
v0.5.12 - 2026-05-23 - dedup 9 more dup-primary detectors
Fixed
Dropped the duplicate “secret/companion” primary in nine more detectors. Companion-only text no longer fires the detector without the id-half nearby.
- hashicorp-vault-approle-credentials (Vault Secret ID)
- qualys-api-credentials (qualys_username)
- remitly-api-credentials (Remitly client ID)
- smartproxy-credentials (smartproxy_username)
- tidb-cloud-credentials (TiDB Public Key)
- veracode-api-credentials (veracode_api_secret)
- zscaler-api-key (zscaler_client_secret)
- zuora-api-credentials (zuora_client_secret)
- cloudflare-zero-trust-service-token (client_secret) - positives use the Client-Id shape, so dedup is safe even with main contract.
belvo, crisp, env0, exoscale, checkmarx, crowdstrike, fastspring, fedex still have the dup-shape - their main contracts have a secret-only positive that fires by design, so dedup would regress recall and isn’t a safe local sweep.
Changed
- Pattern count 1674 → 1665 across README + e2e_binary + readme_claims gate.
v0.5.11 - 2026-05-23 - dedup carbon-black + databricks
Fixed
- carbon-black-api-key: dropped duplicate org-key primary (kept as required companion). org_key=… alone no longer fires the detector without a CB API KEY primary nearby.
- databricks-token: dropped duplicate workspace-url primary
(kept as companion). A bare workspace URL with no
dapitoken nearby no longer fires the detector.
Same SURPLUS shape as the v0.5.9/v0.5.10 sweeps. These two had existing main contracts whose positives did NOT depend on the dropped primary firing alone - verified before edit.
Changed
- Pattern count 1676 → 1674 across README + e2e_binary + readme_claims gate.
v0.5.10 - 2026-05-23 - detector dedup sweep + binary/crates alignment
Fixed
- Dedupe primary-equals-companion in 14 detectors (idenfy, infura, jumio, marvel, packer, scaleway, sovos, thomson-reuters-onesource, time4vps, twilio-iot, upcloud, vonage-video, wix, woocommerce). Each listed the “secret / companion” half as a duplicate primary regex; companion-only text would fire the detector. Same SURPLUS shape closed in v0.5.9 for ringcentral/booking-com/vanta/trulioo/appdynamics/ avalara/akoya - sweeping the rest of the corpus that has no main contracts yet so existing positives can’t regress.
- Test-target clippy lints in gpu_ac_recall_bug_56, cve_replay_runner, companion_contracts_runner, property/scanner_fuzz.
Changed
- Pattern count 1697 → 1676 across README banner +
e2e_binary::README_PATTERN_COUNT+readme_claimsgate. - v0.5.10 binary release and crates.io publish are built from the same commit. v0.5.9 shipped a linux binary built from the tag commit before CI dedup landed; crates.io was never published at 0.5.9 (CI test red on the pattern-count drift).
v0.5.9 - 2026-05-23 - companion contracts gate + LFS coverage
Fixed
- Companion contracts gate (12 issues closed). Five detectors
(ringcentral, booking-com, vanta, trulioo, appdynamics) listed
the “secret” half as a duplicate primary regex, so the
secret-only
negative_companion_lookalikefixture fired the detector. Removed the duplicate primaries; secret is now companion-only. Akoya / avalara had the same dup-primary shape. - bitbucket-app-password companion regex. Was
[a-zA-Z0-9._-]+(matched anything), so primary-only text populatedcompanion.usernamefrom inside the primary’s own assignment line and verification proceeded despitemust_not_verify. Re-anchored tobitbucket_username=shape. - ringcentral companion now anchored to client_secret= shape
so id-only text no longer populates
client_pairand triggers VERIFY-RISK. - Three twilio companion fixtures used
xxx/fakeplaceholders containing non-hex characters that the example-credential filter suppressed; swapped to realistic hex so the gate tests the engine behavior, not the example-credential filter. - rustfmt -
scan_gpu.rs+engine/mod.rsre-joined now-short calls after thematching→scanmodule migration.
Changed
.gitattributesnow coverscontracts/companion/*.tomlin LFS. The original LFS rule was non-recursive; companion fixtures with Twilio-shaped strings would otherwise trip GitHub push-protection.
v0.5.8 - 2026-05-23 - daemon wire-v2, GitHub Action, contracts gate
Added
- GitHub Action that actually works.
uses: santhsecurity/keyhog/.github/actions/keyhog@v0.5.10now installs the Rust toolchain + Vectorscan/Hyperscan and builds keyhog, or downloads a prebuilt binary from the matching GitHub Release when one exists. Previously the action rancargo buildwithout setup, so every downstream Ubuntu run failed withcargo: command not foundor a hyperscan-sys linker error. SARIF output auto-uploads to code-scanning whenformat: sarif. README example was also pointing at a nonexistentkeyhog/keyhog-action@v1repo - fixed to the bundled action path. .github/workflows/release.yml- tag-driven binary build- upload. Pushing a
v*tag now compileskeyhogforkeyhog-linux-x86_64(default features incl. Hyperscan via apt) andkeyhog-macos-aarch64(feature subset, no Hyperscan), then attaches the artifacts to the release. The composite action prefers these prebuilt binaries over a cold cargo build whenever the host triple matches.
- upload. Pushing a
KEYHOG_DOGFOOD=1- daemon-side dogfood capture. Set when starting the daemon (KEYHOG_DOGFOOD=1 keyhog daemon start) to enable per-scan event capture inside the daemon; the events cross the wire to the client and flow into--dogfoodoutput. Per-request toggling is not wired - env-var gating keeps one client’s debug session from bleeding into another client’s payload on a shared daemon, which a per-request flag would break without additional isolation work.- Daemon mode.
keyhog daemon start | stop | statusruns a long- lived scanner over a Unix socket (default$XDG_RUNTIME_DIR/keyhog.sock, falls back to~/.cache/keyhog/server.sock; socket ischmod 0600).keyhog scan --daemon(or auto-detected when the socket exists) routes a stdin scan / single-file scan through the daemon instead of paying the ~3 sCompiledScanner::compilecold start. Measured 105× speedup (7 ms via daemon vs 740 ms in-process) on a real GitHub PAT, same detector + hash + offset in both paths.--no-daemonforces the in-process path.--verify,--baseline, directory walks, git-staged scans, and archive decoding stay in-process by design (the daemon doesn’t replicate that pipeline). .keyhogignoregitignore-style shorthand. Bare path globs (*.log,node_modules/,vendor/**/*.json) and bare 64-char hex hashes are now accepted alongside the explicitpath:/hash:/detector:prefixes. Lets users drop a copied.gitignorein place and have it work.--max-file-sizeskip summary. Files dropped by the size cap now emit a per-file WARN AND an end-of-scan summary line (“N file(s) skipped: exceeded –max-file-size”). Walker’s silent filter was the only behavior before - a user looking at a smaller-than-expected scan had no signal about which files were dropped.- Live progress ticker. Long scans paint a self-overwriting
scanning N/M chunks · K findings · t.t sline on stderr every 250 ms; suppressed under--streamor when stderr isn’t a TTY. - 25 companion-required detector contracts at
crates/scanner/tests/contracts/companion/. Per-detector TOMLs encode the three-shape contract (positive_with_companion, positive_primary_only withmust_not_verify, negative_companion_lookalike) for AWS, Twilio (api-key / auth-token / IoT), Algolia, Razorpay, Amplitude, AppDynamics, Avalara, Backblaze, Belvo, Bitbucket, Booking, Akoya, 4everland, Lark, Linear, Linode, Plaid, Reddit, RingCentral, SumoLogic, Trulioo, Vanta. Runner test atcompanion_contracts_runner.rsenforces all three shapes per contract.
Fixed
contracts_runnerwas flaky across CI vs local. The 341-fixture loop reused a singleCompiledScannerand never calledclear_fragment_cache()between scans, so the cross-file reassembly cache accumulated. CI’s filesystem-iteration order put braintree’ssandbox_…positive ahead of blur-api-key’s evasion and the sandbox credential surfaced as the only finding on"blur key = \"Kp4Q…\""- a non-deterministic failure invisible locally. Fix: clear the cache before every scan incontracts_runner.rs(5 sites) andcompanion_contracts_runner.rs(3 sites) per the documented test-isolation API inengine/mod.rs:747-760.blur-api-keyregex required uppercaseKEYwhile the contract evasion uses lowercasekey. Prepended(?i)and lower-cased the literals; the contract evasion now hits the intended case-variant path. Tests assert truth, not shape - weakening the test would have masked the engine gap.- Daemon-mode
--dogfoodwas inert. Engine-side telemetry (record_example_suppressioncalls frompipeline.rs::should_suppress_known_example_credential_*) fired inside the daemon process - the client never saw any of it, sokeyhog scan --dogfood demo-secret.envagainst a daemon silently dropped every suppression event and the reporter counter stayed at 0. Wire protocol bumped 1 → 2:Response::ScanResultsnow carriesengine_example_suppressions: u64anddogfood_events: Vec<DogfoodEvent>(both#[serde(default)], so a v2 client tolerates a v1 daemon). Daemon drains its per-scan telemetry after eachscanner.scan(...)and resets; client merges the values into its ownOnceLock<Telemetry>via two new public helpers (add_example_suppressions(n),append_events(iter)). Verified locally:--no-daemonAND a fresh daemon both emit “No real secrets - but 6 example/test keys suppressed. Pass –dogfood to see them.” demo-secret.envsummary regressed to the clean-repo message. The v0.5.7 fix wiredTextReporterto read the suppression count, but the orchestrator’stest_fixture_suppressions.suppresses()branch ran before any telemetry write -AKIAIOSFODNN7EXAMPLEmatched the bundled substring suppression list and returnedfalsewithout incrementing the counter, so the reporter still saw 0 and printed “Your code is clean.” Now bumpsrecord_example_suppression(..., "test_fixture_suppression")before returning. Same patch in the daemon-sidefinalize_for_reportfilter. Locked bye2e_binary::demo_secret_aws_example_summary_distinguishes_suppression_from_clean.- Mega-scan allocated ~20 GB RSS on tiny inputs. Every shard’s
static input/state buffers were sized for
MEGASCAN_INPUT_LEN=256 MiB. Forcing--backend mega-scanon a 19-byte file uploaded ~570 × 256 MiB ≈ 20 GB of GPU memory and burned ~20 s before returning. Small-buffer guard at the entry ofscan_coalesced_megascannow routes batches under 64 KiB through the literal-set GPU path. Same recall (same AC literal prefix anchors), orders of magnitude lower setup cost. Confirmed 20.77 s / 19.7 GB → 0.34 s / 399 MB on the kimi reproducer. - GPU fallback regex-NFA dispatch silently dropped to CPU. The
fallback
RulePipeline::scanwas passedmax_matches_per_dispatch=1_000_000which trips vyre’s hard-codedmax_hits=10_000static buffer declaration. Capping the dispatch atNFA_HITS_PER_DISPATCH=10_000keeps the GPU path live; the always-active fallback regex set is small enough that 10 K matches per dispatch is well above what we’d ever see. env::args()panicked on non-UTF-8 args. Linux allows raw-byte paths;std::env::args()calls.unwrap()on each Result which aborts with SIGABRT. Switched the version-flag detection inmain.rstoargs_os()+ lossy compare.- Non-UTF-8 paths reported “No such file or directory” even when the file existed. New pre-flight at the CLI boundary refuses non-UTF-8 paths with a clear message (“Rename the file or scan its parent directory”) instead of confusing the user with a missing-file rabbit hole.
- Nonexistent / unreadable input paths exited 0 with a WARN
and “No secrets found, your code is clean.” Per the documented
exit-code contract these are runtime errors. CLI now stat’s the
input pre-walk; missing path → exit 2 with “path does not exist”,
unreadable file → exit 2 with “cannot read … (fix
chmod +r …)”. --backend invalidsilently ignored and the scan ran with the default. clap now validates against the PossibleValues set{gpu, mega-scan, megascan, simd, cpu, auto}and exits 2 with a clear error..keyhogignoredetector:entries were dead. The parser populatedignored_detectorsbut the orchestrator’s per-finding filter never read it. Now applied alongsideis_path_ignored/is_raw_hash_ignored.- RefCell double-borrow panic in
fallback.rs. Per-pool thread-local borrows nowtry_borrow_mut+ fresh-alloc fallback at three sites (ACTIVE_PATTERNS_POOL,ACTIVE_INDICES_POOL,TRIGGER_POOL). Was a hard P0: the rayon worker re-entry caught itself on the second borrow and aborted mid-scan. - FP storms killed: lastpass-dev-creds firing on random
id=<digits>in /var/log archives (87% FP rate per kimi); GitHub PAT placeholderghp_xxxxxxxx…flagged at 0.80; xoxb tokens with ascending-digit runs flagged. Tightened lastpass-dev-creds to requirelastpasscontext within 40 chars; extendedlooks_like_prefixed_masked_sequenceto suppress x/X-dominance, all-same-char, and ascending-digit-run ≥ 13.
Improved
- CUDA driver is opt-in. The
cudafeature was on by default, which madecargo buildfail on any host withoutlibcuda.so/libnvrtc.so/libcudart.so- including macOS, most CI runners, and any Linux box without an NVIDIA driver stack. The default scanner build now useswgpu(Vulkan on Linux, Metal on macOS) for GPU dispatch. CUDA users opt in with--features cudawhen they want the CUDA backend specifically. Drops the link-time CUDA requirement from every default build. scripts/publish.shreads the version fromCargo.toml. Renamed frompublish-0.5.6.sh(which would silently emit “All v0.5.6 crates published” even when publishing v0.5.7). The new scriptawks[workspace.package].versionand uses that everywhere - no per-release rename or message edit.- LayeredPipelineCache short-circuits compile on warm hits. The
prior
rule_pipeline_cachedalways calledbuild_rule_pipelineupfront to keep typed-error semantics for vyre’s infallible-closurecached_load_or_compile, which made the on-disk cache pointless. Now uses vyre’sengine_cache_path+ manual load/save so a warm hit returns the deserialisedRulePipelinewithout paying the compile. PreparedChunk::line_offsets()memoised viaOnceLock.compute_line_offsetsused to walk the preprocessed text twice per chunk (once for the triggered path, once for the pattern-hits path); the second caller now hits the memoised Vec.- Mega-scan compile-failure WARN demoted to debug. Falling back
to the literal-set GPU dispatch when vyre’s byte-NFA frontend
can’t represent every pattern (e.g. pattern 990 in the bundled
detector corpus uses lookaround) is the designed degradation -
the user can’t fix it, and one WARN per
--backend mega-scaninvocation creates noise without signal.
Differential parity
.internal/bench/differential/compare.py against gitleaks 8.30.0
and trufflehog 3.95.3 on the 64 MiB big_with_secrets corpus:
gate green. Every secret two independent competitors HASH-confirm
keyhog also surfaces, except sk_live_4eC39… which is
documented as a public Stripe docs example (suppressed by
test_fixture_suppressions::bundled() and listed in
baseline.toml).
v0.5.7 - 2026-05-17
Fixed
- The ‘No secrets found. Your code is clean.’ message lied when
every match was suppressed as an EXAMPLE/test key. The 0.5.6
bump wired example-suppression telemetry into the orchestrator,
but the user-facing summary is owned by
TextReporter::finish()inkeyhog-core, not the orchestrator - so the misleading banner still printed.TextReporternow takes the suppression count viaset_example_suppressions(n)and prints “No real secrets - but N example/test key(s) suppressed. Pass –dogfood to see them.” instead. Verified end-to-end againstdemo-secret.env. Regression tests pin all three states.
v0.5.6 - 2026-05-17
Added - dogfooding-driven UX
--dogfood- opt-in JSON trace on stderr after the scan. Each example/test/placeholder credential that was matched and then suppressed gets a redacted-prefix event with the algorithmic reason (contains_EXAMPLE_token,algorithmic_placeholder). Closes the “did the scanner miss this, or silence it?” question without a debug rebuild. Full credentials are never emitted ---dogfoodis a decision tracer, not a credential exfil channel.- Honest scan summary when only example keys were found. Previously,
scanning
demo-secret.env(which holdsAKIAIOSFODNN7EXAMPLE) printed “No secrets found. Your code is clean.” - identical to a genuinely clean repo. Now the summary distinguishes:- 0 findings, 0 suppressed → “0 secrets in 0.12s. You are secure!”
- 0 findings, N suppressed → “0 real secrets, N example/test key(s) suppressed (pass –dogfood to see them).”
Internal
- New
keyhog_scanner::telemetrymodule: per-scan atomic counters + optional event log. Engines callrecord_example_suppression(...)from the existingshould_suppress_known_example_credential_*paths; the orchestrator drains events at the end ofrun(). Zero new state threaded through engine boundaries - singleOnceLockprocess-local container with areset()for tests. - Two regression tests pinning the demo-secret.env case + the dogfood
redaction contract. Telemetry-touching tests serialise behind a
module-local
Mutexsocargo test’s parallel runner doesn’t let them step on each other.
v0.5.5 - 2026-05-09
GPU foundations + vyre composition pass. The session wires keyhog deeper into vyre as a primitive consumer and contributes new general-purpose capability back to vyre.
Tier-aware GPU routing + 2 MiB threshold on RTX 40/50-class GPUs.
select_backend now classifies the detected adapter into High /
Mid / Low tiers and consults per-tier crossover thresholds:
| Tier | Adapter examples | min_bytes | solo cap |
|---|---|---|---|
| High | RTX 40/50, A100/H100, M-Max/Ultra, RX 7900 | 2 MiB | 16 MiB |
| Mid | RTX 20/30, GTX 16, Arc, M-Pro/base, RX 6/7 | 16 MiB | 64 MiB |
| Low | iGPU, older discretes, unknown | 64 MiB | 256 MiB |
Pattern-count breakeven is also tier-aware (100 / 500 / 2000).
keyhog backend reports the active tier and effective thresholds
for the live adapter. Backwards compatible: unknown adapters
classify as Low and keep the legacy thresholds.
GPU dispatch sharding + correctness fix. scan_coalesced_gpu
now slices the coalesced buffer at 65535 * 32 = 2,097,120 bytes
per dispatch (the wgpu workgroup-per-dimension cap × vyre’s
workgroup_size_x = 32) and re-bases shard-local match offsets
into the global buffer’s coordinate space. Eliminated the silent
dispatch group size > 65535 error that the prior single-dispatch
path hit on every 100 MiB+ batch. Recall on the realistic
benchmark fixture now matches CPU/SIMD within rounding (303,554
vs 302,168 vs 304,128) - earlier 121× speedup numbers were
lying because the dispatch errored mid-batch and only ~1% of
true hits came back.
Vyre intern::perfect_hash wired for static-string interning.
CompiledScanner builds a CHD perfect hash from every detector’s
(id, name, service) plus the seed source-type literals at
construction time. ScanState::intern_metadata consults this
frozen interner first; only dynamic strings (file paths, commit
SHAs, author names, dates) hit the per-scan HashSet<Arc<str>>
fallback. Per-scan allocation count drops by ~100k on a typical
1000-chunk run. 6 unit tests + 282 scanner tests still green.
Vyre megakernel scaffolding (gated behind KEYHOG_USE_MEGAKERNEL).
engine/megakernel_dispatch.rs ships a working DFA-per-literal
compile + BatchDispatcher init + dispatch loop that hands back
the same per-chunk per-pattern trigger bitmask the literal-set
GPU path produces. Routed in scan_coalesced_megakernel behind
the env opt-in. Defaults OFF: vyre’s BatchDispatcher is
optimised for “many files × few rules” but keyhog’s corpus is
“few files × 6000+ rules” - modelling each literal as its own
BatchRuleProgram allocates chunks × rules ≈ 600,000 work
items per dispatch, which keeps the persistent kernel sleeping
in S-state on RTX 5090. Real megakernel win needs vyre-side
multi-pattern hit reporting (one DFA covering many literals,
HitRecord gains a per-pattern field) - wiring then collapses
to a one-line swap.
Cross-platform compile fix in vendored vyre-runtime: GpuStream<'a>
now carries PhantomData<&'a ()> on non-Linux so the lifetime
parameter isn’t flagged unused when uring is cfg’d out.
Windows / macOS builds now pull vyre-runtime cleanly.
Vyre rule engine wired for declarative .keyhogignore.toml.
Upstream vyre additions (general-purpose, lives in vyre-libs):
vyre_libs::rule::cpu_eval- pure-CPU evaluator forRuleCondition/RuleFormulatrees. Mirror of the GPU lowering. Useful for any consumer that wants per-record rule evaluation without dispatching a backend program. 11 unit tests.vyre_libs::rule::ast::RuleCondition::FieldInSet- new variant for “context field’s value is in this set”. Distinct fromSetMembership(which compares a static value, not a field lookup). Required for expressing “detector_id is one of …” without resorting to regex alternation. Builder lowering errors with an actionable Fix: message - only the CPU evaluator can resolve field lookups today.- vyre
smallvecworkspace pin bumped 1.14.0 → 1.15.1 so consumers carrying gix (which requires ^1.15.1) can share the type - keyhog needed this to putSmallVec<[Arc<str>; 4]>on the wire between core and vyre.
Keyhog consumes via new crates/core/src/rule_filter.rs. Schema
documented in docs/keyhogignore-toml.md. [[suppress]] tables
compose AND of named predicates (detector / service / severity /
severity_lte / path_eq / path_contains / path_starts_with /
path_ends_with / path_regex / credential_hash). Multiple
[[suppress]] tables compose with OR. Empty entry rejected at
parse to prevent accidental suppress-everything. Unknown fields
rejected via serde deny_unknown_fields. Wired into
orchestrator.rs::run after finalize() returns
VerifiedFindings - predicates need the resolved fields that
dedup_cross_detector populates. Malformed
.keyhogignore.toml is non-fatal: warn + load zero rules; legacy
.keyhogignore still applies. 11 keyhog rule_filter tests pass.
Realistic benchmark fixture. The previous --benchmark corpus
used 36-char alphanumeric filler on every line, triggering the
entropy detector constantly so the benchmark was measuring
per-chunk extraction cost rather than the literal-prefilter
crossover it claims to measure. New fixture mirrors typical
TypeScript/Go/Rust source: short identifiers, natural-language
comments, short string literals. RTX 5090 against this fixture:
130 MiB/s (cpu-fallback) / 136 MiB/s (simd-regex) / 34 MiB/s
(gpu-zero-copy). The architectural fix for GPU loss on dense
corpora is megakernel fusion of the extraction pipeline (vyre
upstream feature, queued).
Vyre full 30-crate audit doc (docs/vyre-usage.md). Catalogues
every vyre crate (foundation, driver, driver-wgpu, driver-megakernel,
driver-spirv, libs, primitives, runtime, spec, intrinsics, reference,
cc, harness, macros) with the public surface of each. Lists every
vyre-libs and vyre-primitives module by name with what keyhog
could conceivably wire from each.
v0.5.4 - 2026-05-08
Roadmap-clearing pass plus the first crates.io publish for every workspace crate. The README’s “Roadmap” section drops four items and a long-standing ignored regression test goes green.
Cross-chunk window-boundary reassembly (roadmap #3). New
crates/scanner/src/engine/boundary.rs splices the tail of each
large-file scan window to the head of the next and rescans the seam,
catching secrets that physically straddle the 64 MiB scan-window
boundary. Wired into scan_coalesced after Phase 2 in both the SIMD
and no-SIMD paths. Bounded to 1 KiB per side (2 KiB per pair), so
cost is independent of chunk size: a 64 GiB file sliced into 1000
chunks pays ~2 MiB of total boundary work - negligible next to the
per-chunk regex pass. Six unit tests + the previously-#[ignore]-
marked test_window_boundary_detection integration test now pass;
the test itself was rewritten to use an AKIA-shaped secret (the
original XX_FAKE_* shape was unconditionally suppressed by the
placeholder filter, so the test would have stayed red even with
reassembly).
keyhog detectors --audit and keyhog detectors --fix
(roadmap #4). detectors --audit runs every detector through
keyhog_core::validate_detector, prints issues grouped by detector
ID, and exits with code 3 when any Error-severity issue surfaces -
drop it into CI to gate detector PRs. detectors --fix scans the
on-disk TOML corpus for the one validator finding that’s safe to
repair mechanically - single-brace template references ({shop})
inside [detector.verify*] blocks - and rewrites them to the
double-brace form ({{shop}}) the interpolator actually honours.
Rewrites are scoped to verify blocks only (regex quantifiers like
[A-Z]{4,6} in pattern blocks stay untouched), atomic-written via
NamedTempFile, and re-validated post-rewrite so a corrupted result
backs off rather than overwriting the original. --dry-run previews
without writing. The 888-detector embedded corpus shows zero errors
today (the v0.4.x detector cleanup wave already cleared them) - the
subcommand is the regression net for the next batch of contributions.
Seven unit tests cover the rewriter’s edge cases.
Streaming finding previews (roadmap #5). New --stream flag emits
a one-line redacted preview to stderr per finding as the scanner
produces it, instead of waiting for dedup + verification before
printing anything. Format is grep-friendly:
[stream] CRITICAL aws/aws-access-key src/foo.rs:42 AKIA...XYZ_a.
The full report (text/json/sarif/jsonl) still lands on stdout/--output
at the end - the stream is purely a UX hint that the scanner is
making progress on long-running runs (large monorepos, scan-system,
GitHub-org walks). Implemented inside the existing scanner thread via
io::LineWriter so per-line writes land atomically across rayon
workers.
--verify-rate + --verify-batch (roadmap #7). The per-service
token-bucket rate limiter (crates/verifier/src/rate_limit.rs) is now
hot-swappable via a new set_default_rps() (atomic-backed nanosecond
interval) so the CLI’s --verify-rate <RPS> flag can take effect
after the global limiter has lazily initialised. Default stays at
5 rps; existing per-service overrides via update_limit are
preserved. --verify-batch adds per-service serialisation
(max_concurrent_per_service = 1) on top of the rate cap - use it
for repos with hundreds of fixture findings where bursting an
upstream auth endpoint would get the scan IP throttled. Three new
unit tests cover the rps→nanos clamp behaviour and the atomic update
path.
Robustness sweep.
entropy_1000_chars_under_1mswas unconditionally failing undercargo teston debug builds (2.5 ms vs the 1 ms threshold). Marked#[ignore]matching the two sibling perf-threshold tests; rerun locally withcargo test -- --ignoredagainst a release build.crates/cli/src/scan_runtime.rswas a 0-byte dead module with no references anywhere in the workspace. Deleted.- Workspace
licensefield downgraded fromMIT OR Apache-2.0toMIT- the only license file shipped in the repo is the MIT one. Honesty over ecosystem convention. cargo clippy --workspace --all-targetsnow clean (was 4 warnings: unused-mut indedup.rs, items-after-test-module inorchestrator_config.rs, an unnecessaryas_ref()in the new streaming preview, and an explicit-counter loop inextract_plain_matchesthat’s intentional for deadline-cadence gating and now carries an explanatory#[allow]).detectors/.keyhog-cache.json(runtime parse cache) is now gitignored ANDkeyhog-core/Cargo.tomlcarries an explicitexcludeso a stale cache file can’t sneak into the published tarball.scripts/audit.shwrapscargo auditwith the four accept-with-rationale--ignoreflags so local audits exit clean the way CI does (cargo-audit 0.22 doesn’t auto-loadaudit.toml).
Crates.io publish setup. Workspace package metadata
(description/license/repo/homepage/docs/keywords/categories/readme)
audited end-to-end across all five crates; package contents verified
via cargo package --list for each crate before publish (no stray
fixtures, no .work-linux.bundle, no target tree). Path-dep version
pins on the four library crates bumped in lockstep with the
workspace version (=0.5.4 everywhere) - the = pin guarantees a
downstream cargo install keyhog 0.5.4 resolves to a self-consistent
set.
v0.5.3 - 2026-05-07
I/O perfection pass - five staged perf + correctness landings on the filesystem source path, plus one latent-bug fix surfaced by the new test coverage.
Stage A - content cache (perf + correctness). Merkle index schema
v2: each entry now carries (mtime_ns, size, BLAKE3) and the file
gets a top-level spec_hash derived from the canonical detector set.
metadata_unchanged(path, mtime, size) short-circuits the file read
entirely when stat metadata matches a stored entry - the dominant
cost on cold-cache disk for --incremental re-runs.
load_with_spec(path, expected_spec_hash) invalidates the cache the
moment any detector regex, group, or companion changes, fixing a
latent correctness bug where an added detector would silently miss
unchanged files forever.
Stage B - mmap big-file scan. Replaced the read+seek loop in
FilesystemSource’s >64 MiB path with a single mmap + zero-copy slice
into window_size-byte windows with window_overlap shared bytes
between neighbours. Drops the 64 MiB heap working buffer and the
per-window seek+re-read overlap round-trip; madvise(SEQUENTIAL)
drives kernel readahead. Falls back cleanly to the buffered loop
when mmap is refused (locked writer, exotic filesystem).
Stage C - I/O ↔ scan pipeline. scan_sources spawns the scanner
in a dedicated thread holding Arc<CompiledScanner>. The producer
(main thread) iterates sources and builds batches; the scanner pulls
completed batches off a sync_channel(1) and runs scan_coalesced.
While the scanner is busy on regex, the producer is busy on disk
I/O, so total wall time approaches max(read, scan) instead of
read + scan. Channel capacity 1 keeps memory bounded to one
in-flight batch.
Stage D - mmap compressed reads. ziftsieve only takes a
contiguous &[u8] so streaming decompression isn’t on the menu, but
mmap’ing the compressed file lets us hand it the whole input without
a corresponding heap allocation. A 1 GiB .zst previously manifested
as a 1 GiB Vec<u8> before decompression began. New FileBytes enum
(Mmap | Owned) with size-cap gating; falls back to fs::read
only on mmap refusal.
Stage E - per-platform mmap threshold. Lowered to 64 KiB on Unix
where mmap setup is sub-microsecond and avoids the page cache →
userland buffer copy. Held at 1 MiB on Windows where MapViewOfFile
carries section-object + security-token costs that buffered
ReadFile doesn’t pay.
Latent bug fixed alongside Stage D. gz and zst were in
SKIP_EXTENSIONS, so the extract_compressed_chunks dispatch arm in
the FilesystemSource iterator was actually unreachable - compressed
files were silently being skipped on every scan. Removed those
entries (the gz/zst handler now actually runs).
Tests. ~55 new tests covering: 13 merkle_index v2 unit, 12 window-slicing pure-helper unit, 4 FileBytes/mmap-or-bytes unit, 6 pipeline orchestrator unit (including a 6000-chunk recall floor that proves the threading doesn’t drop batches), 9 FilesystemSource integration covering the windowed path, merkle skip, and gz end-to-end. Existing 53 scanner lib + 31 sources read unit + 20 filesystem integration all still green on both Windows and Linux.
Code cleanup. Removed dead detector_to_patterns field + helper
from the scanner (unused since the v0.5.2 perf trim). Tightened the
Arc import gate in crates/sources/src/lib.rs so docker-only
builds no longer warn about unused imports.
v0.5.2 - 2026-05-06
Reconciliation pass against the parallel Legendary Hardening line
(v0.3.0 → v0.4.0 → v0.5.0) that lived only on the work-linux clone
and was never pushed. Both lines diverged at 013257e (CI fmt scope)
and independently arrived at near-identical scanner/sources state.
Reviewed every file the work-linux line touched; no salvageable code was missing from this branch:
SensitiveStringmigration,MADV_DONTDUMPzero-leak buffers, proximity-aware multiline reassembly, hardened ratelimiter, AC prefilter forhas_secret_keyword_fast- already present here, fmt-clean, with the no-default-features feature gates the v0.6.x pass added.- The 6 secret-laden boundary-test fixtures (
test.txt,boundary_test.txt, etc.) accidentally committed in work-linux’s v0.4.0-finalize commit are intentionally not brought in: they trip GitHub push-protection and the boundary test that needed them was rewritten to use a syntheticXX_FAKE_*shape in v0.6.1. crates/sources/src/slack.rs:54data: T.into()syntax bug that still exists on the work-linux line was already fixed here in v0.6.0.
Net new: version bump only. No code regressions, no losses.
vendor/vyre is untouched - separate project with its own versioning.
v0.6.1 - 2026-05-06
Perfection pass on top of v0.6.0.
Fixed
crates/sources/src/binary/{mod,sections}.rs: 5 type errors (theextract_printable_stringswrapper claimedVec<String>while the underlying call returnedVec<SensitiveString>). Any build with--features binarypreviously failed to compile.aws-access-key.toml: droppedrequired = truefrom thesecret_keycompanion. A leaked AKIA on its own is still a reportable finding; verification correctly downgrades to “unverified” when no co-located secret is found instead of silently dropping the match.crates/core/tests/unit/spec.rs: theno_detector_uses_singular_companion_tabletest now mirrorscrates/core/build.rs’s symlink fallback so it works on Windows checkouts wherecrates/core/detectorslands as a literal file containing the link target.crates/scanner/tests/performance_regression.rs: replaced the CRC32-invalidghp_ABCDEF…synthetic with an AKIA-shape fixture so the test exercises the no-default-features build (where checksum validation fails closed).- 3 adversarial tests gated behind the features they exercise (
ml,multiline,decode); previously they ran under--no-default-featuresand asserted behavior that requires those features.
Hygiene
cargo clippy --workspace --no-default-features --all-targetsclean (zero warnings) under both--no-default-featuresand the default-minus-simd matrix.cargo fmt --checkclean.- 596/596 tests pass under both feature configurations.
v0.6.0 - 2026-05-06
Out-of-band callback verification + broad robustness/detector fixes.
Added
- OOB verification (
--verify-oob): RSA-2048 + AES-256-CFB interactsh client (oast.funby default;--oob-server HOSTto self-host). Detector TOML gains an[detector.verify.oob]block withprotocol={dns,http,smtp, any},policy={oob_and_http,oob_only,oob_optional}, andaccept={dns,http,smtp,any}. Probe payloads can interpolate{{interactsh_url}},{{interactsh_host}}, and{{interactsh_id}}to embed a unique callback URL per probe; the session waits for a matching hit before declaring the credential live. Documented indocs/OOB.md. keyhog_core::spec::validatenow audits companion-substitution capture groups, reserved companion names (__keyhog_oob_*), and that every{{companion.X}}/ auth-field reference resolves to a declared companion.
Fixed
extract_grouped_matches(scanner): zero-width regex hits no longer infinite-loop the matcher; capture-group walk reuses a singleCaptureLocationsand aligns to UTF-8 boundaries; out-of-range detector index now fails closed instead of panicking.- Required companions (
required = true) actually short-circuit: priorunwrap_or_default()swallowed the “missing required companion” signal and shipped the finding anyway. OobSession::wait_forrace: registers theNotifiedwaiter viaNotified::enable()before checking observations, so notifications fired between the check and the await no longer get lost.- 8 detector verify specs that referenced undeclared companions or used template strings in the auth-field slot would 401 every probe (Twilio IoT, Akoya, Razorpay, Braintree sandbox, etc.). Each now declares the companion it references.
- Look-behind regex assertions (
(?<=,(?<!) are no longer misclassified as named capture groups by the spec validator. crates/sources/src/slack.rs:data: T.into()syntax error inSlackResponse<T>would have failed any build that exercised the slack feature.
Performance
- Aho-Corasick prefilter for
has_secret_keyword_fastandhas_generic_assignment_keyword(single-pass). extract_inner_literalsAST walker promotes inner literals into the prefilter alphabet (corpus coverage test pins ≥3 patterns promoted).find_companionsplits into a capture-group-free fast path (find_iter) and a grouped path that reusesCaptureLocations.- Active-fallback bitmap precomputed at scanner construction; per-chunk
thread-local
ACTIVE_PATTERNS_POOLavoids reallocation. - Filesystem reader: two-sided
looks_binaryearly exit, streaming UTF-16 decode, valid-UTF-8 fast path. - Slack source fetches per-channel history concurrently (rayon, 8 threads).
Hardening
looks_binaryshort-circuit verified against full-scan baseline across page-boundary cases.open_file_saferejects symlinks on Windows (Unix already enforced).- Self-suppression list rewritten with
concat!()to keep example credentials out of the repo’s literal string table.
v0.3.0 - 2026-05-01
The “legendary” wave: 18 Tier-A perf wins + 12 Tier-B moat innovations from the 2026-04-26 deep audits, plus a perfection pass that hardened GPU/CPU auto-routing across every supported OS. Build is green, scanner test suite 229+/0, core 33+/0, hw_probe routing 11/0, doctests 38/0.
Hardware routing & GPU/CPU saturation (perfection pass)
KEYHOG_BACKEND={gpu,simd,cpu}env var force-pins the scan backend at the highest routing priority, used by CI matrix builds and benchmarks to assert backend-specific code paths actually run (ba0e3fc).KEYHOG_THREADS=Nenv var threads the rayon pool size; with--threadstaking absolute priority and physical-core count as the auto fallback (3c4924c).- Per-OS wgpu adapter preference replaces
Backends::all(): Windows → DX12 + Vulkan, macOS/iOS → Metal, Linux/BSD → Vulkan + GL - each platform gets its first-class native API (ba0e3fc). - Public
hw_probe::thresholdsmodule exposes the routing crossovers (GPU_MIN_BYTES=64 MiB, GPU_PATTERN_BREAKEVEN=2000, GPU_BYTES_BREAKEVEN_SOLO= 256 MiB) for benchmarks and the inspector subcommand to reference one source of truth (ba0e3fc). - 11 routing unit tests pin every documented threshold + the env-override
branch + the software-renderer skip. Tests serialize through a
Mutexguard since they mutate process env (ba0e3fc,3c4924c). keyhog backendsubcommand: dumps detected hardware, the active backend, the env override (if set), and a routing decision matrix at every documented threshold;--probe-bytesand--patternsfor what-if simulation (ba0e3fc).- GPU init now requests the adapter’s full limits (was capped at wgpu
Limits::default()’s 128 MiB storage-buffer ceiling; an RTX 5090 had its batch size throttled to 0.4% of physical capacity) (e182938). - GPU init rejects
device_type == Cpuadapters at the wgpu layer too (catches future software fallbacks not in the llvmpipe/lavapipe name list) (3c4924c). - Per-scan
tracing::info!logs the selected backend; per-chunktracing::trace!onkeyhog::routingfor full audit trails (3c4924c,ba0e3fc). - Verifier gained
danger_allow_httpopt-in flag to support HTTP test mocks while keeping production HTTPS-only (0da1f94).
Performance - CPU saturation
scan_chunks_with_backend_internalnow usesrayon::par_iteron the non-GPU paths - was serial, pinned to a single core even on 32-core boxes (a693ba2).scan_coalescedparallelizes its#[cfg(not(feature = "simd"))]and Hyperscan-init-failure fallbacks; multi-core builds without Hyperscan now saturate cores (27caaf9).[profile.release]pinned: opt-level=3 + lto=fat + codegen-units=1 + panic=abort + strip - was using cargo defaults; the new profile yields ~10-20% throughput on hot paths via cross-crate inlining (3c4924c).[profile.release-fast](thin LTO, 16 codegen-units) for sub-minute CI builds;[profile.bench]keeps line-tables for flamegraph attribution.
Performance - Tier-A perf wins (~constant-factor allocations on the hot path)
- Cow-borrowed
normalize_homoglyphsandprepare_chunk- ASCII fast path no longer clones (7e7cd55). post_process_matchesdedup keys areArc<str>, notString(7e7cd55).- Thread-local trigger-bitmask pool - drops ~2.4M allocs on a 100k-file scan
(
7e7cd55). - Phase-1 returns
Option<Vec<u64>>so empty chunks never allocate (7e7cd55). BTreeMapdedup →indexmap::IndexMapfor O(1) deterministic ordering (d3b6721).- Streaming SARIF reporter - peak memory drops from O(N findings) to O(rules)
(
3a15fd0). - Batched-streaming orchestrator - 4096 chunks / 256 MiB per batch caps peak
memory on giant scans (
a6c88b2). - Sharded
DashMapfor verifierVerificationCache,RateLimiter, and in-flight map (no more global RwLock contention) (d3b6721). - Concurrent rayon-parallel S3 / GitHub-org / Slack source backends
(8–16 in-flight) (
d3b6721). - Shared
Arc<Regex>compile cache viashared_regex()- same regex across detectors compiles once (a38e79c). - Pre-built
index_setonce onBaseline::loadviaOnceLock(d3b6721). - Bigram bloom prefilter (Layer 0.5) - gates chunks ≥64 bytes before
Hyperscan (
3a15fd0). - Dropped io_uring single-op path (latency regression, kept the multi-op
batch path) (
d3b6721). - Decode-bomb time budget - per-chunk wall-clock ceiling on
decode_chunk(20d3ef8). - Probabilistic gate filled in: distinct-bigram density via FNV-512 (
20d3ef8).
Innovations - Tier-B moat features
- Bayesian Beta(α,β) confidence calibration - per-detector posterior
updated from observed TP/FP, multiplier wired into the live scoring path,
CLI surface (
keyhog calibrate --tp/--fp/--show) (34deeb0,d5d447e). - Incremental scan via persisted BLAKE3 Merkle index - unchanged files
skip the scanner entirely on CI re-runs (
57c4cc8). - Cross-detector dedup at emit - one secret matched by N detectors
collapses to one finding with N ranked service guesses (
eab71b2). - Diff-aware severity - git source pre-walks HEAD’s tree, tags chunks
git/headvsgit/history, and the latter’s findings drop one severity tier (410dc0e). - JWT structural validation - header.payload decode with
alg/typ/expinspection andalg=noneanomaly detection (43092b6). - CWE-798 + OWASP A07:2021 SARIF taxa - compliance-grade reporting
(
5462625). - SARIF v2.2 fixes[] with deletedRegion/insertedContent and env-var-name
auto-fix suggestions (
650e599). - Allowlist governance metadata -
; reason="…" ; expires=YYYY-MM-DD ; approved_by="…"per entry, expired entries auto-drop (32ff3a8). keyhog explain <detector-id>- full spec dump, regex breakdown, and rotation-guide URLs for major providers (f56f97e).keyhog diff <before.json> <after.json>- NEW / RESOLVED / UNCHANGED set diff for CI regression detection (52d7242).keyhog watch <path>- daemon mode with notify-based file watcher, compile-once-scan-many on saves; sub-100ms re-scan (56c61d6).keyhog calibrate- α/β counter management with posterior-mean bar visualization (34deeb0).keyhog detectors --search <query> --verbose- case-insensitive filter against id/name/service/keywords; verbose dumps full spec (5951a14).keyhog completion <shell>- bash, zsh, fish, powershell, elvish (8ab105f).
Adversarial coverage
- Reverse-string decoder for tokens stored backwards as evasion (
c462e9c). - Caesar / ROT-N decoder for ROT13’d configs (
c462e9c). - Hex
_separator stripping (firmware dumps, embedded configs useA1_B2_C3_…) (2980284). - Comment-suffix disclaimer suppression -
// not a real key,# fake credential, etc. (2980284). - Cross-detector dedup also handles 2-fragment AWS reassembly with
no-shared-prefix var names (
3327b39).
Architecture
- GPU auto-routing - runtime probe selects GPU vs CPU based on adapter type,
workload size, and pattern count; mandatory build-time presence (no more
feature gate) (
7feb723). - Filesystem source: per-archive-entry uncompressed-size cap; ziftsieve
gzip/zstd/lz4 4× decompressed-byte budget (
5cc3906). - Verifier hardening: SSRF DNS-rebinding defeated via
tokio::net::lookup_hostpost-resolve check; HTTPS-only no-localhost-exception (7feb723). - AWS SigV4 dates derived from
SystemTime::nowvia Howard-Hinnant civil arithmetic (no chrono runtime cost) (7feb723). fragment_cachemodule relocated undermultiline/where every call site lives; re-exported at the crate root for back-compat (70e35a8).
Tests
- Wired adversarial fixtures into
cargo test(no more skipped corpus) (5cc3906). - Aligned
gitleaks_hash_*allowlist tests with the hardenedis_hash_allowedAPI (no plaintext fallback) (b2b405d). - Wrapped
?-using doctests in explicitfn main() -> Resultso the E0277 wave is gone (19ce4f5). - 229 scanner tests / 33 core unit tests / 38 doctests, 0 failed.
Detector corpus
- Brutal audit of all 896 detectors found schema decay; corrupted entries
removed, broken logic flagged (
e934144). - Schema rename (kimi automated): aligned every detector to the post-audit
field set (
826d54f). - Verifier auth wiring fixes for the corpus (
826d54f). - 859 valid detectors after the gate; ~30 still flagged for pure-character- class companions (tracked separately).
v0.2.1 - 2026-04-04
Maintenance release: production-readiness fixes, dependency updates, agent
sweeps. See git log v0.2.0..v0.2.1 for the commit list.
v0.2.0 - 2026-03-30
The fastest, most accurate secret scanner.
First “legendary bar” release. Highlights:
- Embedded 888-detector corpus (no separate
detectors/directory needed). - Hyperscan SIMD regex with disk-cached compiled DB.
- Aho-Corasick literal prefilter feeding into the regex layer.
- ML-based confidence scoring (MoE classifier with per-detector calibration).
- Decode-through pipeline: base64, hex, URL, MIME, HTML entities, Z85, unicode/octal escapes, quoted-printable.
- Multiline secret reassembly across line-continuation patterns in a dozen languages.
- Sources: filesystem, git history, git diff, GitHub orgs, S3, Docker images, web URLs (JS/sourcemap/WASM), Slack (admin export).
- Verifier framework with TOML-defined live verification per detector.
- SARIF v2.1.0 + JSON + JSONL + plain-text reporters.
v0.1.0 - 2026-03-26
- First public release of the KeyHog workspace.
- Production-readiness cleanup for docs, examples, README guidance, and release metadata.
- Verified
cargo check,cargo test, andcargo clippy --workspace -- -D warnings.