Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

KeyHog

A secret scanner. Built in Rust. Made to be fast on big repos, careful with your time on small ones, and quiet about findings that aren’t actually credentials.

$ keyhog scan .
keyhog v0.5.37 │ 891 detectors │ 1647 patterns │ avx-512 + hyperscan + cuda

scanned 12,841 files in 1.4 s
3 findings · 0 verified live · 1041 example fixtures suppressed

What it does

Walks files - your working tree, your git history, a docker image, an S3 bucket, a list of URLs - and reports leaked credentials. Every finding has:

  • a detector that fired (stripe-secret-key, aws-access-key, …)
  • a location (file, line, offset, optionally commit hash and author)
  • an entropy score + confidence
  • an optional live verification result if you pass --verify

The list of detectors ships in TOML files under detectors/. There are 891 of them today, covering ~750 distinct services. Anyone can add or override them without touching Rust code.

What it doesn’t do

  • No telemetry. Findings stay local. The scanner never phones home.
  • No agent. A daemon mode exists for pre-commit / IDE-save fast-path scans on Unix, but it’s opt-in and stays on your machine.
  • No “AI-powered” detection. Every detector is a regex with a service-specific anchor and a real verification endpoint. The ML scorer that bumps confidence on ambiguous matches is a tiny on-device MoE; no network calls.

Why another scanner

Three things, in order of how much they matter:

  1. Precision. A scanner that emits one false positive per ten findings teaches developers to ignore it. KeyHog suppresses example credentials (the Stripe docs key, the AWS sample key, the RFC 7519 specimen JWT), vendored bundles (minified jQuery, node_modules), and CI workflow ${{ secrets.NAME }} references by default. The 22-repo dogfood corpus has 22 non-PEM findings, all true positives.

  2. Recall. The detector corpus is built service-by-service. For every detector, the test suite carries positive shapes (env-var, JSON, YAML, header, URL), negative shapes (placeholder, EXAMPLE marker), and adversarial evasions (split across lines, hex/base64-encoded, reversed via Caesar cipher). If a shape isn’t in the suite, the detector isn’t shipped.

  3. Speed. Hyperscan SIMD prefilter, AVX-512 entropy gate, GPU literal scan for big workloads. A million-LOC monorepo scans in under three minutes on a modern laptop without warming any caches. Pre-commit incremental scans are sub-100 ms.

Get going

# Linux / macOS
curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh | sh

# Windows (PowerShell)
iwr https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.ps1 -useb | iex

Then:

keyhog scan .

The Install page has package-manager, build-from-source, and offline-install paths. The Your first scan page walks through what the output means and where to go from there.

Where things live

License: MIT.

Install

The quickest paths first. Pick one - they all give you the same keyhog binary.

One-liner: Linux / macOS

curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh | sh

Drops a binary in ~/.local/bin/keyhog. The installer detects your CPU, GPU, and existing install before downloading, and tells you the asset it picked and why.

The default is the WGPU + SIMD build everywhere: it already dispatches the same vyre AC / RulePipeline on your GPU via the vulkan backend, with a smaller binary and no libcuda.so runtime dependency. The dedicated keyhog-linux-x86_64-cuda build is only auto-selected on Linux when the host has the full CUDA toolkit installed - nvcc on PATH, $CUDA_HOME set, or /usr/local/cuda present. A driver-only NVIDIA host (libcuda.so loadable but no toolkit) stays on the WGPU build, since the native-CUDA dispatch saves only single-digit percent on typical repo scans and the binary footprint + runtime dependency are not worth it for the non-CUDA-developer case. Pass --variant=cuda (or set KEYHOG_VARIANT=cuda) to force the CUDA build anyway. Apple Silicon hosts get an explicit “Metal GPU acceleration coming soon” note; until that lands, Apple Silicon runs SIMD on CPU plus WGPU on the integrated GPU.

curl ... | sh is fast but skips the wizard because stdin is a pipe. For variant selection, shell completions, and optional hook setup:

curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh \
    -o keyhog-install.sh
sh keyhog-install.sh

The interactive installer shows you:

  • The host it detected (OS, arch, GPU, libcuda state).
  • The binary it would install (with the GPU note).
  • Any existing keyhog install it found.
  • Whether ~/.local/bin is on your PATH.

Then it prompts (default in brackets):

  • Add ~/.local/bin to your shell PATH? [Y/n]
  • Install shell completions for bash / zsh / fish? [y/N]
  • Wire keyhog as a git pre-commit hook in this dir? [y/N]

Each prompt is opt-in. Nothing in your .bashrc / .zshrc / git hooks dir is touched without an explicit “y”. Claude Code / Cursor agent-hook integration is on the roadmap but not yet shipped; the prompt was removed in v0.5.34 once it became clear the underlying keyhog hook install --agent <name> flag wasn’t real yet.

One-liner: Windows

PowerShell 5+ (ships with Windows 10/11):

iwr https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.ps1 -useb | iex

Drops the binary in %LOCALAPPDATA%\keyhog\bin\keyhog.exe. Detects your GPU (informational only: a dedicated CUDA-on-Windows variant is on the roadmap but not yet shipped, so today every Windows host gets the same WGPU + SIMD binary).

For the interactive flow:

iwr https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.ps1 `
    -OutFile keyhog-install.ps1
.\keyhog-install.ps1

Heads up. The Unix daemon mode is unavailable on Windows (it relies on Unix-domain sockets). keyhog scan, keyhog detectors, keyhog watch, keyhog hook, etc. all work the same. The daemon subcommand and the --daemon flag emit an explicit “unix-only” error so nothing silently regresses.

Variants and overrides

The installer auto-detects, but you can override:

Env var / flagEffect
KEYHOG_VARIANT=cuda (or --variant=cuda)Force the CUDA-accelerated Linux build (requires libcuda.so).
KEYHOG_VARIANT=cpu (or --variant=cpu)Force the default WGPU + SIMD build, skip GPU detection.
KEYHOG_VERSION=v0.5.37 (or --version=v0.5.37)Pin a specific release tag (default: most recent release with assets attached).
KEYHOG_INSTALL=/usr/local/bin (or --install-dir=...)Install into a different directory.
--yes / -yNon-interactive: accept all defaults, no prompts.
--no-colorDisable ANSI colors (e.g. for log capture).

Runtime env vars (consumed by the keyhog binary itself)

Env varEffect
KEYHOG_NO_GPU=1Force the CPU + SIMD path; skip every GPU init (saves ~250 ms of cold-start on hosts with no usable GPU).
KEYHOG_NO_GPU=0Force GPU init even when CI auto-detection would otherwise skip it. Useful on self-hosted GitHub / GitLab runners with a real GPU.
KEYHOG_REQUIRE_GPU=1Hard-fail (exit 2) instead of silently degrading when the GPU stack is unavailable. Pairs with the no-silent-fallback contract.
KEYHOG_BACKEND=gpu|mega-scan|simd|cpuForce a specific scan backend regardless of hardware probe. Mostly for benches; production code should let auto-select route.

CI auto-detect. When CI=true is set (or any of GITHUB_ACTIONS, GITLAB_CI, CIRCLECI, TRAVIS, JENKINS_URL, TF_BUILD, BUILDKITE, DRONE, APPVEYOR, TEAMCITY_VERSION, CODEBUILD_BUILD_ID, BITBUCKET_BUILD_NUMBER, WERCKER, SEMAPHORE), keyhog skips the GPU probe entirely and goes straight to the SIMD + CPU path. The savings: ~250 ms of cold-start per keyhog invocation, plus no confusing “GPU MoE init failed” warning when the runner’s only GPU is llvmpipe. Override with KEYHOG_NO_GPU=0 on self-hosted GPU runners.

When a CUDA variant asset isn’t published for the resolved release tag yet, the installer logs the fallback and downloads the default WGPU + SIMD asset instead. You can rerun with --variant=cuda once a tag with the CUDA variant lands.

Repair, diagnose, uninstall

sh keyhog-install.sh --diagnose    # print host + binary state, change nothing
sh keyhog-install.sh --repair      # re-download the right variant for this host
sh keyhog-install.sh --uninstall   # remove the binary (leaves PATH entries alone)

--diagnose is the first thing to run if something looks off: it reports CPU arch, OS, GPU + libcuda state, the currently-installed binary (path + version), whether the install dir is on PATH, and the asset the installer would download for the latest release tag.

--repair re-downloads the asset matching your current host even if the existing binary still runs. Useful after a host upgrade adds a new GPU, or after CUDA userland gets installed and the WGPU build should be swapped for the CUDA build.

--uninstall only removes the binary itself. Shell PATH entries and completion files added by the post-install wizard are left in place: we don’t know which lines in your .bashrc / .zshrc are ours vs yours, and silently editing those files is exactly the kind of installer behavior we don’t want.

Direct binary download

If you don’t trust pipe-to-shell - fair - grab the binary by hand from the releases page.

PlatformAsset name
Linux x86_64 (default)keyhog-linux-x86_64
Linux x86_64 + CUDAkeyhog-linux-x86_64-cuda
macOS x86_64 (Intel)keyhog-macos-x86_64
macOS aarch64 (Apple)keyhog-macos-aarch64
Windows x86_64keyhog-windows-x86_64.exe

chmod +x the binary and put it somewhere on your PATH.

Build from source

You’ll want this if you’re contributing or running a feature combination the prebuilt binaries don’t cover (e.g. Ghidra binary extraction).

git clone https://github.com/santhsecurity/keyhog
cd keyhog
cargo build --release -p keyhog
./target/release/keyhog --version

The default feature set requires Hyperscan / Vectorscan:

  • Debian / Ubuntu: sudo apt install libhyperscan-dev pkg-config
  • macOS: not available via Homebrew. Build with --no-default-features --features portable to skip Hyperscan and use the pure-Rust path.
  • Windows: build with --no-default-features --features portable.

For the CUDA backend, add the cuda feature on Linux:

cargo build --release -p keyhog --features cuda

This requires the CUDA toolkit at link time (NVCC + cudart + nvrtc) and libcuda.so at runtime. The release workflow provisions CUDA 12.6 on the GitHub-hosted ubuntu runner for the keyhog-linux-x86_64-cuda asset; for local source builds, install the matching toolkit from developer.nvidia.com/cuda-toolkit or your distro’s nvidia-cuda-toolkit package.

The portable feature is what the official Windows + macOS release binaries are built with: same scanner, no native dependency, ~5% slower on big inputs.

crates.io

Not yet. KeyHog vendors vyre-libs (the GPU literal-set scan crate) and isn’t published to crates.io until that dependency lands there. Track the crates.io publish issue for status.

Verify the install

keyhog --version
keyhog detectors | head     # smoke-test the embedded detector corpus
keyhog scan README.md       # scan a single file; exit 0 = clean

If keyhog --version reports the latest release (currently 0.5.34 from prebuilt assets, or 0.5.35 from a source build of main) and keyhog detectors lists hundreds of detectors, you’re set. Move on to Your first scan.

You can also run the installer in diagnostic mode at any time to print a full status report:

sh keyhog-install.sh --diagnose

Your first scan

You have the binary on your PATH. Now:

keyhog scan .

That walks the current directory, hands every file through the scanner, and prints findings. The exit code carries the verdict:

Exit codeMeaning
0Scan finished, no findings
1Scan finished, findings present (unverified or verified-live)
2Runtime error - bad config, panic, I/O failure

So a CI step that should fail the build when a credential leaks is just:

keyhog scan .

No grep, no jq, no exit-code arithmetic. Findings == exit 1 == build red.

What you get out of it

By default, output is human-readable:

$ keyhog scan .
keyhog v0.5.37 │ 891 detectors │ 1647 patterns │ avx-512 + hyperscan

src/config/staging.env:14:12  HIGH  stripe-secret-key
                              sk_live_4eC39H…Tcd3Hc (redacted, last 6)
                              entropy 5.21 │ confidence 0.999 │ unverified

scanned 12,841 files in 1.4 s
1 finding · 0 verified live · 1041 example fixtures suppressed

The header tells you the binary version, the detector count, and which hardware acceleration is active (AVX-512, Hyperscan/Vectorscan SIMD, CUDA, etc.). The body lists each finding with its location, severity, detector, redacted credential, and confidence. The footer summarizes counts and runtime.

Default suppressions

KeyHog ships with a Tier-B suppression list of publicly documented test fixtures - credentials that appear in vendor docs as examples. Findings on these are suppressed by default. Examples:

  • Stripe’s sk_live_4eC39HqLyjWDarjtT1zdp7dc (docs sample)
  • AWS’s AKIAIOSFODNN7EXAMPLE (docs sample)
  • The RFC 7519 specimen JWT
  • GitHub’s ghp_aBcDeFgHiJ… placeholder

To see what was suppressed, pass --no-suppress-test-fixtures. The list lives at crates/cli/data/suppressions/test-fixtures.toml inside the source tree, baked into the binary at build time, and is the ONLY built-in suppression list - there’s no opaque allow-list.

JSON output

keyhog scan . --format json

Each finding is a JSON object with these fields, every one always present (consumers like SARIF converters and CI gates rely on the schema being stable):

{
  "detector_id":        "stripe-secret-key",
  "detector_name":      "Stripe Secret Key",
  "service":            "stripe",
  "severity":           "critical",
  "credential_redacted": "sk_live_4e…3Hc",
  "credential_hash":     "sha256-hex",
  "location": {
    "source":    "filesystem",
    "file_path": "src/config/staging.env",
    "line":      14,
    "offset":    12,
    "commit":    null,
    "author":    null,
    "date":      null
  },
  "verification": "skipped",
  "metadata": {},
  "additional_locations": [],
  "confidence": 0.999
}

Pipe it into jq, into a SARIF converter for the GitHub Security tab, or into your own dedup / triage tooling.

Limiting scope

keyhog scan src/                        # one subdirectory
keyhog scan src/config/staging.env      # one file
keyhog scan --stdin < staging.env       # from stdin (CI: cat | keyhog)
keyhog scan . --exclude-paths 'docs/*'  # exclude a glob

Common patterns the default walk already skips: .git/, node_modules/, __pycache__/, vendor/, dist/, build/, out/, .min.js, .min.css, .bak, .swp. To see the full list, look at is_default_excluded in crates/sources/src/filesystem.rs.

Interactive TUI dashboard

For an interactive scan with a live finding feed, current-file banner, and stats panel showing throughput and backend choice:

keyhog tui .                       # scan CWD with live dashboard
keyhog tui src/ --throttle-ms 200  # paced scan, good for demos/recordings
keyhog tui . --feed-depth 500      # keep last 500 findings in feed

The TUI builds on the same scanner core; q or Esc quits, and a non-zero exit code is returned when any findings are surfaced. Useful for sitting next to a developer demoing keyhog, or recording a vhs GIF for a README or talk.

Going further

Once the basic scan works:

  • Output formats - JSON, SARIF, plain text.
  • Verification - --verify makes API calls to confirm credentials are live, downgrades dead ones to severity LOW.
  • Pre-commit hook - block leaked creds before they hit the repo.
  • CI integration - GitHub Actions, GitLab CI, CircleCI patterns.

Output formats

KeyHog speaks four formats. Pick the one that fits the consumer.

--format text (default)

Human-readable table. Best for terminal use, pre-commit hook output, and screenshots. Colors auto-detect TTY; pipe through cat (or set NO_COLOR=1) to disable.

src/config/staging.env:14:12  HIGH  stripe-secret-key
                              sk_live_4eC39H…Tcd3Hc (redacted)
                              entropy 5.21 │ confidence 0.999 │ unverified

The columns are file:line:offset, severity, detector ID. The second line is the redacted credential. The third is metadata.

--format json

Stable-schema JSON array. Every finding has every documented field present. See Your first scan for the schema.

keyhog scan . --format json | jq '.[] | .detector_id' | sort | uniq -c

That sample command dedups findings by detector, which is the most common “what kinds of leaks do I have” question.

--format sarif

SARIF 2.1.0

  • Static Analysis Results Interchange Format. GitHub Code Scanning, GitLab Security Dashboard, and most IDE security plugins consume this.
keyhog scan . --format sarif > keyhog-results.sarif

Upload to GitHub:

# .github/workflows/secrets.yml
- uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: keyhog-results.sarif

Findings show up in the Security → Code scanning tab with the detector ID as the rule, file path + line as the location, and the redacted credential as the message.

--format jsonl

Newline-delimited JSON - one finding per line, no outer array. Better than --format json for streaming consumers that want to start processing before the scan finishes:

keyhog scan /huge/monorepo --format jsonl \
  | while read line; do
      echo "$line" | jq -r '.location.file_path'
    done

Combining with --verify

--verify calls each detector’s verification endpoint to confirm the credential is live. Live credentials keep their severity; dead ones get downgraded one tier. The output format doesn’t change - the verification field of each finding becomes "verified-live" or "verified-dead" instead of "skipped".

keyhog scan . --verify --format json \
  | jq '.[] | select(.verification == "verified-live")'

Quiet mode

--quiet suppresses the header banner and the footer summary. Output is findings-only, which is what CI scripts usually want:

keyhog scan . --format json

Exit code semantics are unchanged.

How detection works

A KeyHog scan is a pipeline. Files come in one side, findings go out the other. In between, four stages:

files → [chunker] → [prefilter] → [detector match] → [post-process] → findings

Each stage is a hard filter - if a chunk fails the prefilter, no detector ever runs on it. That’s where the speed comes from: the expensive regex evaluation only sees chunks that already plausibly contain something.

Stage 1 - chunker

A file becomes one or more chunks. A chunk is {data: str, metadata: {source_type, path, line_offsets, …}}. The chunker:

  • Skips obvious binaries via magic-byte sniffing (PDF, PNG, zip, …).
  • Skips files matching is_default_excluded (node_modules, .min.js, build/, etc.).
  • Splits files larger than 64 MiB into overlapping windows so a single giant log file doesn’t blow scratch memory. Cross-window secrets are reassembled in stage 4.
  • Decodes UTF-16 BOM files into UTF-8 (PowerShell / .NET configs).

Specialized chunkers run too:

  • Git history → one chunk per (commit × file × diff line)
  • Docker images → one chunk per layer × file
  • Web URLs → one chunk per response body / sourcemap / WASM strings
  • S3 buckets → one chunk per object body

Stage 2 - prefilter (the cheap pass)

Three gates, in order, each cheaper than the next:

  1. Alphabet screen. A 256-bit mask of which bytes the corpus’s detectors care about. If a chunk doesn’t contain ANY byte in the mask, it’s discarded. Most random-binary chunks fail here.

  2. Bigram bloom. A 4096-bit bloom filter of 2-byte sequences from detector keyword prefixes. If a chunk has no overlapping bigram, discard. Catches the “this is a Go source file with no key= anywhere” case in microseconds.

  3. SIMD prefilter (Hyperscan). A multi-pattern SIMD regex scanner. The detector corpus is compiled to a single Hyperscan database; one scan call returns “which detector IDs have a candidate match.” On AVX-512 hardware this runs at ~3 GB/s.

    On GPUs above the breakeven threshold (2 MiB on 5090-class, 16 MiB on 4090-class), the prefilter switches to a CUDA literal-set scan via vyre - same patterns, parallelized across thousands of cores.

Stage 3 - detector match

For each detector that the prefilter flagged, the FULL regex evaluates. The regex is what’s in the .toml file - detector.patterns[].regex. The capture group becomes the candidate credential.

A detector’s .toml carries:

  • id, name, service, severity, keywords
  • one or more patterns, each with regex + group + optional description
  • optional companions (e.g. AWS access key needs the secret key nearby)
  • optional verify block - HTTP method, URL template, auth scheme, success status

Detectors fall into two camps:

  • Service-anchored. Regex requires a service-specific keyword (AWS_SECRET_ACCESS_KEY=, stripe.com/v1/, dn_ Deepnote prefix). These have HIGH precision: the keyword itself is positive evidence, not just a hint.

  • Generic / entropy fallback (generic-password, entropy-api-key, entropy-token). Triggered by entropy + assignment shape only - password = "...", secret: "...", JSON { "token": "..." }. Lower precision; suppression filters do most of the work.

The split matters for the post-process stage.

Stage 4 - post-process

Even a regex match isn’t always a credential. Stage 4 filters:

  • Known example fixtures (Stripe docs key, AWS docs key, RFC 7519 specimen JWT).
  • Placeholder language - credentials containing YOUR_, INSERT, EXAMPLE, PLACEHOLDER, TODO, FIXME, etc.
  • Shape gates.
    • Universal: punctuation_decorated_identifier - credentials starting with --, &, @, !, /, $ (CLI flags, pointers, SQL vars, shell vars, GraphQL refs) or ending in : / ! (UI labels, TypeScript non-null assertions).
    • Generic / entropy only: pure_identifier, word_separated_identifier, scheme_prefixed_uri, url_or_path_segment, contains_uuid_v4_substring. These shapes CAN be real credentials when paired with a service anchor (PowerBI client_id is a UUID, mongodb-atlas is a URI), so we only apply them to anchorless detectors.
  • Path-based suppressions - vendored bundles (node_modules/, wp-includes/, bower_components/), CI workflow files (where ${{ secrets.NAME }} references are syntactic, not credentials), i18n translation files, secret-scanner source files (the file IS a scanner; its regex literals shouldn’t fire on itself).
  • Cross-chunk reassembly. A secret split across window boundaries gets reassembled from the tail of chunk N + the head of chunk N+1.

A finding that survives stage 4 makes it to output.

Where the speed comes from

StageThroughput on a modern laptop
Chunker~5 GB/s (mmap + magic-byte sniff)
Alphabet screen~12 GB/s (256-bit table lookup, vectorized)
Bigram bloom~8 GB/s (4096-bit table, vectorized)
Hyperscan SIMD~3 GB/s (multi-pattern regex)
Per-detector regex~150 MB/s × detectors flagged
Post-process~200 MB/s

The end-to-end number on the dogfood corpus is ~800 MB/s sustained. Hardware acceleration (AVX-512, CUDA) raises the SIMD-prefilter ceiling substantially on big inputs; small inputs (< 100 KB) bottleneck on the chunker and post-process, not the regex.

Where the precision comes from

FilterWhat it catches
Known example fixturesStripe docs key, AWS docs key, RFC 7519 JWT
pure_identifiergetParameter, Benutzername, auth_decoders
word_separated_identifiers3_secret_access_key (function name)
scheme_prefixed_uriurn:foo:bar (URI literal, not creds)
url_or_path_segment/api/v1/users/123 (REST path)
contains_uuid_v4_substringTOKEN_LIST=636765a9-… (UUID identifier)
punctuation_decorated_identifier--api-secret, &password, Password:
Vendored-minified-pathnode_modules/jquery-3.6.0.min.js
CI workflow path.github/workflows/ci.yml - ${{ secrets.X }}
i18n translation pathlocale/de.po - translated password word

Each filter has a known-FP-cluster it was built to defuse. The Suppressions page enumerates them with examples.

What this looks like for one finding

file.env contains: AWS_SECRET_ACCESS_KEY=ev0BsFtSD7S/4VWYObxiEhME3hJBXeYzR43jgiB1

stage 1 - chunker:        emit chunk{ path: "file.env", data: "AWS_SECRET..." }
stage 2 - alphabet:       PASS (chunk has `=`, alphanumerics from the corpus)
stage 2 - bigram bloom:   PASS (`AW`, `WS`, `_S` are in the bloom)
stage 2 - Hyperscan:      MATCH → triggers `aws-secret-access-key` + `generic-password`
stage 3 - regex eval:
  `aws-secret-access-key` regex `(?i)(?:AWS[_-]?SECRET[_-]?ACCESS[_-]?KEY|...)[=:\s"']+([0-9a-zA-Z/+=]{40})(?:[^0-9a-zA-Z/+=]|$)`
    captures `ev0BsFtSD7S/4VWYObxiEhME3hJBXeYzR43jgiB1`
  `generic-password` regex doesn't match (no `_password`/`_pwd` substring)
stage 4 - post-process:
  known-example check: no
  `looks_like_pure_identifier`: false (has digits + /)
  `looks_like_punctuation_decorated_identifier`: false
  → EMIT

That’s one finding’s life. Multiply by 10⁶ files and the throughput math is why each stage matters.

Detectors

A detector is a single TOML file that teaches KeyHog one shape of credential. There are 891 of them in the embedded corpus today, spread across detectors/*.toml.

Anatomy of a detector

# detectors/stripe-secret-key.toml

[detector]
id = "stripe-secret-key"
name = "Stripe Secret Key"
service = "stripe"
severity = "critical"
keywords = ["sk_live_", "sk_test_", "stripe"]

[[detector.patterns]]
regex = "sk_(?:live|test)_[a-zA-Z0-9]{24,}"
description = "Stripe secret key - live or test mode"
group = 0

[detector.verify]
method = "GET"
url = "https://api.stripe.com/v1/charges?limit=1"

[detector.verify.auth]
type = "bearer"
field = "match"

[detector.verify.success]
status = 200

That’s the whole contract for one service. Every other detector follows the same shape.

Fields

detector.id - kebab-case, globally unique. Shows up in JSON output as detector_id and in CLI output as the third column.

detector.name - human-readable name. Shows up in keyhog detectors listing and IDE plugins.

detector.service - the upstream service slug. Used for grouping findings (e.g. “you leaked 3 stripe credentials”); a single service can have multiple detectors (stripe-secret-key, stripe-restricted-key, stripe-publishable-key).

detector.severity - one of critical | high | medium | low | client-safe | info. The CLI’s exit code only depends on whether ANY finding exists, but SARIF / GitHub Code Scanning surface severity prominently.

client-safe is the bug-bounty tier for keys public by design (Sentry DSN, Stripe pk_*, Mapbox pk., PostHog phc_, Firebase Web API key, Google Maps browser key, Mixpanel project token, Algolia search-only, Datadog browser RUM, Bugsnag, Segment write key). The detector still fires (a token grep is a token grep), but the finding renders below low and --hide-client-safe filters it out entirely. Set per-pattern via the client_safe = true field on a [[detector.patterns]] block - detectors that fire on both the public and the secret prefix (Stripe pk_* vs sk_*, Mapbox pk. vs sk.) tag only the public pattern so a misused secret key still surfaces at its nominal severity.

detector.keywords - strings the prefilter ahokorasick matches on. At least ONE keyword in the chunk is required before the regex even runs. Pick keywords that are short, distinctive, and likely to appear near a real credential (stripe, sk_live_, STRIPE_SECRET_KEY).

detector.patterns[] - one or more regexes. Each carries:

  • regex - the pattern. Compiled with CASELESS (matches both cases without explicit alternation).
  • group - which capture group is the credential. 0 = whole match, 1 = first captured group, etc.
  • description - what shape this captures (env var, header, URL, …).
  • client_safe - optional bool, default false. When true, any match against this pattern collapses to Severity::ClientSafe regardless of the detector’s nominal severity. Use for patterns that capture keys the vendor expects to ship in client bundles (Sentry DSN, Stripe pk_*, etc.). Per-pattern (not per-detector) so a detector that covers both the public and the secret prefix can tag only the public one.

Multiple patterns means “any of these shapes”. A typical detector has 1–3 patterns covering env-var, JSON, and inline forms.

detector.companions[] - optional. Some credentials are only useful in pairs (AWS access key + secret key). A companion is a second regex that must match within N lines of the primary; without it, the primary’s finding is dropped.

detector.verify - optional. If present, keyhog scan --verify makes the documented API call with the captured credential and:

  • live + valid → keep severity, mark verification: "verified-live"
  • live + invalid → downgrade severity one tier, mark "verified-dead"

Listing detectors

keyhog detectors                  # human-readable list, grouped by service
keyhog detectors --json           # one JSON object per detector
keyhog detectors --json | jq length
891

Filter by service:

keyhog detectors --json \
  | jq '.[] | select(.service == "stripe")'

Explaining one detector

keyhog explain stripe-secret-key

Prints the full TOML contents, the keywords, the patterns with their descriptions, the verification endpoint, and any companions. Useful when debugging “why didn’t this fire?” - usually the answer is in the regex or keywords.

Custom detectors

Drop a .toml next to the binary or in ~/.config/keyhog/detectors/:

# ~/.config/keyhog/detectors/my-internal-token.toml

[detector]
id = "acme-internal-token"
name = "ACME internal API token"
service = "acme-internal"
severity = "high"
keywords = ["ACME_API_TOKEN", "acme_internal_"]

[[detector.patterns]]
regex = "acme_internal_[a-zA-Z0-9]{32}"
group = 0

Restart the scanner and the new detector is loaded alongside the built-ins. There’s no opt-in, no flag, no rebuild - TOML in, detector out.

Disabling specific detectors

Turn off a detector by id in .keyhog.toml:

[detector.aws-access-key]
enabled = false

[detector.generic-secret]
enabled = false

Detector ids are the detector_id field in --format json/jsonl output, or the left column of keyhog detectors. The high-precision fast-path detectors are prefixed hot- (e.g. hot-aws_key); a service like AWS can have both a hot- detector and a TOML detector, so disable both to silence it entirely:

[detector.hot-aws_key]
enabled = false
[detector.aws-access-key]
enabled = false

Disabled TOML detectors are dropped before the corpus compiles (zero scan cost); disabled hot-pattern findings are filtered from the report. If an id matches nothing in the loaded corpus, keyhog warns rather than silently ignoring it.

Running only a chosen subset

To run a curated set instead of the full corpus, point --detectors at a directory holding only the TOMLs you want:

mkdir my-detectors
cp detectors/stripe-secret-key.toml detectors/aws-*.toml my-detectors/
keyhog scan . --detectors my-detectors/     # or KEYHOG_DETECTORS=my-detectors

Quieting a noisy detector

When a detector produces persistent false positives in your repo, down-weight it instead of dropping it entirely so a real hit still surfaces:

keyhog calibrate --fp generic-api-key       # record a false positive
keyhog scan . --min-confidence 0.7          # filter low-confidence hits

Each --fp lowers that detector’s Bayesian confidence multiplier (persisted under $XDG_DATA_HOME/keyhog/), so repeated FPs steadily push its findings below your --min-confidence floor. To suppress specific findings rather than a whole detector, use a .keyhogignore, the [allowlist] config, or a --baseline.

Severity bumps and downgrades

Severity is a property of the detector, but can shift per-finding:

  • Git history → severity one tier lower. A credential present only in non-HEAD git history (the developer already removed it from main) is still a leak - anyone can fetch it - but strictly less urgent than one live in HEAD. Reported in the chunk.metadata.commit field of the finding.

  • Verification: dead → severity one tier lower. The credential was format-valid but the API rejected it. Could be a rotated key, a fake in a test file, or a typo.

  • Verification: live → severity unchanged. The credential authenticates successfully. As bad as it can get.

Writing your own - the short version

  1. Find a real example of the credential format (vendor docs, leaked public sample, source).
  2. Write the regex. Test it against the example, against a similar non-credential (“looks like, isn’t”), and against an attacker-rotated form.
  3. Add to detectors/<service>-<thing>.toml - id, keywords, patterns, optionally verify.
  4. Add a contract file at crates/scanner/tests/contracts/<id>.toml with at least:
    • 2 positives (env-var form, quoted form)
    • 2 negatives (placeholder, EXAMPLE marker)
    • 2 evasions (the actual deployed credential shape from production)
  5. Run cargo test -p keyhog-scanner --test contracts_runner - must pass for your detector to ship.

That’s it. The contracts gate enforces that every shipped detector catches what it claims to catch.

HTTP and wire scanning

Real credentials don’t always sit on disk. They flow through:

  • Live web bundles that ship from production at a public URL.
  • HAR files that browsers (Chrome / Firefox / Safari DevTools) produce when you click “Save all as HAR with content.”
  • mitmproxy / Burp captures of an authenticated session.
  • curl / httpie / Postman exports of one specific request you want to verify.

KeyHog scans every one of these, but the surface is split across a few flags and sources. This page is the map.

TL;DR

WorkflowCommand
Scan a public JS bundlekeyhog scan --url https://app.example.com/static/main.js
Scan every URL in a listkeyhog scan --url $(cat urls.txt)
Scan a source-map exposed by Webpackkeyhog scan --url https://app.example.com/static/main.js.map
Scan a HAR export from DevToolskeyhog scan capture.har (see HAR auto-expansion)
Scan a single curl responsecurl -s https://api/... | keyhog scan --stdin
Scan a saved Burp / mitmproxy capturekeyhog scan dump.txt (treats as text - no protocol parsing)
Route every fetch through Burpkeyhog scan --url https://... --proxy http://burp:8080 --insecure
Scan in an air-gapped networkkeyhog scan --url https://... --proxy off

The --url flag (Web Source)

keyhog scan --url https://app.example.com/static/main.js
keyhog scan --url https://app.example.com/static/main.js \
            https://app.example.com/static/runtime.js \
            https://app.example.com/static/vendor.js

Each URL is fetched with the shared HTTP client policy (see Proxy and TLS below). The response is routed by extension:

  • .js → one chunk per file, scanned as plain text.
  • .map → JSON parsed, each sourcesContent[i] becomes its own chunk tagged with the original filename. This is how a Webpack build with devtool: 'source-map' accidentally exposes server- side env vars baked into the bundle at build time.
  • .wasm → linear-memory + import section dumped as strings (best- effort; native WASM symbol extraction lives behind the binary feature).
  • Everything else → one chunk of text.

Findings are tagged source: "web:js", web:sourcemap, web:sourcemap:raw, web:wasm, or web:other. The original URL is the file_path.

SSRF defense

--url refuses to fetch:

  • Private RFC1918 ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16).
  • Loopback (127.0.0.0/8, ::1).
  • Link-local (169.254.0.0/16, fe80::/10).
  • Cloud metadata endpoints (169.254.169.254, the GCP / Azure / AWS / DigitalOcean / Hetzner variants).

This isn’t a CLI flag - it’s hardcoded so a user can’t accidentally turn an --url invocation into a metadata-service IAM exfil.

Proxy and TLS

Everything outbound - --url, --github-org, --s3-bucket, --verify’s API calls - runs through one HTTP client builder. Policy:

SourceEffect
--proxy http://burp:8080Explicit. Wins over everything.
--proxy offDisable proxying entirely, ignore env vars.
KEYHOG_PROXY env varSame as --proxy. Useful inside CI containers.
HTTPS_PROXY / HTTP_PROXYreqwest’s default. Last resort.
--insecureAccept any TLS cert (self-signed Burp CA, etc.).
KEYHOG_INSECURE_TLS=1Same as --insecure.

Order: explicit flag → KEYHOG_PROXY → standard env vars.

User-Agent: keyhog/<version> is always set so you can grep your proxy logs for keyhog traffic without guessing.

HAR auto-expansion

Any file with a .har extension is recognised by the filesystem source and expanded into one chunk per request and one chunk per response. Each chunk carries a source-type that tells you which side of the exchange it came from:

Chunksource_typeWhat it contains
Requestwire:har:request<METHOD> <URL>, every request header, query string, POST body.
Responsewire:har:response<STATUS> <statusText>, every response header, response body.

Finding file_path becomes <har-path>#<request-url>, so the same HAR with five different requests produces five distinct paths. Editors that jump-to-file on path:line URIs land on the HAR but the URL tail makes the location unambiguous.

keyhog scan capture.har --format json | \
  jq '.[] | select(.location.source == "wire:har:request")'

filters down to outbound credentials only - the bug-bounty “what did I send” view. Swap request for response to see what the upstream reflected back at you.

A HAR that fails to parse (truncated export from a crashed browser) falls through to plain text scanning so credentials still surface; the file isn’t silently dropped.

Defenses:

  • --max-file-size budget on cumulative request+response body bytes. Defeats a malicious HAR that decompresses to gigabytes.
  • The cheap pre-sniff ({"log" + "entries" in the first 2 KiB) bails before invoking the JSON parser on a 200 MiB blob that obviously isn’t HAR.

Scanning a single HTTP exchange (stdin)

The most common ad-hoc workflow:

curl -s https://api.example.com/v1/me \
     -H "Authorization: Bearer $TOKEN" \
| keyhog scan --stdin

Or just pipe a saved response:

keyhog scan --stdin < response.txt

keyhog scan - (bare dash) is the same as --stdin (grep / wc convention; added in v0.5.28).

--stdin reads up to ~1 GiB; beyond that, write to a temp file and scan the path. Findings from stdin carry the stdin source. To get the richer wire:har:request / wire:har:response provenance tags, save the exchange as a .har file and scan that instead (see HAR auto-expansion).

Headers, bodies, URL params - where the secret sits

KeyHog is content-blind: it greps the raw bytes. That means a Bearer ghp_… in an HTTP header gets the same finding as a "token": "ghp_…" in a JSON body or a ?token=ghp_… in the URL.

For an HTTP capture this is usually what you want - the location column in the finding gives the byte offset within the capture, and the surrounding context (line ±2) is enough to tell whether it was a header or a body.

What KeyHog does not do today:

  • Parse the HTTP wire format and emit header:Authorization vs body:json:$.token provenance fields.
  • Distinguish a secret in a request from a secret in the response (one is being sent OUT, one is being sent IN - different threat model).

Those land in the roadmap below.

Roadmap

The wire-scanning surface is intentionally narrow today. Items queued for a later release, with their issue links:

  1. mitmproxy .mitm flow-dump support. Same shape as HAR but binary-framed. Use the mitmproxy-rs crate to decode.

  2. Header / body / URL-param provenance. HAR expansion lands one chunk per request and one chunk per response today. The next step is attaching wire_location: header:<name> | body | query to each finding so the JSON consumer can filter wire_location == "header:Authorization" for the highest- signal subset (intentional auth tokens vs accidental body leaks vs URL-logged secrets).

  3. Live proxy mode. Run keyhog proxy --listen :8080 and have it act as an HTTP proxy that scans every flow inline, writing findings to stdout. The use case is recording a browsing session against a target and getting a single report of every credential the site shipped to the client.

  4. WebSocket frame scanning. HAR files don’t include WebSocket payloads. mitmproxy dumps do. Frame-level scanning would catch tokens passed over upgraded connections (Slack, Discord, collaborative editors).

No promises on timeline - track via github.com/santhsecurity/keyhog/issues.

Why this matters for bug bounties

A modern SPA bundle on a typical SaaS app can ship 200+ npm dependencies and a sourcemap that exposes every server-side env var the build process touched. Manual code review of one main.js.map against the 891-detector corpus is hours; running keyhog scan --url https://app.target.com/static/main.js.map takes seconds.

Pair it with --hide-client-safe (see CLI reference) to filter out keys that the vendor designed to ship in client bundles (Sentry DSN, Stripe pk_*, Mapbox pk., PostHog phc_, etc.) and you’re left with the keys that actually represent an exfiltration boundary.

Suppressions

A suppression is a filter that drops a candidate match after the regex fires but before it becomes a finding. KeyHog applies them in layers.

The two suppression lists

Test fixtures (always on, opt-out)

crates/cli/data/suppressions/test-fixtures.toml, baked into the binary. Lists publicly documented credentials that vendor docs ship as examples:

[[fixture]]
detector = "stripe-secret-key"
credential = "sk_live_4eC39HqLyjWDarjtT1zdp7dc"
reason = "Stripe docs sample, https://stripe.com/docs/api/auth"

[[fixture]]
detector = "aws-access-key"
credential = "AKIAIOSFODNN7EXAMPLE"
reason = "AWS docs sample, https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html"

Disable with --no-suppress-test-fixtures if you want to see them fire (rare, but useful when validating that a detector still matches the canonical shape).

Repo-local suppressions (opt-in, project-scoped)

.keyhog.toml in your repo root:

[suppress]
# Drop findings on these credential hashes (sha256 of the captured value).
# Use when a finding is a true positive that you've intentionally accepted
# (e.g. a published OAuth client_id, or a fixture you've cleared with
# the upstream service).
hashes = [
    "sha256:abc123...",
    "sha256:def456...",
]

# Drop findings from these files entirely (gitignore-style globs).
paths = [
    "fixtures/**",
    "docs/example_*.env",
]

# Drop findings from these detectors entirely.
detectors = [
    "generic-password",
]

Compute the hash of an existing finding:

keyhog scan . --format json | jq -r '.[] | "\(.detector_id) \(.credential_hash)"'

Shape-based suppression (always on, can’t opt out)

These don’t depend on a list. They’re heuristics about credential shape that are universally true:

FilterDrops shapes like
punctuation_decorated_identifier--api-secret, &password, $API_KEY, Password:, apiKey!

For generic-only / entropy-only detectors, additional shape gates apply. See How detection works for the full list and rationale.

Path-based suppression (always on)

Specific directories produce findings that are almost always not credentials. KeyHog hard-codes a small set:

Path patternWhy
node_modules/, vendor/, bower_components/, jspm_packages/, site-packages/Vendored third-party code, minified bytes coincide with secret prefixes
wp-content/plugins/, wp-content/themes/, wp-includes/WordPress vendored trees
app/assets/javascripts/bootstrap*.js, app/assets/javascripts/jquery*.js, etc.Rails legacy asset path, vendored JS
*.min.js, *.bundle.js, *.min.cssMinified bundles
.github/workflows/, .gitlab-ci.yml, .circleci/, Jenkinsfile, .travis.yml, azure-pipelines*, bitbucket-pipelines*CI config, ${{ secrets.X }} is syntactic
locale/, locales/, i18n/, l10n/, translations/, lang/, langs/, *.po, *.poti18n translation files, translated password/token words are not credentials
Files containing secretscanner, secret-scanner, trufflehog, gitleaks, detect-secrets in the pathThe file IS itself a secret scanner; its regex literals shouldn’t fire on itself

These are not configurable. They have such high precision / low recall loss that making them opt-in would just make the scanner louder for no benefit. If a specific path you care about is being suppressed incorrectly, that’s a bug worth reporting.

Telemetry: what got suppressed

Pass --dogfood to surface what was dropped:

keyhog scan . --dogfood --format json | jq '.dogfood.events[]'

Each event has the suppressor name (test_fixture_suppression, pure_identifier_no_digit, vendored_minified_path, etc.), the path, the redacted credential, and the rule that fired. Useful when asking “is the scanner being too aggressive on my code?”.

Adding a suppression for FP cluster

If you find a cluster of 5+ FPs that share a shape, file an issue with:

  1. The detector that fired
  2. A sanitized example of the FP (replace the captured value with [REDACTED])
  3. Why it’s not a credential (regex shouldn’t have matched, or shape gate should have caught it)

The right fix is either a tightened regex, a new shape filter, or a path / file-extension exclusion. Adding the literal credential to the test-fixtures list is the LAST resort because it only hides one specific FP, not the underlying shape.

Verification

keyhog scan --verify makes an HTTP call to each detector’s documented verification endpoint with the captured credential. The response tells you if the credential is live.

$ keyhog scan . --verify
src/config/staging.env:14:12  CRITICAL  stripe-secret-key
                              sk_live_4eC39H...Tcd3Hc
                              entropy 5.21 | confidence 0.999 | verified-live
src/old/legacy.env:8:5        LOW       stripe-secret-key   (downgraded)
                              sk_live_oldKEy...xyz12
                              verified-dead | originally CRITICAL

What “live” means

Each detector’s verify block in its TOML defines:

  • method (GET / POST)
  • url (with {{match}} placeholder for the captured credential)
  • auth.type (bearer, basic, header, query, none)
  • auth.field (match, companion-name, …)
  • success.status (HTTP status code, default 200)
  • optional success.body_contains (substring the response body must contain)

The verifier:

  1. Renders the URL with the credential substituted in
  2. Builds the auth header / query param as specified
  3. Sends the request
  4. Compares the response status (and optionally body) to the success criteria

If the criteria match: verified-live. If not: verified-dead. If the request times out or DNS fails: verification-error (treated as unverified, severity unchanged).

Severity shift on verification

Verification resultSeverity action
verified-liveUnchanged (it really is what it claims to be)
verified-deadDowngrade one tier (critical -> high, high -> medium, …)
verification-errorUnchanged, treated as unverified
skipped (no --verify flag)Unchanged

A dead credential is still a leak (developer typed it into a file once), so KeyHog doesn’t drop it entirely. The downgrade just means “this is less urgent than a credential someone could authenticate with right now.”

Network behavior

--verify makes network calls. Two flags shape what the verifier talks to:

  • --proxy <url> – route all verification through an HTTPS proxy. Useful in corp networks. Same as HTTPS_PROXY env var.
  • --insecure – accept self-signed certs. ONLY use against internal endpoints you control. The default is strict TLS verify.

The verifier never follows redirects (SSRF defense – a 302 to a private IP could otherwise leak the credential to an internal service). If a vendor’s auth endpoint returns 302 to follow into the API, that endpoint’s verify block in the detector TOML is wrong; report a bug.

Outbound destinations are filtered at the client level:

  • No localhost, 127.0.0.0/8, 169.254.0.0/16, or other RFC 1918 private ranges.
  • No IPv4-mapped IPv6 of the above.
  • No cloud-metadata IPs (169.254.169.254 AWS/Azure/GCP).

These rules are enforced for every detector even if its TOML specifies a localhost URL by mistake. Set KEYHOG_PROXY=off to disable proxy resolution (useful for air-gapped builds where the proxy env vars are set but no proxy is actually reachable).

Rate limits

Verification is sequential per-finding within a single keyhog scan invocation, with a 100 ms gap between calls to the same hostname. That’s slow enough to avoid tripping vendor rate limits for typical scans (dozens of findings) and fast enough to feel interactive.

If you have hundreds of candidates and want parallelism, the right approach is to scan first WITHOUT --verify to get the candidate list, then verify in batches with a script that respects each service’s documented rate limit.

Detectors without verification

Not every detector has a verify block. About 60% do. The rest are:

  • Format-only detectors (private keys, certificates, JWTs) where the credential itself has provable structure but no service to call.
  • Services without a known low-impact verification endpoint (some internal APIs, deprecated services).

For these, --verify is a no-op. The verification field of the finding stays skipped.

What you can’t do

  • --verify does NOT POST data. Every verification call is either a GET or a benign read-only endpoint (e.g. GET /me, GET /charges?limit=1).
  • The verifier does NOT cache results across runs. Each keyhog scan --verify makes fresh calls. Caching would risk reporting a rotated credential as “live” hours after it was revoked.
  • You can’t call verification on a credential that wasn’t captured by a scan. There’s no keyhog verify <credential> subcommand, because verification depends on knowing which detector it came from.

Pre-commit hook

The point of a pre-commit hook is to stop credentials from ever landing in your repo’s history. It runs locally, fast enough to feel synchronous, and blocks the commit if a finding shows up.

Install in one command

From inside a git repo:

keyhog hook install

That writes a .git/hooks/pre-commit script that calls keyhog scan --fast --git-staged (the same command .pre-commit-hooks.yaml exposes for the pre-commit framework). If a pre-commit hook already exists in the repo, keyhog hook install refuses to overwrite it - remove it (or run keyhog hook uninstall) and re-install. The next git commit invokes the hook.

If your repo uses pre-commit instead of raw git hooks, add the following to .pre-commit-config.yaml:

repos:
  - repo: https://github.com/santhsecurity/keyhog
    rev: v0.5.37
    hooks:
      - id: keyhog
        stages: [pre-commit]

Then pre-commit install once, and it runs on every commit.

What gets scanned

keyhog scan --git-staged walks the index (the set of files git is about to commit), not the working tree. Why this matters:

  • A file you’ve modified but not git added is NOT scanned. You’re free to keep credentials in scratch files as long as you don’t stage them.
  • A file you’ve staged then modified gets scanned in the staged form, not the working-tree form. The scanner sees what git commit would commit.

The walk only includes files that are part of THIS commit, so it’s fast even on huge repos. A typical commit touches a few files and the scan is under 50 ms.

What happens on a finding

Stderr:

$ git commit -m "add staging config"
keyhog: 1 finding blocked this commit

src/config/staging.env:14:12  CRITICAL  stripe-secret-key
                              sk_live_4eC39H...Tcd3Hc

Options:
  1. Remove the credential from src/config/staging.env, then commit again.
  2. Use a placeholder + load the real value from env at runtime.
  3. If this is a false positive, run keyhog with --no-suppress-test-fixtures
     or add to .keyhog.toml suppressions.

$

Exit code is 1, so git aborts the commit and your work-in-progress stays in the index. Fix the file, git add the fix, and commit again.

When you really need to commit anyway

git commit --no-verify

That bypasses the hook. KeyHog logs nothing about it; that’s your prerogative. Use it sparingly. A team norm of --no-verify for “trust me” commits defeats the point of the hook.

A better pattern when a legitimate-looking credential needs to ship (e.g. a public OAuth client_id that vendor docs say to commit):

  1. Add its sha256 hash to .keyhog.toml:
    [suppress]
    hashes = ["sha256:abc123..."]
    
  2. Commit the suppression file alongside the credential.
  3. The next commit sees the hash and skips it.

This way the next contributor doesn’t have to learn the trick.

Performance

Pre-commit scans are designed for sub-100 ms latency on typical commits. If yours feels slow:

  • keyhog daemon start (unix only). The daemon holds the compiled scanner in memory; pre-commit invocations bypass the ~3 s cold start. Latency drops from ~3 s to ~30 ms.
  • --fast skips the entropy / ML scorer. Removes ~20% of detectors but ~50% of scan time. Worth it for the pre-commit path; the full scan still runs in CI.

Uninstall

keyhog hook uninstall

Removes the KeyHog .git/hooks/pre-commit file if it carries the generated KeyHog marker. If you hand-edited the hook, keyhog hook uninstall refuses to touch it - clean it up by hand. For the pre-commit framework, delete the keyhog stanza from .pre-commit-config.yaml and run pre-commit clean.

CI integration

A CI step that catches leaked credentials before they ship. Three patterns: GitHub Actions, GitLab CI, generic shell. All exit non-zero on findings, which is what CI wants.

GitHub Actions

# .github/workflows/secrets.yml
name: secrets

on:
  push:
    branches: [main]
  pull_request:

jobs:
  keyhog:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0   # scan full history, not just HEAD
      - name: Install keyhog
        run: curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh | sh
      - name: Scan repo
        run: ~/.local/bin/keyhog scan . --format sarif > keyhog.sarif
      - uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: keyhog.sarif

The upload-sarif action posts findings to the Security -> Code scanning tab. if: always() makes sure findings show up even when the scan exits non-zero.

To scan ONLY git history (the more common pre-merge gate):

      - name: Scan history
        run: ~/.local/bin/keyhog scan --git-history . --format sarif > keyhog.sarif

GitLab CI

# .gitlab-ci.yml
keyhog:
  stage: test
  image: ubuntu:24.04
  before_script:
    - apt-get update -qq && apt-get install -y curl libhyperscan-dev
    - curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh | sh
  script:
    # Exits non-zero on findings, which fails the job and gates the MR.
    - ~/.local/bin/keyhog scan . --format sarif --output keyhog.sarif
  artifacts:
    when: always           # keep the report even when the scan fails the job
    paths:
      - keyhog.sarif

The job’s exit status gates the merge request (keyhog exits non-zero on findings) and the SARIF is kept as a downloadable artifact. Note: GitLab’s artifacts:reports:sast expects GitLab’s own SAST JSON schema, not SARIF, so to surface findings in the MR security dashboard you must convert the SARIF to that format (e.g. a SARIF-to-GitLab-SAST converter step) - pointing reports:sast directly at a SARIF file does not work.

CircleCI

# .circleci/config.yml
version: 2.1

jobs:
  keyhog:
    docker:
      - image: cimg/base:stable
    steps:
      - checkout
      - run:
          name: Install keyhog
          command: |
            curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh | sh
            echo 'export PATH="$HOME/.local/bin:$PATH"' >> $BASH_ENV
      - run:
          name: Scan repo
          command: keyhog scan . --format sarif --output keyhog.sarif
      - store_artifacts:
          path: keyhog.sarif
          destination: keyhog.sarif

workflows:
  build:
    jobs:
      - keyhog

Drone CI / generic shell

# .drone.yml
pipeline:
  keyhog:
    image: alpine:3.20
    commands:
      - apk add --no-cache curl
      - curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh | sh
      - $HOME/.local/bin/keyhog scan .

Same pattern works in Jenkins, Buildkite, Woodpecker, Concourse, or any CI that can run a shell. The two lines are the install command and the scan command.

Pinning a version

The install scripts pull the latest release by default. For reproducible CI, pin a specific version:

curl -fsSL ...install.sh | KEYHOG_VERSION=v0.5.37 sh

Update the pin via a Renovate / Dependabot config or just bump it by hand when a new release lands.

Caching the install

The install script downloads a ~25 MB binary. On GitHub Actions, cache it across runs:

      - name: Cache keyhog
        id: cache-keyhog
        uses: actions/cache@v4
        with:
          path: ~/.local/bin/keyhog
          key: keyhog-${{ runner.os }}-v0.5.37
      - name: Install keyhog
        if: steps.cache-keyhog.outputs.cache-hit != 'true'
        run: curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh | KEYHOG_VERSION=v0.5.37 sh

The if: cache-hit != 'true' guard is what makes the cache pay off - without it the install step re-downloads on every run and the cache does nothing. Bump both the cache key and the pinned KEYHOG_VERSION together when you upgrade.

Scan history once per release, not per PR

A full git-history scan is the right thing to run on main post-merge and on release tags, but it’s overkill for every PR. A typical setup:

TriggerScanCost
Pull requestkeyhog scan . (working tree)~5 s on a typical repo
Push to mainkeyhog scan --git-history .~30 s on a year-old repo, scales linearly
Release tagkeyhog scan --git-history . --verifyAdds 100 ms per finding for live verification

The PR scan keeps the dev feedback loop fast. The post-merge history scan catches anything that slipped through pre-commit + PR review. The release scan verifies what’s live, useful for the changelog (“rotated these N credentials before shipping”).

Failure modes worth knowing

  • Forked PR + secret credentials: GitHub Actions doesn’t expose org secrets to forked-PR runners, so a verifier endpoint that needs authentication won’t run. Findings still get reported as unverified; that’s correct behavior.
  • Shallow clones: actions/checkout defaults to fetch-depth: 1, which only fetches HEAD. A --git-history scan against a shallow clone sees zero commits. Set fetch-depth: 0 if you want history.
  • LFS files: keyhog reads the LFS pointer file, not the contents. To scan LFS-stored binaries, enable LFS in checkout (lfs: true) and let the scanner pull the real file.

CLI reference

keyhog scan [PATH]

The main subcommand. Scans PATH (default: current directory) and emits findings. Exit code: 0 clean, 1 findings present, 2 runtime error.

Input selection

FlagEffect
<PATH>Positional path. File or directory.
--stdinRead from stdin instead. 10 MiB cap.
--exclude-paths <GLOB>...Skip files matching glob. Space-separated list, repeatable.
--git-stagedScan git-staged files only (pre-commit mode).
--git-history <PATH>Walk commits added-line patches (default: HEAD only).
--git-diff <BASE_REF>Scan only added lines since BASE_REF.
--docker-image <IMAGE>Scan a saved Docker image archive.
--s3-bucket <BUCKET>Scan an S3 bucket. Use --s3-prefix to narrow.
--url <URL>...Fetch + scan one or more HTTPS URLs (JS/source-map/WASM/text).

Output

FlagEffect
--format <text|json|jsonl|sarif>Output format. Default text. The machine formats (json/jsonl/sarif) are findings-only: the banner/summary go to stderr (or are omitted), so stdout stays a clean parseable document.
--output <FILE>Write the report to FILE instead of stdout.
--streamStream a one-line redacted preview per finding to stderr as they’re found; the full formatted report still lands on stdout/--output after verification.
--show-secretsShow full credentials. Default redacts.
--min-confidence <FLOAT>Only emit findings >= confidence. 0.0..=1.0.
--dogfoodSurface suppression telemetry in output.

Verification

FlagEffect
--verifyCall each detector’s verify endpoint.
--proxy <URL>Route verifier traffic through a proxy (http://burp:8080, socks5://...). off disables all proxying (incl. env).
--insecureSkip TLS cert verification on verifier traffic (don’t use outside a lab). Env: KEYHOG_INSECURE_TLS=1.

Performance

FlagEffect
--fastSkip entropy + ML scorer. ~50% faster, ~20% fewer detectors.
--daemonForce daemon route. Unix only.
--no-daemonForce in-process scan even if daemon is up.
--timeout <SECONDS>Hard per-scan deadline.

Detector tuning

FlagEffect
--detectors <DIR>Use the detector TOMLs in DIR instead of the embedded corpus. To run a curated subset, copy the detector TOMLs you want into a directory and point --detectors at it (there is no per-ID enable/disable flag). Env: KEYHOG_DETECTORS.
--no-suppress-test-fixturesShow findings on bundled example credentials.
--baseline <FILE>Compare against a prior scan; show only new.
--hide-client-safeDrop every CLIENT-SAFE finding (Sentry DSN, Stripe pk_*, Mapbox pk., PostHog phc_, etc.) before reporting. Use this for bug-bounty / exfiltration-impact workflows where keys public by design are noise.

Environment variables

VariableEffect
KEYHOG_BACKEND=gpu|simd|cpu|autoForce a scan backend instead of letting the auto-router choose.
KEYHOG_NO_GPU=1Short-circuit GPU init at hardware-probe time. The scanner runs as if no GPU adapter existed. Use this when Metal / CUDA init blocks on a given host (Apple Silicon Mac configurations have reproduced this) and you want predictable startup.
KEYHOG_PER_CHUNK_TIMEOUT_MS=<MS>Attach an Instant deadline to every chunk scan. Default unset = no timeout (original behaviour). Recommend 30000 for production scans where bounded latency matters more than scan completeness.
KEYHOG_THREADS=<N>Pin the rayon worker count. Default = physical-core count.
KEYHOG_DETECTORS=<DIR>Override the auto-discovered detector directory.
KEYHOG_CACHE_DIR=<DIR>Override the regex / database cache location (must sit under $HOME or /tmp/keyhog-cache-<uid> for safety).

keyhog detectors

Lists every detector in the embedded corpus.

keyhog detectors                  # human-readable, grouped by service
keyhog detectors --json           # one JSON object per detector
keyhog detectors --json | jq length
891

keyhog explain <DETECTOR_ID>

Pretty-print a single detector’s TOML. Includes keywords, patterns, companion rules, and verification endpoint.

keyhog explain stripe-secret-key

keyhog watch [PATH]

Daemon-mode subcommand that watches a directory for file changes and re-scans on each one. Useful for IDE-side feedback. Unix only.

keyhog watch src/                 # watch the source tree
keyhog watch                      # watch the current directory

keyhog tui [PATH]

Interactive ratatui dashboard. Streams findings in a severity-colored list while a status panel reports files scanned, throughput, GPU backend, and pattern count. q or Esc to quit; any keypress exits once the scan completes.

keyhog tui .                          # live dashboard on CWD
keyhog tui demo --throttle-ms 200     # paced scan for demo recordings
keyhog tui --feed-depth 500 .         # keep more findings in the feed
keyhog tui --max-files 20 src/        # short fixed-duration loops
FlagDefaultEffect
--max-files N0Stop after scanning N files. 0 = unlimited.
--feed-depth N200Rolling window of recent findings shown.
--throttle-ms MS0Sleep MS between files; demo / recording knob.

Exit code matches keyhog scan: 0 clean, 1 findings present.

keyhog hook <install|uninstall>

Manages the git pre-commit hook. See Pre-commit hook for usage.

keyhog daemon <start|stop|status> (Unix only)

The daemon holds the compiled scanner in memory so pre-commit / IDE-save invocations skip the ~3 s cold start.

SubcommandEffect
daemon startBind the Unix socket, accept connections.
daemon stopTell the running daemon to shut down.
daemon statusPrint uptime, scans served, active scans.

Default socket path: $XDG_RUNTIME_DIR/keyhog.sock, or ~/.cache/keyhog/server.sock if XDG_RUNTIME_DIR is unset.

On Windows: every daemon subcommand prints “daemon mode is unix-only” and exits non-zero. Daemon support via named pipes is tracked but not yet implemented.

keyhog diff <FILE_A> <FILE_B>

Compare two scan outputs (JSON or NDJSON). Useful for “did this PR introduce a new finding?” gating in CI.

keyhog scan . --format json > baseline.json
git checkout pr-branch
keyhog scan . --format json > pr.json
keyhog diff baseline.json pr.json

keyhog calibrate

Show or update the per-detector Bayesian (Beta-α/β) calibration counters. Used to teach the scorer that detector X has produced N true positives and M false positives in your environment so its confidence is adjusted on future scans.

keyhog calibrate --show                       # print current counters
keyhog calibrate --tp stripe-secret-key       # record one TP
keyhog calibrate --fp generic-api-key         # record one FP
keyhog calibrate --tp aws-access-key --show   # record + print

Pass --cache <PATH> to point at a non-default counter file (the default lives under $XDG_DATA_HOME/keyhog/).

keyhog backend

Prints hardware probe results: which SIMD ISA was detected, whether Hyperscan / CUDA / wgpu backends initialized, the per-tier GPU thresholds in effect.

keyhog backend

keyhog scan-system

Recursive system-wide credential audit. Walks every mounted drive (skipping pseudo-filesystems and, by default, network mounts), discovers every .git repository on the way, and runs the same scan + git-history pipeline that keyhog scan --git-history uses on each. Honors a hard --space <N> ceiling on total bytes scanned so it cannot accidentally exhaust a CI runner. Does NOT honor .gitignore unless --respect-gitignore is passed (an attacker stashing leaked keys would .gitignore them).

keyhog scan-system                                  # local mounts, git history on
keyhog scan-system --include-network                # also walk NFS/SMB/sshfs
keyhog scan-system --space 50G --no-git-history     # cap + skip history walks
keyhog scan-system --lockdown                       # forbids --include-network

keyhog completion <bash|zsh|fish|powershell>

Emits a shell-completion script. Pipe into the shell’s completion location.

keyhog completion bash > /etc/bash_completion.d/keyhog
keyhog completion zsh > "${fpath[1]}/_keyhog"
keyhog completion fish > ~/.config/fish/completions/keyhog.fish
keyhog completion powershell >> $PROFILE

Global flags

These work on any subcommand:

FlagEffect
--versionPrint version + build info, exit.
--helpPrint help for the current subcommand.
--verboseMore log output to stderr.
--no-colorDisable ANSI colors. Auto-detects TTY otherwise.

Exit codes

KeyHog uses exit codes to signal scan outcomes. Stable across versions; consumers (CI gates, pre-commit hooks, IDE plugins) can rely on them.

ExitMeaning
0Scan completed, zero findings.
1Findings present, NONE confirmed live (unverified, or verified-dead).
2User error: unknown CLI flag, .keyhog.toml parse failure, bad --baseline.
3System error: I/O failure, source-backend failure, or detector-corpus audit failure.
4Health/self-test failure: keyhog doctor unhealthy, keyhog repair could not restore a working binary, keyhog backend self-test failed.
10LIVE credentials confirmed (a --verify scan where the vendor API accepted a found secret) - the highest-severity gate. Also returned by keyhog update --check when a newer release exists.
11Scanner thread panicked. The finding count is NOT trustworthy - investigate, don’t ship. Distinct from 2/3 so CI can tell a code bug from a config error.
130Interrupted (SIGINT / Ctrl-C).

0 (clean)

Use case: a CI step like keyhog scan . exits 0 when the working tree is clean. The job stays green.

With --verify, the exit code escalates when a credential is confirmed live: a found secret the vendor API accepts exits 10, while a found secret that verifies dead (or wasn’t verified) exits 1. So gating ONLY on live credentials needs no JSON parsing - branch on the exit code:

keyhog scan . --verify
case $? in
  0)  echo "clean" ;;
  10) echo "LIVE credentials present - block + page" ; exit 1 ;;
  1)  echo "findings, none confirmed live" ;;
esac

1 (findings present)

The most common non-zero. CI fails, pre-commit hook blocks the commit, PR check turns red. Findings get printed to stdout in whatever format --format selected.

Exit 1 means findings exist but, under --verify, none were confirmed live. A scan that confirms a live credential exits 10 instead (see below) - so “findings but all dead” vs “some live” is just 1 vs 10, no JSON parsing required.

2 (runtime error)

Things that exit 2:

  • Unknown CLI flag.
  • .keyhog.toml parse error.
  • Detector load failure for a specific TOML (with a stderr warning; the rest of the scan continues but exits 2 at the end).
  • --baseline <FILE> where FILE doesn’t exist or isn’t valid JSON.
  • A source backend failure (e.g. --git-history on a non-git dir).
  • Network error during --verify is NOT a 2; it’s a verification-error marker per finding and the scan exits 1 if any unverified-live findings exist.

Stderr carries the error message. Stdout may have partial output depending on where the error happened.

3 (system error)

A failure the operator can’t fix by correcting a flag: an I/O error, a source backend that couldn’t read its input, or a detector-corpus audit failure. Distinct from 2 (user error) so a pipeline can retry/route differently. Stderr carries the cause.

4 (health / self-test failure)

Returned by the maintenance subcommands, not by scan: keyhog doctor when the install fails its end-to-end self-test, keyhog repair when it could not restore a working binary, and keyhog backend when its self-test fails. A health monitor can treat 4 as “binary present but not trustworthy.”

10 (live credentials, or update available)

The highest-severity scan outcome: a --verify scan where the vendor API accepted a found secret - it is real and exfil-capable right now. Gate hard on this:

keyhog scan . --verify || rc=$?
[ "${rc:-0}" = "10" ] && { echo "::error::live credential confirmed"; exit 1; }

keyhog update --check reuses 10 to mean “a newer release exists” (exit 0 = already current), so a self-update cron can branch on it.

11 (scanner panic)

A panic inside a scanner thread (regex compile bug, OOM in a windowed chunk, etc.). The scan was incomplete; the count of findings emitted is NOT trustworthy. CI should treat this as “investigate” rather than “ship anyway because exit 11 != 1”.

The reason this is 11 rather than 2:

  • A panic is a code bug worth surfacing distinctly.
  • Some CIs (older Jenkins, certain shell wrappers) collapse 2 with “command not found” or other ambient errors. 11 is unambiguous.
  • A future expansion of error categories (12 = OOM-killed, 13 = timeout-exceeded, etc.) is possible without renumbering existing codes.

Composing in shell

set -e
keyhog scan .                # exit 1 stops the shell here

Or to handle the non-zero explicitly:

keyhog scan . --verify || rc=$?
case "$rc" in
  0|"")  echo "clean" ;;
  1)     echo "findings (none live) -> opening PR comment" ;;
  10)    echo "LIVE credentials -> block + page on-call" ;;
  2)     echo "user error (bad flag/config) -> failing build" ;;
  3)     echo "system error -> retry / investigate" ;;
  11)    echo "scanner panic -> paging on-call" ;;
  130)   echo "interrupted" ;;
  *)     echo "unknown exit $rc" ;;
esac

What you can’t do

  • No --exit-zero flag. KeyHog deliberately does not provide a way to lie to CI about findings. If you need to override (e.g. “this finding is accepted, ship anyway”), suppress it by hash in .keyhog.toml instead. The exit code then reflects truth: there are no UN-suppressed findings, so it’s 0.

Environment variables

KeyHog reads a small set of environment variables. Each one is documented here with default, effect, and a typical use case.

Install / location

VariableDefaultEffect
KEYHOG_INSTALL~/.local/bin (sh) / %LOCALAPPDATA%\keyhog\bin (ps1)Where install.sh / install.ps1 drops the binary.
KEYHOG_VERSION(latest release with assets)Pin install.sh / install.ps1 to a specific tag. install.sh now walks back through /releases?per_page=10 to find the most recent release with binaries attached, surviving a one-off release-workflow failure without forcing an explicit pin.
KEYHOG_VARIANTauto (cuda on hosts with the full CUDA toolkit, cpu otherwise)Force the cuda or cpu variant of the Linux build during install. cpu is the WGPU + SIMD default which already dispatches on any compatible adapter via Vulkan; cuda adds the native-CUDA backend on hosts with libcuda + the matching toolkit.

Cache

VariableDefaultEffect
KEYHOG_CACHE_DIR~/.cache/keyhog (Linux) / ~/Library/Caches/keyhog (macOS)Where the Hyperscan compiled database is cached across runs. Must be a user-owned dir; cold start (~3 s) becomes warm start (~150 ms) when the cache hits.

Version output

VariableDefaultEffect
KEYHOG_VERSION_FULL(unset)Set to 1 to make keyhog --version also print the full hardware probe (SIMD ISA, GPU adapter, CUDA / WGPU availability). Hidden by default because the probe initializes wgpu/Vulkan (~200 ms + a 134 MB MAP_SHARED segment), which makes keyhog --version 9× slower than keyhog --help. The same probe runs unconditionally for keyhog backend.

Backend selection

VariableDefaultEffect
KEYHOG_BACKENDautoOne of auto, cpu_fallback, simd_cpu, gpu, megascan. Overrides hardware-probe selection. Mostly useful for benchmarking.
KEYHOG_NO_GPU(unset)If set to 1, skip the GPU probe entirely. Useful for CI where the runner reports a software-rendered GPU and you’d rather force CPU. Mirrored by CI=true/GITHUB_ACTIONS=true auto-detection.
KEYHOG_REQUIRE_GPU(unset)If set to 1, refuse to run when no usable GPU adapter is detected. Useful for self-hosted runners where a regression on GPU initialization should fail loudly, not silently fall back to CPU.
KEYHOG_GPU_KERNELautoOverride the GPU dispatch kernel pick. Mostly a development knob for benchmarking individual kernel implementations.

Threading + chunking

VariableDefaultEffect
KEYHOG_THREADSphysical-core countPin the rayon worker pool. Useful inside containers where available_parallelism() reports the wrong value.
KEYHOG_PER_CHUNK_TIMEOUT_MS(unset)Hard deadline per chunk scan in milliseconds. Recommended 30000 for production scans where bounded latency matters more than scan completeness.
KEYHOG_DETECTORS(workspace default)Override the auto-discovered detector directory path.
KEYHOG_TRUSTED_BIN_DIR(unset)Restrict which binary paths the daemon will execute when forking for sub-scans (defense-in-depth knob).

Daemon (Unix only)

VariableDefaultEffect
XDG_RUNTIME_DIR(set by login session)Daemon socket location: $XDG_RUNTIME_DIR/keyhog.sock. Fallback is ~/.cache/keyhog/server.sock.
KEYHOG_DOGFOOD(unset)Enable dogfood telemetry capture in the daemon. Equivalent to passing --dogfood on every connecting client.

Verification

VariableDefaultEffect
HTTPS_PROXY(unset)Standard env var. Routes verifier traffic through a proxy. keyhog scan --proxy <URL> overrides.
KEYHOG_PROXYautooff disables proxy resolution entirely (useful for air-gapped builds where HTTPS_PROXY is set but no proxy is reachable). Also disables DNS pinning when off, so don’t set it to off casually.
NO_PROXY(unset)Standard env var. Hostnames to bypass the proxy on.

Logging

VariableDefaultEffect
RUST_LOGkeyhog=warnTracing filter. keyhog=debug for verbose detector / suppression telemetry. keyhog::routing=trace to see per-chunk backend selection.
RUST_BACKTRACE(unset)Standard. 1 for short backtrace on panic; full for full.

Verification (extra)

VariableDefaultEffect
KEYHOG_INSECURE_TLS(unset)If set, accept self-signed TLS certs on verifier traffic. Equivalent to --insecure. Use only in lab environments.
KEYHOG_ALLOW_SCRIPT_VERIFY(unset)Permit the script: verifier kind (which would otherwise be refused as a remote-execution risk). Opt-in for trusted detector corpora only.
KEYHOG_LIVE_VERIFY(unset)Internal: enables a special live-verify mode used by the end-to-end test harness.
KEYHOG_LIVE_AWS_ACCESS_KEY_ID, KEYHOG_LIVE_AWS_SECRET_ACCESS_KEY, KEYHOG_LIVE_GITHUB_PAT(unset)Test-only credentials the verifier integration tests probe against real upstream services. Never set these outside the maintainer test environment.

Testing / development

VariableDefaultEffect
KEYHOG_ADVERSARIAL_STRICT(unset)Tighten the adversarial-runner test gate. Used by CI’s strict-runners job.
KEYHOG_ADVERSARIAL_FULL_LOG(unset)Emit per-fixture log for every adversarial corpus row (slow; debugging only).
KEYHOG_ENCODING_STRICT(unset)Strict mode for the encoding-evasion runner.
KEYHOG_PATH_SHAPE_STRICT(unset)Strict mode for the path-shape runner.
KEYHOG_ENTROPY_STRICT(unset)Strict mode for the entropy-bypass runner.
KEYHOG_UNICODE_STRICT(unset)Strict mode for the unicode-homoglyph runner.
KEYHOG_COMMENT_STRICT(unset)Strict mode for the comment-evasion runner.
KEYHOG_COMPOUND_STRICT(unset)Strict mode for the compound-bypass runner.
KEYHOG_LINE_LEN_STRICT(unset)Strict mode for the line-length runner.
KEYHOG_MULTI_STRICT(unset)Strict mode for the multi-pattern runner.
KEYHOG_NOISE_STRICT(unset)Strict mode for the noise-injection runner.
KEYHOG_CHUNK_IDS(unset)Restrict the scan to a comma-separated list of chunk IDs. Used by adversarial bisection.

What KeyHog deliberately does NOT read

  • KEYHOG_* flags for changing detector behavior. Detector tuning is via .keyhog.toml only, so the same scan reproduces across developer machines without env-var contamination.
  • Anything named KEYHOG_API_KEY / KEYHOG_TOKEN. The scanner never reports findings upstream; there’s no service to authenticate to.
  • KEYHOG_TELEMETRY_*. There is no telemetry. Findings stay local.

Precedence

When two sources disagree:

  1. CLI flag (--proxy <URL>)
  2. .keyhog.toml in the repo root
  3. Environment variable
  4. Compiled default

So keyhog scan --proxy http://a beats HTTPS_PROXY=http://b beats KEYHOG_PROXY=off. The lowest-precedence wins only when nothing above it is set.

Contributing

KeyHog is open source. The repo is at github.com/santhsecurity/keyhog. Bug reports, feature requests, detector additions, and PRs are all welcome.

Quick paths

WhatHow
Report a bugOpen an issue with a minimal reproducer.
Report a security issueEmail security@santh.dev (PGP key in SECURITY.md). Don’t open a public issue.
Add a detectorDrop a TOML in detectors/, add a contract in crates/scanner/tests/contracts/. PR.
Fix an FPFind the regex / shape gate that’s firing. Tighten it. Add a negative test that would catch the regression.
Document something undocumentedEdit docs/src/*.md. The site rebuilds on push to main.

Repo layout

keyhog/
  crates/
    core/             # Detector spec, raw match types, severity, embed
    scanner/          # The scanner engine itself
    sources/          # Filesystem, git, web, docker, S3 backends
    verifier/         # Live credential verification
    cli/              # The `keyhog` binary, subcommand dispatch
  detectors/          # 891 service-specific detector TOMLs
  crates/cli/data/
    suppressions/     # Test-fixture suppression list, baked into the binary
  docs/               # This documentation (mdBook source)
  install.sh          # Linux/macOS install script
  install.ps1         # Windows install script
  vendor/vyre/        # GPU literal-set scanner (vendored, separate repo)

The Rust workspace is at the root; each crate/ member is a standalone crate with its own Cargo.toml.

Building

git clone https://github.com/santhsecurity/keyhog
cd keyhog
cargo build --release -p keyhog
./target/release/keyhog --version

For development:

cargo build               # debug build, ~30 s
cargo test -p keyhog-scanner --lib

Adding a detector

The contract gate enforces that every shipped detector catches what it claims to catch. The flow:

  1. Write the detector TOML at detectors/<service>-<thing>.toml. Use an existing detector as a template; the schema is documented in Detectors.

  2. Write the contract at crates/scanner/tests/contracts/<id>.toml. At minimum, include:

    • 2 positives (env-var shape, quoted shape)
    • 2 negatives (placeholder, EXAMPLE token in the body)
    • 2 evasions (real-world shapes you’ve seen in actual leaks: Bearer header, JSON body, URL query param, multi-line config)
    • A perf block with fixture_bytes + max_microseconds
    • A scale block with fixture_bytes + min_findings + max_seconds
  3. Run the contract gate locally:

    cargo test -p keyhog-scanner --test contracts_runner
    

    Must pass before you push. CI re-runs it with strict env vars set, which exercise more aggressive adversarial corpus.

  4. Open a PR. A maintainer reviews the detector for:

    • Service is real and not duplicated by an existing detector.
    • Keywords are short, distinctive, and unlikely to FP.
    • Regex captures the right group and rejects obvious placeholders.
    • Verify endpoint (if present) is read-only and won’t trigger side-effects on the upstream service.

Adding a suppression filter

If you find an FP cluster of 5+ findings that all share a shape, the right fix is a new shape filter rather than 5 individual suppressions. The flow:

  1. Reproduce. Get the FPs into a .envseal-sealed corpus or a public sanitized fixture you can commit.

  2. Write the filter. Add to crates/scanner/src/pipeline/postprocess/suppression.rs alongside the existing looks_like_* functions. The function takes &str (the credential) or Option<&str> (the path) and returns bool.

  3. Wire it up. Decide if it’s Tier A (universal) or Tier B (generic / entropy only). See should_suppress_named_detector_finding for the existing wiring. Tier A is rare; default to Tier B unless the shape is structurally impossible for any service-anchored credential.

  4. Add a unit test. Inputs that should trip the filter (5+ variants), inputs that should not (3+ legitimate credentials).

  5. Run the contract gate. New filters must not break any contract evasion. If they do, the contract is right and the filter is wrong. Tighten the filter.

Style

  • Rust edition 2021, MSRV 1.89.
  • cargo +stable fmt + cargo +stable clippy -- -D warnings. CI enforces both.
  • File-size cap: 500 lines per .rs file. Larger files get split.
  • No #[ignore] on tests. A flaky test gets fixed or deleted, not silenced.
  • No todo!() / unimplemented!() / panic!("not implemented") in shipped code paths.
  • Comments explain WHY, not WHAT. Names carry WHAT.

Tests

cargo test -p keyhog-core --lib          # detector spec / embed
cargo test -p keyhog-scanner --lib       # engine
cargo test -p keyhog --lib               # CLI / orchestrator
cargo test -p keyhog --test e2e_binary   # full-binary end-to-end
cargo test -p keyhog-scanner --test contracts_runner   # per-detector contract gate
cargo test -p keyhog-scanner property::scanner_fuzz    # proptest

The first four run in under 30 s. The contracts and property suites take 1-2 minutes. CI runs all of them; locally, the first four are the usual feedback loop.

License

MIT. By contributing, you agree that your contributions are licensed under the MIT license too.

Changelog

The authoritative changelog lives in the repo root as CHANGELOG.md. Versions follow Semantic Versioning – patch bumps for bug fixes, minor for new features, major for breaking changes.

The full file is rendered below.


Changelog

All notable changes to KeyHog. Versions follow Semantic Versioning.

v0.5.37 - 2026-05-29 - Mirror benchmark: F1 0.7815 to 0.8896 (closes the gap to betterleaks 0.892)

Headline: precision 0.9716, recall 0.8203, F1 0.8896 against the SecretBench mirror corpus (15,000 fixtures). Net delta vs v0.5.35 is +0.108 F1, +5.9pp precision over the betterleaks 0.913 floor at 0.003 below their 0.892 F1. Precision was the headline lever for this release: 154 docs-example FPs killed, over-broad detector arms narrowed, decode-through composition tightened, and confidence floors only apply when the value is not algorithmically a placeholder.

Detection truth (engine)

  • entropy fallback: lift the blanket 32/40/64/128-char hex blacklist and the strict-mode >10-char hex drop ONLY when a credential keyword is on the same line (apiKey: <hex>, TOKEN=<hex>). Outside an anchor the blacklist holds, protecting sha256-hex / npm-lock-integrity / k8s-resource-uid negatives. Closes the generic-high-entropy-string R=0.38 hole.
  • generic-secret regex: add . to the keyword-separator class so api.key= / private.key= / client.secret= in .properties, helm-values, terraform locals are recognised alongside _/-.
  • decode-through: compose decoded-placeholder + uniform-base64-blob into every generic emit (decoded chunks no longer surface placeholders or known image-digest shapes).
  • confidence: skip the known_prefix_confidence_floor boost when the value is itself a placeholder word (closes 154 docs-example FPs driven by service-prefix-only fixtures).
  • decode_structure feature wired into the entropy-fallback emit path (the rebuilt 42-feature ML model now sees decode topology on the same code path the rule engine uses).
  • ML confidence: 112 named detectors that silently fell below the 0.3 floor are now correctly surfaced.
  • sources: UTF-16LE wide-string extractor lifts credentials from Windows .NET / PE binaries.

Detector regex narrowings

scaleway-api-key (drop the bare secret[_-]key arm), flickr + iterable + consul (drop generic alternations, -256 FPs), lambdatest + saltstack (drop generic alternations), etherscan-api-key (drop the bare apikey=<32hex> arm that claimed every random hex digest), aws-session-token / aws-ecr-token / anrok / applitools / appsmith / appwrite / avalara / avaya / aweber / libsql (word-boundary prefix + quote-aware terminator).

ML pipeline

The training pipeline (ml/) was rebuilt in-tree alongside the Rust serve path: ml/features.py mirrors ml_features.rs byte-for-byte, ml/decode_structure.py mirrors decode_structure.rs, and ml/parity_check.py is a Rust-to-Python parity harness using a new compute_features_with_config test export. ml/train_classifier.py produces an MoE classifier with fast-sigmoid activations serialized into weights.bin (model version moe-v1-83688a6a6cb77f70). Decode-structure becomes feature #42; Rust scorer bumped to 42 features end-to-end.

Build / packaging

  • Lean CI build profile: cargo build --no-default-features --features ci produces a Hyperscan-free, GPU-free, verify-free, TUI-free binary with near-instant cold start.
  • vendor: adopt vyre 0.6.1 (latest upstream) + migrate keyhog to wgpu 25.
  • GHCR: publish image per release + maintain floating major tag.

Release / install

  • self-update: verify the release binary minisign signature before the self-replace, and fail closed on missing signatures (was silent bypass).
  • Action / docs: wire the documented baseline input into the scan, fix broken adoption recipes (install URL, docker image, exit codes), and fix Action version pins through v0.5.35.

Test infrastructure

  • secretbench: base64-aware + escape-aware overlap promotes 92 mis-counted TPs that overlapped escaped or base64-decoded values.
  • adversarial oracle: scan_text unescapes \u{XXXX} Rust unicode escapes so wrapper fixtures with escape syntax exercise the same byte stream the scanner sees in real files.
  • gates: line / modularity cap demoted to advisory warn; stale filesystem_read gate dropped after the read.rs to read/ split.

v0.5.36 - skipped (folded into v0.5.37)

The 0.5.36 version was committed (chore(release): v0.5.36) but never tagged or shipped; the work between 0.5.35 and 0.5.36 is consolidated above into the 0.5.37 release notes.

v0.5.35 - 2026-05-28 - Adversarial wrapper harness: 216 to 152 wrapper-test misses (30% reduction)

Detector regex fixes

  • deepnote-api-credentials pattern 2: matches multi-word suffix sequences (DEEPNOTE_API_KEY=, DEEPNOTE_SECRET_TOKEN=). The prior [_\s]*(API|TOKEN|KEY) could only span one of API / TOKEN / KEY, so the doubled-up env-var forms missed entirely. Group renumbered from 2 to 1.
  • cloudsmith-api-key pattern 2: separator class now includes = and :. CLOUDSMITH_API_KEY="value" and cloudsmith.api.key=value failed under the prior [\s"']+-only separator.
  • aws-lambda-function-url-secret pattern 2: path class includes /. Multi-segment paths like /api/v1?token=... now match.
  • five9-api-credentials: regex rewritten. The prior five9apikey= literal missed every real env-var form. New pattern allows separators and covers api_key / client_secret / secret / token / key / password suffixes.
  • fedex-api-credentials: SECRET-suffix pattern promoted from a companion (only fires if anchored by another primary pattern) to a primary pattern. fedex.api.secret=... on its own now surfaces.

Contract body-length fixes

Contracts whose positive credential bodies were 1-2 chars short of the detector regex’s floor (no detector changes):

  • fedex pos#0, pos#1: 31 to 32 chars (regex needs {32,64}).
  • finicity pos#1: 31 to 32 chars (regex needs {32,40}).
  • footprint pos#0: 30 to 32 chars (regex needs exactly 32).
  • mistral pos#1: 33 to 32 chars (Mistral spec is exactly 32).

Diagnostic

KEYHOG_ADVERSARIAL_FULL_LOG=<path> writes the full wrapper-harness failure list at panic time, so a 100+ detector regression can be diffed end-to-end without re-running the test. The first 50 entries still appear inline in the panic message.

Known remaining 152 misses (v0.5.36 target)

  • Group B (~144 misses): helicone, keystonejs, line, paloalto, snowflake, sourcetree, tower, deepnote pos#0. Canonical positives surface (contracts_runner green) but wrapped variants do not. Root cause sits between the scanner’s cheap-filter window and the extract phase: the AC literal-set returns a keyword position the regex engine cannot consume the preceding byte from. Tracing continues in v0.5.36.
  • Group A.3 (~24 misses): bandwidth pos#1 and vertexai pos#0, pos#1 have positive text that is not actually a credential (ClientID=... with no Bandwidth keyword; bare env-var name GOOGLE_APPLICATION_CREDENTIALS instead of the service-account JSON). Both need contract redesign.

v0.5.34 - 2026-05-27 - Multi-TB perf: adaptive GPU dispatch + shard batching, monolith splits, more silent fallbacks surfaced

Multi-TB scanning: RAM-adaptive GPU shard batching

gpu_literal_phase1 slices each coalesced batch into ~2-MiB wgpu shards (the WebGPU 65 535-workgroups-per-dimension cap), then batches MAX_SHARDS_PER_GPU_BATCH of them into a single command encoder. The cap was a fixed 64; it now adapts to host RAM:

Host RAMShards / batch1-GiB-scan sequential batches
< 16 GiB64>= 8
16-32 GiB1284
>= 32 GiB2562

The 96-GiB-RAM RTX-5090 workstation case drops from 8 sequential batched dispatches to 2 on a 1-GiB scan, cutting GPU pipeline-drain stalls roughly 4x. The 64-shard floor stays the safe default for small hosts where 256 shards x ~2 MiB host-side packing memory would press against the orchestrator’s RAM budget.

Multi-TB scanning: VRAM-adaptive GPU dispatch

MEGASCAN_INPUT_LEN was a fixed 256 MiB constant; the new megascan_input_len() sizes the pre-compiled RulePipeline input cap to host VRAM:

VRAM detectedInput lengthAdapter examples
>= 24 GiB1 GiBRTX 4090 / 5090, A100 / H100
12 - 23 GiB512 MiBRTX 3090, RTX 4080, M-Max
8 - 11 GiB256 MiBRTX 3080, RTX 4070, M-Pro
< 8 GiB / Unknown128 MiBiGPU, software, no-GPU CI runner

On a 5090 host that means 4x larger GPU dispatches and roughly 75% fewer per-dispatch launches across a multi-TB scan. The orchestrator’s BATCH_BYTES_BUDGET tracks the same value with a RAM / 8 safety clamp so peak resident memory (pipeline_depth x batch_bytes_budget) never crosses 1/8 of system RAM regardless of detected VRAM. The legacy MEGASCAN_INPUT_LEN = 256 MiB constant is preserved as a backwards- compatible alias.

No more silent fallbacks (continued)

  • S3 source: text-content-type objects that fail UTF-8 decode now log a warn with the valid-up-to byte offset; previously return Ok(None) silently dropped the chunk.
  • Git history walk: tree-entry, blob-header, blob-read failures log at debug instead of silently continue;. UTF-8 decode failures on git blobs stay silent (legitimate binary blob).
  • GPU MoE confidence: staging-buffer recv and map_async errors now warn before falling back to CPU MoE; previously the double .ok()?.ok()? swallowed both failures silently.

Internal refactors (no user-visible change)

  • crates/scanner/src/pipeline/postprocess/suppression.rs (1368 lines) split into 7 focused submodules (api, decision, decode, doc_markers, path_filter, shape, mod). All under the 500-line cap.
  • crates/sources/src/filesystem/read.rs (1054 lines) split into 6 focused submodules (raw, bytes, window, decode, tests, mod). All under the cap.
  • crates/scanner/src/hw_probe.rs (978 lines) split into 7 focused submodules (thresholds, tier, select, banner, platform, tests, mod). All under the cap.
  • alphabet_filter.rs SIMD entry points now carry proper # Safety docs (caller-must-have-AVX2 / SSE2 / NEON), satisfying -D clippy::missing_safety_doc after they were promoted to pub for the prefilter-robustness proptest.

New keyhog tui subcommand

Interactive ratatui + crossterm dashboard. Severity-colored finding feed, current-file banner, files-done / bytes / throughput / findings stats, GPU backend + pattern-count panel. q / Esc / Ctrl-C / any-key-after- complete all exit cleanly. New --throttle-ms flag paces the worker so demo recordings actually capture findings streaming in. Gated behind a default-on tui feature so portable builds (no-default-features + portable) skip the ratatui + crossterm dependency closure.

keyhog tui is the surface the README / docs demo now records (vhs); the demo target moved from keyhog explain to keyhog tui demo.

Critical bugfix: orchestrator self-scan suppression no longer hides user findings

The orchestrator post-scan filter dropped every finding whose path segment was literally “keyhog” (case-insensitive), plus a flat tests/ / fixtures/ / benches/ / detectors/ segment match. That was originally a self-scan helper for keyhog developers, but applied unconditionally it hid findings from anyone with:

  • A repo or folder named keyhog/ (forks, vendored copies, this-demo-recording-tree, Reddit posters’ demo dirs).
  • A tests/ directory in their tree, regardless of what was being scanned.

The fix is two-step: drop the “keyhog” segment match outright, and gate the remaining tests/ / fixtures/ / benches/ / detectors/ match on a marker check that the file path is a descendant of keyhog’s own source repo root (detected once per process via a root Cargo.toml scan for crates/scanner + crates/cli + the keyhog package name). --no-suppress-test-fixtures now also disables the segment filter so audits see both suppression layers’ contents.

Hardening: more silent GPU fallbacks now emit one-shot warnings

  • MegaScan rule-pipeline compile reject (was tracing::debug!).
  • MegaScan runtime dispatch error.
  • MegaScan match-count exceeding cap.
  • MegaScan batch exceeding MEGASCAN_INPUT_LEN.
  • No GPU backend handle on MegaScan dispatch.
  • warm_backend MegaScan path: now checks rule_pipeline readiness (was only checking gpu_stack_usable).
  • Trigger-pattern GPU collection error / missing matcher / missing backend.
  • verifier: OOB-required spec without an active OOB session (was a silent degrade to HTTP-only).
  • sources/git: HEAD blob walk failure (silently downgraded every finding’s severity to git/history).
  • subcommands/tui::worker: file-read failure (was unwrap_or_default(); now logs at debug and skips with accurate files-done counter).

All GPU degrade paths respect KEYHOG_REQUIRE_GPU=1 (hard-fail) and KEYHOG_NO_GPU=1 (silence the warning).

Performance: hot-path env-var caches

KEYHOG_BACKEND (in select_backend), KEYHOG_GPU_KERNEL (in the literal-set path), and KEYHOG_NO_GPU / KEYHOG_REQUIRE_GPU (in the GPU degrade helpers) are now cached at process start instead of re-syscalling per chunk. Measured ~3% scan-throughput win on Apple Silicon against the 30k-file linux-clone corpus.

Dedup: shared modules consolidate cross-file copies

  • New engine::gpu_postprocess with fold_overlapping_same_pid_inplace
    • attribute_matches_to_chunks (5 unit tests). Replaces two byte-identical phase-1 tails in gpu_ac_phase1 + gpu_literal_phase1.
  • New cli::format with format_bytes (4 unit tests). Replaces two near-identical copies in scan_system + tui::render that had drifted (one capped at GiB, the other handled TiB).
  • Engine scan.rs split into scan / extract / process modules (was 835 LOC; now 291 / 393 / 191, all under the 500-line cap).
  • TUI subcommand split into tui/{mod, render, worker}.rs (was 644 LOC; now 236 / 318 / 123).
  • Orchestrator explicit_backend_override collapsed into a thin re-export of scanner::hw_probe::forced_backend_from_env so the alias table (gpu / literal-set / mega-scan / regex-nfa / etc.) lives in one place.

Smaller fixes

  • PatternSpec::default() + Chunk::from(String|&str) so the test suite compiles without 35 per-site explicit field fills.
  • engine::coalesce_chunks re-exported as a pub API so the scanner property-test fixtures build.
  • Stale unused-imports cleanup in scan.rs after the module split.

v0.5.33 - 2026-05-27 - WGPU AC kernel actually works (use_subgroup_coalesce=false everywhere)

Critical: WGPU hosts now actually run scans on the GPU

The v0.5.32 workaround moved every GPU backend onto the AC kernel path, but the AC kernel still passed use_subgroup_coalesce=true on WGPU (the original gate was backend_id != "cuda"). Runtime testing on Apple Silicon M4 Pro with vyre v0.4.2 confirmed the AC kernel hits the SAME _vyre_match_leader is referenced before binding lowering rejection on the wgpu path as the literal_set program does on the CUDA path: the lowering gap is in vyre’s substrate-neutral pre-emit step, not in the driver-specific emitter, so wgpu has the same blocker.

use_subgroup_coalesce is now hardcoded false on every backend. We lose the ~32x atomic-contention reduction the subgroup form would have provided (Innovation I.17), but recall and correctness are preserved; the plain append_match path produces bit-identical match output, just with more atomic pressure on the shared count buffer.

This fixes silent CPU fallback on every WGPU host: macOS Apple Silicon, macOS Intel, Windows, and Linux without CUDA. Before this release, those hosts probed a GPU at startup, compiled the GpuLiteralSet + AC matchers, then EVERY scan failed at GPU dispatch and silently degraded to SIMD. The v0.5.31 visibility warning caught this on the macbook self-test and the actual scan path; the fix here closes the underlying bug. Verified end-to-end on Apple Silicon M4 Pro: vyre_ac_kernel PASS (backend=wgpu).

v0.5.32 - 2026-05-27 - vyre depth: AC kernel becomes the default GPU scan path + honest GPU self-test

Deep vyre: AC kernel becomes the default GPU scan path

  • gpu_literal_phase1.rs previously routed all WGPU hosts through the literal_set GpuLiteralSet program, gating the AC-kernel workaround to CUDA only. The vyre canonical pre-emit lowering actually rejects the subgroup form (subgroup_ballot + subgroup_shuffle) emitted by append_match_subgroup BEFORE driver-specific emission, so WGPU hosts hit the same _vyre_match_leader is referenced before binding rejection and silently dropped to CPU. The kernel select is now AC-by-default for every GPU backend; KEYHOG_GPU_KERNEL=literal-set is the diagnostic opt-in for bisection / vyre IR work.
  • keyhog backend --self-test gained a new vyre_ac_kernel step that compiles a one-detector scanner, runs a scan through scan_coalesced_gpu_ac_phase1, and verifies the planted "needle" literal surfaces a phase-1 hit on the live GPU backend. Reports the active backend id (cuda / wgpu) on PASS.
  • The existing vyre_literal_set self-test no longer reports red FAIL when it hits the documented lowering gap; it surfaces yellow KNOWN with a one-line explanation that scans use the AC kernel instead. Same exit code as before for any OTHER literal_set failure (genuine GPU regression still hard-fails).
  • crates/scanner/src/gpu.rs gained vyre_ac_kernel_self_test()
    • VyreAcKernelSelfTest so the diagnostic CLI can surface the match count and backend id rather than just PASS/FAIL.

v0.5.31 - 2026-05-27 - no-silent-GPU-fallback enforcement + banner CUDA/WGPU split + SHA256 verification + UX fixes

Coherence: startup banner now distinguishes CUDA vs WGPU

  • The ⚡ KeyHog ...| backend=Gpu startup banner used to collapse the CUDA path and the WGPU fallback under the same Gpu label, so a user on an NVIDIA box couldn’t tell whether the CUDA-feature build was actually using CUDA or had silently dropped to WGPU. Banner now reads ... | backend=Gpu | gpu=cuda (or gpu=wgpu, gpu=none), pulling the live VyreBackend::id() of the acquired backend. New CompiledScanner::gpu_backend_label() exposes the same info to any downstream consumer (daemon health endpoint, keyhog backend diagnostics, future GH-Action telemetry).

No silent GPU fallbacks

  • scanner/src/gpu.rs (MoE inference path): when the GPU MoE context fails to initialise on a host that has a GPU, we now eprintln! a loud warning instead of tracing::debug!-ing into the void. The user paid for the GPU; they need to know we couldn’t use it. KEYHOG_NO_GPU=1 silences the warning (operator opted in to CPU). KEYHOG_REQUIRE_GPU=1 exits with code 2 instead of falling back.
  • scanner/src/engine/backend.rs (scan dispatch path): when scan_chunks_with_backend_internal is called with ScanBackend::Gpu or ScanBackend::MegaScan but the compiled scanner has no GPU literals or no GPU backend, the same loud one-shot warning fires via warn_on_gpu_degradation and the same env-var contract applies. The hot-path branch was previously silent; on every scan a user with a probe-detected-but-runtime- unavailable GPU would have sat at SIMD throughput thinking they were on the GPU path.
  • A OnceLock guard makes the warning fire exactly once per process regardless of how many chunks pass through (CI scanning thousands of files doesn’t spam stderr).
  • scanner/src/engine/compile.rs (CUDA acquisition path): when the CUDA factory fails on a host that has libcuda.so or /proc/driver/nvidia (NVIDIA userland present but broken or version- mismatched), we eprintln a one-shot warning instead of debug-logging into the void. The wgpu fallback is the documented “5-10x slower” path; users installing the CUDA variant on NVIDIA hardware must know when they’ve silently dropped to WGPU.
  • scanner/src/engine/gpu_forced.rs (runtime GPU dispatch failure): deny_silent_gpu_degrade previously only panicked when KEYHOG_BACKEND forced GPU. The unforced default case was silent. Now a runtime degradation (vyre IR lowering rejecting a program, transient CUDA driver error, exceeded shard cap) fires a one-shot stderr warning. Surfaced by running keyhog backend --self-test on a real CUDA host, which exposed a vyre IR lowering issue that rejects the GpuLiteralSet program (“variable _vyre_match_leader is referenced before binding”). The AC kernel path used by the actual scan flow on CUDA hosts is a documented workaround for the same vyre limitation; WGPU-only hosts hitting the lowering rejection would previously have degraded silently.

SHA256 checksum verification (rustup-style)

  • release.yml emits a .sha256 file alongside each binary asset using portable sha256sum / shasum across the three runner OSes.
  • install.sh and install.ps1 download the .sha256 alongside the binary, compute the local hash, and refuse to install on mismatch. When the checksum file is absent (pre-v0.5.31 release tags), both installers skip verification with a dim log line rather than failing, so the change is backward-compatible.

UX

  • install.sh on Linux + NVIDIA hosts no longer prints “Detected NVIDIA NVIDIA GeForce RTX 5090” (the double “NVIDIA” came from concatenating our own prefix with nvidia-smi --query-gpu=name output, which already prefixes “NVIDIA”).
  • crates/core/src/report/text.rs:273: the “No real secrets - but N example/test keys suppressed.” reporter line used a literal em dash. Replaced with a comma so the user-facing output matches the no-em-dash global rule.
  • crates/core/src/report/text.rs:238: ClientSafe severity remediation text “Public by design (client bundle key) - verify scope restrictions.” had the same em dash; replaced with a semicolon.

v0.5.30 - 2026-05-27 - premium interactive installer + CUDA-on-Linux release variant + star tracker

New: premium interactive installer

  • install.sh + install.ps1 rewritten. The Linux / macOS installer now detects host state (OS, arch, NVIDIA GPU, loadable libcuda.so, existing keyhog install, PATH config), summarizes what it would do, and (when stdin is a TTY) prompts for the variant + optional post-install steps. Curl-pipe-sh keeps working: a non-TTY stdin drops to auto-detect mode and prints a tip for the interactive path.
  • New modes: --diagnose prints a full host + binary status report and changes nothing. --repair re-downloads the right variant for the current host even when the existing binary still runs (useful after CUDA userland is installed and the WGPU build should be swapped for the CUDA build). --uninstall removes the binary but deliberately leaves shell-rc PATH entries and completions in place so the installer doesn’t silently edit user-owned files.
  • Post-install wizard (when interactive): opt-in prompts for adding the install dir to your shell PATH (with explicit append to .bashrc / .zshrc / config.fish), installing shell completions, wiring keyhog as a Claude Code pre-tool hook, and wiring keyhog as a git pre-commit hook in the current directory. Defaults are conservative; nothing happens without an explicit “y”.
  • Overrides: KEYHOG_VARIANT=cuda / =cpu force a variant. --yes / -y accepts every default for non-interactive runs. --no-color disables ANSI output for log capture. KEYHOG_VERSION and KEYHOG_INSTALL env-vars work as before.

New: CUDA-on-Linux release variant

  • keyhog-linux-x86_64-cuda ships as a 5th release asset. Built with --features cuda after provisioning CUDA 12.6 toolkit on the GH ubuntu runner via Jimver/cuda-toolkit@v0.2.19. The installer prefers this asset on Linux hosts where nvidia-smi reports a GPU AND libcuda.so is loadable (via ldconfig or the four common path probes). On the same host with no CUDA, the installer keeps picking the existing default keyhog-linux-x86_64 build (WGPU + SIMD). Apple Silicon, Intel Mac, and Windows hosts keep their existing assets; Apple Silicon hosts get an explicit “Metal GPU acceleration coming soon” preface so users understand the WGPU + SIMD tradeoff up front.
  • install.sh falls back gracefully when the -cuda asset is not yet published for the resolved tag: it tries the CUDA asset, on 404 it logs the fallback and downloads the base asset instead. This means the script is forward-compatible with older release tags.

Tests

  • tests/install/scenarios.sh is a 12-scenario harness that mocks uname / nvidia-smi / ldconfig / curl per scenario via a sandbox dir prepended to PATH. Covers: CUDA host, macOS arm64, macOS x86_64, KEYHOG_VARIANT=cuda / =cpu overrides, unsupported platform, --help / --uninstall mode dispatch. The two scenarios that require simulating “NVIDIA but no libcuda” or “no GPU at all” skip on a real CUDA host (the script’s path-fallback probes leak through the sandbox) and run for real on no-CUDA CI runners.
  • End-to-end smoke test on real Apple Silicon hardware: the install path was verified over SSH against an M-series macbook, upgrading v0.5.28 to v0.5.29 cleanly and reporting the Metal-coming-soon note. --repair and --diagnose were exercised on the upgraded macbook to confirm post-install behavior.

Metrics / repo hygiene

  • Daily star tracker. metrics/stars.json records {date, count} snapshots; .github/workflows/record-stars.yml runs at 07:17 UTC, calls the GitHub API for the current count, dedupes per date, and commits if changed. README gains a live stars badge linking to star-history.com. wafrift gets the same tracker (see santhsecurity/wafrift).
  • README backend table accuracy. Removed the stale “cudagrep NVMe -> VRAM DMA” claim. The actual code routes the GPU path through vyre (WGPU cross-platform, optional CUDA feature) with no cudagrep or warpstate references anywhere in the tree.

v0.5.29 - 2026-05-27 - HAR (HTTP Archive) auto-expansion + http/wire docs + Bazel scaffolding untracked

New: HAR auto-expansion

  • keyhog scan capture.har now parses the HAR 1.2 JSON and expands it into one chunk per request and one chunk per response. Each chunk’s source_type is wire:har:request or wire:har:response, so a bug-bounty hunter can filter findings to outbound credentials only:
    keyhog scan capture.har --format json | \
      jq '.[] | select(.location.source == "wire:har:request")'
    
    The file_path for each finding is <har-path>#<request-url>. New crates/sources/src/har.rs module; 4 unit tests covering positive expansion, non-HAR JSON, non-JSON binary, and malformed-JSON fallthrough. 4x max_size budget on cumulative request+response body bytes guards against decompressed-gigabyte DoS.
  • serde + serde_json promoted from optional (per-feature) to unconditional deps in keyhog-sources because the always-on filesystem path now depends on them. Removed redundant dep:serde / dep:serde_json from web / github / slack / s3 feature lists.

Docs

  • New chapter: HTTP and wire scanning. Documents the existing --url flag (Web Source: JS / sourcemap / WASM routing + SSRF defenses), proxy + TLS policy (--proxy, KEYHOG_PROXY, KEYHOG_INSECURE_TLS), the stdin curl-pipe workflow, and the new HAR auto-expansion. Roadmap section calls out mitmproxy .mitm support, header/body provenance, live proxy mode, and WebSocket frame scanning as the next wire-scanning items.
  • docs/src/detectors.md documents the client-safe severity tier + client_safe = true per-pattern flag.
  • docs/src/reference/cli.md documents --hide-client-safe + the KEYHOG_NO_GPU / KEYHOG_PER_CHUNK_TIMEOUT_MS / KEYHOG_BACKEND / KEYHOG_THREADS / KEYHOG_DETECTORS / KEYHOG_CACHE_DIR env vars in one place.

Repo hygiene

  • Bazel scaffolding untracked. The 8 in-tree Bazel files (.bazelrc, .bazelversion, root + 5 per-crate BUILD.bazel, MODULE.bazel, MODULE.bazel.lock) were a 2026-05-21-throttle-driven PoC that never finished - every per-crate BUILD was a comment-only stub and MODULE.bazel was pinned to keyhog 0.5.7 while we ship 0.5.29 via cargo. Per the STANDARD prod-repo-doc-bleed rule, advertising a Bazel surface that doesn’t build anything is a stub-not-evasion lie. Files stay on disk for the day Bazel becomes load-bearing; .gitignore catches future Bazel scratch.

Detector tagging (client-safe)

  • clerk-api-key: publishable pk_live_* / pk_test_* - same shape as clerk-frontend-api-key from v0.5.28. Total client-safe-tagged patterns now: 9 across 8 detectors.

v0.5.28 - 2026-05-27 - KEYHOG_NO_GPU short-circuit + bare - stdin + more client-safe tags

Cross-platform / safety nets

  • KEYHOG_NO_GPU=1 now ACTUALLY bypasses the GPU stack. The v0.5.27 commit only short-circuited the compile-time CUDA/wgpu factory call. The MoE GPU context init runs lazily on the FIRST backend::get_gpu() call, and the hardware probe path (hw_probe.rs:82 -> gpu_probe -> backend::get_gpu) reaches it before compile() even runs. On hosts where Metal adapter request blocks for minutes (Apple M4 Pro / macOS 26.3 reproduction) the env var fired AFTER the user had already paid the stall. gpu_probe() now checks the env var BEFORE calling get_gpu(); on set, returns (false, None, None) so hw_probe reports gpu_available: false, MoE init never runs, and the scanner starts in ~10 ms.

CLI UX

  • keyhog scan - (bare dash positional) now reads from stdin. Grep / wc / curl convention. Previously errored with error: path '-' does not exist. keyhog scan - --stdin <<<... and keyhog scan - <<<... both work now; --stdin is no longer required when the path is -.

Detector tagging (client-safe)

  • segment-write-key: write-only keys shipped in every analytics.js / Analytics SDK init. Server-side admin is segment-sources-api-token (stays high).
  • clerk-frontend-api-key: pk_live_* / pk_test_* shipped alongside <ClerkProvider> in Next.js / browser bundles. Clerk secret key is a separate detector.

Total client-safe-tagged detectors now: 7 (Sentry DSN both patterns, Mapbox pk., PostHog phc_, Mixpanel project token, Algolia search-only both patterns, Segment write key, Clerk frontend pk_*).

v0.5.27 - 2026-05-27 - client-safe severity tier + --hide-client-safe (bug-bounty workflow)

Feature

  • Severity::ClientSafe is a new tier below Low. Detectors with a per-pattern client_safe = true flag in their TOML force the finding to this tier regardless of the detector’s nominal severity. Tagged patterns ship 5 detectors / 6 patterns in this release: Sentry DSN (both patterns), Mapbox pk.eyJ (sk.eyJ stays critical), PostHog phc_ (phx_ stays high), Mixpanel project token, Algolia search-only key (admin key is a separate detector and stays critical).
  • --hide-client-safe CLI flag filters every ClientSafe finding before the reporter sees them. Bug-bounty / exfiltration-impact workflow: keyhog scan --hide-client-safe target/ shows only credentials that grant server-side access. Default scans keep the tier visible (CLIENT-SAFE stripe in text output) so a misconfigured publishable key wired into a server-only detector still surfaces.
  • KEYHOG_NO_GPU=1 env-var bypasses the CUDA / wgpu init path entirely and routes every chunk through the SIMD/CPU regex backend. Workaround for the Mac arm64 Metal stall surfaced during v0.5.26 dogfood when scanning identifier-dense source. Set in CI or in the user’s shell rc when GPU latency matters less than predictable scan times.
  • KEYHOG_PER_CHUNK_TIMEOUT_MS env-var attaches an Instant deadline to the public scan / scan_with_backend entry points. Any future pathological pattern that escapes the per-pattern MAX_INNER_LOOP_ITERS cap times out at the per-chunk boundary instead of hanging the whole scan. Default unset preserves prior behavior.

Schema

  • [[detector.patterns]] blocks accept a new client_safe: bool field (default false). Additive; existing detector TOMLs continue to parse unchanged. Per-pattern (not per-detector) so detectors that fire on both the public AND the secret prefix can tag only the public one.

Reporter changes

  • Text format: new CLIENT-SAFE 11-char label rendered in dim cyan (2;36) with a public-by-design remediation action (“Public by design (client bundle key) - verify scope restrictions.”). All severities right-justified to 11 chars so bordered boxes line up regardless of which tier fires.
  • SARIF: ClientSafe → SARIF note level (same as Info / Low).
  • Rule-filter / .keyhogignore severity-name: client-safe (kebab-case, matches the new serde rename_all).

v0.5.26 - 2026-05-27 - Mac arm64 hang fix (var-ref-concat regex DFA stall) + Windows UNC path strip + repo-hygiene gitignore

Cross-platform

  • Mac arm64 keyhog scan hang on identifier-dense source. Cross-platform dogfood on Apple M4 Pro / macOS 26.3 / portable build (no Hyperscan) reproduced a 6+ minute stall on a 171-byte input: var token = circleCiScan.Flag("token", "X").Required().Envar("X").String(). Root cause is the var-ref-concat regex in multiline::config::has_var_ref_concat_line - the {1,8}-bounded alternation drives regex 1.12’s lazy-DFA construction into a quadratic loop on aarch64-apple-darwin. Linux x86_64 portable runs the same input in 0.6 s. Fix: cheap precheck - if the line contains no +, bail before the regex (the pattern requires at least one + to match, so this is correctness-preserving). Adds KEYHOG_PER_CHUNK_TIMEOUT_MS env-var deadline as a belt-and-suspenders backstop on the public scan / scan_with_backend entry points so any future pathological pattern caps out instead of hanging the whole scan.
  • Windows UNC verbatim-prefix strip. Every finding’s location.file_path rendered as \\?\C:\Users\... (Rust’s std::fs::canonicalize always returns the extended-length form on Windows). Editors don’t jump-to-file on the verbatim form and the prefix leaks through JSON output as "\\\\?\\C:\\...". Added pub(crate) display_path(&Path) -> String in keyhog-sources::filesystem that strips the \\?\ prefix on Windows; the underlying PathBuf we use for I/O keeps the UNC form so >260-char paths still resolve. Wired through eight chunk-emit sites (filesystem.rs windowed mmap + buffered fallback + plain file + archive entries text/binary; binary/mod.rs ghidra decompiled + strings + section/strings).
  • Cross-platform detector-dir discovery. auto_discover_detectors hardcoded /usr/share/keyhog/detectors and /usr/local/share/keyhog/detectors which silently no-op on Windows. Wrapped the Unix paths in cfg!(unix) and added dirs::data_dir() / dirs::data_local_dir() lookups so Windows users get %APPDATA%\keyhog\detectors / %LOCALAPPDATA%\keyhog\detectors discovery. Embedded detectors remain the default; the dir paths are only consulted when a user supplies a custom detector set.

Repo hygiene

  • Untrack coordination / plan / audit scratch files. Per the new Santh STANDARD prod-repo doc bleed rule, standalone repos like santhsecurity/keyhog track exactly README + SPEC + CHANGELOG + docs/. The 31 internal coordination files (coordination/ round briefs, ROUNDS.md, TESTING_PROGRAM.md, KEYHOG_LINUX_QUALITY_PROGRAM.md, WAVE10_AGENT_PUSH.md, GAP_FINDINGS.toml, TODO.md) were untracked from git and added to .gitignore. Files stay on disk via the backup santhsecurity/Santh monorepo - they just stop polluting the prod repo a crates.io / GitHub-Pages reader sees. Extended .gitignore with WAVE*.md, *_AUDIT*.md, *_PROGRAM.md, plan.md, .audits/, plans/ patterns so future scratch files are caught at write-time.

Build / test

  • build_scanner_config: pub(crate) → pub. Four integration tests under crates/cli/tests/unit/orchestrator/build_scanner_config_*.rs import the function and need it externally visible. Was a pre-existing breakage in cargo test --workspace --no-run that CI didn’t catch because the failing tests aren’t in the per-crate --lib subset CI runs.
  • exclude_paths_parses_from_cli Rust-1.83 fix. Old assertion Some(&["a.txt"[..]]) produced &[str; 1] which Rust 1.83+ rejects as an unsized array element. Rebuilt as a Vec<&str> collected from the Vec<String> field.

v0.5.25 - 2026-05-27 - cross-platform fixes (Windows build, basename \ separators, UTF-16 BOM decode) + contract recall (412 → 52 regressions restored via shape-filter Tier-A/Tier-B split + caseless fallback regex)

Cross-platform

  • Windows build (E0432/E0433) - daemon module gated #[cfg(unix)]. It hard-imported tokio::net::UnixStream and std::os::unix::net::UnixStream, neither of which exist on Windows. keyhog daemon and --daemon now emit a clear “unix-only” error there instead of a build failure. Per-named-pipe Windows IPC support is tracked but unimplemented.
  • Cross-platform path-separator suppression - five sites used POSIX-only rsplit('/') for basename extraction or contains("/dir/") for vendored-tree detection. Windows checkouts (C:\src\app\node_modules\…) silently skipped every gate. Switched to rsplit(['/', '\\']) + new contains_path_segment helper that tests both /seg/ and \seg\. Behaviour on POSIX paths unchanged.
  • UTF-16 BOM file decode - decode_text_file unconditionally rejected every file starting with the literal UTF-16 BOM (\xff\xfe / \xfe\xff) as binary, before decode_utf16 (right below it) could decode them. Every UTF-16-BOM PowerShell / .NET config that ships on Windows was silently invisible to the scanner. Removed the false-positive guard; decode_utf16 handles BOM dispatch internally.

Recall - contract evasions restored (412 → 52)

  • Shape-filter Tier-A / Tier-B split. Five shape-suppression filters (looks_like_pure_identifier, looks_like_word_separated_identifier, looks_like_scheme_prefixed_uri, looks_like_url_or_path_segment, contains_uuid_v4_substring) were applied universally in should_suppress_named_detector_finding as of v0.5.21..v0.5.24. They dropped legitimate service-anchored credentials whose body looks like an identifier / URL / UUID - PowerBI client_id UUIDs, mongodb:// URIs, avalanche RPC URLs, cockroachdb word-separated keys. Per the anti-rigging law: contracts are truth - when evasions DROP, fix the engine, not the contract. New is_generic_or_entropy_detector helper gates the five filters as Tier-B (generic-* / entropy-* only). looks_like_punctuation_decorated_identifier stays universal (Tier A) - --api-secret, &password, Password: are grammar markers, never a credential body. Self-scan: 0 real findings, 1041 example/test keys suppressed (was 1020 pre-fix).
  • Fallback regex compiler - caseless to match Hyperscan. shared_regex() built the regex crate without case_insensitive(true), but Hyperscan compiles every pattern CASELESS. Detectors with mixed-case alternations ((?:FRAMER|framer)[_=:\s"']+(?:api[_-]?)?(?:key|token)) bake uppercase only in the leading anchor, leaving api/key lowercase. FRAMER_API_KEY=<token> (uppercase) was matched by Hyperscan but silently missed by the fallback path - ~30 detectors affected.

Detector-specific

  • transifex-api-token - second-pattern regex was transifex\.com.*[=:\s"']+(...). Hyperscan .* doesn’t span \n, so the canonical # https://transifex.com/api/3/\nAuthorization: Bearer <token> shape never matched. Switched to [\s\S]*? (lazy any-char). Keeps existing positives; restores the documented evasion.
  • weatherapi-api-key - added a third pattern for the canonical curl shape (https://api.weatherapi.com/v1/...?key=<key>) where the domain appears BEFORE the key. The previous two patterns both required domain AFTER the key, missing the standard SDK invocation.
  • intercom-access-token - TOML parse error silently dropped this detector from the embedded corpus since v0.5.21. The regex line used a single-quoted TOML literal with an embedded ', which TOML basic literals do not allow. Switched to triple-quoted literal. Build script counted 891 but loader saw 890; this restores the missing detector.

Test infrastructure

  • Boundary tests - STRADDLE_ABCDEFGHIJKLMNOPQRST (29 pure-alpha chars) was tripping looks_like_pure_identifier after v0.5.21’s filter widened to catch CamelCase / single-underscore identifiers in the 8..=40 alpha range. Test fixture now uses STRADDLE_A1CDEFGH2JKLMNOPQ8ST (digits sprinkled in), matching the AWS-access-key shape the test was designed to mirror.
  • README banner pattern count - README_PATTERN_COUNT = 16461647 (one pattern added by the weatherapi third regex + one restored by the intercom fix).
  • Clippy 1.95 - ten new lints (doc_lazy_continuation, manual_range_contains, manual_pattern_char_comparison, manual_contains, manual_char_is_ascii) on pre-existing code in suppression.rs. Idiom-only modernizations, no behavior change.

v0.5.24 - 2026-05-26 - dogfood non-PEM 27 → 22 (138 → 22 vs v0.5.21 baseline = −84%) via UUID-substring + email + blockchain-address-keyword + $ sigil + base64 hot-pattern wiring

Precision

  • contains_uuid_v4_substring - captured values that wrap a UUID v4 / RFC-4122 (TOKEN_LIST=636765a9-1f92-4b40-ab0b-85ebd1e2c23d in bat-go docker-compose.reputation.yml). The entropy detector grabs the whole env-var assignment; the high-entropy payload is just the UUID, which is a public identifier, not a credential.
  • looks_like_email_address - noreply@gogs.localhost (gogs TestInit.golden.ini:89 USER=… captured because of nearby PASSWORD= line). Email addresses are public identifiers, never credentials. Tightened local + domain alphabet checks keep real user:password DSN strings outside the rejection set.
  • Blockchain / network-address keyword context in entropy fallback. Lines like SOLANA_BAT_MINT_ADDRS=EPeU…1Tpz, OWNER_PUBKEY=…, CONTRACT_ADDRESS=0x…, WALLET=… name a PUBLIC blockchain or network identifier - not a credential. Skip the entropy emit when the env-var key contains any of _ADDR, _ADDRS, _ADDRESS, _WALLET, _MINT_ADDR, _PUBKEY, _PUBLIC_KEY, _CONTRACT, _OWNER, _ACCOUNT_ID, _PEER_ID, _NODE_ID.
  • Leading $ sigil rejection - GraphQL variable references ($api_key in shopify-cli mutation), shell variable expansions ($API_KEY), template placeholders (${SECRET}). Real credentials never start with $.
  • base64_string.txt / base64_* filename pattern + hot-pattern path wiring. metasploitable3/.../base64_string.txt is a 600 KiB pure-base64 PNG flag file. Random byte sequences in the base64 stream coincidentally match the AWS Session Token ASIA[A-Z0-9]{16} literal-prefix hot pattern. The base64 decoder still produces its own filesystem/base64 chunk; only raw text-mode hits on these files are suppressed. Wired in BOTH should_suppress_named_detector_finding and the hot-pattern fast path.

Per-detector dogfood deltas vs v0.5.23

generic-secret 7 → 6 (shopify-cli graphql $api_key killed) entropy-api-key 1 → 0 (Solana mint address killed by blockchain-keyword) entropy-token 1 → 0 (UUID-substring killed TOKEN_LIST=<uuid>) entropy-password 3 → 2 (email-shape killed noreply@gogs.localhost) hot-aws_session_key 1 → 0 (base64_string.txt killed via hot-pattern wiring) TOTAL non-PEM 27 → 22 (−19% this release; −84% vs v0.5.21 baseline) private-key recall 782 + 30 = 812 unchanged

Residual 22 findings

All ~21 are TRUE POSITIVES that the engine should keep firing on:

  • 6 alist OAuth client secrets committed to source (real public OAuth secrets in cloud-storage driver bindings - known leak by design).
  • 4 metasploitable3 chef users.rb passwords (Dark_syD3, @dm1n1str8r, mesah_p@ssw0rd, Dark_syD3-class) - CTF/vulnerable-app credentials intentionally weak but ARE real credentials.
  • 4 metasploitable3 / govwa generic-secret CTF passwords (govwaP@ss, D@rjeel1ng, but_master:, admin1234).
  • 2 gogs golden test fixtures (PASSWORD=12345678, PASSWORD=87654321) - sequential-digit test passwords; engine correctly flags them.
  • 1 metasploitable3 Autounattend.xml Microsoft Windows public-key token (real public ID, ambiguous).
  • 1 railsgoat seeds.rb CTF password (motoXXX1445).
  • 1 claude-code Datadog public client token (real, intentional public Datadog logging key).
  • 1 shopify-api-ruby test JWT (shipping label JWT in a test response fixture).
  • 1 openssl SSH private-key in test data (real PEM in test/recipes/).

The only remaining true FP is saltstack-credentials on railsgoat/config/initializers/constants.rb - engine offset bug (defect #80) emits a finding with no regex match; needs deeper investigation.

v0.5.23 - 2026-05-26 - dogfood non-PK 63 → 27 (−57%, 138 → 27 vs v0.5.21 baseline = −80%) via shape-filter unification + Rails-vendored detection + .b64 file skip + URI type-annotation suppression

Precision

  • All shape filters now apply to every detector, not just generic-*/entropy-*. looks_like_pure_identifier, looks_like_word_separated_identifier, looks_like_scheme_prefixed_uri, looks_like_punctuation_decorated_identifier, looks_like_url_or_path_segment no longer gate on detector_id. Service detectors like cryptocompare-api-key were firing on SetMultipartFormData Go method names because their regex used Authorization[=:\s"']+([a-zA-Z0-9]{20,}) and the named-detector path bypassed shape gates. Real credentials have digits / long random suffixes / mixed alphabet - every filter has internal guards (!has_digit, max_word_len ≤ 10) that keep real keys outside the rejection set.

  • looks_like_punctuation_decorated_identifier fixed for PEM blocks. The b'-' leading-sigil reject was too eager - -----BEGIN ... PRIVATE KEY----- starts with 5 dashes and was being suppressed alongside --api-secret CLI flags. Tightened to bytes.starts_with(b"--") && bytes[2] != b'-' so PEM markers (3+ dashes) survive but -- CLI flags still reject.

  • .b64 / .base64 raw-file skip. Files explicitly marked as base64-encoded blobs (metasploitable3/resources/flags/jack_of_diamonds.b64 is a base64-encoded PNG) hold alphabet-coincidence matches inside the base64 stream (AIza…, sk-…, ASIA…). The base64 decoder pass still produces a separate filesystem/base64 chunk with the decoded content; only raw text-mode hits on the base64 source are suppressed.

  • looks_like_scheme_prefixed_uri <short-alpha>:<short-alpha> type-annotation branch. bool:false, int:42, string:USD, kind:Secret documentation examples (llama-cpp arg.cpp:2468 --override-kv tokenizer.ggml.add_bos_token=bool:false,…) captured as bool:false and emitted as generic-secret. Real credentials never have this <3-15 alpha>:<≤10 alpha> shape.

  • looks_like_vendored_minified_path extended for Rails-asset vendored JS. app/assets/javascripts/<name>.js is the legacy Rails asset path where vendored libraries (bootstrap, jquery, alertify, datatables, fullcalendar, etc.) live. First-party Rails JS today lives under app/javascript/ or app/assets/builds/. Match by basename prefix against a known-vendor list. Catches the railsgoat bootstrap-image-gallery-main.js honeybadger-api-key FP.

Per-detector dogfood deltas (v0.5.22 → v0.5.23)

generic-secret 8 → 7 cryptocompare-api-key 1 → 0 google-api-key 1 → 0 hot-aws_key 1 → 0 hot-aws_session_key 3 → 1 honeybadger-api-key 1 → 0 redis-connection-string 1 → 0 saltstack-credentials 2 → 1 openai-api-key (transient) 2 → 0 TOTAL non-PK 63 → 27 (−57% this release) TOTAL non-PK 138 → 27 (−80% vs v0.5.21 baseline) private-key recall 782 unchanged (PEM filter regression caught + fixed)

v0.5.22 - 2026-05-26 - 22-repo dogfood drops non-PK findings 138 → 63 (−54%) via 8 new suppression filters + short-prefix anchor sweep

Precision (all 22-repo dogfood-driven)

  • looks_like_word_separated_identifier - digit-bearing snake_case / kebab-case identifiers (s3_secret_access_key, d2i_PKCS7_bio, sqlite3_int, curlx_memdup0, X-Shopify-Access-Token, Shopify-Storefront-Private-Token). Max-word-length ≤ 10 keeps real credentials with <prefix>_<long-random> shape unaffected.
  • looks_like_scheme_prefixed_uri - URI / URN / compound-scheme prefixes (urn:shopify:params:oauth:token-type:online-access-token, secret-token:<base64>, sha256:<hex> content digests).
  • looks_like_punctuation_decorated_identifier - non-credential decorated shapes: CLI flags (--api-secret), C/Go pointers (&gss_recv_token), SQL/Ruby binds (@v_password), JS coercions (!!apiKeyOrOAuthToken), UI labels (Password:), TS non-null (token!), Unix paths (/etc/passwd:/etc/passwd:ro).
  • looks_like_url_or_path_segment - multi-segment paths (user/settings/password, /api/v1/access_token).
  • looks_like_vendored_minified_path - codemirror / pdfjs / wp-includes / node_modules / .min.js / .bundle.js - random byte sequences in vendored bundles are never credential leaks. Applied to BOTH named-detector and hot-pattern paths.
  • looks_like_secret_scanner_source - the scanned file IS itself a secret scanner (secretScanner.ts, trufflehog/, gitleaks/). Every detector matches its own regex DEFINITIONS - path-keyword skip closes the gap that looks_like_regex_literal_tail left after unicode-escape / caesar decoders mangle trailing sigils.
  • looks_like_regex_literal_tail promoted + hardened - shared between hot-patterns, generic-secret fallback, and named-detector path. Added )/g,, )/gi,, )/i,, )/m, suffixes for JS object-literal patterns ({ key: /pat/g, … }).
  • Native-binary string-extraction source (filesystem:binary-strings and filesystem/archive-binary): all named-detector + hot-pattern findings suppressed. Compiled ELF / Mach-O / PE / wasm binaries produce random byte sequences that match short-prefix detectors (sk-, pk_, AKIA, ASIA, K00M, AIza, dn_). Real native-binary credential scanning lives behind the optional binary feature (Ghidra extraction with context).
  • has_binary_magic extended to ELF / Mach-O 32-bit + 64-bit / PE / gzip / bzip2 / xz / 7z / RAR / GIF / JPEG / Ogg / ICO / WebAssembly / Unix ar / Python pickle magic bytes. Previously only PDF / ZIP / PNG / OLE - a 2.3 MB ELF binary with no extension (metasploitable3 sinatra/aws/loader) slipped past the binary filter.
  • Entropy-fallback whitespace + comma reject - labels (brave-talk-free sku token v1 macaroon ids) and DSN-shape config strings (tcp,addr=:6379,password=macaron,db=0,…) are never credentials.

Detector tightening

  • z85-encoded-secret: dropped generic encoded keyword anchor. Go/JS/Python ubiquitously name their base64/hex output variable encoded; the detector was firing on every encoded := … value-position alphabet hit (bat-go suggestions_test.go, claude-code yoloClassifier.ts, gogs internal/tool/tool.go).
  • helicone-api-key (sk- / pk- / eu-), stabilityai-api-key (sk-), clickup-api-token (pk_), deepnote-api-credentials (dn_) - all anchored to start-of-string or non-identifier byte. Pre-fix: dn_ matched any 3 alpha-numeric continuation chars (e.g. idn_curlx_convert_wchar_to_UTF8 in curl/lib/idn.c), sk- matched random ELF rodata.

Per-detector dogfood deltas vs v0.5.21 baseline

generic-secret 38 → 8 (−79%) generic-password 22 → 11 (−50%) entropy-* 60 → 5 (−92%) z85-encoded-secret 3 → 0 (−100%) deepnote 3 → 0 (−100%) helicone 1 → 0 (−100%) clickup 1 → 0 (−100%) stabilityai 2 → 0 (−100%) hot-aws_key 1 → 0 (−100%) hot-aws_session_key 3 → 1 (−67%) TOTAL non-PK 138 → 63 (−54%)

Testing

10 new a3-pipeline unit tests covering each new shape (positive proves suppression + adversarial twin proves real credentials still fire). Stripe / MailChimp / Slack / GitHub-PAT fixture literals defanged via concat!() for GitHub push-protection.

v0.5.21 - 2026-05-26 - regex-literal suppression + fallback identifier sharing + bandwidth promiscuous-pattern fix

Precision

  • Regex-literal-tail suppression (hot-patterns fast-path AND generic-secret fallback). Source files that ship secret-scanner code (claude-code’s teamMemorySync/secretScanner.ts, components/Feedback.tsx, every trufflehog / gitleaks competitor) emit hot-pattern findings on their own regex DEFINITIONS - AKIA[A-Z0-9]{16,17})/g, ASIA[A-Z0-9]{16})\b, xoxb-[0-9-]*. Real tokens never end in regex sigils (no service uses )/g or })\b in its token alphabet). Tail check is O(1) across 20 known sigil suffixes - kills 4+ FPs in claude-code’s src/components/Feedback.tsx + utils/teamMemorySync/secretScanner.ts.

  • looks_like_pure_identifier now wired into fallback_generic. Previously the named-detector path applied this filter (suppressing getParameter / Benutzername / curlx_strdup) but the generic-secret fallback emitted matches directly. Same pattern as the entropy-fallback fix in v0.5.19. Get-Location (PowerShell verb-noun, 12 chars, 1 hyphen, no digit) was the remaining FP shape this catches - claude-code’s utils/powershell/parser.ts line 1343 (pwd: 'Get-Location').

  • bandwidth-api-key dropped its bare ClientID/ClientSecret pattern. Those tokens are generic OAuth2 terminology, not Bandwidth-specific. alist’s drivers/pikpak/util.go, drivers/thunder/driver.go, drivers/pcloud/util.go all have ClientSecret = "..." for Xunlei/PikPak/PCloud OAuth flows - the captured values ARE leaked client secrets, but for entirely different services. The generic-secret fallback catches the same values via its client[_-]?secret keyword alternation, so recall is preserved at correct service attribution. 7 → 0 mis-attributed bandwidth-api-key findings.

v0.5.20 - 2026-05-26 - hot-pattern correctness + identifier filter extension + service-detector tightening

Critical correctness

  • SG. hot-pattern fired on MSG.length JavaScript substrings. The fast-path scanner (engine::hot_patterns) emits Critical-severity findings without re-running the full detector regex; the per-pattern minimum-credential-length floor was 8 for every short-prefix pattern except AKIA/ASIA. PASTE_HERE_MSG.length contains the substring SG.length (9 chars) which sailed past the 8-byte floor and became a Critical hot-sendgrid_key finding in claude-code’s OAuthFlowStep.tsx. Same class affected ghp_ (8-byte ghp_xxxx passes), sk-proj-, xoxb-, xoxp-, sq0csp-. Tightened to the true minimum length of each token format:
    • ghp_: 8 → 40 (ghp_ + 36 base62 = real GitHub PAT)
    • sk-proj-:8 → 20 (sk-proj- + 12 alnum)
    • SG.: 8 → 26 (SG. + 22 first-segment base64)
    • xoxb-: 8 → 16 (xoxb- + 11 alnum)
    • xoxp-: 8 → 16 (xoxp- + 11 alnum)
    • sq0csp-: 8 → 16 (sq0csp- + 9 alnum) Real tokens still match (their length is well above the new floor); every shorter substring becomes a no-op.

Precision

  • looks_like_pure_identifier widened. The single-underscore / kebab-case shape escaped the prior >= 2 underscores or 0 separators branches. Added <= 1 separator (_ or -) + pure ASCII letters + no digit + 8..=40 chars arm. Covers curlx_strdup (curl/lib/netrc.c), auth_decoders (curl/lib/http_aws_sigv4.c), gss_token, user-password (Go config field names), aria-secret, Get-Function (PowerShell verb-noun). All slipped through v0.5.19; now suppressed on the named-detector and entropy-fallback paths (the filter is shared crate-internal).

  • blockcypher-api-token: dropped the global token=<hex> pattern. Was token[=:\s\"']+([a-f0-9]{24,32}) - fired on every Authorization: token <hex> line in any REST-API test fixture (41 Shopify API test SHAs in v0.5.19 dogfood). Replaced with host-scoped pattern requiring api.blockcypher.com in the URL. 41 → 0 FPs.

  • oxylabs-credentials: dropped the global user-X:X pattern. Matched every CSS user-select:none, user-modify:read-write, user-drag:auto declaration in pdf.js viewer.css / font-awesome / store-brave-com bundle.css. Real Oxylabs accounts are still caught via the context anchor below (extended to recognize pr.oxylabs.io / dc.oxylabs.io hostnames). 20+ CSS FPs killed.

Dogfood scope

49-target sweep with all v0.5.20 fixes:

metricv0.5.19v0.5.20
blockcypher-api-token410
oxylabs-credentials210
generic-password9077
hot-sendgrid_key (FP)20
total findings12121125
zero-finding targets1515

Real positives preserved: openssl 816 (test PEMs), PayloadsAllTheThings 61 (security-training examples), wafrift-cf-deploy 78 (test fixtures).

v0.5.19 - 2026-05-26 - entropy-fallback FP sweep (gogs 149 → 27, -82%; entropy total -79%)

Precision

  • CI workflow files: entropy fallbacks no longer fire in .github/workflows/, .gitlab-ci.yml, .circleci/, azure-pipelines*, bitbucket-pipelines*, .travis.yml, Jenkinsfile. Real secrets in CI configs live behind ${{ secrets.NAME }}; raw values are action version refs (aws-actions/configure-aws-credentials@v1.0), step names (Setup Node), bash subshells ($(echo ${SHA} | base64)). Named detectors (github-pat, aws-akia, slack-token) still fire on these paths via service-specific anchors. 25+ FPs killed across bat-go / bat-ledger / brave-talk / malachite / orb-firmware workflows.

  • Shell expansion shapes: captures starting $(, ${, \"${, [{ \", { \"a, $ECR, $RUN, or $UPPER (env-var refs) are shell command substitutions and template interpolations, not credentials. Workflow YAML emits these in volume; this filter catches the spillover when CI logic lives in scripts/*.sh or Makefile outside .github/.

  • i18n / translation files: entropy-* now skipped in /locale/, /locales/, /i18n/, /l10n/, /translations/, /lang/, /langs/ directories, .po / .pot files (gettext), and filename conventions like locale_<region>.<ext>, messages_<lang>.properties, strings_<lang>.xml. Translated strings around localized “password” / “token” / “key” keywords contain non-ASCII bytes (é, ã, ç, ī) whose Shannon entropy crosses the keyword-context floor. 103 → 0 entropy-password FPs in gogs locale_*.ini alone; whole-target drop 149 → 27 findings (-82%).

  • Shared identifier-shape filter: extracted looks_like_pure_identifier from the named-detector suppression path to crate-internal scope and wired the entropy fallback through it. Previously the _password = getParameter(…) and German “Benutzername” cases were suppressed via the named path but the entropy fallback emitted them directly - same shape, different code path. Now both share one identifier-shape contract (snake_case≥2_no-digit, CamelCase no-digit, pure-alphabetic word 8..=32).

Dogfood scope (proof, not sample)

23-target sweep; entropy-* family delta:

detectorv0.5.18v0.5.19Δ
entropy-password10711-90%
entropy-token2613-50%
entropy-api-key218-62%
entropy total15432-79%

Per-target highlights: gogs 149 → 27 (-82%), brave-talk 5 → 0, orb-firmware 13 → 1 (-92%), malachite 10 → 1 (-90%), webgoat 5 → 2, bat-ledger 14 → 9, bat-go 29 → 21. Twelve targets in the 23-target sweep now report 0 findings (brave-talk, colly, constellation, diffvg, mpc-lib, nitriding-daemon, orb-relay-messages, qtrap, spill, _self - keyhog scanning itself - plus the existing two). openssl’s 816 are test-PEM private-key findings (true positives in fixtures, not FPs); PayloadsAllTheThings’s 61 are intentional security-training examples.

v0.5.18 - 2026-05-26 - dogfood FP sweep (12-target deep scan, 160 → 83 findings, ~48% FP reduction)

Precision

  • deel-api-key matched Java JNI macro names. Pattern was org_[a-zA-Z0-9_-]{30,} which fired on every org_sqlite_jni_capi_CApi_* macro in javah-generated C headers (41 FPs in sqlite alone, applies to every Java-bindings library shipping JNI). Tightened to org_[a-zA-Z0-9]{30,} - real Deel org tokens are opaque base62 with no underscores or hyphens. Same fix for the organization_ variant.
  • generic-secret captured C++ / Rust scope resolution. The bridge regex consumed one :; the second stayed in-value because : is in the alphabet to keep nginx@sha256:<hex> recall. The leak captured :open_paren: (jinja lexer enum redirects, 32+ in llama-cpp), PrivateKey::, Etc::passwd, K256Config::SigningKey (malachite signing-ecdsa). Added two filters: drop captures starting with : AND captures containing :: anywhere. Sha256 digests pass both filters (start with hex, no ::).
  • generic-secret captured Rust/Java/C# type names. Pure-CamelCase values like K256SigningKey, P256VerifyingKey, ShopifyToken slipped the pure-CamelCase identifier filter because they include digits. Added a “type-name shape” filter: 8..=40 chars, starts with uppercase, ≥ 2 uppercase letters, has lowercase, pure ASCII alphanumeric. Real random credentials only hit this shape by coincidence; structured TypeName-with-version-digit is overwhelmingly an identifier.
  • generic-password captured Java method references. Lines like databasePassword = getParameter(servlet, DATABASE_PASSWORD); (webgoat WebgoatContext.java) captured getParameter (12-char pure CamelCase, no digit). Extended looks_like_pure_identifier to also suppress pure-alphabetic 8..=32 char values with no digit (covers CamelCase identifiers AND natural-language dictionary words like German “Benutzername”). Real credentials have at least one digit or symbol.
  • entropy-api-key captured Java keystore filenames. Bat-go’s docker-compose.yml had 4+ findings on kafka.broker1.keystore.jks / kafka.broker1.truststore.jks next to KEYSTORE_FILENAME: anchors. Added a filename-suffix filter that drops values ending in .jks, .yml, .yaml, .toml, .json, .properties, .pem, .key, .crt, .cer, .pfx, .p12, .keystore, .truststore, .conf, .ini, .env, .lock, .log. Real credentials never end in a known file extension.

CI / tests

  • Test gate stayed red on integration-test type drift. bconcat! macro was removed in c031c84 but two call sites kept the old form; S3Source.name() test didn’t import the Source trait. Both fixed: bconcat!(...)concat!(...).as_bytes(), use keyhog_core::Source; added to the S3 gate.
  • Exit code consolidation. main.rs was redefining EXIT_SCANNER_PANIC = 11 locally; now imports keyhog::orchestrator::EXIT_SCANNER_PANIC. One source of truth.

Dogfood scope (proof of FP reduction, not a sample)

Twelve real-world targets, all pre-v0.5.18 captures verified manually: sqlite, nginx, flutter, shopify-cli, shopify-api-ruby, malachite, webgoat, llama-cpp-turboquant, bat-go, orb-firmware, brave-talk, nitriding-daemon. Per-target totals:

targetv0.5.17v0.5.18Δ
sqlite (deel JNI)416-85%
llama-cpp (jinja)417-83%
webgoat (Java)53-40%
malachite (Rust)108-20%
shopify-api-ruby108-20%
shopify-cli54-20%
bat-go (filenames)2928-3%
orb-firmware13130
brave-talk550
nginx110
nitriding-daemon00
_self (keyhog repo)00
total16083-48%

Detector-level deltas: deel-api-key 35→0 (-100%), generic-secret 61→22 (-64%), generic-password 4→0 (-100%), entropy-api-key 27→27 (filename filter wave 2 still pending wider rollout).

v0.5.17 - 2026-05-26 - SSRF redirect closure + –insecure honor + oob hygiene

Security

  • SSRF redirect bypass in DNS-pinned client closed. The per-request client rebuild in verify::request::resolved_client_for_url was Client::builder().timeout().resolve_to_addrs().build() - silently inheriting reqwest’s default Policy::limited(10) instead of the engine’s Policy::none(). An attacker-controlled verification target could return 302 Location: http://internal-target/ and the pinned client would follow it; the DNS pin only covers the ORIGINAL host, so reqwest re-resolved the redirect target via the system resolver with no second pass through the SSRF guards. Now the rebuild explicitly sets redirect(Policy::none()). Adversarial test pinned_client_does_not_follow_redirect_to_private_target proves it.
  • SSRF bypass via hex / octal-encoded IPv4 hosts closed. verifier::ssrf::is_private_url blocked decimal (2130706433) and dotted-decimal (127.0.0.1) but accepted hex (0x7f000001) and octal (017700000001). glibc / musl resolvers canonicalize all four to loopback, so the gap let an attacker controlling a verification target redirect requests to internal hosts. Both radix paths now blocked. See crates/verifier/src/ssrf.rs.

Fixed

  • --insecure flag now honored on the DNS-pinned path. Same root cause as the redirect bypass above: the per-request client rebuild dropped danger_accept_invalid_certs(insecure_tls) baked into the engine’s base client, so --insecure (and KEYHOG_INSECURE_TLS) silently did nothing for direct (non-proxy) verifications. Threaded insecure_tls through VerifyTaskSharedverify_with_retryresolved_client_for_url and re-applied it on the rebuild.
  • Scanner-panic exit code no longer collides with detector-audit. Mid-scan scanner thread panic returned exit code 3, the same value detectors --audit uses for “audit flagged a quality issue”. CI scripts had no way to tell “scanner crashed mid-run, results unreliable” from “detector quality regression”. Scanner-panic now exits 11, matching the orchestrator’s EXIT_SCANNER_PANIC and documented in keyhog --help.
  • scan-system exit code. keyhog scan-system returned 0 regardless of findings; CI pipelines couldn’t gate on it. Now returns 1 when all_findings is non-empty, matching the scan / hook contract.
  • find_companion off-by-one. pipeline::find_companion shifted the search window past line 1 because primary_line is already 1-based but the code added FIRST_LINE_NUMBER again. Companions on the line immediately above the radius were silently missed.
  • UTF-8 in JSON value extraction. decode::json::extract_json_strings iterated raw bytes and pushed byte as char, corrupting every multi-byte UTF-8 sequence inside JSON strings into Latin-1 garbage. Switched to char_indices().
  • Zero-width regex hits in extract_plain_matches. Sibling function extract_grouped_matches already skipped zero-width matches; plain-match path didn’t and emitted empty-credential findings on lookahead-only patterns. Added the matching guard.
  • Panic-on-init paths removed from prefilter + disclaimer loaders. Three .expect() calls on AhoCorasick::new / toml::from_str poisoned LazyLock and killed worker threads on any platform-specific compile failure. Converted to soft fallback (Option/empty list) with tracing::warn!. Worker threads now survive a corrupted-binary / build regression.

Changed

  • InteractshClient::for_test returns Result instead of panicking. The helper formerly carried RsaPrivateKey::new(...).expect("test RSA key generates") - a panic-in-production path the no-unwrap gate caught. Returns Result<Self, InteractshError> now (mapped to KeyGen); test callers wrap with .unwrap() at the test boundary. Source: gate oob_client_no_unwrap_expect.
  • oob::client split: decrypt_entry moved to oob::decrypt. File hit 516 lines (over the 500 modularity cap). Natural seam - client owns RSA state + HTTP I/O, decrypt owns AES-256-CFB per-entry decode. No behaviour change. Source: gate oob_client_file_size_cap.
  • README exit codes match --help. Documented codes 3 (detectors –audit failure), 4 (backend –self-test failure), 10 (live findings under --verify), and 11 (scanner panic) - README previously listed only 0/1/2.
  • Hash-digest gate is no longer always-on for named detectors. Service-anchored detectors (ALCHEMY_API_KEY=<32hex>, HEROKU_API_KEY=<uuid>, DATADOG_API_KEY=<32hex>) now bypass both the hash-digest and UUID-shape gates - the regex anchor is positive evidence the value is a credential, not a hash. Generic / entropy / private-key paths stay gated. Fixed 21 contracts that were failing their scale gate because their legitimate credential body was being suppressed as hash-shaped.
  • kubernetes-secret detector disabled. Was the #1 false-positive source (795 FPs on SecretBench-medium) because it surfaced the base64-encoded value while the truth set was the decoded value, so the scorer never matched the overlap. Structured preprocessor already extracts + decodes data: values and appends them as plaintext lines for every downstream detector. Detector file kept (vs deleted) so the embedded count stays stable.
  • Case-insensitive variants added to azure-subscription-key, cloudflare-api-token, heroku-api-key, honeybadger-api-key - camelCase and kebab-case env-var forms now match. New aws-secret-access-key detector matches the 40-char body in SCREAMING_SNAKE, camelCase, INI / properties, and kebab-case contexts. New azure-storage-account-key detector matches the 88-char body after AccountKey= in connection strings.
  • Verifier SSRF blocklist routed through the vendored bogon crate. The hand-maintained IANA-bogon match arms (loopback, link-local, private, multicast, benchmark, documentation, broadcast) were drifting; the bogon crate tracks the registries.
  • README overhauled. Stale ~60-line Roadmap section killed. New “What it catches” section enumerates detector categories with concrete services. “Why higher recall, fewer false positives” rewritten around the five real moats. Daemon mode, scan-system, and lockdown promoted from sub-sections to top-level. Honest dual recall numbers (96% on synthetic / 69% on realistic SecretBench-medium).

Added

  • Documentation site under site/. 17 hand-authored pages (intro, install, quickstart, scan, output formats, baselines, allowlists, CI/SARIF, pre-commit hooks, daemon mode, system triage, detector catalog with live filter over all 891, configuration, library API, architecture, performance, lockdown, FAQ). Black-on-white with restrained yellow accents. Build with python3 site/build.py; deploy to GitHub Pages.
  • Per-detector self-validation test (tests/all_detectors_self_validate.rs). Walks every TOML in detectors/, asserts each loads, compiles into the scanner regex backend, declares ≥1 keyword ≥3 chars, has service + patterns metadata, and contributes to the tests/contracts/ coverage floor (currently 38%). Catches load-but-never-fires regressions before they ship.
  • SecretBench v5 corpus + provider-anchor wrappers. Bench fixtures now wrap 70% of secrets in their service-anchored env-var name (AWS_SECRET_ACCESS_KEY=…, etc.) instead of generic SECRET_KEY=…. Matches real-repo distribution. fn_analyze.py companion to fp_analyze.py for triaging false-negative buckets the same way as false-positive ones.
  • CI workflows fixed. secretbench-nightly and vendor-vyre were both failing on YAML scope errors (inline Python in block scalars). Python summary now lives in tools/secretbench/scoring/print_summary.py; vendor-vyre commit message built via printf into a temp file. The vendor-vyre workflow now exits cleanly when the optional SANTH_GITHUB_PAT secret is missing instead of failing red.

Performance

  • SecretBench-medium scoreboard (15k fixtures, seed 0):

    runF1precisionrecallTPFPFN
    v170.77100.84490.70891063419524366
    v180.71200.70780.71621074344364257
    v190.78150.90180.68951034211264658

    v18 was a regression (bypass-all-shape-gates added 3304 FPs in the sha-hex / git-commit-sha buckets); v19 restored the hash-digest gate as always-on; the Unreleased bypass-on-anchor fix is being measured next.

v0.5.16 - 2026-05-23 - JsonDecoder wired into decode registry

Fixed

JsonDecoder is now in the decode-through pipeline. It had a splice-aware implementation in crates/scanner/src/decode/json.rs since v0.5.15 but was never registered in get_decoders() - pure dead code. Credentials stored as JSON-encoded fields (the most common shape after .env) silently went unsurfaced.

Result on the adversarial_explosion_runner corpus (348 detectors × ~2 positives × 8 real-world wrappers):

statevariants firing
v0.5.155719 / 5792 (73 JSON-wrapper misses)
v0.5.165792 / 5792 (corpus is wrapper-tight)

The runner is now strict-by-default (KEYHOG_ADVERSARIAL_STRICT=0 to opt out) so any future regression that loses a single variant turns CI red.

v0.5.15 - 2026-05-23 - decode-through splice: base64/hex recall 30% → 93%

Fixed

Decode-through pipeline preserves companion context now. Decoded chunks used to be bare bytes with no surrounding text - every detector anchored on a companion keyword (aws_secret = …, Authorization: Bearer …, api_key: …) lost its anchor as soon as the credential was recovered from an encoded blob. push_decoded_text_chunk_spliced in crates/scanner/src/decode/pipeline.rs now splices the decoded text BACK into the parent at the position of the original encoded blob. Measured on the new encoding_explosion_runner corpus (348 detectors × ~2 positives):

encodingbeforeafterdelta
base64-std30.5%93.1%+62.6pp
base64-url30.5%92.8%+62.3pp
hex30.5%92.8%+62.3pp
url-percent15.5%79.7%+64.2pp

Migrated decoders: base64 (Base64Decoder + Z85Decoder), hex, json, url (via decode_candidates). Splice path is memory-capped at 256 KiB parent so multi-MB chunks don’t blow allocation.

Added

  • keyhog scan --proxy <URL> - route every outbound HTTP request through an HTTP/HTTPS/SOCKS5 proxy. Falls back to KEYHOG_PROXY / HTTPS_PROXY / HTTP_PROXY / ALL_PROXY env. --proxy off disables proxying including env inheritance (air-gapped scans).
  • keyhog scan --insecure - skip TLS verification for every outbound request. Needed when scanning through Burp / mitmproxy CAs with self-signed certificates. Env: KEYHOG_INSECURE_TLS=1.
  • Shared keyhog_sources::http policy module. Single source of truth for proxy + TLS + UA so an operator setting KEYHOG_PROXY affects every outbound request uniformly.
  • 40 000-case proptest suite for the HTTP-client policy and SARIF dedup contracts (crates/sources/tests/property/http_fuzz.rs, crates/core/tests/property/sarif_dedup.rs).
  • 5 500-case adversarial wrapper-explosion runner - re-embeds every contract positive in 8 real-world formats and asserts the detector fires.
  • 6 500-case path-shape runner - replays every positive at 5 production paths and 4 suppressed-shape paths.
  • 5 070-case encoding-explosion runner with split decode-hit vs incidental-hit metrics. Floors pinned so a regression below 88% base64 / 92% hex / 75% url-percent trips the gate.
  • tests/live_verify.rs - env-gated live-verify smoke against real AWS/GitHub creds (KEYHOG_LIVE_VERIFY=1).
  • tools/diff_bench/ - single-shot runner that drives keyhog + trufflehog + gitleaks across one labeled corpus (positives synthesized at CI runtime to dodge push-protection) and emits differential_results.json with per-scanner precision / recall / F1 / timing. .github/workflows/differential-bench.yml runs nightly + on workflow_dispatch.

v0.5.14 - 2026-05-23 - macOS x86_64 + Windows release binaries

Added

release.yml now produces five assets per tag instead of two:

  • keyhog-linux-x86_64 (default features, dynamic Hyperscan)
  • keyhog-macos-aarch64 (Apple Silicon, portable features)
  • keyhog-macos-x86_64 (Intel mac, portable features) - new
  • keyhog-windows-x86_64.exe (MSVC, portable features) - new

The Windows + Intel-mac variants share the existing portable feature subset (every detector data feature, every git / web / github / s3 / docker / verify source backend, no Hyperscan / Ghidra / CUDA system libs). Daemon IPC is #[cfg(unix)]-gated, so it compiles to a stub on Windows hosts without disabling the rest of the binary surface. v0.5.13 only shipped the prior two assets because the matrix change landed after the tag was cut.

v0.5.13 - 2026-05-23 - SARIF dedup so GitHub Code Scanning accepts uploads

Fixed

SARIF v2.1.0 forbids duplicate items in relatedLocations. When a finding had the same supplemental location reported twice (e.g. verifier echo + scanner overlap), GitHub Code Scanning rejected the whole SARIF with relatedLocations contains duplicate item, silently losing every finding on the upload. The dedup runs on a (file_path, line, offset) key before serialization, so each related location appears at most once.

This is what unblocks the fleet-wide keyhog.yml CI rollout - prior to this fix every repo that produced a finding lost its SARIF, leaving the Code Scanning tab empty even when the run was “green”.

v0.5.12 - 2026-05-23 - dedup 9 more dup-primary detectors

Fixed

Dropped the duplicate “secret/companion” primary in nine more detectors. Companion-only text no longer fires the detector without the id-half nearby.

  • hashicorp-vault-approle-credentials (Vault Secret ID)
  • qualys-api-credentials (qualys_username)
  • remitly-api-credentials (Remitly client ID)
  • smartproxy-credentials (smartproxy_username)
  • tidb-cloud-credentials (TiDB Public Key)
  • veracode-api-credentials (veracode_api_secret)
  • zscaler-api-key (zscaler_client_secret)
  • zuora-api-credentials (zuora_client_secret)
  • cloudflare-zero-trust-service-token (client_secret) - positives use the Client-Id shape, so dedup is safe even with main contract.

belvo, crisp, env0, exoscale, checkmarx, crowdstrike, fastspring, fedex still have the dup-shape - their main contracts have a secret-only positive that fires by design, so dedup would regress recall and isn’t a safe local sweep.

Changed

  • Pattern count 1674 → 1665 across README + e2e_binary + readme_claims gate.

v0.5.11 - 2026-05-23 - dedup carbon-black + databricks

Fixed

  • carbon-black-api-key: dropped duplicate org-key primary (kept as required companion). org_key=… alone no longer fires the detector without a CB API KEY primary nearby.
  • databricks-token: dropped duplicate workspace-url primary (kept as companion). A bare workspace URL with no dapi token nearby no longer fires the detector.

Same SURPLUS shape as the v0.5.9/v0.5.10 sweeps. These two had existing main contracts whose positives did NOT depend on the dropped primary firing alone - verified before edit.

Changed

  • Pattern count 1676 → 1674 across README + e2e_binary + readme_claims gate.

v0.5.10 - 2026-05-23 - detector dedup sweep + binary/crates alignment

Fixed

  • Dedupe primary-equals-companion in 14 detectors (idenfy, infura, jumio, marvel, packer, scaleway, sovos, thomson-reuters-onesource, time4vps, twilio-iot, upcloud, vonage-video, wix, woocommerce). Each listed the “secret / companion” half as a duplicate primary regex; companion-only text would fire the detector. Same SURPLUS shape closed in v0.5.9 for ringcentral/booking-com/vanta/trulioo/appdynamics/ avalara/akoya - sweeping the rest of the corpus that has no main contracts yet so existing positives can’t regress.
  • Test-target clippy lints in gpu_ac_recall_bug_56, cve_replay_runner, companion_contracts_runner, property/scanner_fuzz.

Changed

  • Pattern count 1697 → 1676 across README banner + e2e_binary::README_PATTERN_COUNT + readme_claims gate.
  • v0.5.10 binary release and crates.io publish are built from the same commit. v0.5.9 shipped a linux binary built from the tag commit before CI dedup landed; crates.io was never published at 0.5.9 (CI test red on the pattern-count drift).

v0.5.9 - 2026-05-23 - companion contracts gate + LFS coverage

Fixed

  • Companion contracts gate (12 issues closed). Five detectors (ringcentral, booking-com, vanta, trulioo, appdynamics) listed the “secret” half as a duplicate primary regex, so the secret-only negative_companion_lookalike fixture fired the detector. Removed the duplicate primaries; secret is now companion-only. Akoya / avalara had the same dup-primary shape.
  • bitbucket-app-password companion regex. Was [a-zA-Z0-9._-]+ (matched anything), so primary-only text populated companion.username from inside the primary’s own assignment line and verification proceeded despite must_not_verify. Re-anchored to bitbucket_username= shape.
  • ringcentral companion now anchored to client_secret= shape so id-only text no longer populates client_pair and triggers VERIFY-RISK.
  • Three twilio companion fixtures used xxx / fake placeholders containing non-hex characters that the example-credential filter suppressed; swapped to realistic hex so the gate tests the engine behavior, not the example-credential filter.
  • rustfmt - scan_gpu.rs + engine/mod.rs re-joined now-short calls after the matchingscan module migration.

Changed

  • .gitattributes now covers contracts/companion/*.toml in LFS. The original LFS rule was non-recursive; companion fixtures with Twilio-shaped strings would otherwise trip GitHub push-protection.

v0.5.8 - 2026-05-23 - daemon wire-v2, GitHub Action, contracts gate

Added

  • GitHub Action that actually works. uses: santhsecurity/keyhog/.github/actions/keyhog@v0.5.10 now installs the Rust toolchain + Vectorscan/Hyperscan and builds keyhog, or downloads a prebuilt binary from the matching GitHub Release when one exists. Previously the action ran cargo build without setup, so every downstream Ubuntu run failed with cargo: command not found or a hyperscan-sys linker error. SARIF output auto-uploads to code-scanning when format: sarif. README example was also pointing at a nonexistent keyhog/keyhog-action@v1 repo - fixed to the bundled action path.
  • .github/workflows/release.yml - tag-driven binary build
    • upload. Pushing a v* tag now compiles keyhog for keyhog-linux-x86_64 (default features incl. Hyperscan via apt) and keyhog-macos-aarch64 (feature subset, no Hyperscan), then attaches the artifacts to the release. The composite action prefers these prebuilt binaries over a cold cargo build whenever the host triple matches.
  • KEYHOG_DOGFOOD=1 - daemon-side dogfood capture. Set when starting the daemon (KEYHOG_DOGFOOD=1 keyhog daemon start) to enable per-scan event capture inside the daemon; the events cross the wire to the client and flow into --dogfood output. Per-request toggling is not wired - env-var gating keeps one client’s debug session from bleeding into another client’s payload on a shared daemon, which a per-request flag would break without additional isolation work.
  • Daemon mode. keyhog daemon start | stop | status runs a long- lived scanner over a Unix socket (default $XDG_RUNTIME_DIR/keyhog.sock, falls back to ~/.cache/keyhog/server.sock; socket is chmod 0600). keyhog scan --daemon (or auto-detected when the socket exists) routes a stdin scan / single-file scan through the daemon instead of paying the ~3 s CompiledScanner::compile cold start. Measured 105× speedup (7 ms via daemon vs 740 ms in-process) on a real GitHub PAT, same detector + hash + offset in both paths. --no-daemon forces the in-process path. --verify, --baseline, directory walks, git-staged scans, and archive decoding stay in-process by design (the daemon doesn’t replicate that pipeline).
  • .keyhogignore gitignore-style shorthand. Bare path globs (*.log, node_modules/, vendor/**/*.json) and bare 64-char hex hashes are now accepted alongside the explicit path: / hash: / detector: prefixes. Lets users drop a copied .gitignore in place and have it work.
  • --max-file-size skip summary. Files dropped by the size cap now emit a per-file WARN AND an end-of-scan summary line (“N file(s) skipped: exceeded –max-file-size”). Walker’s silent filter was the only behavior before - a user looking at a smaller-than-expected scan had no signal about which files were dropped.
  • Live progress ticker. Long scans paint a self-overwriting scanning N/M chunks · K findings · t.t s line on stderr every 250 ms; suppressed under --stream or when stderr isn’t a TTY.
  • 25 companion-required detector contracts at crates/scanner/tests/contracts/companion/. Per-detector TOMLs encode the three-shape contract (positive_with_companion, positive_primary_only with must_not_verify, negative_companion_lookalike) for AWS, Twilio (api-key / auth-token / IoT), Algolia, Razorpay, Amplitude, AppDynamics, Avalara, Backblaze, Belvo, Bitbucket, Booking, Akoya, 4everland, Lark, Linear, Linode, Plaid, Reddit, RingCentral, SumoLogic, Trulioo, Vanta. Runner test at companion_contracts_runner.rs enforces all three shapes per contract.

Fixed

  • contracts_runner was flaky across CI vs local. The 341-fixture loop reused a single CompiledScanner and never called clear_fragment_cache() between scans, so the cross-file reassembly cache accumulated. CI’s filesystem-iteration order put braintree’s sandbox_… positive ahead of blur-api-key’s evasion and the sandbox credential surfaced as the only finding on "blur key = \"Kp4Q…\"" - a non-deterministic failure invisible locally. Fix: clear the cache before every scan in contracts_runner.rs (5 sites) and companion_contracts_runner.rs (3 sites) per the documented test-isolation API in engine/mod.rs:747-760.
  • blur-api-key regex required uppercase KEY while the contract evasion uses lowercase key. Prepended (?i) and lower-cased the literals; the contract evasion now hits the intended case-variant path. Tests assert truth, not shape - weakening the test would have masked the engine gap.
  • Daemon-mode --dogfood was inert. Engine-side telemetry (record_example_suppression calls from pipeline.rs::should_suppress_known_example_credential_*) fired inside the daemon process - the client never saw any of it, so keyhog scan --dogfood demo-secret.env against a daemon silently dropped every suppression event and the reporter counter stayed at 0. Wire protocol bumped 1 → 2: Response::ScanResults now carries engine_example_suppressions: u64 and dogfood_events: Vec<DogfoodEvent> (both #[serde(default)], so a v2 client tolerates a v1 daemon). Daemon drains its per-scan telemetry after each scanner.scan(...) and resets; client merges the values into its own OnceLock<Telemetry> via two new public helpers (add_example_suppressions(n), append_events(iter)). Verified locally: --no-daemon AND a fresh daemon both emit “No real secrets - but 6 example/test keys suppressed. Pass –dogfood to see them.”
  • demo-secret.env summary regressed to the clean-repo message. The v0.5.7 fix wired TextReporter to read the suppression count, but the orchestrator’s test_fixture_suppressions.suppresses() branch ran before any telemetry write - AKIAIOSFODNN7EXAMPLE matched the bundled substring suppression list and returned false without incrementing the counter, so the reporter still saw 0 and printed “Your code is clean.” Now bumps record_example_suppression(..., "test_fixture_suppression") before returning. Same patch in the daemon-side finalize_for_report filter. Locked by e2e_binary::demo_secret_aws_example_summary_distinguishes_suppression_from_clean.
  • Mega-scan allocated ~20 GB RSS on tiny inputs. Every shard’s static input/state buffers were sized for MEGASCAN_INPUT_LEN=256 MiB. Forcing --backend mega-scan on a 19-byte file uploaded ~570 × 256 MiB ≈ 20 GB of GPU memory and burned ~20 s before returning. Small-buffer guard at the entry of scan_coalesced_megascan now routes batches under 64 KiB through the literal-set GPU path. Same recall (same AC literal prefix anchors), orders of magnitude lower setup cost. Confirmed 20.77 s / 19.7 GB → 0.34 s / 399 MB on the kimi reproducer.
  • GPU fallback regex-NFA dispatch silently dropped to CPU. The fallback RulePipeline::scan was passed max_matches_per_dispatch=1_000_000 which trips vyre’s hard-coded max_hits=10_000 static buffer declaration. Capping the dispatch at NFA_HITS_PER_DISPATCH=10_000 keeps the GPU path live; the always-active fallback regex set is small enough that 10 K matches per dispatch is well above what we’d ever see.
  • env::args() panicked on non-UTF-8 args. Linux allows raw-byte paths; std::env::args() calls .unwrap() on each Result which aborts with SIGABRT. Switched the version-flag detection in main.rs to args_os() + lossy compare.
  • Non-UTF-8 paths reported “No such file or directory” even when the file existed. New pre-flight at the CLI boundary refuses non-UTF-8 paths with a clear message (“Rename the file or scan its parent directory”) instead of confusing the user with a missing-file rabbit hole.
  • Nonexistent / unreadable input paths exited 0 with a WARN and “No secrets found, your code is clean.” Per the documented exit-code contract these are runtime errors. CLI now stat’s the input pre-walk; missing path → exit 2 with “path does not exist”, unreadable file → exit 2 with “cannot read … (fix chmod +r …)”.
  • --backend invalid silently ignored and the scan ran with the default. clap now validates against the PossibleValues set {gpu, mega-scan, megascan, simd, cpu, auto} and exits 2 with a clear error.
  • .keyhogignore detector: entries were dead. The parser populated ignored_detectors but the orchestrator’s per-finding filter never read it. Now applied alongside is_path_ignored / is_raw_hash_ignored.
  • RefCell double-borrow panic in fallback.rs. Per-pool thread-local borrows now try_borrow_mut + fresh-alloc fallback at three sites (ACTIVE_PATTERNS_POOL, ACTIVE_INDICES_POOL, TRIGGER_POOL). Was a hard P0: the rayon worker re-entry caught itself on the second borrow and aborted mid-scan.
  • FP storms killed: lastpass-dev-creds firing on random id=<digits> in /var/log archives (87% FP rate per kimi); GitHub PAT placeholder ghp_xxxxxxxx… flagged at 0.80; xoxb tokens with ascending-digit runs flagged. Tightened lastpass-dev-creds to require lastpass context within 40 chars; extended looks_like_prefixed_masked_sequence to suppress x/X-dominance, all-same-char, and ascending-digit-run ≥ 13.

Improved

  • CUDA driver is opt-in. The cuda feature was on by default, which made cargo build fail on any host without libcuda.so / libnvrtc.so / libcudart.so - including macOS, most CI runners, and any Linux box without an NVIDIA driver stack. The default scanner build now uses wgpu (Vulkan on Linux, Metal on macOS) for GPU dispatch. CUDA users opt in with --features cuda when they want the CUDA backend specifically. Drops the link-time CUDA requirement from every default build.
  • scripts/publish.sh reads the version from Cargo.toml. Renamed from publish-0.5.6.sh (which would silently emit “All v0.5.6 crates published” even when publishing v0.5.7). The new script awks [workspace.package].version and uses that everywhere - no per-release rename or message edit.
  • LayeredPipelineCache short-circuits compile on warm hits. The prior rule_pipeline_cached always called build_rule_pipeline upfront to keep typed-error semantics for vyre’s infallible-closure cached_load_or_compile, which made the on-disk cache pointless. Now uses vyre’s engine_cache_path + manual load/save so a warm hit returns the deserialised RulePipeline without paying the compile.
  • PreparedChunk::line_offsets() memoised via OnceLock. compute_line_offsets used to walk the preprocessed text twice per chunk (once for the triggered path, once for the pattern-hits path); the second caller now hits the memoised Vec.
  • Mega-scan compile-failure WARN demoted to debug. Falling back to the literal-set GPU dispatch when vyre’s byte-NFA frontend can’t represent every pattern (e.g. pattern 990 in the bundled detector corpus uses lookaround) is the designed degradation - the user can’t fix it, and one WARN per --backend mega-scan invocation creates noise without signal.

Differential parity

.internal/bench/differential/compare.py against gitleaks 8.30.0 and trufflehog 3.95.3 on the 64 MiB big_with_secrets corpus: gate green. Every secret two independent competitors HASH-confirm keyhog also surfaces, except sk_live_4eC39… which is documented as a public Stripe docs example (suppressed by test_fixture_suppressions::bundled() and listed in baseline.toml).

v0.5.7 - 2026-05-17

Fixed

  • The ‘No secrets found. Your code is clean.’ message lied when every match was suppressed as an EXAMPLE/test key. The 0.5.6 bump wired example-suppression telemetry into the orchestrator, but the user-facing summary is owned by TextReporter::finish() in keyhog-core, not the orchestrator - so the misleading banner still printed. TextReporter now takes the suppression count via set_example_suppressions(n) and prints “No real secrets - but N example/test key(s) suppressed. Pass –dogfood to see them.” instead. Verified end-to-end against demo-secret.env. Regression tests pin all three states.

v0.5.6 - 2026-05-17

Added - dogfooding-driven UX

  • --dogfood - opt-in JSON trace on stderr after the scan. Each example/test/placeholder credential that was matched and then suppressed gets a redacted-prefix event with the algorithmic reason (contains_EXAMPLE_token, algorithmic_placeholder). Closes the “did the scanner miss this, or silence it?” question without a debug rebuild. Full credentials are never emitted - --dogfood is a decision tracer, not a credential exfil channel.
  • Honest scan summary when only example keys were found. Previously, scanning demo-secret.env (which holds AKIAIOSFODNN7EXAMPLE) printed “No secrets found. Your code is clean.” - identical to a genuinely clean repo. Now the summary distinguishes:
    • 0 findings, 0 suppressed → “0 secrets in 0.12s. You are secure!”
    • 0 findings, N suppressed → “0 real secrets, N example/test key(s) suppressed (pass –dogfood to see them).”

Internal

  • New keyhog_scanner::telemetry module: per-scan atomic counters + optional event log. Engines call record_example_suppression(...) from the existing should_suppress_known_example_credential_* paths; the orchestrator drains events at the end of run(). Zero new state threaded through engine boundaries - single OnceLock process-local container with a reset() for tests.
  • Two regression tests pinning the demo-secret.env case + the dogfood redaction contract. Telemetry-touching tests serialise behind a module-local Mutex so cargo test’s parallel runner doesn’t let them step on each other.

v0.5.5 - 2026-05-09

GPU foundations + vyre composition pass. The session wires keyhog deeper into vyre as a primitive consumer and contributes new general-purpose capability back to vyre.

Tier-aware GPU routing + 2 MiB threshold on RTX 40/50-class GPUs. select_backend now classifies the detected adapter into High / Mid / Low tiers and consults per-tier crossover thresholds:

TierAdapter examplesmin_bytessolo cap
HighRTX 40/50, A100/H100, M-Max/Ultra, RX 79002 MiB16 MiB
MidRTX 20/30, GTX 16, Arc, M-Pro/base, RX 6/716 MiB64 MiB
LowiGPU, older discretes, unknown64 MiB256 MiB

Pattern-count breakeven is also tier-aware (100 / 500 / 2000). keyhog backend reports the active tier and effective thresholds for the live adapter. Backwards compatible: unknown adapters classify as Low and keep the legacy thresholds.

GPU dispatch sharding + correctness fix. scan_coalesced_gpu now slices the coalesced buffer at 65535 * 32 = 2,097,120 bytes per dispatch (the wgpu workgroup-per-dimension cap × vyre’s workgroup_size_x = 32) and re-bases shard-local match offsets into the global buffer’s coordinate space. Eliminated the silent dispatch group size > 65535 error that the prior single-dispatch path hit on every 100 MiB+ batch. Recall on the realistic benchmark fixture now matches CPU/SIMD within rounding (303,554 vs 302,168 vs 304,128) - earlier 121× speedup numbers were lying because the dispatch errored mid-batch and only ~1% of true hits came back.

Vyre intern::perfect_hash wired for static-string interning. CompiledScanner builds a CHD perfect hash from every detector’s (id, name, service) plus the seed source-type literals at construction time. ScanState::intern_metadata consults this frozen interner first; only dynamic strings (file paths, commit SHAs, author names, dates) hit the per-scan HashSet<Arc<str>> fallback. Per-scan allocation count drops by ~100k on a typical 1000-chunk run. 6 unit tests + 282 scanner tests still green.

Vyre megakernel scaffolding (gated behind KEYHOG_USE_MEGAKERNEL). engine/megakernel_dispatch.rs ships a working DFA-per-literal compile + BatchDispatcher init + dispatch loop that hands back the same per-chunk per-pattern trigger bitmask the literal-set GPU path produces. Routed in scan_coalesced_megakernel behind the env opt-in. Defaults OFF: vyre’s BatchDispatcher is optimised for “many files × few rules” but keyhog’s corpus is “few files × 6000+ rules” - modelling each literal as its own BatchRuleProgram allocates chunks × rules ≈ 600,000 work items per dispatch, which keeps the persistent kernel sleeping in S-state on RTX 5090. Real megakernel win needs vyre-side multi-pattern hit reporting (one DFA covering many literals, HitRecord gains a per-pattern field) - wiring then collapses to a one-line swap.

Cross-platform compile fix in vendored vyre-runtime: GpuStream<'a> now carries PhantomData<&'a ()> on non-Linux so the lifetime parameter isn’t flagged unused when uring is cfg’d out. Windows / macOS builds now pull vyre-runtime cleanly.

Vyre rule engine wired for declarative .keyhogignore.toml.

Upstream vyre additions (general-purpose, lives in vyre-libs):

  • vyre_libs::rule::cpu_eval - pure-CPU evaluator for RuleCondition / RuleFormula trees. Mirror of the GPU lowering. Useful for any consumer that wants per-record rule evaluation without dispatching a backend program. 11 unit tests.
  • vyre_libs::rule::ast::RuleCondition::FieldInSet - new variant for “context field’s value is in this set”. Distinct from SetMembership (which compares a static value, not a field lookup). Required for expressing “detector_id is one of …” without resorting to regex alternation. Builder lowering errors with an actionable Fix: message - only the CPU evaluator can resolve field lookups today.
  • vyre smallvec workspace pin bumped 1.14.0 → 1.15.1 so consumers carrying gix (which requires ^1.15.1) can share the type - keyhog needed this to put SmallVec<[Arc<str>; 4]> on the wire between core and vyre.

Keyhog consumes via new crates/core/src/rule_filter.rs. Schema documented in docs/keyhogignore-toml.md. [[suppress]] tables compose AND of named predicates (detector / service / severity / severity_lte / path_eq / path_contains / path_starts_with / path_ends_with / path_regex / credential_hash). Multiple [[suppress]] tables compose with OR. Empty entry rejected at parse to prevent accidental suppress-everything. Unknown fields rejected via serde deny_unknown_fields. Wired into orchestrator.rs::run after finalize() returns VerifiedFindings - predicates need the resolved fields that dedup_cross_detector populates. Malformed .keyhogignore.toml is non-fatal: warn + load zero rules; legacy .keyhogignore still applies. 11 keyhog rule_filter tests pass.

Realistic benchmark fixture. The previous --benchmark corpus used 36-char alphanumeric filler on every line, triggering the entropy detector constantly so the benchmark was measuring per-chunk extraction cost rather than the literal-prefilter crossover it claims to measure. New fixture mirrors typical TypeScript/Go/Rust source: short identifiers, natural-language comments, short string literals. RTX 5090 against this fixture: 130 MiB/s (cpu-fallback) / 136 MiB/s (simd-regex) / 34 MiB/s (gpu-zero-copy). The architectural fix for GPU loss on dense corpora is megakernel fusion of the extraction pipeline (vyre upstream feature, queued).

Vyre full 30-crate audit doc (docs/vyre-usage.md). Catalogues every vyre crate (foundation, driver, driver-wgpu, driver-megakernel, driver-spirv, libs, primitives, runtime, spec, intrinsics, reference, cc, harness, macros) with the public surface of each. Lists every vyre-libs and vyre-primitives module by name with what keyhog could conceivably wire from each.

v0.5.4 - 2026-05-08

Roadmap-clearing pass plus the first crates.io publish for every workspace crate. The README’s “Roadmap” section drops four items and a long-standing ignored regression test goes green.

Cross-chunk window-boundary reassembly (roadmap #3). New crates/scanner/src/engine/boundary.rs splices the tail of each large-file scan window to the head of the next and rescans the seam, catching secrets that physically straddle the 64 MiB scan-window boundary. Wired into scan_coalesced after Phase 2 in both the SIMD and no-SIMD paths. Bounded to 1 KiB per side (2 KiB per pair), so cost is independent of chunk size: a 64 GiB file sliced into 1000 chunks pays ~2 MiB of total boundary work - negligible next to the per-chunk regex pass. Six unit tests + the previously-#[ignore]- marked test_window_boundary_detection integration test now pass; the test itself was rewritten to use an AKIA-shaped secret (the original XX_FAKE_* shape was unconditionally suppressed by the placeholder filter, so the test would have stayed red even with reassembly).

keyhog detectors --audit and keyhog detectors --fix (roadmap #4). detectors --audit runs every detector through keyhog_core::validate_detector, prints issues grouped by detector ID, and exits with code 3 when any Error-severity issue surfaces - drop it into CI to gate detector PRs. detectors --fix scans the on-disk TOML corpus for the one validator finding that’s safe to repair mechanically - single-brace template references ({shop}) inside [detector.verify*] blocks - and rewrites them to the double-brace form ({{shop}}) the interpolator actually honours. Rewrites are scoped to verify blocks only (regex quantifiers like [A-Z]{4,6} in pattern blocks stay untouched), atomic-written via NamedTempFile, and re-validated post-rewrite so a corrupted result backs off rather than overwriting the original. --dry-run previews without writing. The 888-detector embedded corpus shows zero errors today (the v0.4.x detector cleanup wave already cleared them) - the subcommand is the regression net for the next batch of contributions. Seven unit tests cover the rewriter’s edge cases.

Streaming finding previews (roadmap #5). New --stream flag emits a one-line redacted preview to stderr per finding as the scanner produces it, instead of waiting for dedup + verification before printing anything. Format is grep-friendly: [stream] CRITICAL aws/aws-access-key src/foo.rs:42 AKIA...XYZ_a. The full report (text/json/sarif/jsonl) still lands on stdout/--output at the end - the stream is purely a UX hint that the scanner is making progress on long-running runs (large monorepos, scan-system, GitHub-org walks). Implemented inside the existing scanner thread via io::LineWriter so per-line writes land atomically across rayon workers.

--verify-rate + --verify-batch (roadmap #7). The per-service token-bucket rate limiter (crates/verifier/src/rate_limit.rs) is now hot-swappable via a new set_default_rps() (atomic-backed nanosecond interval) so the CLI’s --verify-rate <RPS> flag can take effect after the global limiter has lazily initialised. Default stays at 5 rps; existing per-service overrides via update_limit are preserved. --verify-batch adds per-service serialisation (max_concurrent_per_service = 1) on top of the rate cap - use it for repos with hundreds of fixture findings where bursting an upstream auth endpoint would get the scan IP throttled. Three new unit tests cover the rps→nanos clamp behaviour and the atomic update path.

Robustness sweep.

  • entropy_1000_chars_under_1ms was unconditionally failing under cargo test on debug builds (2.5 ms vs the 1 ms threshold). Marked #[ignore] matching the two sibling perf-threshold tests; rerun locally with cargo test -- --ignored against a release build.
  • crates/cli/src/scan_runtime.rs was a 0-byte dead module with no references anywhere in the workspace. Deleted.
  • Workspace license field downgraded from MIT OR Apache-2.0 to MIT - the only license file shipped in the repo is the MIT one. Honesty over ecosystem convention.
  • cargo clippy --workspace --all-targets now clean (was 4 warnings: unused-mut in dedup.rs, items-after-test-module in orchestrator_config.rs, an unnecessary as_ref() in the new streaming preview, and an explicit-counter loop in extract_plain_matches that’s intentional for deadline-cadence gating and now carries an explanatory #[allow]).
  • detectors/.keyhog-cache.json (runtime parse cache) is now gitignored AND keyhog-core/Cargo.toml carries an explicit exclude so a stale cache file can’t sneak into the published tarball.
  • scripts/audit.sh wraps cargo audit with the four accept-with-rationale --ignore flags so local audits exit clean the way CI does (cargo-audit 0.22 doesn’t auto-load audit.toml).

Crates.io publish setup. Workspace package metadata (description/license/repo/homepage/docs/keywords/categories/readme) audited end-to-end across all five crates; package contents verified via cargo package --list for each crate before publish (no stray fixtures, no .work-linux.bundle, no target tree). Path-dep version pins on the four library crates bumped in lockstep with the workspace version (=0.5.4 everywhere) - the = pin guarantees a downstream cargo install keyhog 0.5.4 resolves to a self-consistent set.

v0.5.3 - 2026-05-07

I/O perfection pass - five staged perf + correctness landings on the filesystem source path, plus one latent-bug fix surfaced by the new test coverage.

Stage A - content cache (perf + correctness). Merkle index schema v2: each entry now carries (mtime_ns, size, BLAKE3) and the file gets a top-level spec_hash derived from the canonical detector set. metadata_unchanged(path, mtime, size) short-circuits the file read entirely when stat metadata matches a stored entry - the dominant cost on cold-cache disk for --incremental re-runs. load_with_spec(path, expected_spec_hash) invalidates the cache the moment any detector regex, group, or companion changes, fixing a latent correctness bug where an added detector would silently miss unchanged files forever.

Stage B - mmap big-file scan. Replaced the read+seek loop in FilesystemSource’s >64 MiB path with a single mmap + zero-copy slice into window_size-byte windows with window_overlap shared bytes between neighbours. Drops the 64 MiB heap working buffer and the per-window seek+re-read overlap round-trip; madvise(SEQUENTIAL) drives kernel readahead. Falls back cleanly to the buffered loop when mmap is refused (locked writer, exotic filesystem).

Stage C - I/O ↔ scan pipeline. scan_sources spawns the scanner in a dedicated thread holding Arc<CompiledScanner>. The producer (main thread) iterates sources and builds batches; the scanner pulls completed batches off a sync_channel(1) and runs scan_coalesced. While the scanner is busy on regex, the producer is busy on disk I/O, so total wall time approaches max(read, scan) instead of read + scan. Channel capacity 1 keeps memory bounded to one in-flight batch.

Stage D - mmap compressed reads. ziftsieve only takes a contiguous &[u8] so streaming decompression isn’t on the menu, but mmap’ing the compressed file lets us hand it the whole input without a corresponding heap allocation. A 1 GiB .zst previously manifested as a 1 GiB Vec<u8> before decompression began. New FileBytes enum (Mmap | Owned) with size-cap gating; falls back to fs::read only on mmap refusal.

Stage E - per-platform mmap threshold. Lowered to 64 KiB on Unix where mmap setup is sub-microsecond and avoids the page cache → userland buffer copy. Held at 1 MiB on Windows where MapViewOfFile carries section-object + security-token costs that buffered ReadFile doesn’t pay.

Latent bug fixed alongside Stage D. gz and zst were in SKIP_EXTENSIONS, so the extract_compressed_chunks dispatch arm in the FilesystemSource iterator was actually unreachable - compressed files were silently being skipped on every scan. Removed those entries (the gz/zst handler now actually runs).

Tests. ~55 new tests covering: 13 merkle_index v2 unit, 12 window-slicing pure-helper unit, 4 FileBytes/mmap-or-bytes unit, 6 pipeline orchestrator unit (including a 6000-chunk recall floor that proves the threading doesn’t drop batches), 9 FilesystemSource integration covering the windowed path, merkle skip, and gz end-to-end. Existing 53 scanner lib + 31 sources read unit + 20 filesystem integration all still green on both Windows and Linux.

Code cleanup. Removed dead detector_to_patterns field + helper from the scanner (unused since the v0.5.2 perf trim). Tightened the Arc import gate in crates/sources/src/lib.rs so docker-only builds no longer warn about unused imports.

v0.5.2 - 2026-05-06

Reconciliation pass against the parallel Legendary Hardening line (v0.3.0 → v0.4.0 → v0.5.0) that lived only on the work-linux clone and was never pushed. Both lines diverged at 013257e (CI fmt scope) and independently arrived at near-identical scanner/sources state.

Reviewed every file the work-linux line touched; no salvageable code was missing from this branch:

  • SensitiveString migration, MADV_DONTDUMP zero-leak buffers, proximity-aware multiline reassembly, hardened ratelimiter, AC prefilter for has_secret_keyword_fast - already present here, fmt-clean, with the no-default-features feature gates the v0.6.x pass added.
  • The 6 secret-laden boundary-test fixtures (test.txt, boundary_test.txt, etc.) accidentally committed in work-linux’s v0.4.0-finalize commit are intentionally not brought in: they trip GitHub push-protection and the boundary test that needed them was rewritten to use a synthetic XX_FAKE_* shape in v0.6.1.
  • crates/sources/src/slack.rs:54 data: T.into() syntax bug that still exists on the work-linux line was already fixed here in v0.6.0.

Net new: version bump only. No code regressions, no losses.

vendor/vyre is untouched - separate project with its own versioning.

v0.6.1 - 2026-05-06

Perfection pass on top of v0.6.0.

Fixed

  • crates/sources/src/binary/{mod,sections}.rs: 5 type errors (the extract_printable_strings wrapper claimed Vec<String> while the underlying call returned Vec<SensitiveString>). Any build with --features binary previously failed to compile.
  • aws-access-key.toml: dropped required = true from the secret_key companion. A leaked AKIA on its own is still a reportable finding; verification correctly downgrades to “unverified” when no co-located secret is found instead of silently dropping the match.
  • crates/core/tests/unit/spec.rs: the no_detector_uses_singular_companion_table test now mirrors crates/core/build.rs’s symlink fallback so it works on Windows checkouts where crates/core/detectors lands as a literal file containing the link target.
  • crates/scanner/tests/performance_regression.rs: replaced the CRC32-invalid ghp_ABCDEF… synthetic with an AKIA-shape fixture so the test exercises the no-default-features build (where checksum validation fails closed).
  • 3 adversarial tests gated behind the features they exercise (ml, multiline, decode); previously they ran under --no-default-features and asserted behavior that requires those features.

Hygiene

  • cargo clippy --workspace --no-default-features --all-targets clean (zero warnings) under both --no-default-features and the default-minus-simd matrix.
  • cargo fmt --check clean.
  • 596/596 tests pass under both feature configurations.

v0.6.0 - 2026-05-06

Out-of-band callback verification + broad robustness/detector fixes.

Added

  • OOB verification (--verify-oob): RSA-2048 + AES-256-CFB interactsh client (oast.fun by default; --oob-server HOST to self-host). Detector TOML gains an [detector.verify.oob] block with protocol={dns,http,smtp, any}, policy={oob_and_http,oob_only,oob_optional}, and accept={dns,http,smtp,any}. Probe payloads can interpolate {{interactsh_url}}, {{interactsh_host}}, and {{interactsh_id}} to embed a unique callback URL per probe; the session waits for a matching hit before declaring the credential live. Documented in docs/OOB.md.
  • keyhog_core::spec::validate now audits companion-substitution capture groups, reserved companion names (__keyhog_oob_*), and that every {{companion.X}} / auth-field reference resolves to a declared companion.

Fixed

  • extract_grouped_matches (scanner): zero-width regex hits no longer infinite-loop the matcher; capture-group walk reuses a single CaptureLocations and aligns to UTF-8 boundaries; out-of-range detector index now fails closed instead of panicking.
  • Required companions (required = true) actually short-circuit: prior unwrap_or_default() swallowed the “missing required companion” signal and shipped the finding anyway.
  • OobSession::wait_for race: registers the Notified waiter via Notified::enable() before checking observations, so notifications fired between the check and the await no longer get lost.
  • 8 detector verify specs that referenced undeclared companions or used template strings in the auth-field slot would 401 every probe (Twilio IoT, Akoya, Razorpay, Braintree sandbox, etc.). Each now declares the companion it references.
  • Look-behind regex assertions ((?<=, (?<!) are no longer misclassified as named capture groups by the spec validator.
  • crates/sources/src/slack.rs: data: T.into() syntax error in SlackResponse<T> would have failed any build that exercised the slack feature.

Performance

  • Aho-Corasick prefilter for has_secret_keyword_fast and has_generic_assignment_keyword (single-pass).
  • extract_inner_literals AST walker promotes inner literals into the prefilter alphabet (corpus coverage test pins ≥3 patterns promoted).
  • find_companion splits into a capture-group-free fast path (find_iter) and a grouped path that reuses CaptureLocations.
  • Active-fallback bitmap precomputed at scanner construction; per-chunk thread-local ACTIVE_PATTERNS_POOL avoids reallocation.
  • Filesystem reader: two-sided looks_binary early exit, streaming UTF-16 decode, valid-UTF-8 fast path.
  • Slack source fetches per-channel history concurrently (rayon, 8 threads).

Hardening

  • looks_binary short-circuit verified against full-scan baseline across page-boundary cases.
  • open_file_safe rejects symlinks on Windows (Unix already enforced).
  • Self-suppression list rewritten with concat!() to keep example credentials out of the repo’s literal string table.

v0.3.0 - 2026-05-01

The “legendary” wave: 18 Tier-A perf wins + 12 Tier-B moat innovations from the 2026-04-26 deep audits, plus a perfection pass that hardened GPU/CPU auto-routing across every supported OS. Build is green, scanner test suite 229+/0, core 33+/0, hw_probe routing 11/0, doctests 38/0.

Hardware routing & GPU/CPU saturation (perfection pass)

  • KEYHOG_BACKEND={gpu,simd,cpu} env var force-pins the scan backend at the highest routing priority, used by CI matrix builds and benchmarks to assert backend-specific code paths actually run (ba0e3fc).
  • KEYHOG_THREADS=N env var threads the rayon pool size; with --threads taking absolute priority and physical-core count as the auto fallback (3c4924c).
  • Per-OS wgpu adapter preference replaces Backends::all(): Windows → DX12 + Vulkan, macOS/iOS → Metal, Linux/BSD → Vulkan + GL - each platform gets its first-class native API (ba0e3fc).
  • Public hw_probe::thresholds module exposes the routing crossovers (GPU_MIN_BYTES=64 MiB, GPU_PATTERN_BREAKEVEN=2000, GPU_BYTES_BREAKEVEN_SOLO= 256 MiB) for benchmarks and the inspector subcommand to reference one source of truth (ba0e3fc).
  • 11 routing unit tests pin every documented threshold + the env-override branch + the software-renderer skip. Tests serialize through a Mutex guard since they mutate process env (ba0e3fc, 3c4924c).
  • keyhog backend subcommand: dumps detected hardware, the active backend, the env override (if set), and a routing decision matrix at every documented threshold; --probe-bytes and --patterns for what-if simulation (ba0e3fc).
  • GPU init now requests the adapter’s full limits (was capped at wgpu Limits::default()’s 128 MiB storage-buffer ceiling; an RTX 5090 had its batch size throttled to 0.4% of physical capacity) (e182938).
  • GPU init rejects device_type == Cpu adapters at the wgpu layer too (catches future software fallbacks not in the llvmpipe/lavapipe name list) (3c4924c).
  • Per-scan tracing::info! logs the selected backend; per-chunk tracing::trace! on keyhog::routing for full audit trails (3c4924c, ba0e3fc).
  • Verifier gained danger_allow_http opt-in flag to support HTTP test mocks while keeping production HTTPS-only (0da1f94).

Performance - CPU saturation

  • scan_chunks_with_backend_internal now uses rayon::par_iter on the non-GPU paths - was serial, pinned to a single core even on 32-core boxes (a693ba2).
  • scan_coalesced parallelizes its #[cfg(not(feature = "simd"))] and Hyperscan-init-failure fallbacks; multi-core builds without Hyperscan now saturate cores (27caaf9).
  • [profile.release] pinned: opt-level=3 + lto=fat + codegen-units=1 + panic=abort + strip - was using cargo defaults; the new profile yields ~10-20% throughput on hot paths via cross-crate inlining (3c4924c).
  • [profile.release-fast] (thin LTO, 16 codegen-units) for sub-minute CI builds; [profile.bench] keeps line-tables for flamegraph attribution.

Performance - Tier-A perf wins (~constant-factor allocations on the hot path)

  • Cow-borrowed normalize_homoglyphs and prepare_chunk - ASCII fast path no longer clones (7e7cd55).
  • post_process_matches dedup keys are Arc<str>, not String (7e7cd55).
  • Thread-local trigger-bitmask pool - drops ~2.4M allocs on a 100k-file scan (7e7cd55).
  • Phase-1 returns Option<Vec<u64>> so empty chunks never allocate (7e7cd55).
  • BTreeMap dedup → indexmap::IndexMap for O(1) deterministic ordering (d3b6721).
  • Streaming SARIF reporter - peak memory drops from O(N findings) to O(rules) (3a15fd0).
  • Batched-streaming orchestrator - 4096 chunks / 256 MiB per batch caps peak memory on giant scans (a6c88b2).
  • Sharded DashMap for verifier VerificationCache, RateLimiter, and in-flight map (no more global RwLock contention) (d3b6721).
  • Concurrent rayon-parallel S3 / GitHub-org / Slack source backends (8–16 in-flight) (d3b6721).
  • Shared Arc<Regex> compile cache via shared_regex() - same regex across detectors compiles once (a38e79c).
  • Pre-built index_set once on Baseline::load via OnceLock (d3b6721).
  • Bigram bloom prefilter (Layer 0.5) - gates chunks ≥64 bytes before Hyperscan (3a15fd0).
  • Dropped io_uring single-op path (latency regression, kept the multi-op batch path) (d3b6721).
  • Decode-bomb time budget - per-chunk wall-clock ceiling on decode_chunk (20d3ef8).
  • Probabilistic gate filled in: distinct-bigram density via FNV-512 (20d3ef8).

Innovations - Tier-B moat features

  • Bayesian Beta(α,β) confidence calibration - per-detector posterior updated from observed TP/FP, multiplier wired into the live scoring path, CLI surface (keyhog calibrate --tp/--fp/--show) (34deeb0, d5d447e).
  • Incremental scan via persisted BLAKE3 Merkle index - unchanged files skip the scanner entirely on CI re-runs (57c4cc8).
  • Cross-detector dedup at emit - one secret matched by N detectors collapses to one finding with N ranked service guesses (eab71b2).
  • Diff-aware severity - git source pre-walks HEAD’s tree, tags chunks git/head vs git/history, and the latter’s findings drop one severity tier (410dc0e).
  • JWT structural validation - header.payload decode with alg/typ/exp inspection and alg=none anomaly detection (43092b6).
  • CWE-798 + OWASP A07:2021 SARIF taxa - compliance-grade reporting (5462625).
  • SARIF v2.2 fixes[] with deletedRegion/insertedContent and env-var-name auto-fix suggestions (650e599).
  • Allowlist governance metadata - ; reason="…" ; expires=YYYY-MM-DD ; approved_by="…" per entry, expired entries auto-drop (32ff3a8).
  • keyhog explain <detector-id> - full spec dump, regex breakdown, and rotation-guide URLs for major providers (f56f97e).
  • keyhog diff <before.json> <after.json> - NEW / RESOLVED / UNCHANGED set diff for CI regression detection (52d7242).
  • keyhog watch <path> - daemon mode with notify-based file watcher, compile-once-scan-many on saves; sub-100ms re-scan (56c61d6).
  • keyhog calibrate - α/β counter management with posterior-mean bar visualization (34deeb0).
  • keyhog detectors --search <query> --verbose - case-insensitive filter against id/name/service/keywords; verbose dumps full spec (5951a14).
  • keyhog completion <shell> - bash, zsh, fish, powershell, elvish (8ab105f).

Adversarial coverage

  • Reverse-string decoder for tokens stored backwards as evasion (c462e9c).
  • Caesar / ROT-N decoder for ROT13’d configs (c462e9c).
  • Hex _ separator stripping (firmware dumps, embedded configs use A1_B2_C3_…) (2980284).
  • Comment-suffix disclaimer suppression - // not a real key, # fake credential, etc. (2980284).
  • Cross-detector dedup also handles 2-fragment AWS reassembly with no-shared-prefix var names (3327b39).

Architecture

  • GPU auto-routing - runtime probe selects GPU vs CPU based on adapter type, workload size, and pattern count; mandatory build-time presence (no more feature gate) (7feb723).
  • Filesystem source: per-archive-entry uncompressed-size cap; ziftsieve gzip/zstd/lz4 4× decompressed-byte budget (5cc3906).
  • Verifier hardening: SSRF DNS-rebinding defeated via tokio::net::lookup_host post-resolve check; HTTPS-only no-localhost-exception (7feb723).
  • AWS SigV4 dates derived from SystemTime::now via Howard-Hinnant civil arithmetic (no chrono runtime cost) (7feb723).
  • fragment_cache module relocated under multiline/ where every call site lives; re-exported at the crate root for back-compat (70e35a8).

Tests

  • Wired adversarial fixtures into cargo test (no more skipped corpus) (5cc3906).
  • Aligned gitleaks_hash_* allowlist tests with the hardened is_hash_allowed API (no plaintext fallback) (b2b405d).
  • Wrapped ?-using doctests in explicit fn main() -> Result so the E0277 wave is gone (19ce4f5).
  • 229 scanner tests / 33 core unit tests / 38 doctests, 0 failed.

Detector corpus

  • Brutal audit of all 896 detectors found schema decay; corrupted entries removed, broken logic flagged (e934144).
  • Schema rename (kimi automated): aligned every detector to the post-audit field set (826d54f).
  • Verifier auth wiring fixes for the corpus (826d54f).
  • 859 valid detectors after the gate; ~30 still flagged for pure-character- class companions (tracked separately).

v0.2.1 - 2026-04-04

Maintenance release: production-readiness fixes, dependency updates, agent sweeps. See git log v0.2.0..v0.2.1 for the commit list.

v0.2.0 - 2026-03-30

The fastest, most accurate secret scanner.

First “legendary bar” release. Highlights:

  • Embedded 888-detector corpus (no separate detectors/ directory needed).
  • Hyperscan SIMD regex with disk-cached compiled DB.
  • Aho-Corasick literal prefilter feeding into the regex layer.
  • ML-based confidence scoring (MoE classifier with per-detector calibration).
  • Decode-through pipeline: base64, hex, URL, MIME, HTML entities, Z85, unicode/octal escapes, quoted-printable.
  • Multiline secret reassembly across line-continuation patterns in a dozen languages.
  • Sources: filesystem, git history, git diff, GitHub orgs, S3, Docker images, web URLs (JS/sourcemap/WASM), Slack (admin export).
  • Verifier framework with TOML-defined live verification per detector.
  • SARIF v2.1.0 + JSON + JSONL + plain-text reporters.

v0.1.0 - 2026-03-26

  • First public release of the KeyHog workspace.
  • Production-readiness cleanup for docs, examples, README guidance, and release metadata.
  • Verified cargo check, cargo test, and cargo clippy --workspace -- -D warnings.