How detection works
A KeyHog scan is a pipeline. Files come in one side, findings go out the other. In between, four stages:
files → [chunker] → [prefilter] → [detector match] → [post-process] → findings
Each stage is a hard filter - if a chunk fails the prefilter, no detector ever runs on it. That’s where the speed comes from: the expensive regex evaluation only sees chunks that already plausibly contain something.
Stage 1 - chunker
A file becomes one or more chunks. A chunk is {data: str, metadata: {source_type, path, line_offsets, …}}. The chunker:
- Skips obvious binaries via magic-byte sniffing (PDF, PNG, zip, …).
- Skips files matching
is_default_excluded(node_modules, .min.js, build/, etc.). - Splits files larger than 64 MiB into overlapping windows so a single giant log file doesn’t blow scratch memory. Cross-window secrets are reassembled in stage 4.
- Decodes UTF-16 BOM files into UTF-8 (PowerShell / .NET configs).
Specialized chunkers run too:
- Git history → one chunk per (commit × file × diff line)
- Docker images → one chunk per layer × file
- Web URLs → one chunk per response body / sourcemap / WASM strings
- S3 buckets → one chunk per object body
Stage 2 - prefilter (the cheap pass)
Three gates, in order, each cheaper than the next:
-
Alphabet screen. A 256-bit mask of which bytes the corpus’s detectors care about. If a chunk doesn’t contain ANY byte in the mask, it’s discarded. Most random-binary chunks fail here.
-
Bigram bloom. A 4096-bit bloom filter of 2-byte sequences from detector keyword prefixes. If a chunk has no overlapping bigram, discard. Catches the “this is a Go source file with no
key=anywhere” case in microseconds. -
SIMD prefilter (Hyperscan). A multi-pattern SIMD regex scanner. The detector corpus is compiled to a single Hyperscan database; one scan call returns “which detector IDs have a candidate match.” On AVX-512 hardware this runs at ~3 GB/s.
On GPUs above the breakeven threshold (2 MiB on 5090-class, 16 MiB on 4090-class), the prefilter switches to a CUDA literal-set scan via vyre - same patterns, parallelized across thousands of cores.
Stage 3 - detector match
For each detector that the prefilter flagged, the FULL regex evaluates.
The regex is what’s in the .toml file - detector.patterns[].regex.
The capture group becomes the candidate credential.
A detector’s .toml carries:
id,name,service,severity,keywords- one or more
patterns, each withregex+group+ optionaldescription - optional
companions(e.g. AWS access key needs the secret key nearby) - optional
verifyblock - HTTP method, URL template, auth scheme, success status
Detectors fall into two camps:
-
Service-anchored. Regex requires a service-specific keyword (
AWS_SECRET_ACCESS_KEY=,stripe.com/v1/,dn_Deepnote prefix). These have HIGH precision: the keyword itself is positive evidence, not just a hint. -
Generic / entropy fallback (
generic-password,entropy-api-key,entropy-token). Triggered by entropy + assignment shape only -password = "...",secret: "...", JSON{ "token": "..." }. Lower precision; suppression filters do most of the work.
The split matters for the post-process stage.
Stage 4 - post-process
Even a regex match isn’t always a credential. Stage 4 filters:
- Known example fixtures (Stripe docs key, AWS docs key, RFC 7519 specimen JWT).
- Placeholder language - credentials containing
YOUR_,INSERT,EXAMPLE,PLACEHOLDER,TODO,FIXME, etc. - Shape gates.
- Universal:
punctuation_decorated_identifier- credentials starting with--,&,@,!,/,$(CLI flags, pointers, SQL vars, shell vars, GraphQL refs) or ending in:/!(UI labels, TypeScript non-null assertions). - Generic / entropy only:
pure_identifier,word_separated_identifier,scheme_prefixed_uri,url_or_path_segment,contains_uuid_v4_substring. These shapes CAN be real credentials when paired with a service anchor (PowerBI client_id is a UUID, mongodb-atlas is a URI), so we only apply them to anchorless detectors.
- Universal:
- Path-based suppressions - vendored bundles (
node_modules/,wp-includes/,bower_components/), CI workflow files (where${{ secrets.NAME }}references are syntactic, not credentials), i18n translation files, secret-scanner source files (the file IS a scanner; its regex literals shouldn’t fire on itself). - Cross-chunk reassembly. A secret split across window boundaries gets reassembled from the tail of chunk N + the head of chunk N+1.
A finding that survives stage 4 makes it to output.
Where the speed comes from
| Stage | Throughput on a modern laptop |
|---|---|
| Chunker | ~5 GB/s (mmap + magic-byte sniff) |
| Alphabet screen | ~12 GB/s (256-bit table lookup, vectorized) |
| Bigram bloom | ~8 GB/s (4096-bit table, vectorized) |
| Hyperscan SIMD | ~3 GB/s (multi-pattern regex) |
| Per-detector regex | ~150 MB/s × detectors flagged |
| Post-process | ~200 MB/s |
The end-to-end number on the dogfood corpus is ~800 MB/s sustained. Hardware acceleration (AVX-512, CUDA) raises the SIMD-prefilter ceiling substantially on big inputs; small inputs (< 100 KB) bottleneck on the chunker and post-process, not the regex.
Where the precision comes from
| Filter | What it catches |
|---|---|
| Known example fixtures | Stripe docs key, AWS docs key, RFC 7519 JWT |
pure_identifier | getParameter, Benutzername, auth_decoders |
word_separated_identifier | s3_secret_access_key (function name) |
scheme_prefixed_uri | urn:foo:bar (URI literal, not creds) |
url_or_path_segment | /api/v1/users/123 (REST path) |
contains_uuid_v4_substring | TOKEN_LIST=636765a9-… (UUID identifier) |
punctuation_decorated_identifier | --api-secret, &password, Password: |
| Vendored-minified-path | node_modules/jquery-3.6.0.min.js |
| CI workflow path | .github/workflows/ci.yml - ${{ secrets.X }} |
| i18n translation path | locale/de.po - translated password word |
Each filter has a known-FP-cluster it was built to defuse. The Suppressions page enumerates them with examples.
What this looks like for one finding
file.env contains: AWS_SECRET_ACCESS_KEY=ev0BsFtSD7S/4VWYObxiEhME3hJBXeYzR43jgiB1
stage 1 - chunker: emit chunk{ path: "file.env", data: "AWS_SECRET..." }
stage 2 - alphabet: PASS (chunk has `=`, alphanumerics from the corpus)
stage 2 - bigram bloom: PASS (`AW`, `WS`, `_S` are in the bloom)
stage 2 - Hyperscan: MATCH → triggers `aws-secret-access-key` + `generic-password`
stage 3 - regex eval:
`aws-secret-access-key` regex `(?i)(?:AWS[_-]?SECRET[_-]?ACCESS[_-]?KEY|...)[=:\s"']+([0-9a-zA-Z/+=]{40})(?:[^0-9a-zA-Z/+=]|$)`
captures `ev0BsFtSD7S/4VWYObxiEhME3hJBXeYzR43jgiB1`
`generic-password` regex doesn't match (no `_password`/`_pwd` substring)
stage 4 - post-process:
known-example check: no
`looks_like_pure_identifier`: false (has digits + /)
`looks_like_punctuation_decorated_identifier`: false
→ EMIT
That’s one finding’s life. Multiply by 10⁶ files and the throughput math is why each stage matters.