Add optional LLM tokenizer-based false positive filtering with per-detector opt-out support by MuneebUllahKhan222 · Pull Request #4935 · trufflesecurity/trufflehog

MuneebUllahKhan222 · 2026-04-29T18:26:30Z

Description

This PR introduces LLM tokenizer-based filtering as an optional mechanism to reduce false positives during scans.

A new CLI flag --filter-tokenize enables this feature, allowing users to run scans with tokenizer-based heuristics applied to unverified results.

What’s Changed

1. CLI Flag for Tokenizer Filtering

Added a new flag:
--- filter-tokenize
When enabled, results are passed through an LLM tokenizer-based filter to reduce false positives.

2. Tokenizer-Based Heuristic

Uses an LLM tokenizer (e.g., cl100k_base) to compute a token-to-character ratio.

How it works

An LLM tokenizer breaks a string into smaller units called tokens.
These tokens are not just words—they can be:

whole words ("connection")
subwords ("connect", "ion")
or even small fragments ("x7", "Ab")

The tokenizer is trained on large amounts of natural language and code, so it tries to split text into meaningful or commonly seen patterns.

Step-by-step

Take the candidate string (e.g., a suspected secret)
Pass it through the tokenizer → get a list of tokens
Compute:

token_to_char_ratio = number_of_tokens / number_of_characters

Intuition

Natural language / structured text
- Tokenizer recognizes patterns and merges efficiently
- Fewer tokens → lower ratio
Random or high-entropy strings
- Tokenizer cannot find meaningful patterns
- Splits into many small pieces → higher ratio

Threshold

token_to_char_ratio > 0.39 → valid (kept)
token_to_char_ratio ≤ 0.39 → filtered out as false positive

Example (conceptual)

String	Tokens	Length	Ratio	Interpretation
`getUserAccount`	few	medium	low	structured code (likely FP)
`QWbToc7xu15O5oDf`	many	medium	high	random-looking (likely secret)

What this helps filter out

CamelCase identifiers
Code fragments
Structured non-secret strings

Important Note

This is a heuristic, not a guarantee:

Tokenizer measures linguistic structure, not true randomness
Used in combination with other signals (e.g., regex, verification)

3. Per-Detector Opt-Out Mechanism

Introduced interface:

type TokenizerFalsePositiveChecker interface {
    IsTokenizerFpDisabled() bool
}

4. Initial Detector Support

Implemented opt-out for:

Postgres
MongoDB

Reason:

These detectors often contain human-readable usernames and passwords
Tokenizer-based filtering may incorrectly classify such values as false positives

Checklist:

Tests passing (make test-community)?
Lint passing (make lint this requires golangci-lint)?

Note

Medium Risk
Changes result filtering behavior and adds a new heuristic dependency, which could hide legitimate unverified findings when enabled or impact scan performance.

Overview
Introduces an optional tokenizer-based false-positive filter for unverified findings, enabled via the new CLI flag --filter-tokenize and wired through engine.Config/Engine to suppress results flagged by this heuristic.

Adds tokenizer FP plumbing (ResultWithMetadata.IsTokenizerFalsePositive, TokenizerFalsePositiveChecker opt-out) and implements the heuristic using tiktoken-go (token-to-character ratio) with initial opt-outs for the MongoDB and Postgres detectors. Also extends FalsePositiveInfo protobuf with low_token_char_ratio, and logs a new total_results_found metric derived from Engine.NumFoundResults().

^{Reviewed by Cursor Bugbot for commit 54a5243. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor · 2026-04-29T18:31:17Z

+				} else {
+					ctx.Logger().Info("Filtered out result with low token-to-character ratio", "detector", res.DetectorType.String(), "ratio", ratio, "result", string(res.Raw))
+					return false
+				}


Inverted boolean logic filters valid secrets, keeps false positives

High Severity

IsTokenizerFalsePositive returns true when ratio > 0.39 and false when ratio ≤ 0.39, but according to the PR description, ratio > 0.39 means the result is valid (should be kept) and ratio ≤ 0.39 means it's a false positive (should be filtered). In notifierWorker, IsTokenizerFalsePositive == true causes the result to be skipped. This means valid secrets get discarded and actual false positives are retained — the exact opposite of the intended behavior.

Additional Locations (1)

pkg/engine/engine.go#L1287-L1290

^{Reviewed by Cursor Bugbot for commit af6a7ba. Configure here.}

cursor · 2026-04-29T18:31:18Z

 	maxDepth   = 5 * 2
 	maxSize    = 2 << 30 // 2 GB
-	maxTimeout = time.Duration(60) * time.Second
+	maxTimeout = time.Duration(100) * time.Second


Unrelated archive timeout increase shipped with PR

Low Severity

maxTimeout was changed from 60 to 100 seconds for archive processing. This is unrelated to the tokenizer-based filtering feature described in the PR and changes behavior for all users — archive extraction now waits 67% longer before timing out. This looks like it was accidentally included in the commit.

^{Reviewed by Cursor Bugbot for commit af6a7ba. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{Reviewed by Cursor Bugbot for commit 54a5243. Configure here.}

cursor · 2026-04-29T18:43:56Z

+	if ok {
+		return func(ctx context.Context, res Result) bool {
+			return checker.IsTokenizerFpDisabled()
+		}


Opt-out detectors have all results incorrectly filtered

High Severity

GetTokenizerFalsePositiveCheck returns a function that directly returns checker.IsTokenizerFpDisabled() for detectors implementing the opt-out interface. Since Postgres and MongoDB return true from IsTokenizerFpDisabled(), this marks every unverified result from these detectors as a tokenizer false positive, causing all their results to be filtered out — the exact opposite of opting out. The returned function for opted-out detectors needs to return false (not a false positive) to prevent filtering.

^{Reviewed by Cursor Bugbot for commit 54a5243. Configure here.}

zricethezav · 2026-04-29T18:43:50Z

 }
+
+func getTokens(text, encoding string) []int {
+	tke, err := tiktoken.GetEncoding(encoding)


Probably wanna grab it offline: https://github.com/betterleaks/betterleaks/blob/c65402b457ee3d5eca13d050805a8e77da285175/detect/detect.go#L174-L176

gzip it for minimizing binary size (saves a couple MBs) https://github.com/betterleaks/betterleaks/tree/main/detect/assets

Cool idea ;)

Yeah thats the intent for it later on.

Added llm-tokenizer to reduce FPs

af6a7ba

MuneebUllahKhan222 requested a review from a team April 29, 2026 18:26

MuneebUllahKhan222 requested review from a team as code owners April 29, 2026 18:26

cursor Bot reviewed Apr 29, 2026

View reviewed changes

fixed bugbot comments

54a5243

cursor Bot reviewed Apr 29, 2026

View reviewed changes

zricethezav reviewed Apr 29, 2026

View reviewed changes

mustansir14 marked this pull request as draft April 30, 2026 08:13

mustansir14 assigned mustansir14 and unassigned mustansir14 Apr 30, 2026

mustansir14 marked this pull request as ready for review April 30, 2026 08:14

MuneebUllahKhan222 marked this pull request as draft May 4, 2026 06:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional LLM tokenizer-based false positive filtering with per-detector opt-out support#4935

Add optional LLM tokenizer-based false positive filtering with per-detector opt-out support#4935
MuneebUllahKhan222 wants to merge 2 commits intotrufflesecurity:mainfrom
MuneebUllahKhan222:tokenizer-fp

MuneebUllahKhan222 commented Apr 29, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot Apr 29, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot Apr 29, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 29, 2026

Uh oh!

zricethezav Apr 29, 2026

Uh oh!

MuneebUllahKhan222 Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MuneebUllahKhan222 commented Apr 29, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What’s Changed

1. CLI Flag for Tokenizer Filtering

2. Tokenizer-Based Heuristic

How it works

Step-by-step

Intuition

Threshold

Example (conceptual)

What this helps filter out

Important Note

3. Per-Detector Opt-Out Mechanism

4. Initial Detector Support

Checklist:

Uh oh!

cursor Bot Apr 29, 2026

Choose a reason for hiding this comment

Inverted boolean logic filters valid secrets, keeps false positives

Uh oh!

Uh oh!

Uh oh!

cursor Bot Apr 29, 2026

Choose a reason for hiding this comment

Unrelated archive timeout increase shipped with PR

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 29, 2026

Choose a reason for hiding this comment

Opt-out detectors have all results incorrectly filtered

Uh oh!

zricethezav Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

MuneebUllahKhan222 Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MuneebUllahKhan222 commented Apr 29, 2026 •

edited by cursor Bot

Loading

MuneebUllahKhan222 Apr 29, 2026 •

edited

Loading