Methodology: Tokenization Bias Measurement Across 8 LLM Models

By Luis Méndez

June 2026

Read the full article: Leeuwwolk Blog · Medium

Objective

Measure the token overhead that Spanish legal text incurs compared to semantically equivalent English text, across 8 large language models using different tokenizer architectures.

Models tested

Model Tokenizer type Vocab size Test method Baseline
GPT-4 (cl100k_base) BPE ~100K Offline (gpt-tokenizer npm) None needed
GPT-4o (o200k_base) BPE ~200K Offline (gpt-tokenizer npm) None needed
Llama 2 SentencePiece BPE 32K Offline (llama-tokenizer-js npm) None needed
Qwen 3.5 9B BPE ~150K Offline (@cyberlangke/tokkit-qwen npm) + local llama-server /tokenize None needed (local)
Llama 3.3 70B SentencePiece BPE 128K Groq API (usage.prompt_tokens) 36 tokens
Qwen3 32B BPE ~150K Groq API (usage.prompt_tokens) 9 tokens
GPT-OSS 120B BPE (custom) ~200K Groq API (usage.prompt_tokens) 72 tokens
Claude Sonnet 4 BPE (proprietary) Unknown Anthropic API (usage.input_tokens) 4,116 tokens

Test texts

Three pairs of semantically equivalent legal clauses were used. The Spanish versions are not literal translations — they use the standard phrasing a Mexican lawyer would use for the same legal concept.

NDA clause

English (300 chars): "The Receiving Party agrees to hold in strict confidence all Confidential Information disclosed by the Disclosing Party and shall not disclose such information to any third party without prior written consent. This obligation shall survive the termination of this agreement for a period of five years."

Spanish (315 chars): "La Parte Receptora se obliga a mantener en estricta confidencialidad toda la Información Confidencial revelada por la Parte Divulgante y no divulgará dicha información a terceros sin el consentimiento previo por escrito. Esta obligación sobrevivirá la terminación del presente contrato por un período de cinco años."

Incorporation clause

English (361 chars): Standard Delaware corporation formation clause covering purpose, authorized shares, and board composition.

Spanish (328 chars): Equivalent clause under Mexico's Ley General de Sociedades Mercantiles.

Tax/Fiscal clause

English (243 chars): Standard IRS filing requirement with employer identification number reference.

Spanish (317 chars): Equivalent SAT filing requirement with RFC and CFDI references.

Individual terms (20 pairs)

Standard legal/business terminology pairs: contract/contrato, shareholders/accionistas, articles of incorporation/acta constitutiva, tax identification number/registro federal de contribuyentes, etc.

Mexican-specific terms (8 terms, no English equivalent)

fideicomiso, escritura pública, acta constitutiva, protocolo notarial, Sociedad Anónima Promotora de Inversión de Capital Variable, Registro Federal de Contribuyentes, comprobantes fiscales digitales por Internet, asamblea general ordinaria de accionistas.

Baseline correction methodology

The problem

When measuring tokens via API (Groq, Anthropic), the reported token count includes system prompt overhead — tokens consumed by the model's internal system prompt and chat template formatting. This overhead is constant regardless of the input text language, so it dilutes the apparent overhead percentage.

The dilution is asymmetric across models because system prompt sizes vary dramatically:

Model System prompt baseline Dilution effect on NDA overhead
Qwen3 32B (Groq) 9 tokens Minimal dilution
Llama 3.3 70B (Groq) 36 tokens Moderate dilution
GPT-OSS 120B (Groq) 72 tokens Significant dilution
Claude Sonnet 4 4,116 tokens Extreme dilution

Example: Claude's NDA overhead appears as +0.6% with baseline included (4195/4171), but the actual text overhead is +43.6% (79/55) after subtracting the 4,116-token baseline.

Correction procedure

  1. Send the word "test" to the model API and record usage.prompt_tokens (or usage.input_tokens for Anthropic). This is the baseline: system prompt + chat template + 1 token for "test".
  2. Send each test text and record prompt tokens.
  3. Subtract baseline from each measurement to isolate the actual text token count.
  4. Compute overhead as: (ES_text_tokens / EN_text_tokens - 1) × 100

When correction is not needed

Offline tokenizer libraries (gpt-tokenizer, llama-tokenizer-js, tokkit-qwen) return raw token counts with no system prompt overhead. The local llama-server /tokenize endpoint also returns raw token arrays. These measurements need no correction.

Tools and reproduction

Offline tokenizers (Node.js)

npm install gpt-tokenizer llama-tokenizer-js @cyberlangke/tokkit-qwen

Local model (llama-server)

# Requires llama-server running with your model
python3 test_local_tokenizer.py --host YOUR_HOST --port YOUR_PORT

Groq API

GROQ_API_KEY=your-key python3 test_groq_tokenizer.py --model MODEL_NAME

Anthropic API

ANTHROPIC_API_KEY=your-key python3 test_claude_tokenizer.py

All scripts are open source. Each script outputs raw token counts and computed overhead percentages.

Known limitations

  1. Groq baseline subtraction assumes constant overhead. The system prompt overhead should be constant regardless of user message length, but there could be minor variations from chat template tokens that scale with message structure (e.g., role tags). We verified this by testing multiple message lengths and confirming the baseline remains stable within ±1 token.

  2. Claude's large baseline reduces measurement precision. With a 4,116-token baseline, the signal-to-noise ratio for short texts is lower. The NDA clause measurements (55 EN / 79 ES tokens after correction) are reliable, but individual word-level measurements have higher relative uncertainty.

  3. Groq model versions may differ. Groq may serve quantized or modified versions of models. The tokenizer should be identical to the base model, but we cannot independently verify this without access to Groq's deployment configuration.

  4. Semantic equivalence, not literal translation. The test texts were written to express the same legal concepts using natural phrasing in each language. This means the Spanish texts sometimes contain more words (e.g., "registro federal de contribuyentes" vs "tax identification number"), which accounts for some of the overhead. However, this reflects real-world document processing — legal documents in Spanish ARE longer for the same concepts, and that additional length IS tokenized and billed.

  5. Limited to legal/business domain. The overhead percentages are specific to legal and fiscal vocabulary. Casual Spanish text would show lower overhead because common words (de, el, la, que) tokenize efficiently. The bias is most pronounced in specialized terminology that rarely appears in English-dominated training corpora.

  6. Single tokenizer version per model. Tokenizer updates between model versions (e.g., Qwen3 vs Qwen 3.5) can change results significantly. The results reflect the tokenizer state as of June 2026.

Summary of corrected results

Model NDA EN tokens NDA ES tokens Overhead
Qwen 3.5 9B (local) 50 61 +22.0%
GPT-4 / GPT-4o (offline) 49 61 +24.5%
Llama 2 32K (offline) 59 79 +33.9%
GPT-OSS 120B (Groq, corrected) 32 43 +34.4%
Claude Sonnet 4 (API, corrected) 55 79 +43.6%
Llama 3.3 70B (Groq, corrected) 33 55 +66.7%
Qwen3 32B (Groq, corrected) 33 55 +66.7%

By Luis Méndez · June 2026

This methodology document accompanies the blog post "Your AI charges you up to 67% more for not speaking English" published on Leeuwwolk and Medium. Contact: [email protected]

Reproduction scripts: test_local_tokenizer.py · test_groq_tokenizer.py · test_claude_tokenizer.py · test_offline_tokenizers.mjs

Complete test clauses

All clauses used in this study are reproduced below for full reproducibility. Spanish versions use standard Mexican legal phrasing, not literal translations.

Clause 1: NDA / Confidentiality

English (300 chars, 49-55 tokens depending on model):

The Receiving Party agrees to hold in strict confidence all Confidential Information disclosed by the Disclosing Party and shall not disclose such information to any third party without prior written consent. This obligation shall survive the termination of this agreement for a period of five years.

Spanish (315 chars, 61-79 tokens depending on model):

La Parte Receptora se obliga a mantener en estricta confidencialidad toda la Información Confidencial revelada por la Parte Divulgante y no divulgará dicha información a terceros sin el consentimiento previo por escrito. Esta obligación sobrevivirá la terminación del presente contrato por un período de cinco años.

Observation: The Spanish version is only 5% longer in characters but consumes 22-44% more tokens. The overhead comes from words like "confidencialidad" (3 tokens), "consentimiento" (3 tokens), and "sobrevivirá" (3 tokens), which have no single-token equivalent.


Clause 2: Incorporation / Corporate formation

English (361 chars, 66 tokens on GPT-4):

This Corporation is organized for the purpose of engaging in any lawful act or activity for which corporations may be organized under the General Corporation Law of the State of Delaware. The total number of shares of stock which the corporation shall have authority to issue is ten million shares of common stock, each having a par value of one cent per share.

Spanish (328 chars, 64 tokens on GPT-4):

La Sociedad tiene por objeto la realización de cualquier acto o actividad lícita para la cual las sociedades pueden organizarse conforme a la Ley General de Sociedades Mercantiles del Estado Mexicano. El capital social autorizado será de diez millones de acciones ordinarias, cada una con valor nominal de un centavo por acción.

Observation: This is the one case where Spanish is slightly more efficient (-3% on GPT-4). The Spanish text is 33 characters shorter and uses more common Romance-language words that tokenize well. This demonstrates that the bias is not universal — it depends on vocabulary specificity.


Clause 3: Tax / Fiscal

English (243 chars, 38-40 tokens depending on model):

The taxpayer shall file annual returns with the Internal Revenue Service for each taxable year. All digital tax receipts must comply with current regulations and include the employer identification number assigned by the federal tax authority.

Spanish (317 chars, 52-78 tokens depending on model):

El contribuyente deberá presentar declaraciones anuales ante el Servicio de Administración Tributaria por cada ejercicio fiscal. Todos los comprobantes fiscales digitales por Internet deberán cumplir con la normatividad vigente e incluir el Registro Federal de Contribuyentes asignado por la autoridad fiscal federal.

Observation: This clause shows the highest overhead across all models (+36.8% to +95.0%). The Spanish text is 30% longer in characters, but the token overhead far exceeds the character difference. Key penalty terms: "Administración Tributaria" (4 tokens), "comprobantes fiscales digitales" (5 tokens), "Contribuyentes" (3 tokens). These are SAT/CFDI-specific terms that no tokenizer has seen enough to compress.


Clause 4: SAPI de CV formation (Spanish only, no English equivalent)

Spanish (337 chars, 78-109 tokens depending on model):

PRIMERA.- DENOMINACIÓN, TIPO Y DURACIÓN. La Sociedad se denominará Ejemplo Tecnológico, Sociedad Anónima Promotora de Inversión de Capital Variable. La Sociedad tendrá una duración indefinida y se regirá por lo dispuesto en la Ley General de Sociedades Mercantiles, la Ley del Mercado de Valores y demás disposiciones legales aplicables.

Observation: This clause has no direct English equivalent because the SAPI de CV corporate structure is specific to Mexican law. "Sociedad Anónima Promotora de Inversión de Capital Variable" alone consumes 15-17 tokens across all models. This full legal name appears in every notarial document, every contract, and every regulatory filing for thousands of Mexican companies.


Individual term pairs (20 pairs)

English Tokens (GPT-4) Spanish Tokens (GPT-4) Δ
contract 1 contrato 2 +1
agreement 1 acuerdo 2 +1
corporation 2 corporación 2 0
shareholders 2 accionistas 2 0
liability 2 responsabilidad 2 0
amendment 3 modificación 2 -1
compliance 2 cumplimiento 3 +1
notarization 3 protocolización 2 -1
incorporation 3 constitución 2 -1
bylaws 2 estatutos 2 0
board of directors 3 consejo de administración 5 +2
power of attorney 3 poder notarial 4 +1
due diligence 2 debida diligencia 4 +2
articles of incorporation 3 acta constitutiva 5 +2
tax identification number 3 registro federal de contribuyentes 6 +3
artificial intelligence 3 inteligencia artificial 4 +1
machine learning 2 aprendizaje automático 4 +2
data sovereignty 2 soberanía de datos 5 +3
retrieval augmented generation 4 generación aumentada por recuperación 6 +2

Aggregate: 41 EN tokens vs 58 ES tokens = +41.5% overhead at the vocabulary level.

Note: Some Spanish terms tokenize MORE efficiently than English (protocolización, constitución, modificación). The bias is not absolute — it depends on whether the specific word appeared frequently enough in the BPE training corpus to be merged into a compact token.


Mexican-specific terms (no English equivalent)

Term Claude tokens Qwen 3.5 tokens Description
fideicomiso 4 4 Trust (financial instrument)
escritura pública 4 4 Public deed (notarial document)
acta constitutiva 5 5 Articles of incorporation
protocolo notarial 5 4 Notarial protocol/record
Registro Federal de Contribuyentes 8 6 Federal taxpayer registry (RFC)
comprobantes fiscales digitales por Internet 9 7 Digital tax receipts (CFDI)
asamblea general ordinaria de accionistas 10 8 Ordinary shareholders' meeting
Sociedad Anónima Promotora de Inversión de Capital Variable 17 15 Variable capital investment promotion corp (SAPI de CV)

Key insight: These terms appear in virtually every Mexican corporate document. A single acta constitutiva can contain "Sociedad Anónima Promotora de Inversión de Capital Variable" dozens of times, consuming 15-17 tokens each occurrence. Over a corpus of thousands of documents, this compounds into significant cost when using token-priced APIs.