By Luis Méndez
June 2026
Read the full article: Leeuwwolk Blog · Medium
Measure the token overhead that Spanish legal text incurs compared to semantically equivalent English text, across 8 large language models using different tokenizer architectures.
| Model | Tokenizer type | Vocab size | Test method | Baseline |
|---|---|---|---|---|
| GPT-4 (cl100k_base) | BPE | ~100K | Offline (gpt-tokenizer npm) | None needed |
| GPT-4o (o200k_base) | BPE | ~200K | Offline (gpt-tokenizer npm) | None needed |
| Llama 2 | SentencePiece BPE | 32K | Offline (llama-tokenizer-js npm) | None needed |
| Qwen 3.5 9B | BPE | ~150K | Offline (@cyberlangke/tokkit-qwen npm) + local llama-server /tokenize | None needed (local) |
| Llama 3.3 70B | SentencePiece BPE | 128K | Groq API (usage.prompt_tokens) | 36 tokens |
| Qwen3 32B | BPE | ~150K | Groq API (usage.prompt_tokens) | 9 tokens |
| GPT-OSS 120B | BPE (custom) | ~200K | Groq API (usage.prompt_tokens) | 72 tokens |
| Claude Sonnet 4 | BPE (proprietary) | Unknown | Anthropic API (usage.input_tokens) | 4,116 tokens |
Three pairs of semantically equivalent legal clauses were used. The Spanish versions are not literal translations — they use the standard phrasing a Mexican lawyer would use for the same legal concept.
English (300 chars): "The Receiving Party agrees to hold in strict confidence all Confidential Information disclosed by the Disclosing Party and shall not disclose such information to any third party without prior written consent. This obligation shall survive the termination of this agreement for a period of five years."
Spanish (315 chars): "La Parte Receptora se obliga a mantener en estricta confidencialidad toda la Información Confidencial revelada por la Parte Divulgante y no divulgará dicha información a terceros sin el consentimiento previo por escrito. Esta obligación sobrevivirá la terminación del presente contrato por un período de cinco años."
English (361 chars): Standard Delaware corporation formation clause covering purpose, authorized shares, and board composition.
Spanish (328 chars): Equivalent clause under Mexico's Ley General de Sociedades Mercantiles.
English (243 chars): Standard IRS filing requirement with employer identification number reference.
Spanish (317 chars): Equivalent SAT filing requirement with RFC and CFDI references.
Standard legal/business terminology pairs: contract/contrato, shareholders/accionistas, articles of incorporation/acta constitutiva, tax identification number/registro federal de contribuyentes, etc.
fideicomiso, escritura pública, acta constitutiva, protocolo notarial, Sociedad Anónima Promotora de Inversión de Capital Variable, Registro Federal de Contribuyentes, comprobantes fiscales digitales por Internet, asamblea general ordinaria de accionistas.
When measuring tokens via API (Groq, Anthropic), the reported token count includes system prompt overhead — tokens consumed by the model's internal system prompt and chat template formatting. This overhead is constant regardless of the input text language, so it dilutes the apparent overhead percentage.
The dilution is asymmetric across models because system prompt sizes vary dramatically:
| Model | System prompt baseline | Dilution effect on NDA overhead |
|---|---|---|
| Qwen3 32B (Groq) | 9 tokens | Minimal dilution |
| Llama 3.3 70B (Groq) | 36 tokens | Moderate dilution |
| GPT-OSS 120B (Groq) | 72 tokens | Significant dilution |
| Claude Sonnet 4 | 4,116 tokens | Extreme dilution |
Example: Claude's NDA overhead appears as +0.6% with baseline included (4195/4171), but the actual text overhead is +43.6% (79/55) after subtracting the 4,116-token baseline.
usage.prompt_tokens (or usage.input_tokens for Anthropic). This is the baseline: system prompt + chat template + 1 token for "test".(ES_text_tokens / EN_text_tokens - 1) × 100Offline tokenizer libraries (gpt-tokenizer, llama-tokenizer-js, tokkit-qwen) return raw token counts with no system prompt overhead. The local llama-server /tokenize endpoint also returns raw token arrays. These measurements need no correction.
npm install gpt-tokenizer llama-tokenizer-js @cyberlangke/tokkit-qwen
# Requires llama-server running with your model
python3 test_local_tokenizer.py --host YOUR_HOST --port YOUR_PORT
GROQ_API_KEY=your-key python3 test_groq_tokenizer.py --model MODEL_NAME
ANTHROPIC_API_KEY=your-key python3 test_claude_tokenizer.py
All scripts are open source. Each script outputs raw token counts and computed overhead percentages.
Groq baseline subtraction assumes constant overhead. The system prompt overhead should be constant regardless of user message length, but there could be minor variations from chat template tokens that scale with message structure (e.g., role tags). We verified this by testing multiple message lengths and confirming the baseline remains stable within ±1 token.
Claude's large baseline reduces measurement precision. With a 4,116-token baseline, the signal-to-noise ratio for short texts is lower. The NDA clause measurements (55 EN / 79 ES tokens after correction) are reliable, but individual word-level measurements have higher relative uncertainty.
Groq model versions may differ. Groq may serve quantized or modified versions of models. The tokenizer should be identical to the base model, but we cannot independently verify this without access to Groq's deployment configuration.
Semantic equivalence, not literal translation. The test texts were written to express the same legal concepts using natural phrasing in each language. This means the Spanish texts sometimes contain more words (e.g., "registro federal de contribuyentes" vs "tax identification number"), which accounts for some of the overhead. However, this reflects real-world document processing — legal documents in Spanish ARE longer for the same concepts, and that additional length IS tokenized and billed.
Limited to legal/business domain. The overhead percentages are specific to legal and fiscal vocabulary. Casual Spanish text would show lower overhead because common words (de, el, la, que) tokenize efficiently. The bias is most pronounced in specialized terminology that rarely appears in English-dominated training corpora.
Single tokenizer version per model. Tokenizer updates between model versions (e.g., Qwen3 vs Qwen 3.5) can change results significantly. The results reflect the tokenizer state as of June 2026.
| Model | NDA EN tokens | NDA ES tokens | Overhead |
|---|---|---|---|
| Qwen 3.5 9B (local) | 50 | 61 | +22.0% |
| GPT-4 / GPT-4o (offline) | 49 | 61 | +24.5% |
| Llama 2 32K (offline) | 59 | 79 | +33.9% |
| GPT-OSS 120B (Groq, corrected) | 32 | 43 | +34.4% |
| Claude Sonnet 4 (API, corrected) | 55 | 79 | +43.6% |
| Llama 3.3 70B (Groq, corrected) | 33 | 55 | +66.7% |
| Qwen3 32B (Groq, corrected) | 33 | 55 | +66.7% |
By Luis Méndez · June 2026
This methodology document accompanies the blog post "Your AI charges you up to 67% more for not speaking English" published on Leeuwwolk and Medium. Contact: [email protected]
Reproduction scripts: test_local_tokenizer.py · test_groq_tokenizer.py · test_claude_tokenizer.py · test_offline_tokenizers.mjs
All clauses used in this study are reproduced below for full reproducibility. Spanish versions use standard Mexican legal phrasing, not literal translations.
English (300 chars, 49-55 tokens depending on model):
The Receiving Party agrees to hold in strict confidence all Confidential Information disclosed by the Disclosing Party and shall not disclose such information to any third party without prior written consent. This obligation shall survive the termination of this agreement for a period of five years.
Spanish (315 chars, 61-79 tokens depending on model):
La Parte Receptora se obliga a mantener en estricta confidencialidad toda la Información Confidencial revelada por la Parte Divulgante y no divulgará dicha información a terceros sin el consentimiento previo por escrito. Esta obligación sobrevivirá la terminación del presente contrato por un período de cinco años.
Observation: The Spanish version is only 5% longer in characters but consumes 22-44% more tokens. The overhead comes from words like "confidencialidad" (3 tokens), "consentimiento" (3 tokens), and "sobrevivirá" (3 tokens), which have no single-token equivalent.
English (361 chars, 66 tokens on GPT-4):
This Corporation is organized for the purpose of engaging in any lawful act or activity for which corporations may be organized under the General Corporation Law of the State of Delaware. The total number of shares of stock which the corporation shall have authority to issue is ten million shares of common stock, each having a par value of one cent per share.
Spanish (328 chars, 64 tokens on GPT-4):
La Sociedad tiene por objeto la realización de cualquier acto o actividad lícita para la cual las sociedades pueden organizarse conforme a la Ley General de Sociedades Mercantiles del Estado Mexicano. El capital social autorizado será de diez millones de acciones ordinarias, cada una con valor nominal de un centavo por acción.
Observation: This is the one case where Spanish is slightly more efficient (-3% on GPT-4). The Spanish text is 33 characters shorter and uses more common Romance-language words that tokenize well. This demonstrates that the bias is not universal — it depends on vocabulary specificity.
English (243 chars, 38-40 tokens depending on model):
The taxpayer shall file annual returns with the Internal Revenue Service for each taxable year. All digital tax receipts must comply with current regulations and include the employer identification number assigned by the federal tax authority.
Spanish (317 chars, 52-78 tokens depending on model):
El contribuyente deberá presentar declaraciones anuales ante el Servicio de Administración Tributaria por cada ejercicio fiscal. Todos los comprobantes fiscales digitales por Internet deberán cumplir con la normatividad vigente e incluir el Registro Federal de Contribuyentes asignado por la autoridad fiscal federal.
Observation: This clause shows the highest overhead across all models (+36.8% to +95.0%). The Spanish text is 30% longer in characters, but the token overhead far exceeds the character difference. Key penalty terms: "Administración Tributaria" (4 tokens), "comprobantes fiscales digitales" (5 tokens), "Contribuyentes" (3 tokens). These are SAT/CFDI-specific terms that no tokenizer has seen enough to compress.
Spanish (337 chars, 78-109 tokens depending on model):
PRIMERA.- DENOMINACIÓN, TIPO Y DURACIÓN. La Sociedad se denominará Ejemplo Tecnológico, Sociedad Anónima Promotora de Inversión de Capital Variable. La Sociedad tendrá una duración indefinida y se regirá por lo dispuesto en la Ley General de Sociedades Mercantiles, la Ley del Mercado de Valores y demás disposiciones legales aplicables.
Observation: This clause has no direct English equivalent because the SAPI de CV corporate structure is specific to Mexican law. "Sociedad Anónima Promotora de Inversión de Capital Variable" alone consumes 15-17 tokens across all models. This full legal name appears in every notarial document, every contract, and every regulatory filing for thousands of Mexican companies.
| English | Tokens (GPT-4) | Spanish | Tokens (GPT-4) | Δ |
|---|---|---|---|---|
| contract | 1 | contrato | 2 | +1 |
| agreement | 1 | acuerdo | 2 | +1 |
| corporation | 2 | corporación | 2 | 0 |
| shareholders | 2 | accionistas | 2 | 0 |
| liability | 2 | responsabilidad | 2 | 0 |
| amendment | 3 | modificación | 2 | -1 |
| compliance | 2 | cumplimiento | 3 | +1 |
| notarization | 3 | protocolización | 2 | -1 |
| incorporation | 3 | constitución | 2 | -1 |
| bylaws | 2 | estatutos | 2 | 0 |
| board of directors | 3 | consejo de administración | 5 | +2 |
| power of attorney | 3 | poder notarial | 4 | +1 |
| due diligence | 2 | debida diligencia | 4 | +2 |
| articles of incorporation | 3 | acta constitutiva | 5 | +2 |
| tax identification number | 3 | registro federal de contribuyentes | 6 | +3 |
| artificial intelligence | 3 | inteligencia artificial | 4 | +1 |
| machine learning | 2 | aprendizaje automático | 4 | +2 |
| data sovereignty | 2 | soberanía de datos | 5 | +3 |
| retrieval augmented generation | 4 | generación aumentada por recuperación | 6 | +2 |
Aggregate: 41 EN tokens vs 58 ES tokens = +41.5% overhead at the vocabulary level.
Note: Some Spanish terms tokenize MORE efficiently than English (protocolización, constitución, modificación). The bias is not absolute — it depends on whether the specific word appeared frequently enough in the BPE training corpus to be merged into a compact token.
| Term | Claude tokens | Qwen 3.5 tokens | Description |
|---|---|---|---|
| fideicomiso | 4 | 4 | Trust (financial instrument) |
| escritura pública | 4 | 4 | Public deed (notarial document) |
| acta constitutiva | 5 | 5 | Articles of incorporation |
| protocolo notarial | 5 | 4 | Notarial protocol/record |
| Registro Federal de Contribuyentes | 8 | 6 | Federal taxpayer registry (RFC) |
| comprobantes fiscales digitales por Internet | 9 | 7 | Digital tax receipts (CFDI) |
| asamblea general ordinaria de accionistas | 10 | 8 | Ordinary shareholders' meeting |
| Sociedad Anónima Promotora de Inversión de Capital Variable | 17 | 15 | Variable capital investment promotion corp (SAPI de CV) |
Key insight: These terms appear in virtually every Mexican corporate document. A single acta constitutiva can contain "Sociedad Anónima Promotora de Inversión de Capital Variable" dozens of times, consuming 15-17 tokens each occurrence. Over a corpus of thousands of documents, this compounds into significant cost when using token-priced APIs.