Nicolas Dabene
Back to blog
27 June 2026 Nicolas Dabène — Développeur Full Stack & Orchestrateur IA chez Profileo 4 min

Why You Should Almost Never Send a PDF, DOCX, or Image to ChatGPT or Claude

TL;DR

Intelligence artificielle Agents IA API Automatisation LLM & modeles
Why You Should Almost Never Send a PDF, DOCX, or Image to ChatGPT or Claude

TL;DR

Most users try to reduce their AI bill by switching models.

GPT or Claude?

Pro version or API?

Smaller or larger context?

Yet in many cases, the real problem isn't the model.

The problem is what you send to it.

PDF, DOCX, PowerPoint, screenshots, images... all these formats were designed for humans. Not for LLMs.

Result:

  • more tokens consumed;
  • more context wasted;
  • higher costs;
  • sometimes even less relevant responses.

The best optimization is often not changing the model.

It's preparing documents better before sending them.


🚨 The Hidden Token Tax

When you send a document to an LLM, it doesn't "read" the file like you do.

Before even analyzing the content, it must understand:

  • where the text begins;
  • where the text ends;
  • which elements are headings;
  • which elements are paragraphs;
  • which areas represent a table;
  • which parts are decorative;
  • in what order the elements should be read.

In other words, some of your tokens are used solely to reconstruct the document's structure.

So you're paying for noise.


📄 Why PDFs Are Particularly Inefficient

PDF is probably the world's most popular document format.

But it was designed for printing.

Not for artificial intelligence.

A PDF mainly describes:

  • positions;
  • coordinates;
  • graphic blocks;
  • layout elements.

For a human, this is perfect.

For a model, it often requires preliminary reconstruction work:

  • determining the reading order;
  • reconstructing tables;
  • identifying headings;
  • understanding columns;
  • linking captions to images.

Some of the context is consumed before the business content is even analyzed.


📝 DOCX Files Aren't Much Better

Many think DOCX is naturally suited for LLMs.

That's not really the case.

A DOCX file contains:

  • text;
  • styles;
  • metadata;
  • layout information;
  • XML representation elements.

Even if the cost is generally lower than that of a complex PDF, a significant portion of the data sent doesn't directly contribute to understanding the document's business content.


🖼️ Images Are Often the Worst Case Scenario

The situation becomes even more costly when sending:

  • screenshots;
  • photos;
  • scanned documents;
  • slides exported as images.

Before understanding the content, the model must:

  • detect text;
  • perform OCR;
  • identify important areas;
  • reconstruct the logical structure.

You're asking the model to solve a vision problem before even addressing the business problem.

It's like photographing a sheet of paper and then asking someone to copy it before reading it.


📊 Not All Formats Are Equal

If the goal is to efficiently transmit information to an LLM, some formats are much better suited than others.

Format Useful Signal Structural Noise Token Impact
Markdown Very High Low Excellent
TXT High Very Low Excellent
DOCX Medium Medium Fair
Native PDF Medium High Poor
Scanned PDF Low Very High Very Poor
Image Low Extremely High Catastrophic

This table summarizes a simple principle:

The closer a format is to structured plain text, the more efficient it is for an LLM.


💰 The False Debate: Changing Models

Most discussions revolve around questions like:

Which model consumes the least?

Which provider is the cheapest?

Which subscription offers the most context?

These questions matter.

But they often come too early.

Because before optimizing the engine, you need to optimize the fuel.

Reducing a model's cost by 30% is interesting.

Reducing the tokens sent with each request by 30 to 40% is often much more cost-effective.

And it generally improves response quality.


🚀 A Smarter Approach: Preparing Documents for LLMs

For a few months now, a new category of tools has begun to emerge.

Their goal isn't to create a new model.

Their goal is to better represent documents for existing models.

This is exactly the philosophy behind the DocLang project:

https://github.com/doclang-project/doclang

DocLang is an open-source project supported by the Linux Foundation AI & Data that starts with a simple observation:

  • PDF was designed for printing;
  • DOCX for editing;
  • HTML for display;
  • none of these formats were designed for LLMs.

DocLang's goal is therefore to provide an intermediate representation better suited to artificial intelligence.

Specifically, DocLang preserves:

  • document hierarchy;
  • headings;
  • sections;
  • tables;
  • metadata;
  • logical relationships between elements.

While removing much of the noise related to visual presentation.

The result:

  • more compact documents;
  • fewer tokens consumed;
  • better content understanding;
  • more effective context for AI agents and RAG pipelines.

According to figures published by the project, this approach significantly reduces the volume of tokens needed to represent a document while preserving its semantic structure.


🎯 The Real AI Challenge in 2026

For a long time, we optimized infrastructure.

Then we optimized models.

The next step is to optimize data.

In many AI projects, the biggest waste doesn't come from the model used.

It comes from how documents are prepared before ingestion.

Before searching for the next miracle model:

  • look at your prompts;
  • look at your workflows;
  • look at the files you send.

Because the easiest way to reduce your AI bill is often not to change models.

It's to stop sending it noise.

Nicolas Dabène

Author

Nicolas Dabène

Développeur Full Stack & Orchestrateur IA chez Profileo

Senior PHP/Laravel developer with 12+ years of experience in e-commerce. Specialised in PrestaShop architecture, AI agents and automation.

RSS

Follow this blog

Subscribe to the RSS feed to never miss an article.

LinkedIn

Follow my AI and e-commerce analysis

I share practical notes on AI agents, PrestaShop architecture, MCP and automation for e-commerce teams.

Follow on LinkedIn