Token counts for image processing inside PDF documents (original) (raw)

December 17, 2025, 6:52am 1

(Context) I am building a RAG system where I ingest PDF documents containing both text and images. My goal is to convert these PDFs into markdown and use Gemini to explain/describe the images embedded within the documents.

(Question) I need clarification on how the Gemini API counts input tokens for these PDFs, specifically regarding the images:

Tokenization Method: When I send a PDF to the API, are the images converted into a base64 text string first and tokenized as characters (which would be huge)? Or are they processed as native image tokens (visual embeddings)?
Quota Limits: I know a single high-res image can be 1,000,000+ characters when base64 encoded. If the API treats this as text, I would instantly hit the token limit. However, the documentation mentions a 3,000-image limit per prompt. Does the token count for images function separately from the 1M text token context window?