AI OCR feature in Zammad 7 vs. Tesseract

chrisl · June 22, 2026, 6:56am

Have you successfully tested the OCR feature with any providers or vision LLM models? Which models did you try can you share thoughts on results?

On zammad v7.1 I connected a local Gemma4 31B and Qwen3.6 35B, both are vision capable, unfortunately the zammad documentation is very sparse Provider — Zammad Admin Documentation documentation
With OCR toggled enabled and enabling all Ticket Summary Services Generation I tested summary of a few emails with photo images attached or inline ie jpg of objects and people and after AI summary generation do not see any mention of photo images. Next I need to try raster images of text documents.

@dominikklein
Does the current enabling “Recognize image text (OCR)” send all file attachments to LLM? Or is zammad restricting file name extension ie only jpg png and restricting to 1 or more file attachments? Including a few additional information in docs can help users properly set expectations and configurations.

Unless I am confused, I was surprised Zammad appears to only support a single Provider configuration, and also prevent enabling more than one Model from a single Provider.

A simple alternative solution that Zammad may support in future ie feature request Support multiple AI LLM providers and/or multiple AI LLM model names are allowing you to configure more than one provider and/or more than one model and allow specifying which particular model to be used for which types of inference processing. A simple approach to keeping low resource footprint would be using focused LLM model for particular task, ie a small purpose-built OCR vision LLM model that excels for recognizing image text, a different medium model for performing trivial rote tasks ( categorization, tagging etc) , a different larger model performing more complex tasks like writing assistance etc.

Another alternative you could evaluate LLM router proxies; you specify a single Provider in Zammad and that provider uses a smart router LLM analyzes your multi modal text and image context and intelligently routes the context to best model and engine for actual final processing. This would allow you to direct and control your resource usage and maybe keep lower resource footprint.