We updated to zammad 7 and as we already have Ollama running locally we tried the new AI features. However, the ticket summaries mostly do not work as the models misbehave: There is a request, we always see GPU usage and we sometimes also see a response but zammad cannot use it as the json output is not as requested or sometimes completely empty. I tried the follwing models:
gpt-oss:120b
llama4:latest
Are there any suggestions on which model works best in this case? Our ollama context has 128K.
And about llama4:latest. We did not test this one on our own, but normally I would expect that it should work without problems, when it has no general JSON problems.
But sure always difficult to say, prompts are always working some kind of different for different LLMs.
I tested Qwen3-Coder-30B-A3B-Instruct with my local llama.cpp server. Both Polish and English worked well and were quite fast. Most importantly, all data stays on my device.
Which models were tested? I’m using llama3.2 since it was the pre-filled option when setting up AI in Zammad, but I have enough performance in our host to use something else if there is a better option.
I tested with llama4:latest, llama3.2:3b, qwen3.5:122b, gpt-oss:120b and had the best results with llama3.2:3b. I have seen that there is a fix for the next zammad version so that reasoning models also work more reliable. I think I will then try gpt-oss:120b as this gives us the best results for our tech support purposes among llama4 and qwen3.5
We now switched to mistral-small3.2:24b and that works really good in terms of results. Performance is ok, summaries need approx. 10-20 seconds - depending on the ticket size. Title rewriting works much better than with the llama models.
And how was the execution time when you used some of the bigger models? For us it was not really working out, because the execution time was increasing.
Yes bigger models need more time. With the same ticket:
llama3.2:3b - 6s
mistral-small3.2:24b - 15s
llama4:latest - 31s
However, the results from mistral-small3.2:24b are much better than the stuff llama3.2:3b produces, especially as we see huge improvements regarding the title rewriter. This really saves time for us and will in the future as the reporting depends on the ticket title and we have customers complaining about the title they gave to the ticket. So, that’s a huge win, which definitely saves time. Summaries are not our main usage of the tools.
I just switched to Gemma3:latest from llama3.2 and am getting better results on ticket renaming. I have a smallish gpu on Ollama so I only have 8GB of vram to work with but this seems to the the better of the two in my very limited test.
I am currently using Ollama with qwen32b-ctx16k which works okayish on a cluster of 3 Tesla P40 we had available. It is generally too slow… and about 30% of requests fail because of timeouts.
Since I got a bit fed up with the local model I went ahead and created a proxy for Claude which strips the PII from the request. So it should be much more compliant with GDPR regulations…
Instead of the Provider address you put the proxy address into zammads AI config and thats it. No PII leaves the house. The answer from the LLM is then de-anonymized. This seems to work quite well.
Is anything like that planned for zammad? I figured it might be good to have it as a proxy service since it is universally usable and also works with zammad-mcp.
I haven#t set up a GitHub for this yet but if people are interested in testing and or contributing I would maybe do so.
I think currently there is no plan in this direction on shotterm, because you can decide on your own which provider should be used, or if you really need some local solution.
But maybe at some point it could be added as an opt-in, with some additional functionality that can be used (we would maybe need to check if there is a good way in ruby directly or if there is always an additional service needed).