Which model from Ollama works best for AI features?

masterlan · March 5, 2026, 9:42am

We updated to zammad 7 and as we already have Ollama running locally we tried the new AI features. However, the ticket summaries mostly do not work as the models misbehave: There is a request, we always see GPU usage and we sometimes also see a response but zammad cannot use it as the json output is not as requested or sometimes completely empty. I tried the follwing models:

gpt-oss:120b
llama4:latest

Are there any suggestions on which model works best in this case? Our ollama context has 128K.

dominikklein · March 5, 2026, 10:10am

I think gpt-oss is currently not working with ollama when json return structure is expected:

github.com/ollama/ollama

Structured output with OpenAI SDK and gpt-oss:20b not working

opened 08:21PM - 05 Aug 25 UTC

taagarwa-rh

bug gpt-oss

### What is the issue? OpenAI SDK is unable to parse structured output from gpt…-oss:20b responses. Ollama is supposed to be compatible with OpenAI SDK structured outputs per this [Blog Post](https://ollama.com/blog/structured-outputs). Reproducer: ```python import openai from pydantic import BaseModel class Response(BaseModel): response: str client = openai.OpenAI(api_key="NONE", base_url="http://localhost:11434/v1") response = client.beta.chat.completions.parse( messages=[{"role": "user", "content": "Hello, how are you?"}], model="gpt-oss:20b", response_format=Response, ) print(response) ``` ### Relevant log output ```shell pydantic_core._pydantic_core.ValidationError: 1 validation error for Response Invalid JSON: expected value at line 1 column 1 [type=json_invalid, input_value='The user says "\n\n \t}\n \t\t \t\t ', input_type=str] For further information visit https://errors.pydantic.dev/2.11/v/json_invalid ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.11.0

And I think you will also find more:

dominikklein · March 5, 2026, 10:21am

And about llama4:latest. We did not test this one on our own, but normally I would expect that it should work without problems, when it has no general JSON problems.
But sure always difficult to say, prompts are always working some kind of different for different LLMs.

masterlan · March 5, 2026, 10:50am

llama4:latest at least worked some times. But not every time.

dominikklein · March 5, 2026, 11:09am

I think for llama4 we found the root cause. We will investigate more.

dominikklein · March 5, 2026, 11:19am

Looks like it’s because of this bug: Thinking + tools + qwen3 = empty output · Issue #10976 · ollama/ollama · GitHub

Seems to be thinking is enabled by default, when the LLM supports it. For now, we will always disable thinking.

But maybe you can also give some response output, so that we can verify it’s the same problem.

masterlan · March 5, 2026, 11:22am

I did some more tests and I now use the much smaller llama3.2:3b - this seems to do the job for now.

pablo-ross · March 5, 2026, 11:24am

I tested Qwen3-Coder-30B-A3B-Instruct with my local llama.cpp server. Both Polish and English worked well and were quite fast. Most importantly, all data stays on my device.

bialaps · March 5, 2026, 11:53am

We are looking into this: Ollama models with thinking enabled may return empty response field · Issue #5984 · zammad/zammad · GitHub

astrugatch · March 10, 2026, 1:53pm

Which models were tested? I’m using llama3.2 since it was the pre-filled option when setting up AI in Zammad, but I have enough performance in our host to use something else if there is a better option.

masterlan · March 10, 2026, 2:28pm

I tested with llama4:latest, llama3.2:3b, qwen3.5:122b, gpt-oss:120b and had the best results with llama3.2:3b. I have seen that there is a fix for the next zammad version so that reasoning models also work more reliable. I think I will then try gpt-oss:120b as this gives us the best results for our tech support purposes among llama4 and qwen3.5

dominikklein · March 10, 2026, 4:35pm

When nothing changed, “gpt-oss:120b” will not work, because of problems with JSON result support.

A mistral based LLM could also be interesting.

How is the performance when you are using these bigger LLMs? From an execution time perspective?

masterlan · March 12, 2026, 10:53am

We now switched to mistral-small3.2:24b and that works really good in terms of results. Performance is ok, summaries need approx. 10-20 seconds - depending on the ticket size. Title rewriting works much better than with the llama models.

dominikklein · March 12, 2026, 1:07pm

And how was the execution time when you used some of the bigger models? For us it was not really working out, because the execution time was increasing.

masterlan · March 12, 2026, 2:13pm

Yes bigger models need more time. With the same ticket:
llama3.2:3b - 6s
mistral-small3.2:24b - 15s
llama4:latest - 31s

However, the results from mistral-small3.2:24b are much better than the stuff llama3.2:3b produces, especially as we see huge improvements regarding the title rewriter. This really saves time for us and will in the future as the reporting depends on the ticket title and we have customers complaining about the title they gave to the ticket. So, that’s a huge win, which definitely saves time. Summaries are not our main usage of the tools.

astrugatch · March 13, 2026, 1:17pm

Has anyone tried Gemma3?

I just switched to Gemma3:latest from llama3.2 and am getting better results on ticket renaming. I have a smallish gpu on Ollama so I only have 8GB of vram to work with but this seems to the the better of the two in my very limited test.

ruettinger · March 25, 2026, 6:11pm

I am currently using Ollama with qwen32b-ctx16k which works okayish on a cluster of 3 Tesla P40 we had available. It is generally too slow… and about 30% of requests fail because of timeouts.

Since I got a bit fed up with the local model I went ahead and created a proxy for Claude which strips the PII from the request. So it should be much more compliant with GDPR regulations…

Instead of the Provider address you put the proxy address into zammads AI config and thats it. No PII leaves the house. The answer from the LLM is then de-anonymized. This seems to work quite well.

Is anything like that planned for zammad? I figured it might be good to have it as a proxy service since it is universally usable and also works with zammad-mcp.

I haven#t set up a GitHub for this yet but if people are interested in testing and or contributing I would maybe do so.

dominikklein · March 25, 2026, 7:44pm

I think currently there is no plan in this direction on shotterm, because you can decide on your own which provider should be used, or if you really need some local solution.

But maybe at some point it could be added as an opt-in, with some additional functionality that can be used (we would maybe need to check if there is a good way in ruby directly or if there is always an additional service needed).

UdoLlorens · April 27, 2026, 9:04pm

We are at 7.0.1 and have tried ticket summary with:

qwen/qwen3.6-35b-a3b with thinking on: slower and no discernable difference in quality and sometimes it times out and errors out.
qwen/qwen3.6-35b-a3b with thinking off: way faster, good enough for us (slow gpu)
google/gemma-4-e4b: seems to capture a little bit less detail than qwen, but really 95% the same summaries.