Zammad outage spikes

  • Used Zammad version: Zammad version 6.5.2-1769767752.d278bdb0.noble
  • psql (PostgreSQL) 16.11
    elasticsearch 8.19.10
  • Used Zammad installation type: package
  • Operating system:
    Server 1 (Zammad app): OS: Ubuntu 24.04.3 LTS
    Server 2 (Postgres DB): OS: Ubuntu 24.04.3 LTS
    Server 3 (ElasticSearch): OS: Ubuntu 24.04.3 LTS
  • Browser + version: [Brave 1.86.148 (Official Build) (64-bit)]

Hello,

I’m experiencing intermittent spikes in Zabbix monitoring related to Zammad Web UI availability.

I configured a Web Scenario in Zabbix to monitor the Zammad web interface and alert when the site becomes unavailable. Periodically, Zabbix reports “Zammad Web UI is down”. At the exact moment of the alert, the web interface noticeably freezes/lag for a few seconds and then immediately recovers, continuing to operate normally.

What makes this difficult to diagnose:

  • No errors in Zammad logs
  • No Nginx or application errors
  • No CPU, RAM, or I/O spikes
  • No network drops or connectivity issues
  • No visible service restarts
  • Zabbix server itself shows no resource spikes

The behavior looks like a short application stall (2–5 seconds), not a full outage.
Frequency varies sometimes once per day, sometimes every 30 minutes.

Has anyone encountered similar short “micro-outages” or brief UI stalls without corresponding log errors?
What would be the best way to instrument or test this further to identify the root cause (request timing, upstream latency, Puma workers, DB waits, etc.)?

With that heavy seperation, I’d assume you have quite a lot of users.
So you might want to provide the concurrent users and your performance tuning along.

Because this is extremely relevant for what anyone could answer here.

I would say aprox. 20 agents.
Customers around 200 but 20 mybe are using the portal to open tickets rest are using email. And never go on the portal.

We are loking to expand agents to about 45 latter, but i wan’t to solve this if possible before that

Also we would have alot of customers when we expand to 45 agents… so far we are only internal before we transfer to outside customers.

So you have zero performance tunings active?

Yes, we have not tuned anything.
Mainly because the system is working stable, no spikes in usage, nothing.
On app server we have 16% cpu usage stable, ram 8%.
Db server 2% cpu, 3% ram
Elastic 63%ram 7% cpu.

Unless you think i should do some tuning?

Well yes, no wonder the web interface begins to be slow and unresponsive when people are using it.

You can learn more in the documentation

or… if you prefer video content, I am explaining that stuff in this video.

1 Like

Thank you for this, i will set it up, and get back to you on this matter, with feedback.

I did some performance tuning
I have barely touched the server load, we have planned on a extreme scale so we will have to keep any eye on this.

I will keep you informed if this has fixed the issue

1 Like

Yes tinkering around will take some iterations if you ramp up the traffic at some point. But it’s definitely managable.