Zammad is exhausting resources

Infos:

  • Used Zammad version: 3.5.0
  • Used Zammad installation source: zammad-helm
  • Operating system: (Docker on Container-Optimized OS on k8s 1.16 in GKE)
  • Browser + version: (any)

Expected behavior:

  • Run a Zammad installation for 20 users with reasonable compute resources.

Actual behavior:

Zammad keeps consuming more and more resources, no matter how many we give it. https://docs.zammad.org/en/latest/appendix/configure-env-vars.html recommends a Ruby command count the number of active sessions and for us its’s nearly 600:

zammad@zammad-0:~$ bin/rails r "p Sessions.list.uniq.count"
I, [2020-11-06T14:07:37.334189 #207-47006180533700]  INFO -- : Setting.set('models_searchable', ["Chat::Session", "KnowledgeBase::Answer::Translation", "Organization", "Ticket", "User"])
573

It is absolutely impossible that 570 people are using our Zammad, let alone have it currently opened in their browsers.

We can also see hundreds of “invalid client_id receive!” errors in the logs because of hundreds of bogus (?) /api/v1/message_receive HTTP requests.

10.52.3.2 - - [06/Nov/2020:14:03:30 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
10.52.3.2 - - [06/Nov/2020:14:03:31 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
10.52.3.2 - - [06/Nov/2020:14:03:31 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
10.52.3.2 - - [06/Nov/2020:14:03:31 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
10.52.3.2 - - [06/Nov/2020:14:03:31 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
10.52.3.2 - - [06/Nov/2020:14:03:31 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
10.52.3.2 - - [06/Nov/2020:14:03:31 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
10.52.3.2 - - [06/Nov/2020:14:03:31 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
10.52.3.2 - - [06/Nov/2020:14:03:32 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
10.52.3.2 - - [06/Nov/2020:14:03:32 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
10.52.3.2 - - [06/Nov/2020:14:03:32 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
10.52.3.2 - - [06/Nov/2020:14:03:32 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
10.52.3.2 - - [06/Nov/2020:14:03:32 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
10.52.3.2 - - [06/Nov/2020:14:03:32 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
10.52.3.2 - - [06/Nov/2020:14:03:33 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
10.52.3.2 - - [06/Nov/2020:14:03:33 +0000] "POST /api/v1/message_receive HTTP/1.1" 422 92 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"

These tons of sessions then also consume hundreds of Postgres database connections, no matter what we do. If we configure the Rails DB pool, the Zammad scheduler keeps crashing because it needs more connections than the pool allows (currently we have set the pool to 380 connections). We can’t use pgbounce because its isolation level breaks Zammad and keeps it from starting at all.

Please give me hints where to look for the root cause and what to do to run a small-scale Zammad installations without spending thousands of dollars per month for idle database connections. As everyone else is apparently running Zammad with no issues, we must be doing something terribly wrong.

In the meantime I tried

  • bin/rails db:sessions:trim
  • bin/rails tmp:clear
  • manually removing 5,000 sessions from the DB: delete from sessions where data LIKE 'BAh7AA==%';
  • disabled the Zammad Prometheus exporter, fearing its requests to the health endpoint might cause issues

This seems to have stabilized the number of open DB connections (I might have also just kicked a bunch of users out of the system :smiley: ):

I can see that Zammad itself cleans up on a regular basis: https://github.com/zammad/zammad/blob/43b6374d163fc893566f26306010ec7fdcd5e78a/lib/websocket_server.rb#L30-L37 – maybe these cleanups are missing in our setup somehow?

Looking at Found numerous Zammad issues, it seems that we do have some users who (for whatever reason) fallback to longpolling instead of websockets. That this causes way more load makes sense and we need to investigate why these users cannot/do not use websockets.

Just a side note:
The the higher you configure the pool within Zammads db-config file, the more connections Zammad will consume.

By default we pool up to 50 connections per process.
This means, by default, Zammad will use somewhat around 200 connections.

This should be farily enough with 20 users.