Performance problems with 30 employees

swegscheider · October 12, 2022, 6:54am

Hello,
We have been testing Zammad with 3 employees for several months.

In the future, around 60-70 employees will work with it - around 30-40 of them at the same time.

Infos:

Used Zammad version: 5.2.3
Used Zammad installation type: Package
Operating system: Ubuntu 20.04
Browser + version: Different (Chrome, Firefox, Edge)

Expected behavior:

Staff actions like ticket creation should be done within 5 seconds.

Actual behavior:

During setup and testing, we primarily tested with 1-3 people and everything worked as expected - the performance was okay (ticket creation took about 2-4 seconds).

Yesterday we made a performance test (ticket creation and ticket updating at the same time) with 30 employees, unfortunately there were extreme delays (waiting time for e.g. ticket creation: 30-90 seconds).

The server was not busy at the time of the test, CPU about 50% - RAM usage about 41 GB. See screenshot:

Information about the server:
CPU: 12 Core x 3.1 GHz (AMD Ryzen 9 Pro 3900) - 24 threads
RAM: 128GB

Server / Zammad configurations and settings:
zammad config:set ZAMMAD_SESSION_JOBS_CONCURRENT=14
zammad config:set WEB_CONCURRENY=12
zammad config:set ZAMMAD_PROCESS_DELAYED_JOBS_WORKERS=6
zammad config:set ZAMMAD_PROCESS_SCHEDULED_JOBS_WORKERS=1

PostGreSQL (/etc/postgresql/12/main/postgresql.conf):
shared_buffers = 8GB
temp_buffers = 256MB
work_mem = 10MB
max_stack_depth = 5MB

RAM disk:
tmpfs /opt/zammad/tmp tmpfs defaults,size=8000M 0 0

Information about Zammad instance:
14 triggers
11 automations
21 individual objects
32 core workflows
About 16,000 users
About 6000 organizations
27 overviews (maximum 19 for staff)
0 tickets

Active Integrations: CTI

Steps to reproduce the behavior:

~30 Staff create a new ticket at the same time.

I think the server is fast enough. But we can’t find the bottleneck.

Does somebody has any idea?

Thank You & Regards

MrGeneration · October 12, 2022, 8:23am

max_connections is set how high?

swegscheider · October 12, 2022, 9:05am

Hi MrGeneration,
during the test we had 2000.
We changed it to 8000, but we haven’t done any further testing since then.

Is there a maximum or does it depend on the CPU(?) load?

dvanzuijlekom · October 12, 2022, 11:17am

If you’re hitting the PostgreSQL max_connections limit, you should see this in your PostgreSQL logs, e.g. something like:

FATAL: remaining connection slots are reserved for non-replication superuser connections

So inspect your database logs for errors like this. Blindly raising max_connections to an arbitrary high number is a bit of a dumb idea. Here is a pretty good write-up on the subject.

I would start looking for I/O bottlenecks, e.g. disk or database. The data you’ve supplied doesn’t mention this and your platform’s resources seem like quite a bit of overkill, the problem isn’t likely to be in the number of CPU cores or the memory, when looking at the data, although you might need to convince the middleware (Apache/NginX, PostgreSQL, ElasticSearch) to actually utilize it, by tuning it a bit more to the massive specs. Another issue could be deadlocks in the database, due to the high number of concurrent users all wanting to write to the database tables. Configure PostgreSQL to actually start logging information about locking issues (log_lock_waits = on) and Inspect your database logs.

Regarding PostgreSQL tuning: we have set maintenance_work_mem and autovacuum_work_mem to higher values. We have turned off synchronous_commit and you should consider configuring effective_cache_size. Here’s a bit of a write-up on that subject.

swegscheider · October 17, 2022, 6:44am

hey dvanzuijlekom,
thanks a lot!

Regarding to Your reply we made some further changes within postgresql.conf…:

shared_buffers = 48GB
temp_buffers = 4096 MB
max_connection = 2000 → 200 (errors occured) → 800
#log_lock_waits = off → log_lock_waits = on
work_mem = 5MB → 4096MB
#effective_cache_size = 4GB → effective_cache_size = 24GB
#synchronous_commit = on → synchronous_commit = off
#log_min_duration_statement = -1 → log_min_duration_statement = 200
#log_checkpoints = off → log_checkpoints = on
#log_connections = off → log_connections = on
#log_disconnections = off → log_disconnections = on
#log_statement = ‘none’ → log_statement = ‘ddl’
#log_temp_files = -1 → log_temp_files = 0
maintenance_work_mem = 4500MB
autovacuum_work_mem = 64MB

zammad config:get WEB_CONCURRENCY=12
zammad config:set ZAMMAD_SESSION_JOBS_CONCURRENT=16

After restarting postgresql & zammad, the performance seems to be better.

We will launch Zammad in november - I’ll report back if the configuration works.

Thanks again!

dvanzuijlekom · October 17, 2022, 12:42pm

Take care when setting work_mem to such a massively high value, as this setting is per individual query operation. This is fine when you’re hosting an application where only a few queries run at specific moments, which could yield huge result-sets, e.g. data-warehouse analytics queries. If the queries for basic/common Zammad functionality or daily operations need these amounts of memory, you will likely a) run into out-of-memory situations very fast and b) you’ll also need to figure out what is wrong with your data, result-sets and/or queries, as this really shouldn’t be happening.

Your effective_cache_size setting seems wrong, as this should at least include the amount of memory dedicated via shared_buffers plus the expected/estimated operating system caches/buffers available for PostgreSQL. Based on your specs, you could probably safely set this to about 64GB instead of 24GB. Please see here for more information on tuning parameters such as these. Keep in mind that if you’re running PostgreSQL on the same server as the rest of the Zammad stack, you should be conservative here.

Also, keep checking your PostgreSQL logs for anomalies during day-to-day operations when applying tuning like this, I can’t stress this enough. The tuning I did was based on daily plan-do-check-adjust cycles based on I/O statistics, realtime monitoring/logging as well as user feedback/experience.

system · February 14, 2023, 12:42pm

This topic was automatically closed 120 days after the last reply. New replies are no longer allowed.