System performance very limited without obvious reason

Infos:

  • Used Zammad version: 4.1
  • Used Zammad installation type: (source, package, docker-compose, …) Package
  • Operating system: Ubuntu 20
  • Browser + version: All

Expected behavior:

  • Our system, although far from being heavily utilized, is extremely slow. When we started on a fresh install performance was fantastic, and a ticket would load immediately.
    Now any ticket would take 3 to 10 sec to load, and even logging in takes time after login / pwd are entered.
    Even when connecting for an SSH session the prompt would take 10 to 15 second.
    The system is obviously overload, but why is the question… we suspect Elasticsearch…
    We are running a VM with 16Gb of RAM and 16 Core, on a iSCSI drive over a 10gb/s dedicated link. Our system has 5 to 10 simultaneous users (agents+customers).

See how bad it is :slight_smile:

Actual behavior:

  • Very slow system.

Steps to reproduce the behavior:

  • We tried to install a fresh OS and restored a Zammad backup, and end-up in the same state.

That seems like a beefy VM, but does it actually get all the CPU cycles from the host OS, or is it capped? It might also be slow due to CPU over-commitment on the hypervisor or combined with a very busy virtualization platform.

Another reason for feeling sluggish could be due to high I/O wait (slow storage), which isn’t visible on your screenshot. Maybe post the results of top or check the output of iotop for possible storage latency/bandwidth issues?

Thanks for taking the time to share your thoughts.
No the VM isn’t capped - on the contrary the physical machine has lots of unused CPU/RAM and it not limiting VMs.

top/iotop look like this:


Can you see like me that Elasticsearch and postgre are kind of taking too much room on the memory?
I was also thinking about perhaps the system buffering more than needed (rely too much on the HDD buffer file as opposed to using actual RAM)

Again thanks for your help,
A.

Maybe this can help as well (Disk latency viewed from the hypervisor):

The top capture shows a system which is barely under load (but this is only a snapshot). The disk latency graph shows the problem (if I’m reading it correctly). The frequent ~20ms delays, with spikes to 40ms and up to 60ms would be very noticeable on the system and indicate storage performance issues. You should probably focus on that.

Thanks, I am looking into this on our NAS.
I wonder why Zammad is so sensitive to latency (our other services are running fine on the same setup) - is it because of Elasticsearch needing constant/frequent access to HDD?

A.

Your performance may be poor if you haven’t tuned Zammad out of the box. This can become more apparent after you’ve hit a certain number of tickets in your setup, or simultaneous agents working on those tickets. Your really should look into performance tuning and start by gathering metrics on CPU usage, load, disk I/O, network traffic, etc. The basics. Monitoring the database might be a good start, as it is constantly being used on a busy system. If it is improperly tuned, most of that I/O might hit the disks instead of RAM, killing performance, unless you have a very high performance storage backend.

Thanks Dennis. We have ordered an MVNe PCI card to be used as buffer on the NAS. We’ll see how this affects overall perf.

To be honest my “naive” approach was that, because we only have 3/4 concurrent access, tuning wasn’t critical. But I’ll follow your advice and look into this after the NAS has been upgraded.

A.

Quick update, and hopefully some insight for people facing similar behavior:

  • The ssd cache did help a little, but I was still seing high latency that translated into slow GUI.
  • So I looked into vSphere documentation and realize the SCSI controller I was using for the Zammad VM wasn’t perhaps the best one → Changed to ‘VMware Paravirtual SCSI’, and the latency totally went down (less than 10ms)…

It could have been the end of if, but unfortunately the overall system is still very slow.
Here is my config: 16 Cores, 16 Gb of RAM, 500 Gb HDD

One of the most frustrating behavior is logging in the system, and having to wait 10s in front of a white page, before the Zammad GUI would show…

Any suggestion would be more than Welcome :slight_smile:

A.

What I’ve said before and what you seem to be ignoring: start by monitoring the basics. Start by graphing performance metrics over time, inside the OS as well as at the hypervisor level. I’m guessing your VM packs a bit of a punch, but the software just isn’t using all that power because it isn’t tuned appropriately. Adding RAM does nothing if you aren’t configuring your applications to actually use it. Giving the VM a lot of CPU’s does nothing if the applications aren’t instructed to start more worker threads to distribute the load. Performance metrics will probably point to a possible problem area to focus on, allowing for some quick wins.

  • What is the output of this database query?
    SELECT * FROM pg_settings WHERE source != 'default';

  • What is the output of this shell command (run while the system is feeling sluggish):
    zammad run rails r 'p Delayed::Job.count'

  • What is the output of these shell commands?
    zammad config:get WEB_CONCURRENCY
    zammad config:get ZAMMAD_SESSION_JOBS_CONCURRENT

I have the same impression: the power isn’t used by the VM.
Please see bellow the output of the database:

name setting unit
application_name pgAdmin 4 - CONN:4532221 NULL
bytea_output hex NULL
client_encoding UNICODE NULL
client_min_messages notice NULL
cluster_name 12/main NULL
data_checksums off NULL
DateStyle ISO, MDY NULL
default_text_search_config pg_catalog.english NULL
dynamic_shared_memory_type posix NULL
lc_collate en_US.UTF-8 NULL
lc_ctype en_US.UTF-8 NULL
lc_messages en_US.UTF-8 NULL
lc_monetary en_US.UTF-8 NULL
lc_numeric en_US.UTF-8 NULL
lc_time en_US.UTF-8 NULL
listen_addresses * NULL
log_line_prefix %m [%p] %q%u@%d NULL
log_timezone Etc/UTC NULL
max_connections 2000 NULL
max_stack_depth 5120 kB
max_wal_size 1024 MB
min_wal_size 80 MB
port 5432 NULL
server_encoding UTF8 NULL
shared_buffers 262144 8kB
ssl on NULL
ssl_cert_file /etc/ssl/certs/ssl-cert-snakeoil.pem NULL
ssl_key_file /etc/ssl/private/ssl-cert-snakeoil.key NULL
temp_buffers 32768 8kB
TimeZone Etc/UTC NULL
transaction_deferrable off NULL
transaction_isolation read committed NULL
transaction_read_only off NULL
wal_buffers 2048 8kB
wal_segment_size 16777216 B
work_mem 10240 kB

run rails r ‘p Delayed::Job.count’ → returned 0
zammad config:get WEB_CONCURRENCY → returned nothing
zammad config:get ZAMMAD_SESSION_JOBS_CONCURRENT → returned nothing

Thanks for your support,
A.

Justin case someone has an idea - I see Elasticsearch crashing quite often as well…

elasticsearch.service - Elasticsearch
Loaded: loaded (/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
Active: failed (Result: timeout) since Wed 2021-10-13 00:01:47 CEST; 15h ago
Docs: https://www.elastic.co
Process: 1501 ExecStart=/usr/share/elasticsearch/bin/systemd-entrypoint -p ${PID_DIR}/elasticsearch.pid --quiet (code=exited, status=143)
Main PID: 1501 (code=exited, status=143)
Oct 13 00:00:32 support systemd[1]: Starting Elasticsearch…
Oct 13 00:01:47 support systemd[1]: elasticsearch.service: start operation timed out. Terminating.
Oct 13 00:01:47 support systemd[1]: elasticsearch.service: Failed with result ‘timeout’.
Oct 13 00:01:47 support systemd[1]: Failed to start Elasticsearch.