Infos:
- Used Zammad version: 5.2.1
- Used Zammad installation type: package
- Operating system: Previously Ubuntu, now Alma Linux
- Browser + version: Edge 103.0.1264.49
- Maximum 15 concurrent users
Expected behavior:
System should be responsive with minimal delays
*
Actual behavior:
Regular white screens while loading tickets, and tickets slow to update
*
Steps to reproduce the behavior:
Refresh the page, or update a ticket.
*
It may help if I give a timeline of events here which led up to the issue, because it used to work perfectly for two years until a couple of weeks ago.
The first sign of trouble was when Office 365 stopped accepting IMAP logins, which Zammad used for several email accounts, we verified it to be an Office 365 issue, and not related to Zammad. I then setup the Microsoft 365 channel type in Zammad to import emails and this was then able to retrieve tickets once again. However the performance was terrible, every screen and everything you clicked on took 30-40 seconds to complete, I assumed it was catching up with itself and would improve after a few hours but it did not. The load average of the server was consistently above 5 and CPU was maxed out at 100% on all cores by a combination of Puma and background_worker processes, they both took it in turns to hammer the system.
I started checking logs and found the most prevalent issue was apparently some database deadlocks, every deadlock seemed to be preceded by a taskbar-related command:
2022-07-08 10:51:40.578 UTC [1043561] ERROR: deadlock detected
2022-07-08 10:51:40.578 UTC [1043561] DETAIL: Process 1043561 waits for ShareLock on transaction 805640; blocked by process 1048434.
Process 1048434 waits for ShareLock on transaction 805639; blocked by process 1043561.
Process 1043561: SELECT "taskbars".* FROM "taskbars" WHERE "taskbars"."id" = $1 LIMIT $2 FOR UPDATE
Process 1048434: SELECT "taskbars".* FROM "taskbars" WHERE "taskbars"."id" = $1 LIMIT $2 FOR UPDATE
2022-07-08 10:51:40.578 UTC [1043561] HINT: See server log for query details.
2022-07-08 10:51:40.578 UTC [1043561] CONTEXT: while locking tuple (101,2) in relation "taskbars"
2022-07-08 10:51:40.578 UTC [1043561] STATEMENT: SELECT "taskbars".* FROM "taskbars" WHERE "taskbars"."id" = $1 LIMIT $2 FOR UPDATE
2022-07-08 10:51:40.579 UTC [1043561] ERROR: current transaction is aborted, commands ignored until end of transaction block
2022-07-08 10:51:40.579 UTC [1043561] STATEMENT: DEALLOCATE a1
2022-07-08 10:51:40.580 UTC [1043561] ERROR: current transaction is aborted, commands ignored until end of transaction block
2022-07-08 10:51:40.580 UTC [1043561] STATEMENT: DEALLOCATE a2
2022-07-08 10:51:40.581 UTC [1043561] ERROR: current transaction is aborted, commands ignored until end of transaction block
2022-07-08 10:51:40.581 UTC [1043561] STATEMENT: DEALLOCATE a3
2022-07-08 10:51:40.582 UTC [1043561] ERROR: current transaction is aborted, commands ignored until end of transaction block
2022-07-08 10:51:40.582 UTC [1043561] STATEMENT: DEALLOCATE a4
2022-07-08 10:51:40.590 UTC [1043561] ERROR: current transaction is aborted, commands ignored until end of transaction block
The above DEALLOCATE lines repeat for many pages. I tried looking up some of the process IDs but the processes did not exist by the time I was searching. The maximum database connections was, at the time set to 500, with about 430 or so in use at the busiest times, but I have since increased it to 2000.
During troubleshooting, with the high CPU usage I decided to migrate the system to a more powerful server to see if that would help, so I have now moved it to Amazon EC2, I went with a t3a.2xlarge machine type with 8vCPUs and 32gig RAM which appeared to perform the best out of several types that I tried, but it’s still not good. For the migration process I used a pre-defined AWS image from their marketplace to ensure everything was setup ok, then did a backup from the old server, restore to the new one, then reinstalled packages on the new one and it came right up and loaded without error, for the record the AWS image uses Alma Linux as the os, where my previous in-house server used Ubuntu.
CPU usage has now settled down considerably since moving to AWS to 1-3% on average, but the load times are still awful and I still see deadlock errors in the DB, much less frequently now but they still occur and are still mostly related to the taskbar. I note that there was a previous bug related to taskbar deadlocks but I understand that the fix should be built into my version 5.2.1? Search taskbars cause DeadLocks · Issue #3087 · zammad/zammad · GitHub
How can I further diagnose this issue and which logs should I be looking at? I am mostly checking the Postgresql log and Zammad’s own log at the moment, is there something else I should be looking as well? It seemed that the main trigger for all of this was the switch from IMAP accounts to Microsoft 365 and since then the performance is so incredibly bad, but I can’t find many logs to indicate why or what is causing it