GUI sometimes very slow, high CPU usage, closed tickets showing up as new

babakkazemi · October 1, 2019, 10:19am

Infos:

Used Zammad version: 3.1.0
Used Zammad installation source: packager.io stable branch
Operating system: debian 9
Browser + version: current chrome, firefox, edge

Expected behavior:

Anything I do should happen without lag

Actual behavior:

changing owner, changing state, changing group sometimes very slow.
We have ~5-6 concurrent agents
Our system specs: VMware vSphere Host machine with AMD EPYC 7351P, 128GB RAM, iSCSI Target: Synology FS1018 with 6 Samsung SM863a SSDs. Zammad VM has 4 cores, 24GB RAM, 50GB SSD. We are having 4x100% CPU load (each core 100%) 40-50% the time, especially when:

new ticket is created via mail
user is changing overview
user is changing ticket state
user is adding an article to a ticket
etc
everytime we have these “spikes” (lasting for 10-20 seconds), we have several ruby processes eating up 100% cpu time each.

We have set WEB_CONCURRENCY=4, that has helped a bit, after this we raised the DB connections because of timeouts. Also we have changed storage from DB to FS, this also boosted performance marginally.

Every time the system becomes sluggish, Delayed::Job.count raises to something like 200-300, jobs are slowly processed and when it is down at 100, system is responsive again.

Steps to reproduce the behavior:

use zammad

MrGeneration · October 1, 2019, 11:20am

Do I understand correctly, that you’re running Zammads storage on an Synology iSCSI?

babakkazemi · October 1, 2019, 11:24am

More or less - the synology iSCSI is mounted as storage in vSphere and holds the vmdk files, but it’s connected with 10Gbe, and I have LOTS of IOPS (something like 50k random) and 5Gbps write speed, so that shouldn’t be the problem…

babakkazemi · October 1, 2019, 12:32pm

Just a quite strange issue, this ticket is shown in overview “closed”, but history looks like this:

It seems as if the GUI needs some time to display all ticket states (in this case 2 hrs). I can not find anything to reproduce.
Strange thing is - this only happens with one particular agent

Here I grepped for that particular ticket number in production.log:

root@ticket:~# cat /var/log/zammad/production.log |grep '91011070'
I, [2019-10-01T14:22:21.726120 #32387-47231808056000]  INFO -- : Send notification to: 
idoit@customer.de (from:Ticketsystem customer GmbH <ticket@customer.de>/subject:Neues Ticket 
(Message Notification from VPS, CUSTOMERPHONE1) [Ticket#91011070])
I, [2019-10-01T14:27:00.968245 #32487-47451771571100]  INFO -- :   Parameters: 
{"number"=>"91011070", "title"=>"Message Notification from VPS, CUSTOMERPHONE1", "group_id"=>5, "owner_id"=>26, "customer_id"=>118, "state_id"=>1, "priority_id"=>2, "updated_at"=>"2019-10-01T12:22:04.485Z", "preferences"=>{"channel_id"=>4}, "pending_time"=>nil, "postfach_angelegt"=>nil, "user_angelegt"=>false, "id"=>"11657"}
I, [2019-10-01T14:27:53.058640 #32487-69867408081680]  INFO -- :   Parameters: {"number"=>"91011070", "title"=>"Message Notification from VPS, CUSTOMERPHONE2", "group_id"=>5, "owner_id"=>26, "customer_id"=>118, "state_id"=>1, "priority_id"=>2, "updated_at"=>"2019-10-01T12:27:01.270Z", "preferences"=>{"channel_id"=>4}, "pending_time"=>nil, "postfach_angelegt"=>nil, "user_angelegt"=>false, "id"=>"11657"}
I, [2019-10-01T14:28:09.222633 #32475-69867744288580]  INFO -- :   Parameters: {"number"=>"91011070", "title"=>"Message Notification from VPS, CUSTOMERPHONE2", "group_id"=>"5", "owner_id"=>"26", "customer_id"=>118, "state_id"=>"4", "priority_id"=>"2", "updated_at"=>"2019-10-01T12:27:53.083Z", "preferences"=>{"channel_id"=>4}, "pending_time"=>nil, "postfach_angelegt"=>false, "user_angelegt"=>false, "id"=>"11657", "all"=>"true"}

Here we have 2 different emails that became one ticket, if I see this correct (i replaced the phone number in the title with CUSTOMERPHONE1 and CUSTOMERPHONE2)

I am really confused

MrGeneration · October 2, 2019, 10:49am

If this only happens to one agent, you might want to check the local machine of the user.
Proberbly he has a buttload of tabs open on the left side (ticket tabs) which will slow down the Browser drastically.

The mentioned specs should be fair enough, even though I’m having a bad stomache feeling about the iSCI - however, I trust you if you say write and read speed shouldn’t be a problem.

It would be very interesting to know what exactly these delayed Jobs are about - like so:

Delayed::Job.first.handler
Delayed::Job.second.handler
Delayed::Job.third.handler

Is it about sending mails or is it about searchindexing?

This will help us for the direction to look for.

Sidenote from my personal experiences: Whenever I had to deal with Synology iSCSI and vmware, I found these to be extremly slow (even if the network around it is well fast enough and stuff). This experience is a bit older by now, so this might have improved. Just for you to understand where my bad stomache comes from.

babakkazemi · October 10, 2019, 7:51am

Having this strange issue right now without any delayed jobs - closed tickets appear as “new” in an overview that only contains closed tickets. Also when I open the ticket, it seems to be “new”, but it was closed already by an agent.

I just took the first ticket, that’s showing up when I open the ticket history:

Any clue? Delayed::Job.count shows 0… This does happen with several agents, all having maybe 2 or 3 tickets in their left sidebar…

that’s the overview:

When I close the tickets “again”, they show up as closed, but with a very funny history:

MrGeneration · October 10, 2019, 10:14am

Your overviews on the left side would interest me far more.
Looks like you’re putting your scheduler on high load with too many overviews and too many tickets.
Your one screen shows at least 2100 Tickets which mostly are all closed. Is this really neceasary?

Please reduce your overviews to 15-20, try to make them as universal as possible and try to avoid big overviews. In my opinion you don’t need to have an overview that shows your closed tickets - if you do, you might want to limit the entries to the last n days and -if older- use the search function instead.

(see: https://admin-docs.zammad.org/en/latest/manage-overviews.html )

babakkazemi · October 10, 2019, 10:37am

Thanks - see screenshot above - the overview “Erledigt AB” is for the last 7 days only… Changed this to last 2 days now, the overview contains ~1000 tickets. Normally this will contain ~300 tickets… Maybe this will reduce load.
Gonna check what happens next…

babakkazemi · October 17, 2019, 7:56am

After one week with reduced overviews we still have the problem, that sometimes closed tickets show up as new in GUI, but when I check the tickets’ history, it is closed. We then had the situation, that that particular person always complaining about this was on vacation for 2 days, and boom! the problem was gone. Now she’s back, and the issue is back.
She can’t tell what she is doing different from the others… I have no clue either. Is it a layer 8 problem or can it be anything technical in zammad?

MrGeneration · October 18, 2019, 11:30am

Does this only affect one person?
If so, you might want to ensure that the user does not have more than 30 tabs opened within Zammad.

If the user has more, this will likely be your problem.
This should pair up with high CPU usuage of the users browser on the client.

Technically this is a client / layer8 issue, because the workload gets too much for the Browser.
This is the reason why we limit objects you can retrieve within Zammad.

This issue mostly happens, if you update ticket attributes and the user has tabs opened.
Normally Zammad removes the oldest tab to make room for a new one, but not if you have 30 tickets with “edit” state. Zammad won’t and can’t decide if the “edit/draft” may be discarded or not.

babakkazemi · October 18, 2019, 11:48am

This does only affect one person, yes. Everyone has 1 or 2 tickets open in the left sidebar, so that can’t be an issue. We tested firefox and chrome to exclude the local browser from being a problem. The users are instructed to only work on one ticket at a time: they are only listening to voicemails (open ticket, ticket is auto-assigned, download .wav attachment, listen to it, write down what is happening in another system, close ticket with a macro that closes the ticket and closes the tab).
veeeeery strange issue all this, I can’t really understand how it happens…

MrGeneration · October 18, 2019, 4:01pm

Does this issue still appear if the same user does change the machine and/or location?

babakkazemi · October 21, 2019, 1:49pm

We need to try that, didn’t have the chance yet. Thanks for the hint though, nearly forgot about that…

babakkazemi · October 24, 2019, 8:23am

Tried with another machine - same.
We had other users with the same issue now, but not as frequent as that particular user, found out why: she’s working quicker and closes more tickets…
So this seems to be a general issue.
We checked that nobody has more than 2 or 3 tickets showing up in their left column (most only have the current ticket showing up). To say it like Hubert Aiwanger or Günther Oettinger: I am with my latin on the end

Thanks

Babak

MrGeneration · October 30, 2019, 9:31am

Could you please check if these users are members of organizations? If so, do the following for some stat-magic:

zammad run rails r "p User.where(organization_id: User.find_by(email: '{affected-users-mail}').organization_id).count"

The above will check how many users are inside your organization.
It’s possible that you’re affected by the following:

github.com/zammad/zammad

Ensure Zammad only updates affected content (affects several)

opened 03:59PM - 14 Jan 19 UTC

closed 03:11PM - 18 Feb 20 UTC

MrGeneration

duplicate

### Infos: * Used Zammad version: 2.8 + develop * Installation method (source, package, ..): any * Operating system: any * Database + version: any * Elasticsearch version: any * Browser + version: any * Ticket-ID: #1035538, #1033516, #1042393, #1044403, #1056313 **Please note** This issue affects several of Zammads parts: * Organizations * Users * SLA (and thus Tickets) * Calendar-Sync (partial SLA; ical sync for holidays) * (API) As this has several parts and is more of an enhancement, I changed the layout how I wrote this up, as expected and current behaviour is not needed in my opinion. I will describe all affected areas as good as possible, the solution is kinda the same for all of them. ## API * When using user based authentication (instead of an API-Key) for e.g. searching, creating (etc), Zammad will always update the last login. This will cause the Organization to be updated - Zammad will also touch *every* single user that's part of the organization. * Every update and touch will -beside touching the database object- also cause a reindex of the object within elasticsearch. **Effect:** This causes Zammad to use extreme amounts of ressources for just one simple request and is (if you have let's say 50 members in one organization) take very high amounts of time to fullfill the request. **Note:** In worst case, this can lead to a Zammad that can't fullfill all wanted requests. ## Calendar-Sync * Calendar-Sync updates are running daily and will fetch the current holidays from e.g. ical sources. As since might update sla-related things within tickets, Zammad will touch all tickets - at this moment "updated_at" get's updated and nothing is being logged (no history entry as well). ## Organizations * Updated on an orgnization (e.g. a note) will cause Zammad to touch every use. In the moment this happens, Zammad will update "updated_at" for the user as well and reindex the object. * Changing information within the organization shouldn't bother user objects, as they're childs and have no linked information that's needed in that object. **Effect:** The more members an organization has, the worse this behaviour gets - it will take time that's not needed. ## Users * Updating Users will cause an update for the organization and it's members. It should be enough to update only the needed objects (so if something will change them) **Effect:** The more members an organization has, the worse this behaviour gets - it will take time that's not needed. ## SLA * When changing SLA settings, Zammad will touch every tickets. This makes sense at it has to check if the tickets meet the new SLA times. Problem is, that Zammad will update every ticket that matches the SLA options. If the number of tickets that match the SLA rules get higher, this will slow down Zammad drastically. * Another problem with SLA is, that zammad will regulary recreate SLA statistics. This will touch any affected ticket object which will also cause Zammad to take systems ressources (and might even lead to Zammad being temporarily not available). **Note** This will cause a reindex of all objects affected. ### Resolution The resolution would be to ensure that Zammad only updated database entries (and reindexes objects), if they've changed. This will take a lot of important time and ensure that Zammad will stay performant if Zammad deals with larger companies and organizations. Yes I'm sure this is a bug and no feature request or a general question.

This might also affect the agent, if the customer is member of a fairly big organization.
This would also explain why your delayed-job count jumps up so high “instantly”.

babakkazemi · October 31, 2019, 10:14am

Thanks!

Tried it, but get an error message:

irb(main):001:0>p User.where(organization_id: User.find_by(email: '{affected.user@domain.tld}').organization_id).count" Traceback (most recent call last): 1: from (irb):1 NoMethodError (undefined method 'organization_id' for nil:NilClass)

then tried
irb(main):001:0>p User.where(organization_id: User.find_by(email: '{affected.user@domain.tld}')).count"

Gives me a count of 28

Then tried
irb(main):001:0>p User.where(organization_id: User.find_by(email: '{affected.user@domain.tld}').organization_id)"

Shows me users which are not tied to an organization (organization_id: nil). The affected user though is member of an organization.

The organization has something like 60 members. All agents are member of this organization, and also the voicemail is member of this organization.

There are no SLAs defined

This phenomenon seems to happen as soon as the ticket count is raising. As of now, something like 1 ticket every few minutes are created, and the issue isn’t showing up. Some weeks ago, there were 3 or 4 new tickets every minute, and the issue started showing up…

MrGeneration · November 1, 2019, 10:15am

I believe that you’re -at least partly- affected by the following Bug:

github.com/zammad/zammad

Ensure Zammad only updates affected content (affects several)

opened 03:59PM - 14 Jan 19 UTC

closed 03:11PM - 18 Feb 20 UTC

MrGeneration

duplicate

### Infos: * Used Zammad version: 2.8 + develop * Installation method (source, package, ..): any * Operating system: any * Database + version: any * Elasticsearch version: any * Browser + version: any * Ticket-ID: #1035538, #1033516, #1042393, #1044403, #1056313 **Please note** This issue affects several of Zammads parts: * Organizations * Users * SLA (and thus Tickets) * Calendar-Sync (partial SLA; ical sync for holidays) * (API) As this has several parts and is more of an enhancement, I changed the layout how I wrote this up, as expected and current behaviour is not needed in my opinion. I will describe all affected areas as good as possible, the solution is kinda the same for all of them. ## API * When using user based authentication (instead of an API-Key) for e.g. searching, creating (etc), Zammad will always update the last login. This will cause the Organization to be updated - Zammad will also touch *every* single user that's part of the organization. * Every update and touch will -beside touching the database object- also cause a reindex of the object within elasticsearch. **Effect:** This causes Zammad to use extreme amounts of ressources for just one simple request and is (if you have let's say 50 members in one organization) take very high amounts of time to fullfill the request. **Note:** In worst case, this can lead to a Zammad that can't fullfill all wanted requests. ## Calendar-Sync * Calendar-Sync updates are running daily and will fetch the current holidays from e.g. ical sources. As since might update sla-related things within tickets, Zammad will touch all tickets - at this moment "updated_at" get's updated and nothing is being logged (no history entry as well). ## Organizations * Updated on an orgnization (e.g. a note) will cause Zammad to touch every use. In the moment this happens, Zammad will update "updated_at" for the user as well and reindex the object. * Changing information within the organization shouldn't bother user objects, as they're childs and have no linked information that's needed in that object. **Effect:** The more members an organization has, the worse this behaviour gets - it will take time that's not needed. ## Users * Updating Users will cause an update for the organization and it's members. It should be enough to update only the needed objects (so if something will change them) **Effect:** The more members an organization has, the worse this behaviour gets - it will take time that's not needed. ## SLA * When changing SLA settings, Zammad will touch every tickets. This makes sense at it has to check if the tickets meet the new SLA times. Problem is, that Zammad will update every ticket that matches the SLA options. If the number of tickets that match the SLA rules get higher, this will slow down Zammad drastically. * Another problem with SLA is, that zammad will regulary recreate SLA statistics. This will touch any affected ticket object which will also cause Zammad to take systems ressources (and might even lead to Zammad being temporarily not available). **Note** This will cause a reindex of all objects affected. ### Resolution The resolution would be to ensure that Zammad only updated database entries (and reindexes objects), if they've changed. This will take a lot of important time and ensure that Zammad will stay performant if Zammad deals with larger companies and organizations. Yes I'm sure this is a bug and no feature request or a general question.

Currently the only option to reduce the load in this concern is to remove the agents from Organizations.
This can make a difference.

Still odd, because somewhat 6 concurrent agents with 2-4 updates per second shouldn’t be such a big deal to be honest.

system · February 29, 2020, 10:15am

This topic was automatically closed 120 days after the last reply. New replies are no longer allowed.