[Solved] Emails delayed, high Job delayed count, no scheduler_out.log

Currently, we are suffering from very long delays between answering a ticket and sending out the actual email.
We already move the tmp folder to a ramdisk and raised the concurrent webserver limit (High Latency , Timeouts, tickets twice) and we recreated the search index (Email notification delay about 5 hours, CPU Usage 100%).

I can see many scheduler.rb start -t processes running, where one consumes 100% cpu.
The delayed::job.count raises constantly, at the moment 1733 but raising by 2 or 3 every few seconds.
The scheduler_out.log doesn’t show anything meaningful: a few lines “log writing failed. execution expired” and a few lines: “output error : unknown encoding ASCII-8BIT”.

Obviously something blocks the scheduler process, but I don’t know what and how to fix it.

System Specs:
8 core CPU (virtualized)
16 GB RAM (6 GB used)
Load average 1.01
No swapping

Expected behavior:

  • Email send out to customer on ticket answer and/or automatic initial email

Actual behavior:

  • Emails are sent out with several hours delay

Steps to reproduce the behavior:

  • Answer a ticket (or write an email to receive the initial automated email)

this is something I’ve been battling with. Although the scheduler has 10 threads, only 1 thread works on delayed_jobs, and some of those delayed_jobs take 10 seconds to process. So 1700 delayed jobs with no extras being added might take half an hour to clear.

I’ve been documenting the things i’ve tried here
[SOLVED] Zammad is hammering elasticsearch to reindex the same 10 users (it’s not really solved, I should change the subject line)

hi chesty, thanks for your input. You really digged deep into the topic. I might try some of your changes in the linked thread - although you already said it didn’t work out too well.
Your observations fit to mine, the delayed job queue fills up as soon as multiple agents log in to the system. This creates a backlog of unprocessed jobs which gets better as soon as the workday is over and agents start to leave the system.

are your servers running on a different timezone to your desktops?

actually, it doesn’t seem to matter. I set my docker containers to +10, reindexed elasticsearch, and zammad is still using postgres for ticket searches. It still could be timezone related though, and I reported a bug and they seem to have been working on it already.

forget the timezone question, that was a dead end.

The problem for me is the queue is processed by a single thread and my cores are slow. I confirmed it by cranking up the governor to performance which halved the time it takes to process an api request.

I’m upgrading the server, the new one is double the clock, half the cores.

Yes, it seems to be bound by a single schedular process. I graphed out my process running times in the production log and I can clearly see how run times grow as soon as more agents log on. During quiet times all jobs are processed in about 0,3s but as soon as people start to use it these times pile up and may run over 300s peak.
So the question remains, why does it happen with Zammad 2.5 (didn’t hear any complaints before). Why don’t the other scheduler.rb processes take over some tasks?

each thread of the scheduler has a specific task, 10 threads, 10 specific tasks. one task is to process delayed_jobs. I noticed the delayed_job table has a lock column which it uses, so maybe it’s possible now or in the future to run more than one scheduler? A quick search by me didn’t find if it’s possible.

From what I read they improved it in 2.4. So it was worse in 2.3

for me user jobs take 1.5 seconds to process and ticket jobs take 5 or longer. If you’re seeing a single user job take longer to process in peak time, you might be short on cores. My jobs take about the same time to process during peak and idle, but if I get more the 2 ticket and 30 user jobs a minute, the queue grows. I get like 200 jobs added at once at regular intervals, not sure if it’s every hour or more frequent, and it takes 2 minutes to clear.

With the new CPUs, I guess ticket jobs will take about 1.5 seconds to process, and user jobs about 0.5 seconds. Still, multi threads servicing the queue would be handy.

What a constructive discussion - Kudos for that!

How many agents/users/ticket/etc. do you have in your system? There is a performance issue with a single scheduler process and a lot of simultaneous users. I can post some tweaks depending on the size our your system.

PS: @chesty Regarding the reported Timezone issue - we are already working on it :+1:

I guess this wasn’t directed at me, but I’ll give my stats anyway.

we’ve got 10 simultaneous agents atm, we might add a few other departments to it, maybe an extra 10, but it won’t grow much beyond that for a long time.
the number of users are 3000
the number of tickets are 11000 and I think we add about a 1000 a week.
ticket_articles 30000
histories 340000

would you mind sharing your tweaks for that?

as far as postgres goes, I’ve looked over the slow query log and I think I’ve got it going as best as I can. I bumped up the work_mem so sorts and whatnot all happen in memory. I saw the connection pool was 50, but the chances of 50 simultaneous queries let alone 50 simultaneous queries all using max work_mem at the same time is slim. a lot of queries take 2ms and use little resources.

The updates to updated_at column on various tables often take 500ms, I think that’s because it’s updated so often by a lot of threads at the same time there’s a queue up waiting for a lock. I don’t think it’s because the update where clause plan is inefficient. That might be something you could look to optimise, I don’t think for example the organisations table update_at needs to be updated 10 times a second.

I haven’t done any profiling, so I’m guessing the reason ticket jobs take many seconds to complete is because ruby is doing a lot processing, and not because of a slow query, or maybe there are 100s of really quick queries for 1 ticket job that happen one after the other and the time adds up.

We have 45 agents in the system, from which about 20 are online during the day.
User.count = 10415
Ticket.count = 27677

Would love to here about those performance tweaks.

We were able to narrow the problem down with the help of the fantastic Zammad support!
As described in Issue #2108 the scheduler get’s blocked by recreating the overviews for all agents as soon as they hit a critical limit (about 30 active agents, each having 30 or more overviews).

To circumvent the problem one can disable the generation of overviews in the main scheduler:
systemctl stop zammad-worker
zammad run rails r 'Scheduler.find_by(method: "Sessions.jobs").update(active: false)'
systemctl start zammad-worker

Afterwards it is possible to start a separate Job to recreate the overviews. This way this process doesn’t interfere with the regular Job queue. Best done in screen/tmux session to keep it running.
zammad run rails r 'Sessions.jobs'

With this my delayed jobs count went down to 0 in a few minutes!

1 Like

In my case it was heap size and max heap size of jvm. The default value was far away from my server RAM. That was kiling elasticsearch. As soon as I set it to realistic values, Zammad cleared the queue. The options you are look for are on

/etc/elasticsearch/jvm.options
-Xms2g 
-Xmx2g

According to elasticsearch documentation these values should equal, they never should be greater than half your memory and never greater than 32GB or performance won’t be optimal. Here https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html.

hi, thanks for the info. we came to same situation over the last days. that helped us getting the system back to normal. but: the " zammad run rails r ‘Sessions.jobs’ " is currently for minutes and doesn’t seem to come back. can someone tell me what that does in backround and if we should set the auto back to true to let the scheduler kick in again?

should we open a an issue to let the devs have a deeper look (as it seams not to be solved as described within issue #2108)?

for info: disabling several overviews containing many tickets solved the issue as well …

This topic was automatically closed 120 days after the last reply. New replies are no longer allowed.