Big issue with background jobs in queue

maxa · October 11, 2023, 12:31pm

Used Zammad version: 4.0.x
Used Zammad installation type: (source, package, docker-compose, …) : package on Ubuntu 20.04
Operating system: macOS Catalina 10.15.7
Browser + version: Google Chrome Version Version 117.0.5938.149
Number of staff using Zammad : 40
Number of emails received and sent per day : 300 to 400

Expected behavior:

Background jobs that concern emails should be processed within approximately 10 to 30 minutes in Zammad

Actual behavior:

We are facing a large number of background jobs in queue which are being processed very slowly since 2 days and the emails are no longer been sent

Steps to reproduce the behavior:

Update multiple articles in different tickets in a very short period of time. This creates a number of Background jobs.
Send emails via zammad
Result: Emails do not go out until the other background jobs are completed.

Do you think that we can see somewhere (with which command?) the list of the background jobs queued? Can we use a command to priorize the sending emails jobs (Which command)? Can we do something to get a faster process with background jobs?

Thanks for your help to solve this matter!

Kind regards.

MrGeneration · October 11, 2023, 12:33pm

Zammad 4 is several years old.
Before anything else, please upgrade to Zammad 6.1 (current stable), via Zammad 5.0 and 6.0.

Make sure to also read the release notes for relevant changes etc.

maxa · October 11, 2023, 1:08pm

Thanks Mr Generation for your answer.
Yes we inslall it 2 years ago.
We are going to perform these updates during non working days at the end of this week. but is there a chance to solve our problem or to have informations about priorizing email send backgroud jobs before that?

MrGeneration · October 11, 2023, 1:20pm

Anything that would help you is not available in your version. Keeping Zammad up to date would help you in the future to rule those issues out.

maxa · October 12, 2023, 8:29am

Hi MrGeneration, We have finnaly proceed with this update last night. We are now with this configuration :

Used Zammad version: 6.1.
Used Zammad installation type: (source, package, docker-compose, …) : package on Ubuntu 20.04
Operating system: macOS Catalina 10.15.7
Browser + version: Google Chrome Version Version 117.0.5938.149
Number of staff using Zammad : 40
Number of emails received and sent per day : 300 to 400

We are facing a these issues :
The supervision system is now saying :

Channel: Email::Account in is active but not fetched for about 7 hours - We have tested the email connection and every things work fine. Zammad is saying that he found x emails and that he will download them to zammad, but nothing happened.
We still have 14532 background jobs in queue processed very slowly.
scheduler may not run (last execution of Stats.generate 20 minutes ago) - please contact your system administrator

Can we have a list oft these backgroud jobs and priorize emails jobs (Sending and receiving emails is the most important think we need to solve)?

Thanks for your help.

dvanzuijlekom · October 12, 2023, 9:01am

You will probably want to look into performance tuning of your instance. You might want to (temporarily?) raise the hardware capacity of your machine/VM to process the huge backlog of jobs.

Read this Hardware — Zammad System Documentation documentation and this Configuration via Environment Variables — Zammad System Documentation documentation first, and you will probably want to focus on tuning WEB_CONCURRENCY, ZAMMAD_SESSION_JOBS_CONCURRENT and especially ZAMMAD_PROCESS_DELAYED_JOBS_WORKERS.
Reconfigure these settings and allow Zammad to process the jobs.

Keep an eye on the performance of the machine and check if you are experiencing bottlenecks other than Zammad itself (e.g. disk latency, CPU/memory usage).

maxa · October 12, 2023, 9:09am

Hi dvanzuijlekom,
Thanks for your answer!
We have already change those environnement Variables :
ZAMMAD_PROCESS_DELAYED_JOBS_WORKERS is set to 16
WEB_CONCURRENCY is set to 8
ZAMMAD_SESSION_JOBS_CONCURRENT is set to 8

Our Server has 20 cores with 80Go Ram. We didn’t see any perfornance problem on the server. The CPU usage is very low (6%).

Any other ideas? We just want to know if we can list those jobs and if we can do something to proorize our emails, because we have some of them waiting since 2 days…

MrGeneration · October 12, 2023, 9:50am

That’s some big numbers for just 20 cores.
You’re theoretically overloading the host, because your performance tunings alone require 32 core alone if they go full 100% usage.

Double check your database server configuration, especially max_connections is crucial in this scenario.
See: Configure Database server — Zammad System Documentation documentation

Also note that if Zammad claims that it didn’t fetch for 7 hours + - then your background worker most likely is dead for whatever reason. Double check on that and restart zammad-worker and check on it.

maxa · October 12, 2023, 10:17am

OK, Thanks. We have lower the numbers like this :
ZAMMAD_PROCESS_DELAYED_JOBS_WORKERS is set to 8
WEB_CONCURRENCY is set to 6
ZAMMAD_SESSION_JOBS_CONCURRENT is set to 6

We have also restarted background worker. The message disappeared from Supervision tab but the emails are not coming into Zammad.

Concerning PostgreSQL Database, the config file shows that max_connections is set to 300. Do I have to increase that?

MrGeneration · October 12, 2023, 10:34am

Absolutely as stated in the documentation.

maxa · October 12, 2023, 10:38am

max connections is now set to 900

MrGeneration · October 12, 2023, 10:46am

The documentation states 2000.
You need 50 connections per process by default.

Don’t forget systemctl restart postgresql zammad after adjusting.

maxa · October 12, 2023, 10:51am

OK. max connections is now set to 2000
I have indeed run systemctl restart postgresql zammad after adjusting

maxa · October 12, 2023, 11:32am

Just to tell that Emails still not coming into zammad and Background jobs number is increasing.

maxa · October 12, 2023, 12:43pm

Just to share the last informations from production.log :
GNU nano 4.8 /var/log/zammad/production.log
I, [2023-10-12T00:04:30.671629 #290992-47137931778480] INFO – : Setting.set(‘models_searchable’, [“Chat::Session”, “Ticket”, “KnowledgeBase::Answer::Translation”, “Organization”, “User”])
I, [2023-10-12T00:04:30.690980 #291003-47311419947420] INFO – : Setting.set(‘models_searchable’, [“Chat::Session”, “Ticket”, “KnowledgeBase::Answer::Translation”, “Organization”, “User”])
I, [2023-10-12T00:04:30.699921 #291009-46997840492960] INFO – : Setting.set(‘models_searchable’, [“Chat::Session”, “Ticket”, “KnowledgeBase::Answer::Translation”, “Organization”, “User”])
I, [2023-10-12T00:04:31.886557 #291009-46997840492960] INFO – : Scheduler started.
I, [2023-10-12T00:04:31.899362 #291009-46997840492960] INFO – : Cleanup of left over locked delayed jobs 2023-10-12 00:04:31 UTC started.
I, [2023-10-12T00:04:31.929688 #291009-46997840492960] INFO – : Checking left over delayed job #<Delayed::Backend::ActiveRecord::Job id: 36045563, priority: 200, attempts: 0, handler: “— !ruby/object:ActiveJob::QueueAdapters::Delayed…”>
W, [2023-10-12T00:04:31.934816 #291009-46997840492960] WARN – : Destroyed locked delayed job: TransactionJob [ee8235bd-394e-4463-bf2e-7d24e0a73552] from DelayedJob(default) with arguments: [{“object”=>“Ticket”, “object_id”=>36288, "user_id>
I, [2023-10-12T00:04:31.934858 #291009-46997840492960] INFO – : Checking left over delayed job #<Delayed::Backend::ActiveRecord::Job id: 36045563, priority: 200, attempts: 0, handler: “— !ruby/object:ActiveJob::QueueAdapters::Delayed…”>
I, [2023-10-12T00:04:31.934874 #291009-46997840492960] INFO – : Cleanup of left over locked delayed jobs 2023-10-12 00:04:31 UTC finished.
I, [2023-10-12T00:04:31.934899 #291009-46997840492960] INFO – : Cleanup of left over import jobs 2023-10-12 00:04:31 UTC started.
I, [2023-10-12T00:04:31.935954 #291009-46997840492960] INFO – : Cleanup of left over import jobs 2023-10-12 00:04:31 UTC finished.
I, [2023-10-12T00:04:31.936091 #291009-46997840492960] INFO – : Scheduler running…
I, [2023-10-12T00:04:31.962174 #291009-46997905439220] INFO – : execute Ticket.process_pending (try_count 0)…
I, [2023-10-12T00:04:31.971246 #291009-46997905439220] INFO – : ended Ticket.process_pending took: 0.023433465 seconds.
I, [2023-10-12T00:04:32.097855 #291118-47138006120060] INFO – : Started GET “/api/v1/monitoring/health_check?=1697067711728" for 91.182.234.172 at 2023-10-12 00:04:32 +0000
I, [2023-10-12T00:04:32.126517 #291118-47138006120060] INFO – : Processing by MonitoringController#health_check as JSON
I, [2023-10-12T00:04:32.126575 #291118-47138006120060] INFO – : Parameters: {"”=>“1697067711728”}

It seems that since 4:30 AM nothing happened in the system.

MrGeneration · October 12, 2023, 12:47pm

I’m currently unable to help you in a proper time frame due to customer requests I am sorry.
If you need commercial grade support, you may want to consider a support contract for your process critical installation.

Sorry that I have no better news for you.

dvanzuijlekom · October 12, 2023, 1:29pm

You should really be seeing lots of continuous activity in your production.log, as background jobs should be processed and lots of schedules run automatically. If they don’t, your background worker probably isn’t running, as MrGeneration stated earlier.

Maybe start by checking the status of individual Zammad components: systemctl status zammad*

Everything should really be in the green:

Check all related and dependant services: systemctl status elasticsearch postgresql redis

And maybe check Apache or NginX as well, depending on what you’re running.

This is a (short) example of an active production.log:

root@zammad:~# tail -f /var/log/zammad/production.log
I, [2023-10-12T14:56:24.952606#14867-64680]  INFO -- : ProcessScheduledJobs running...
I, [2023-10-12T14:56:24.954140#14867-64680]  INFO -- : Running job thread for 'Generate 'Session' data.' (Sessions.jobs) status is: sleep
I, [2023-10-12T14:56:24.954337#14867-64680]  INFO -- : Running job thread for 'Check channels.' (Channel.fetch) status is: sleep
I, [2023-10-12T14:56:24.954461#14867-64680]  INFO -- : Running job thread for 'Check 'Channel' streams.' (Channel.stream) status is: sleep
I, [2023-10-12T14:56:24.954583#14867-64680]  INFO -- : Running job thread for 'Process ticket escalations.' (Ticket.process_escalation) status is: sleep
I, [2023-10-12T14:56:24.954979#14867-64680]  INFO -- : Running job thread for 'Execute planned jobs.' (Job.run) status is: sleep

If you’re not seeing any activity here, even when restarting Zammad, maybe check if your logging partition/disk isn’t full?

Check for errors up to the point things started going wrong, e.g: grep ERROR production.log or in one of the rotated logfiles. Maybe you’ll find clues.

Maybe also check your PostgreSQL logs for information. If you’re posting logs here for us to analyze, remember that they can contain sensitive information. Also please use the formatting tools to make it more readable (use preformatted text etc).

maxa · October 12, 2023, 3:49pm

Thanks!
I have run systemctl status zammad* : Every things is green
I have also run status all related dependant services systemctl status elasticsearch postgresql redis, and everythings is green too.

We are using apache and it is working fine as we are logging succesfully to the interface.

Any other command to see where the system is blocked?

maxa · October 12, 2023, 3:54pm

Here is the output of the systemctl status elasticsearch postgresql redis command

maxa · October 12, 2023, 4:07pm

Here is the output of the systemctl status zammad*