Big issue with background jobs in queue

That’s weird, the screenshots of the services appear to be fine. The only odd thing was the error message regarding the Redis PID file not being writable. You should probably look into that.

You might want to apply the missing OS updates and perform the release upgrade.

You may need to delve deeper and check your machine utilization, use htop and top or iotop and iostat to see if a specific process is struggling, or if your I/O is causing bottlenecks. The system information lines on your first screenshot reveal some health information, which doesn’t really look out of the ordinary other than it should be really busy processing the background jobs now, but it doesn’t look very busy at all, so that’s a bit odd.

You should check your logfiles too, especially the PostgreSQL database log, but also Elasticsearch. It is just very weird that your production.log isn’t being written to, could you double-check it? Maybe also check permissions on the file to confirm Zammad can actually write to it? Then check the journald logging, e.g. journalctl -u zammad* -S 2023-10-01 for all logging since October 1st, maybe that shows problems in the past?

If you want to see what your database is doing, issue this command from an interactive client, which should show you running queries and connected clients (probably quite a lot more than in this example):

SELECT * FROM pg_stat_activity;
zammad=# SELECT * FROM pg_stat_activity;
 datid | datname |  pid  | leader_pid | usesysid | usename  |      application_name       | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type |     wait_event      | state  | backend_xid | backend_xmin | query_id |                                                                                                                                                                                                           query                                                                                                                                                                                                           |         backend_type         
-------+---------+-------+------------+----------+----------+-----------------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+---------------------+--------+-------------+--------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------
       |         |   761 |            |          |          |                             |             |                 |             | 2023-10-12 03:14:23.632241+02 |                               |                               |                               | Activity        | AutoVacuumMain      |        |             |              |          |                                                                                                                                                                                                                                                                                                                                                                                                                           | autovacuum launcher
       |         |   762 |            |       10 | postgres |                             |             |                 |             | 2023-10-12 03:14:23.633228+02 |                               |                               |                               | Activity        | LogicalLauncherMain |        |             |              |          |                                                                                                                                                                                                                                                                                                                                                                                                                           | logical replication launcher
 16389 | zammad  | 15323 |            |    16388 | zammad   | script/background-worker.rb |             |                 |          -1 | 2023-10-12 15:01:00.913412+02 |                               | 2023-10-12 15:01:00.93375+02  | 2023-10-12 15:01:00.937348+02 | Client          | ClientRead          | idle   |             |              |          | COMMIT                                                                                                                                                                                                                                                                                                                                                                                                                    | client backend
 16389 | zammad  | 15324 |            |    16388 | zammad   | script/background-worker.rb |             |                 |          -1 | 2023-10-12 15:01:00.917678+02 |                               | 2023-10-12 18:38:44.977045+02 | 2023-10-12 18:38:44.977106+02 | Client          | ClientRead          | idle   |             |              |          | SELECT "schedulers".* FROM "schedulers" WHERE "schedulers"."active" = $1 ORDER BY "schedulers"."prio" ASC                                                                                                                                                                                                                                                                                                                 | client backend
 16389 | zammad  | 15338 |            |    16388 | zammad   | script/background-worker.rb |             |                 |          -1 | 2023-10-12 15:01:01.932254+02 |                               | 2023-10-12 15:01:01.945388+02 | 2023-10-12 15:01:01.945406+02 | Client          | ClientRead          | idle   |             |              |          | SELECT "channels".* FROM "channels" WHERE (active = TRUE AND area LIKE '%::Account')                                                                                                                                                                                                                                                                                                                                      | client backend
 16389 | zammad  | 15327 |            |    16388 | zammad   | script/background-worker.rb |             |                 |          -1 | 2023-10-12 15:01:00.934986+02 |                               | 2023-10-12 18:38:49.408377+02 | 2023-10-12 18:38:49.408416+02 | Client          | ClientRead          | idle   |             |              |          | UPDATE "delayed_jobs" SET locked_at = '2023-10-12 16:38:49.407870', locked_by = 'host:zammad pid:15231' WHERE id IN (SELECT "delayed_jobs"."id" FROM "delayed_jobs" WHERE (((run_at <= '2023-10-12 16:38:49.407571' AND (locked_at IS NULL OR locked_at < '2023-10-05 16:38:49.407582')) OR locked_by = 'host:zammad pid:15231') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1 FOR UPDATE) RETURNING * | client backend
 16389 | zammad  | 15350 |            |    16388 | zammad   | script/background-worker.rb |             |                 |          -1 | 2023-10-12 15:01:12.935037+02 |                               | 2023-10-12 18:38:44.358914+02 | 2023-10-12 18:38:44.358933+02 | Client          | ClientRead          | idle   |             |              |          | SELECT "schedulers".* FROM "schedulers" WHERE "schedulers"."id" = $1 LIMIT $2                                                                                                                                                                                                                                                                                                                                             | client backend
 16389 | zammad  | 15376 |            |    16388 | zammad   | script/background-worker.rb |             |                 |          -1 | 2023-10-12 15:03:03.985514+02 |                               | 2023-10-12 18:34:09.406346+02 | 2023-10-12 18:34:09.406364+02 | Client          | ClientRead          | idle   |             |              |          | SELECT "schedulers".* FROM "schedulers" WHERE "schedulers"."id" = $1 LIMIT $2                                                                                                                                                                                                                                                                                                                                             | client backend
 16389 | zammad  | 15405 |            |    16388 | zammad   | script/background-worker.rb |             |                 |          -1 | 2023-10-12 15:04:09.020518+02 |                               | 2023-10-12 18:34:10.328269+02 | 2023-10-12 18:34:10.328287+02 | Client          | ClientRead          | idle   |             |              |          | SELECT "schedulers".* FROM "schedulers" WHERE "schedulers"."id" = $1 LIMIT $2                                                                                                                                                                                                                                                                                                                                             | client backend
 16389 | zammad  | 18632 |            |    16388 | zammad   | script/background-worker.rb |             |                 |          -1 | 2023-10-12 18:11:47.323089+02 |                               | 2023-10-12 18:35:13.931078+02 | 2023-10-12 18:35:13.934793+02 | Client          | ClientRead          | idle   |             |              |          | COMMIT                                                                                                                                                                                                                                                                                                                                                                                                                    | client backend
 16389 | zammad  | 19149 |            |       10 | postgres | psql                        |             |                 |          -1 | 2023-10-12 18:38:32.28725+02  | 2023-10-12 18:38:50.831266+02 | 2023-10-12 18:38:50.831266+02 | 2023-10-12 18:38:50.831268+02 |                 |                     | active |             |      1059423 |          | SELECT * FROM pg_stat_activity;                                                                                                                                                                                                                                                                                                                                                                                           | client backend
       |         |   754 |            |          |          |                             |             |                 |             | 2023-10-12 03:14:23.610342+02 |                               |                               |                               | Activity        | BgWriterHibernate   |        |             |              |          |                                                                                                                                                                                                                                                                                                                                                                                                                           | background writer
       |         |   753 |            |          |          |                             |             |                 |             | 2023-10-12 03:14:23.610605+02 |                               |                               |                               | Activity        | CheckpointerMain    |        |             |              |          |                                                                                                                                                                                                                                                                                                                                                                                                                           | checkpointer
       |         |   760 |            |          |          |                             |             |                 |             | 2023-10-12 03:14:23.632493+02 |                               |                               |                               | Activity        | WalWriterMain       |        |             |              |          |                                                                                                                                                                                                                                                                                                                                                                                                                           | walwriter
(14 rows)

Thanks for your help.
I looked for all the logs.
I can see that production.log is now showing something.
I also run the queries for postgresql, but it seems that a very low number of queries are running


Here are also the last output lines in production.log :

I, [2023-10-12T21:52:59.791286#327445-65080]  INFO -- : Completed 401 Unauthorized in 6ms (Views: 0.1ms | ActiveRecord: 1.2ms | Allocations: 2423)
I, [2023-10-12T21:53:01.763340#439657-5560]  INFO -- : Cleanup of left over locked delayed jobs 2023-10-12 21:53:01 UTC started.
I, [2023-10-12T21:53:01.801442#439657-5560]  INFO -- : Checking left over delayed job #<Delayed::Backend::ActiveRecord::Job id: 36051182, priority: 200, attempts: 0, handler: "--- !ruby/object:ActiveJob::QueueAdapters::Delayed...", last_error: nil, run_at: "2023-10-10 20:04:25.167000000 +0000", locked_at: "2023-10-12 01:54:45.199000000 +0000", failed_at: nil, locked_by: "host:TeamCom pid:307644", queue: "default", created_at: "2023-10-10 20:04:15.173000000 +0000", updated_at: "2023-10-10 20:04:15.173000000 +0000"> started.
I, [2023-10-12T21:53:03.635189 #439707]  INFO -- : ActionCable is using the redis instance at redis://localhost:6379.
I, [2023-10-12T21:53:03.648835#439707-5560]  INFO -- : Using Zammad's file store as Rails cache store.
I, [2023-10-12T21:53:03.648926#439707-5560]  INFO -- : Using the File back end for Zammad's web socket session store.
I, [2023-10-12T21:53:05.242116#327445-65140]  INFO -- : Started POST "/api/v1/message_send" for 86.193.86.63 at 2023-10-12 21:53:05 +0000
I, [2023-10-12T21:53:05.246970#327445-65140]  INFO -- : Processing by LongPollingController#message_send as JSON
I, [2023-10-12T21:53:05.247032#327445-65140]  INFO -- :   Parameters: {"data"=>{"event"=>"login"}}
I, [2023-10-12T21:53:05.251896#327445-65140]  INFO -- : CSRF token verification failed
I, [2023-10-12T21:53:05.252025#327445-65140]  INFO -- : CSRF token verification failed! (Exceptions::NotAuthorized)
app/controllers/application_controller/prevents_csrf.rb:36:in `verify_csrf_token'
app/controllers/application_controller/has_download.rb:17:in `block (4 levels) in <module:HasDownload>'
app/controllers/application_controller/has_download.rb:16:in `block (3 levels) in <module:HasDownload>'
app/controllers/application_controller/has_download.rb:15:in `block (2 levels) in <module:HasDownload>'
app/controllers/application_controller/handles_transitions.rb:16:in `handle_transaction'
I, [2023-10-12T21:53:05.253108#327445-65140]  INFO -- : Completed 401 Unauthorized in 6ms (Views: 0.2ms | ActiveRecord: 1.3ms | Allocations: 2423)
I, [2023-10-12T21:53:07.147418#439707-5560]  INFO -- : Cleanup of left over locked delayed jobs 2023-10-12 21:53:07 UTC started.
I, [2023-10-12T21:53:07.148222#439707-64680]  INFO -- : ProcessScheduledJobs running...
I, [2023-10-12T21:53:07.182122#439707-5560]  INFO -- : Checking left over delayed job #<Delayed::Backend::ActiveRecord::Job id: 36051182, priority: 200, attempts: 0, handler: "--- !ruby/object:ActiveJob::QueueAdapters::Delayed...", last_error: nil, run_at: "2023-10-10 20:04:25.167000000 +0000", locked_at: "2023-10-12 01:54:45.199000000 +0000", failed_at: nil, locked_by: "host:TeamCom pid:307644", queue: "default", created_at: "2023-10-10 20:04:15.173000000 +0000", updated_at: "2023-10-10 20:04:15.173000000 +0000"> started.
I, [2023-10-12T21:53:09.097999#327445-65200]  INFO -- : Started GET "/api/v1/monitoring/health_check?_=1697140031019" for 91.182.234.172 at 2023-10-12 21:53:09 +0000
I, [2023-10-12T21:53:09.102529#327445-65200]  INFO -- : Processing by MonitoringController#health_check as JSON
I, [2023-10-12T21:53:09.102605#327445-65200]  INFO -- :   Parameters: {"_"=>"1697140031019"}
I, [2023-10-12T21:53:09.136975#327445-65200]  INFO -- : Completed 200 OK in 34ms (Views: 0.3ms | ActiveRecord: 16.6ms | Allocations: 11001)
I, [2023-10-12T21:53:09.179487 #439881]  INFO -- : ActionCable is using the redis instance at redis://localhost:6379.
I, [2023-10-12T21:53:09.192780#439881-5560]  INFO -- : Using Zammad's file store as Rails cache store.
I, [2023-10-12T21:53:09.192868#439881-5560]  INFO -- : Using the File back end for Zammad's web socket session store.
I, [2023-10-12T21:53:10.790346#327445-64660]  INFO -- : Started POST "/api/v1/message_send" for 141.255.130.16 at 2023-10-12 21:53:10 +0000
I, [2023-10-12T21:53:10.795293#327445-64660]  INFO -- : Processing by LongPollingController#message_send as JSON
I, [2023-10-12T21:53:10.795338#327445-64660]  INFO -- :   Parameters: {"data"=>{"event"=>"login"}}
I, [2023-10-12T21:53:10.800312#327445-64660]  INFO -- : CSRF token verification failed
I, [2023-10-12T21:53:10.800475#327445-64660]  INFO -- : CSRF token verification failed! (Exceptions::NotAuthorized)
app/controllers/application_controller/prevents_csrf.rb:36:in `verify_csrf_token'
app/controllers/application_controller/has_download.rb:17:in `block (4 levels) in <module:HasDownload>'
app/controllers/application_controller/has_download.rb:16:in `block (3 levels) in <module:HasDownload>'
app/controllers/application_controller/has_download.rb:15:in `block (2 levels) in <module:HasDownload>'
app/controllers/application_controller/handles_transitions.rb:16:in `handle_transaction'
I, [2023-10-12T21:53:10.801440#327445-64660]  INFO -- : Completed 401 Unauthorized in 6ms (Views: 0.1ms | ActiveRecord: 1.4ms | Allocations: 2423)
I, [2023-10-12T21:53:12.554668#439881-5560]  INFO -- : Cleanup of left over locked delayed jobs 2023-10-12 21:53:12 UTC started.
I, [2023-10-12T21:53:12.555381#439881-64680]  INFO -- : ProcessScheduledJobs running...
I, [2023-10-12T21:53:12.589819#439881-5560]  INFO -- : Checking left over delayed job #<Delayed::Backend::ActiveRecord::Job id: 36051182, priority: 200, attempts: 0, handler: "--- !ruby/object:ActiveJob::QueueAdapters::Delayed...", last_error: nil, run_at: "2023-10-10 20:04:25.167000000 +0000", locked_at: "2023-10-12 01:54:45.199000000 +0000", failed_at: nil, locked_by: "host:TeamCom pid:307644", queue: "default", created_at: "2023-10-10 20:04:15.173000000 +0000", updated_at: "2023-10-10 20:04:15.173000000 +0000"> started.

Here is also the htop screen :

Okay, so a couple of things: you have 20 CPU cores, of which 19 are basically idle and only one is currently utilized. This matches up with the system information in one of your previous screenshots, showing a system load of 1.01. The entire machine, I’m guessing a VM on the Microsoft Azure Cloud, is basically doing nothing, apart from the one background worker process. This could mean either the background worker tuning isn’t working (I would have expected more active workers processes), or the background workers are hindered by a locked job preventing them from processing any delayed jobs. The latter appears to be the case looking at the production.log output.

Regarding the possible locked jobs: there are rails console commands for getting this output. I prefer getting it via the database, because I am more comfortable with it. You will want to check what your delayed jobs currently are and which one is possibly blocking the rest from running, which I think the production.log hints at in the Checking left over delayed job... output.

From the rails console, you can run the following commands, in my instance, Delayed::Job.count is zero because I don’t have delayed jobs, Delayed::Job.first is also nil because of that. You might have quite a high number by now and if you run the command a couple of times, it’ll likely be only increasing:

root@zammad:~# zammad run rails c
Loading production environment (Rails 6.1.7.6)
irb(main):001:0> Delayed::Job.count
=> 0
irb(main):002:0> Delayed::Job.first
=> nil
irb(main):003:0>

Alternatively, you can run these commands without invoking an interactive Rails console:

root@zammad:~# zammad run rails r 'p Delayed::Job.count'
0                                                                               
root@zammad:~# zammad run rails r 'p Delayed::Job.first'                   
nil                                                                             

You can also query it from the zammad database, the first query will show a count of current delayed job records, the second will show all details of the first 10 rows. I think I should warn you that the folks at Zammad don’t really like it (and for good reason) when you query the database directly:

SELECT COUNT(*) FROM delayed_jobs;
SELECT * FROM delayed_jobs ORDER BY created_at LIMIT 10;

It will be interesting to know what the blocking job(s) is/are.

I think the CSRF messages are a (literal) red herring due to someone trying to log in to the Zammad webinterface and failing (401 unauthorized), And I think you can safely ignore those messages for now.

The currently running queries in the database don’t really show a problem, but that is only a snapshot at a certain point in time.

Yes, you are right.
I already run these commands yesterday.

Delayed::Job.count is increasing. We are now at 15812

Delayed::Job.first give me yesterday the following output :

Last login: Thu Oct 12 20:48:51 2023 from 91.182.234.172
administrateur@TeamCom:~$ sudo zammad run rails r "p Delayed::Job.first"
#<Delayed::Backend::ActiveRecord::Job id: 36015761, priority: 300, attempts: 0, handler: "--- !ruby/object:ActiveJob::QueueAdapters::Delayed...", last_error: nil, run_at: "2023-10-09 08:06:34.026000000 +0000", locked_at: nil, failed_at: nil, locked_by: nil, queue: "default", created_at: "2023-10-09 08:06:34.026000000 +0000", updated_at: "2023-10-09 08:06:34.026000000 +0000">

Delayed::Job.first give me today the following output :

#<Delayed::Backend::ActiveRecord::Job id: 36015761, priority: 300, attempts: 0, handler: "--- !ruby/object:ActiveJob::QueueAdapters::Delayed...", last_error: nil, run_at: "2023-10-09 08:06:34.026000000 +0000", locked_at: nil, failed_at: nil, locked_by: nil, queue: "default", created_at: "2023-10-09 08:06:34.026000000 +0000", updated_at: "2023-10-09 08:06:34.026000000 +0000">

It seem that this job is blocking everythings. How can I delete it without deleting the other jobs? Or Do I have another solution?

You can remove individual jobs, but I don’t know if this is the best possible solution in this case (I don’t know what the QueueAdapters job does in your case). You could try the following console command to remove only the job which appears to be blocking the rest:

Not completely sure if this will be safe for you
root@zammad:~# zammad run rails c
Loading production environment (Rails 6.1.7.6)                                  
irb(main):001:0> Delayed::Job.find(36015761).destroy

You might want to get all the job details first, as I don’t know if the console command shows all information (I think it cuts off the details):

SELECT * FROM delayed_jobs WHERE id = 36015761 ;

Also be warned that removing the job could mean the system becomes very active in processing all remaining jobs and the load increases severely. It could also mean it’ll hit a later similar job in the queue which will start blocking again. You’ll have to check this regularly if the queue is still being processed and the number of delayed jobs keeps actively decreasing.

I’m getting back to maxa in his ticket right now.

Took us some time to find the issue.
It was caused by a delayed job like Dennis guessed correctly but actually wasn’t the first delayed job in list. Due to Updating from Zammad 4 to 6 (with steps) the instance might not have been able to work on all delayed jobs in time which caused the background-worker process to die due to no longer existing code commands.

This was only visible when looking in the systems syslog which provided the name of the command being at fault.

We then identified the jobs by running Delayed::Job.where("handler LIKE '%TicketCreateScreenJob%').ids in a rails console and were forced to delete the returned ids directly via psql².

Instance is now back to purring.

² This is not suggested at any time. In this case it wasn’t possible in another way.

1 Like

dvanzuijlekom, Just want to thank you very much for your help. MrGeneration worked on this and we finally find the issue! Problem is now solved.
Best,

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.