[SOLVED] High number of delayed_jobs

  • Used Zammad version: 2.5
  • Used Zammad installation source: zammad-docker-compose

I don’t know if it’s related to the problems we had yesterday with deadlocks but zammad for the last 12 hours is hammering elasticsearch to reindex the same 10 odd users.

The index was broken for a lot of yesterday and we had 5 ‘BackgroundJobSearchIndex’ jobs.
I upgraded to 2.5 and since then it’s constantly reindexing the same 10 odd users.

I’ve run rake searchindex:rebuild and that finishes successfully in 2500 seconds.
for the last 8 hours

Delayed::Job.all.each {|job|
Delayed::Worker.new.run(job)
}

has been running, BackgroundJobSearchIndex is down to 3.

I’m hoping it will fix itself but if you know how to fix it quick that would be sweet.

oh wow,
select count(*) from delayed_jobs
count

647899
(1 row)

I worked out some of the puzzle, I was running a cron job every 5 minutes that went through the imap inbox and moved emails into a folder that had been assigned. there was about 40 old emails in there that aren’t in zammad so every 5 minutes that’s a minimum of 40 elasticsearches searching for message id’s, plus the new emails that arrive, but those ones are eventually moved from the inbox.

That explained why it was hammering the same 10 users I think. I think I need to look at elasticsearch to see if I can tune it.

I’m pretty sure it’s the scheduler that’s the bottle neck. elastic search is return queries in 100ms, the scheduler is pegging the CPU. It seems to be barely keeping up with 10 agents without the cron job

I’m going to use cpu affinity and bump up the priority of the ruby scheduler, and we’re going to replace the server with 2.2ghz cpus with some one with 3.4ghz.

Man I really hammered out the messages the last few days, but it was stressful not having a handle on what was going on, and slowing down and typing it all out really helps.

So everything is back to normal and I’ve had some time away from thinking about zammad to let all the new information I’ve taken in over the last few days process in the background.

My last thought was the scheduler was CPU bound and couldn’t keep up, but thinking about it now, it had 10 threads, but the process never got much above 103% CPU utilization. I would have thought if it was CPU bound, 10 threads all running at 100% the process would be 1000%? Is that how threads work?

Taking that into account and thinking back to postgres logs, I think the bottle neck is the database. so many updates to (ie) the users table to update last_login. Looking at the logs after everything settled down, this query turns up a few times a minute in the slow log, which logs all queries that take longer than 50ms.

2018-06-13 00:12:05 UTC [2974]: [183-1] user=postgres,db=zammad_production,app=script/scheduler.rb,client=172.24.0.9 LOG:  duration: 51.681 ms  execute <unnamed>: SELECT  "activity_streams".* FROM "activity_streams" WHERE ((permission_id IN (7,9,33,34,35,36,37,38,39,40,41,42,43,45,48,50) AND group_id is NULL) OR (permission_id IN (7,9,33,34,35,36,37,38,39,40,41,42,43,45,48,50) AND group_id IN (1,2,3,6)) OR (permission_id is NULL AND group_id IN (1,2,3,6))) ORDER BY created_at DESC, id DESC LIMIT $1

running explain analyze shows

 Limit  (cost=5437.01..5437.13 rows=50 width=44) (actual time=56.658..56.672 rows=50 loops=1)
   ->  Sort  (cost=5437.01..5546.49 rows=43792 width=44) (actual time=56.656..56.658 rows=50 loops=1)
         Sort Key: created_at DESC, id DESC
         Sort Method: top-N heapsort  Memory: 28kB
         ->  Seq Scan on activity_streams  (cost=0.00..3982.27 rows=43792 width=44) (actual time=0.042..46.176 rows=45111 loops=1)
               Filter: (((permission_id = ANY ('{7,9,33,34,35,36,37,38,39,40,41,42,43,45,48,50}'::integer[])) AND (group_id IS NULL)) OR ((permission_id = ANY ('{7,9,33,34,35,36,37,38,39,40,41,42,43,45,48,
50}'::integer[])) AND (group_id = ANY ('{1,2,3}'::integer[]))) OR ((permission_id IS NULL) AND (group_id = ANY ('{1,2,3}'::integer[]))))
               Rows Removed by Filter: 1398
 Planning time: 0.507 ms
 Execution time: 56.731 ms

I see there’s a sequence scan, so an index might help,

create index index_activity_streams_created_at_id_desc on activity_streams (created_at desc, id desc);

now rerunning the explain says

 Limit  (cost=0.29..5.40 rows=50 width=44) (actual time=0.075..0.183 rows=50 loops=1)
   ->  Index Scan using index_activity_streams_created_at_id_desc on activity_streams  (cost=0.29..4471.70 rows=43792 width=44) (actual time=0.073..0.171 rows=50 loops=1)
         Filter: (((permission_id = ANY ('{7,9,33,34,35,36,37,38,39,40,41,42,43,45,48,50}'::integer[])) AND (group_id IS NULL)) OR ((permission_id = ANY ('{7,9,33,34,35,36,37,38,39,40,41,42,43,45,48,50}'::
integer[])) AND (group_id = ANY ('{1,2,3}'::integer[]))) OR ((permission_id IS NULL) AND (group_id = ANY ('{1,2,3}'::integer[]))))
 Planning time: 1.845 ms
 Execution time: 0.236 ms

execution time dropped from 56ms to 0.2ms.

So we’ll see how it goes tomorrow.

I also adjusted some vacuum settings which is supposed to make it more vacuum more frequently. I figured due to the constant updates of the users and other tables, more frequent would be better.

autovacuum_analyze_scale_factor = 0.002
autovacuum_vacuum_scale_factor = 0.001

Just a small thought of mine:
If you have 10,000% CPU usuage, this would mean that you have 10cores/threads running at full CPU-Speed. You could aso write it like10.0% (Linux).

So yes, this would be how threads work (if I understood you correctly)

I think we are saying the same thing but you might have an extra 0, 100 * 10 is 1000.

So digging into the logs some more. this is from the railsserver

I, [2018-06-13T12:44:34.419307 #19]  INFO -- : Started GET "/api/v1/users/26?full=true&_=1528844356856" for ::ffff:172.24.0.10 at 2018-06-13 12:44:34 +0000
I, [2018-06-13T12:44:34.439478 #19]  INFO -- : Processing by UsersController#show as JSON
I, [2018-06-13T12:44:34.439563 #19]  INFO -- :   Parameters: {"full"=>"true", "_"=>"1528844356856", "id"=>"26"}
I, [2018-06-13T12:44:34.631249 #19]  INFO -- : Started GET "/api/v1/users/242?full=true&_=1528774405558" for ::ffff:172.24.0.10 at 2018-06-13 12:44:34 +0000
I, [2018-06-13T12:44:34.639965 #19]  INFO -- : Started GET "/api/v1/users/135?full=true&_=1528774405559" for ::ffff:172.24.0.10 at 2018-06-13 12:44:34 +0000
I, [2018-06-13T12:44:34.646527 #19]  INFO -- : Started GET "/api/v1/users/291?full=true&_=1528774405560" for ::ffff:172.24.0.10 at 2018-06-13 12:44:34 +0000
I, [2018-06-13T12:44:34.656248 #19]  INFO -- : Processing by UsersController#show as JSON
I, [2018-06-13T12:44:34.656551 #19]  INFO -- :   Parameters: {"full"=>"true", "_"=>"1528774405559", "id"=>"135"}
I, [2018-06-13T12:44:34.660458 #19]  INFO -- : Processing by UsersController#show as JSON
I, [2018-06-13T12:44:34.660541 #19]  INFO -- :   Parameters: {"full"=>"true", "_"=>"1528774405558", "id"=>"242"}
I, [2018-06-13T12:44:34.664072 #19]  INFO -- : Processing by UsersController#show as JSON
I, [2018-06-13T12:44:34.664169 #19]  INFO -- :   Parameters: {"full"=>"true", "_"=>"1528774405560", "id"=>"291"}
I, [2018-06-13T12:44:34.831800 #23]  INFO -- : Completed 200 OK in 8408ms (Views: 261.9ms | ActiveRecord: 39.6ms)
I, [2018-06-13T12:44:34.969928 #19]  INFO -- : Completed 200 OK in 4374ms (Views: 83.0ms | ActiveRecord: 533.7ms)
I, [2018-06-13T12:44:35.081078 #23]  INFO -- : Completed 200 OK in 8650ms (Views: 121.9ms | ActiveRecord: 38.6ms)
I, [2018-06-13T12:44:35.138607 #19]  INFO -- : Completed 200 OK in 5716ms (Views: 137.5ms | ActiveRecord: 589.6ms)

I don’t understand it fully, I there’s 4 GETs and 4 completed, I don’t know if those are matching though as the completed says it took from 4 to 8 seconds but the time between the GETs and Completed is 1 second. Take the last Completed, it took 5.7 seconds, then views and activerecord took 0.1 and 0.5 seconds. I don’t know what all those number mean. Could the 5.7 seconds be the time the query was completed minus the time it was added to the delayed_jobs table?

and here’s some logs from the scheduler showing easticsearch queries, these don’t match up with the above, they’re just some random ones but all the searches look exactly the same in terms of time to complete.

I, [2018-06-13T12:45:37.221626 #1]  INFO -- : 2018-06-13T12:45:37+0000: [Worker(host:9909885fd80e pid:1)] Job BackgroundJobSearchIndex (id=5710646) RUNNING
I, [2018-06-13T12:45:37.281485 #1]  INFO -- : # curl -X POST "http://zammad-elasticsearch:9200/zammad_production/User/1168?pipeline=zammad813516161384" \
I, [2018-06-13T12:45:37.292369 #1]  INFO -- : # 200
I, [2018-06-13T12:45:37.296800 #1]  INFO -- : 2018-06-13T12:45:37+0000: [Worker(host:9909885fd80e pid:1)] Job BackgroundJobSearchIndex (id=5710646) COMPLETED after 0.0750
I, [2018-06-13T12:45:37.301547 #1]  INFO -- : 2018-06-13T12:45:37+0000: [Worker(host:9909885fd80e pid:1)] Job BackgroundJobSearchIndex (id=5710647) RUNNING
I, [2018-06-13T12:45:37.363320 #1]  INFO -- : # curl -X POST "http://zammad-elasticsearch:9200/zammad_production/User/1461?pipeline=zammad813516161384" \
I, [2018-06-13T12:45:37.375401 #1]  INFO -- : # 200
I, [2018-06-13T12:45:37.381427 #1]  INFO -- : 2018-06-13T12:45:37+0000: [Worker(host:9909885fd80e pid:1)] Job BackgroundJobSearchIndex (id=5710647) COMPLETED after 0.0797
I, [2018-06-13T12:45:37.389300 #1]  INFO -- : 2018-06-13T12:45:37+0000: [Worker(host:9909885fd80e pid:1)] Job BackgroundJobSearchIndex (id=5710648) RUNNING
I, [2018-06-13T12:45:37.449282 #1]  INFO -- : # curl -X POST "http://zammad-elasticsearch:9200/zammad_production/User/708?pipeline=zammad813516161384" \
I, [2018-06-13T12:45:37.462481 #1]  INFO -- : # 200
I, [2018-06-13T12:45:37.467348 #1]  INFO -- : 2018-06-13T12:45:37+0000: [Worker(host:9909885fd80e pid:1)] Job BackgroundJobSearchIndex (id=5710648) COMPLETED after 0.0779
I, [2018-06-13T12:45:37.472651 #1]  INFO -- : 2018-06-13T12:45:37+0000: [Worker(host:9909885fd80e pid:1)] Job BackgroundJobSearchIndex (id=5710649) RUNNING
I, [2018-06-13T12:45:37.536060 #1]  INFO -- : # curl -X POST "http://zammad-elasticsearch:9200/zammad_production/User/2084?pipeline=zammad813516161384" \
I, [2018-06-13T12:45:37.546688 #1]  INFO -- : # 200
I, [2018-06-13T12:45:37.551357 #1]  INFO -- : 2018-06-13T12:45:37+0000: [Worker(host:9909885fd80e pid:1)] Job BackgroundJobSearchIndex (id=5710649) COMPLETED after 0.0782
I, [2018-06-13T12:45:37.556451 #1]  INFO -- : 2018-06-13T12:45:37+0000: [Worker(host:9909885fd80e pid:1)] Job BackgroundJobSearchIndex (id=5710650) RUNNING
I, [2018-06-13T12:45:37.619907 #1]  INFO -- : # curl -X POST "http://zammad-elasticsearch:9200/zammad_production/User/1386?pipeline=zammad813516161384" \
I, [2018-06-13T12:45:37.630978 #1]  INFO -- : # 200
I, [2018-06-13T12:45:37.634890 #1]  INFO -- : 2018-06-13T12:45:37+0000: [Worker(host:9909885fd80e pid:1)] Job BackgroundJobSearchIndex (id=5710650) COMPLETED after 0.0783

They all complete in 0.07 (seconds I guess).

So what does them mean? I think elasticsearch is super quick and not the problem. The last Completed took 5.7 seconds, I don’t know what it’s doing in the 5.7 seconds. Is it taking a long time to process the json? is it waiting for a lock on the database? I don’t know how to find out either.

Sorry, I have no clue where that 5.7 seconds appear from.

Maybe @thorsteneckel or @martini could make it a bit clearer?

Just thinking out aloud.

I ran the inbox processing script after hours, it processed 138 emails, the script finished maybe 5 minutes ago and I just checked the delayed_job table, 16500 jobs in the queue. The script runs
tickets = client.ticket.search({'query': 'article.message_id:"%s"' % message_id})
138 time and
ticket = client.ticket.find(ticket_id)
138 times
The client.ticket.find(ticket_id) submits /api/v1/tickets/17558?all=true&_=1528956813594
I think the all=true means that one query create 100s of delayed_jobs.

These two delayed jobs take 5 or more seconds to process, and there’s a lot of the User ones in the queue, probably the majority of them.

--- !ruby/object:BackgroundJobSearchIndex
object: User
o_id: 177 

--- !ruby/object:BackgroundJobSearchIndex
object: Ticket
o_id: 17629

thinking out aloud again.

while the script is running railsserver is gobbling up cpu time, atm I’ve got
- WEB_CONCURRENCY=16
- MAX_THREADS=16
- MIN_THREADS=16

load average peeked at 20 (I have 32 cores)

Now the script has finished it left behind 13000 jobs in the delayed_jobs queue, the scheduler is barely using any CPU

root@50cfa9275043:/opt/zammad# ps -emo pid,tid,fname,user,psr,pcpu
  PID   TID COMMAND  USER     PSR %CPU
    1     - bundle   zammad     - 89.6
    -     1 -        zammad    20  0.5
    -    17 -        zammad     1  0.1
    -    18 -        zammad     8 60.8
    -    19 -        zammad    27  0.0
    -    20 -        zammad    27  0.0
    -    22 -        zammad    25  7.9
    -    23 -        zammad    27  8.0
    -    24 -        zammad    10  0.0
    -    25 -        zammad    31  7.7
    -    26 -        zammad     9  0.7
    -    33 -        zammad    24  0.0
    -    47 -        zammad    25  0.0
    - 50010 -        zammad    23  0.0
48740     - bash     root       -  0.2
    - 48740 -        root       1  0.2
50011     - ps       root       -  0.0
    - 50011 -        root      26  0.0
root@50cfa9275043:/opt/zammad# ps -emo pid,tid,fname,user,psr,pcpu
  PID   TID COMMAND  USER     PSR %CPU
    1     - bundle   zammad     - 89.6
    -     1 -        zammad     8  0.5
    -    17 -        zammad     9  0.1
    -    18 -        zammad    13 60.8
    -    19 -        zammad    27  0.0
    -    20 -        zammad    27  0.0
    -    22 -        zammad    25  7.9
    -    23 -        zammad    27  8.0
    -    24 -        zammad    10  0.0
    -    25 -        zammad    10  7.7
    -    26 -        zammad    19  0.7
    -    33 -        zammad    24  0.0
    -    47 -        zammad    25  0.0
    - 50052 -        zammad    13  0.0
    - 50053 -        zammad    23  0.0
    - 50055 -        zammad     8  0.0
48740     - bash     root       -  0.2
    - 48740 -        root      27  0.2
50054     - ps       root       -  0.0
    - 50054 -        root      21  0.0

It looks like only 1 thread is processing the queue. and it seems to be waiting for something 40% of the time, only 60% CPU utilisation.

thinking out aloud.

saw this in the logs which confirmed there is only one thread processing the job queue

zammad-scheduler_1      | I, [2018-06-14T12:31:53.704960 #1]  INFO -- : Running job thread for 'Check streams for Channel' (Channel.stream) status is: sleep
zammad-scheduler_1      | I, [2018-06-14T12:31:53.705165 #1]  INFO -- : Running job thread for 'Process escalation tickets' (Ticket.process_escalation) status is: sleep
zammad-scheduler_1      | I, [2018-06-14T12:31:53.705304 #1]  INFO -- : Running job thread for 'Import OTRS diff load' (Import::OTRS.diff_worker) status is: sleep
zammad-scheduler_1      | I, [2018-06-14T12:31:53.705443 #1]  INFO -- : Running job thread for 'Generate Session data' (Sessions.jobs) status is: sleep
zammad-scheduler_1      | I, [2018-06-14T12:31:53.706891 #1]  INFO -- : Running job thread for 'Check Channels' (Channel.fetch) status is: sleep
zammad-scheduler_1      | I, [2018-06-14T12:31:53.707629 #1]  INFO -- : Running job thread for 'Execute jobs' (Job.run) status is: sleep

So right now I believe railsserver is fine and is able to make use of most the CPU resource available.

When it’s really busy I see in the postgres logs updates to “sessions” taking 500ms or more constantly one after the other for minutes at a time. That table is getting hammered.

The job process is the biggest problem now with User and Ticket jobs taking 5, 10 and more seconds each to process. During the day I saw 200 jobs in the queue and it took a minute or more to clear. Maybe part of it is waiting for the update to the sessions table to complete?

I don’t know ruby but I think I might have to profile it to see what it’s doing in those 5 seconds.

since I’ve been hammering out the messages I may as well continue.

I still haven’t done any profiling, but after a break and a sleep, I thought since updates to updated_at in the sessions and users table are seen repeatedly in the slow log, I could optimise postgresql for writes.

my postgresql config now looks like

max_connections = 300 
shared_buffers = 15GB
effective_cache_size = 45GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.7
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 3276kB
max_worker_processes = 32
max_parallel_workers_per_gather = 16
min_wal_size = 1GB
max_wal_size = 2GB
wal_buffers = 16MB

# the above block from https://pgtune.leopard.in.ua/#/
# using https://stackoverflow.com/questions/5131266/increase-postgresql-write-speed-at-the-cost-of-likely-data-loss and https://www.postgresql.org/docs/9.6/static/runtime-config-wal.html as references,

synchronous_commit = off
# setting synchronous_commit to off sounded scary, but it's not too bad, 
# the database isn't at rick of getting corrupted, but it is at risk of loosing 
# the last wal_writer_delay of commits. I can live with that.

wal_writer_delay = 1000 # ms
commit_delay = 5000 # us
# if there's more than 5 open transactions it delays updates by 5ms 
# to try an group them into a single wall flush

deadlock_timeout = 15000 
# default is 1000 ms which is OK when everything is working normally.
# when it's way overloaded it causes ldap sync issues and unprocessable mails, etc.

autovacuum_analyze_scale_factor = 0.002
autovacuum_vacuum_scale_factor = 0.001
# vacuum more often due to the constant updates of updated_at of session and users table.
logging_collector = on
log_destination = 'stderr'
log_directory = log
log_rotation_age = 1d
log_min_duration_statement = 50
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
log_error_verbosity = default
lc_messages='C'
# handy logging, this works well with pgbadger, I think that's where I copied it from.

the result? hmm, just eyeball it, much less slow logs between 7 past and 00, but it looks like there are hourly jobs starting from xx:00 and running until xx:07 there are still heaps of slow updates, and they might even be slower? but that doesn’t make any sense. I’m going to lower the commit delay to 1000 us and see how it goes.

I’ve gone as far as I can with optimising postgres, didn’t fix it, was still a worthwhile exercise.

Now I’ve discovered elasticsearch is only used for user queries, ticket queries are being sent to postgresql.

Any idea how to debug that?

Oh noes, I think it’s a timezone issue. I think zammad assumes server and browser are running in the same timezone? can that be? The timestamp field for created_at and updated_at, etc, are timestamp (as apposed to timestamptz) and the times in them are (for me) UTC with no timezone info, where desktops and browsers are running in +10. I guess email timestamps would be +10 too with timezone info. so 2018-06-17 10:00:00 +10, but the time that gets stored in postgres would just be 2018-06-17 10:00:00 (no +10), and the created_at would be 2018-06-17 00:00:00

I think elasticsearch is timestamptz (the equivalent to) so when zammad is search elasticsearch for a ticket within a narrow time period, the times are 10 hours out and the ticket isn’t found.

I’m not 100% sure though, I might submit a bug

There is a timezone issue and it’s being looked at, but I’m not sure that’s the cause of elasticsearch not being used. It looks like everything through the search bar uses elasticsearch, but I couldn’t see any elasticsearch queries from an api request, it’s using postgres to handle those.

I need to do some more testing after hours when no one is logged in to be sure.

Does anyone know if api requests are supposed to use elasticsearch?

More digging, elasticsearch isn’t being used for api queries, but postgres is returning them in a 2 ms or less. I’m going to conclude it’s because the queue is being processed by a single thread and my cores aren’t the fastest.

I’ve got a new server on the way, I think it’s 3.2ghz base and 4ghz turbo 16 cores. wow, the old cpu is 1.2ghz base and 2.2ghz turbo. I thought it was faster than that.

After looking at /proc/cpuinfo no cores were running at 2.2, they were all 1.2 or 1.3. I ran:

for CPUFREQ in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do [ -f $CPUFREQ ] || continue; echo -n performance > $CPUFREQ; done

to change the governor to performance and it halved the time it took for an api query, from 4 seconds to 2 seconds. so that confirms it’s a cpu bottle neck (and a single thread processing the queue doesn’t help)

What a ride, I’ve been back and forward with theories. I learned a lot though and thoroughly enjoyed the ride.

Thanks for giving so much detail into this topic. I’m sure it will help other people, too.
Plus it’s super handy and good to know where all that evil thing are lurking for one. :smiley:

and I just realised with the api request I was testing with, using postgres for it was the right thing to do as I was requesting a ticket id, that’s like sql’s bread and butter. elasticsearch would be used more for searching the body of emails, full text search.

This topic was automatically closed 120 days after the last reply. New replies are no longer allowed.