Man I really hammered out the messages the last few days, but it was stressful not having a handle on what was going on, and slowing down and typing it all out really helps.
So everything is back to normal and I’ve had some time away from thinking about zammad to let all the new information I’ve taken in over the last few days process in the background.
My last thought was the scheduler was CPU bound and couldn’t keep up, but thinking about it now, it had 10 threads, but the process never got much above 103% CPU utilization. I would have thought if it was CPU bound, 10 threads all running at 100% the process would be 1000%? Is that how threads work?
Taking that into account and thinking back to postgres logs, I think the bottle neck is the database. so many updates to (ie) the users table to update last_login. Looking at the logs after everything settled down, this query turns up a few times a minute in the slow log, which logs all queries that take longer than 50ms.
2018-06-13 00:12:05 UTC [2974]: [183-1] user=postgres,db=zammad_production,app=script/scheduler.rb,client=172.24.0.9 LOG: duration: 51.681 ms execute <unnamed>: SELECT "activity_streams".* FROM "activity_streams" WHERE ((permission_id IN (7,9,33,34,35,36,37,38,39,40,41,42,43,45,48,50) AND group_id is NULL) OR (permission_id IN (7,9,33,34,35,36,37,38,39,40,41,42,43,45,48,50) AND group_id IN (1,2,3,6)) OR (permission_id is NULL AND group_id IN (1,2,3,6))) ORDER BY created_at DESC, id DESC LIMIT $1
running explain analyze shows
Limit (cost=5437.01..5437.13 rows=50 width=44) (actual time=56.658..56.672 rows=50 loops=1)
-> Sort (cost=5437.01..5546.49 rows=43792 width=44) (actual time=56.656..56.658 rows=50 loops=1)
Sort Key: created_at DESC, id DESC
Sort Method: top-N heapsort Memory: 28kB
-> Seq Scan on activity_streams (cost=0.00..3982.27 rows=43792 width=44) (actual time=0.042..46.176 rows=45111 loops=1)
Filter: (((permission_id = ANY ('{7,9,33,34,35,36,37,38,39,40,41,42,43,45,48,50}'::integer[])) AND (group_id IS NULL)) OR ((permission_id = ANY ('{7,9,33,34,35,36,37,38,39,40,41,42,43,45,48,
50}'::integer[])) AND (group_id = ANY ('{1,2,3}'::integer[]))) OR ((permission_id IS NULL) AND (group_id = ANY ('{1,2,3}'::integer[]))))
Rows Removed by Filter: 1398
Planning time: 0.507 ms
Execution time: 56.731 ms
I see there’s a sequence scan, so an index might help,
create index index_activity_streams_created_at_id_desc on activity_streams (created_at desc, id desc);
now rerunning the explain says
Limit (cost=0.29..5.40 rows=50 width=44) (actual time=0.075..0.183 rows=50 loops=1)
-> Index Scan using index_activity_streams_created_at_id_desc on activity_streams (cost=0.29..4471.70 rows=43792 width=44) (actual time=0.073..0.171 rows=50 loops=1)
Filter: (((permission_id = ANY ('{7,9,33,34,35,36,37,38,39,40,41,42,43,45,48,50}'::integer[])) AND (group_id IS NULL)) OR ((permission_id = ANY ('{7,9,33,34,35,36,37,38,39,40,41,42,43,45,48,50}'::
integer[])) AND (group_id = ANY ('{1,2,3}'::integer[]))) OR ((permission_id IS NULL) AND (group_id = ANY ('{1,2,3}'::integer[]))))
Planning time: 1.845 ms
Execution time: 0.236 ms
execution time dropped from 56ms to 0.2ms.
So we’ll see how it goes tomorrow.
I also adjusted some vacuum settings which is supposed to make it more vacuum more frequently. I figured due to the constant updates of the users and other tables, more frequent would be better.
autovacuum_analyze_scale_factor = 0.002
autovacuum_vacuum_scale_factor = 0.001