Scheduler not connecting to database

  • Used Zammad version: 6.3.x
  • Used Zammad installation type: Docker Compose
  • Operating system: Ubuntu
  • Browser + version: Firefox

Expected behavior:

  • scheduler should run

Actual behavior:

  • exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds

Steps to reproduce the behavior:

Hi, I just try to fix this issue with my scheduler which is not running.

The errors I get in Zammad monitoring are:

emails that could not be processed: 5
Channel: Email::Notification out No such file or directory - /usr/sbin/sendmail
scheduler may not run (last execution of CacheClearJob.perform_now 8 days ago) - please contact your system administrator 

I use the standard docker compose file but for 8 days the scheduler is not really working correctly anymore with the described errors. I don’t know where to look anymore and I am clueless what the issue is.

I’ve already checked the active connections in postgres which were active 1 and idle 15 which sound okay and I’ve also increased the limit to 500 concurrent connections just in case but that did not help.

Any pointer in the right direction would be of enourmous help. Otherwise I will have to restore from backup and try without docker.

This is the postgres log:

PostgreSQL Database directory appears to contain a database; Skipping initialization

2024-05-14 08:22:28.645 UTC [1] LOG:  starting PostgreSQL 15.3 on x86_64-pc-linux-musl, compiled by gcc (Alpine 12.2.1_git20220924-r10) 12.2.1 20220924, 64-bit

2024-05-14 08:22:28.645 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432

2024-05-14 08:22:28.645 UTC [1] LOG:  listening on IPv6 address "::", port 5432

2024-05-14 08:22:28.660 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"

2024-05-14 08:22:28.685 UTC [24] LOG:  database system was shut down at 2024-05-14 08:22:25 UTC

2024-05-14 08:22:28.721 UTC [1] LOG:  database system is ready to accept connections

2024-05-14 08:22:29.010 UTC [28] LOG:  incomplete startup packet

2024-05-14 08:27:28.780 UTC [22] LOG:  checkpoint starting: time

2024-05-14 08:27:30.534 UTC [22] LOG:  checkpoint complete: wrote 19 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=1.714 s, sync=0.014 s, total=1.754 s; sync files=17, longest=0.007 s, average=0.001 s; distance=157 kB, estimate=157 kB

2024-05-14 08:32:28.604 UTC [22] LOG:  checkpoint starting: time

2024-05-14 08:32:30.957 UTC [22] LOG:  checkpoint complete: wrote 24 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=2.311 s, sync=0.014 s, total=2.353 s; sync files=20, longest=0.007 s, average=0.001 s; distance=55 kB, estimate=147 kB

Same issue which encompasses a wide variety of jobs:

E, [2024-05-14T21:57:46.447411#1-150100] ERROR -- : execute Channel.fetch (try_count 10) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.000439031 seconds.
E, [2024-05-14T21:57:47.446934#1-150120] ERROR -- : execute SessionTimeoutJob.perform_now (try_count 10) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.000486994 seconds.
E, [2024-05-14T21:57:48.447597#1-150140] ERROR -- : execute Job.run (try_count 10) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.000497121 seconds.
E, [2024-05-14T21:57:48.544974#1-150520] ERROR -- : execute ExternalCredential::Exchange.refresh_token (try_count 0) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.00036161 seconds.
E, [2024-05-14T21:57:49.448330#1-150160] ERROR -- : execute Stats.generate (try_count 10) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.000454126 seconds.
E, [2024-05-14T21:57:50.448750#1-150180] ERROR -- : execute Chat.cleanup_close (try_count 10) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.000459974 seconds.
E, [2024-05-14T21:57:51.448111#1-150200] ERROR -- : execute DataPrivacyTaskJob.perform_now (try_count 10) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.000463803 seconds.
E, [2024-05-14T21:57:51.448506#1-150100] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.001 seconds); all pooled connections were in use>
E, [2024-05-14T21:57:51.448729#1-150100] ERROR -- : Failed to run Channel.fetch after 11 tries.
E, [2024-05-14T21:57:52.447392#1-150120] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>
E, [2024-05-14T21:57:52.447686#1-150120] ERROR -- : Failed to run SessionTimeoutJob.perform_now after 11 tries.
E, [2024-05-14T21:57:53.447999#1-150140] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>
E, [2024-05-14T21:57:53.448259#1-150140] ERROR -- : Failed to run Job.run after 11 tries.
E, [2024-05-14T21:57:53.473147#1-150220] ERROR -- : execute Ticket.process_escalation (try_count 8) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.000269833 seconds.
E, [2024-05-14T21:57:53.545319#1-150520] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>
E, [2024-05-14T21:57:54.448716#1-150160] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>
E, [2024-05-14T21:57:54.448929#1-150160] ERROR -- : Failed to run Stats.generate after 11 tries.

I also had to modify docker-compose.yml from the one included on the docker-compose github because it’s missing the “/opt/zammad/var” folder which [POSTGRES_VERSION=15.6-alpine] requires. That version is specified in the github .env file.

Thank you very much, this makes sense and would’ve propably solved my issue but in the meantime I’ve migrated to a dedicated machine due to the better support.
I’ve had to reg-register elastics trough the rails console but after that it was a straight forward migration without big issues and it at least feels quicker than dockerized.

In my case it only allowed me to get postgres running, but didn’t help with the connectivity issues.

@n0xlf , Looks like this issue I’m also having:

https:// community.zammad .org/t/scheduler-fails-with-activerecord-could-not-obtain-a-connection-from-the-pool-within-5-000-seconds-waited-5-000-seconds-all-pooled-connections-were-in-use/14320/3
(Can’t put properly formed links in posts here)

I’ve tried different things as described in that ticket but can’t put my finger on what the actual fix is.

I rad your thread too and have also tried various things with no success. Apparently mine has been broken for at least 22 days :smile:

Current Status

  • Channel: Microsoft365::Account in is active but not fetched for about 3 hours
  • scheduler may not run (last execution of TaskbarCleanupJob.perform_now 22 days ago) - please contact your system administrator

This wasn’t affecting the MS365/email side before but now has a real world impact that tickets aren’t being processed in a timely fashion.

My installation is very small with only a handful of users and maybe a few tickets a day, so this kind of behavior from it is odd.

that is not the same issue and thus doesn’t belong into this thread.

Thought I saw that in my logs related to the postgres timeout but maybe not - Great :frowning: