Scheduler fails with "ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>"

JasperE · May 15, 2024, 8:39am

Infos:

Used Zammad version: 6.3.0
Used Zammad installation type: docker-compose
Operating system: Ubuntu 22.04.4 LTS (jammy)
Browser + version: Chrome 124.0.6367.201

Expected behavior:

Scheduler able to finish tasks

Actual behavior:

Some scheduled tasks never finish (mostly Channel.stream, Sessions.jobs, but also others), showing errors in the zammad-scheduler container.

Yet some tasks seem to complete just fine. (The LDAP sync and Office 365 integrations do seem to update).

Monitor tab shows: Scheduler may not run (last execution of Chat.cleanup 5 days ago) - please contact your system administrator

Most descriptive error in logs is: ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use.

Steps to reproduce the behavior:

docker-compose stop && docker-compose up -d

What I’ve tried to troubleshoot

Create copy of /opt/zammad/config/database/database.yml, increase pool size to 500 and mount in all zammad containers (via shared-service yaml block)
Set environments vars ZAMMAD_MAX_THREADS, MAX_THREADS, RAILS_MAX_THREADS to 100 in docker-compose.yml
Change postgresql.conf max_connections parameter to value of 500

Full scheduler log:

I, [2024-05-14T16:08:01.944727#1-149060]  INFO -- : ProcessScheduledJobs running...
I, [2024-05-14T16:08:01.946071#1-149060]  INFO -- : Running job thread for 'Process pending tickets.' (Ticket.process_pending) status is: sleep
I, [2024-05-14T16:08:01.946116#1-149060]  INFO -- : Running job thread for 'Check 'Channel' streams.' (Channel.stream) status is: sleep
I, [2024-05-14T16:08:01.946133#1-149060]  INFO -- : Running job thread for 'Execute import jobs.' (ImportJob.start_registered) status is: sleep
I, [2024-05-14T16:08:01.946153#1-149060]  INFO -- : Running job thread for 'Generate 'Session' data.' (Sessions.jobs) status is: sleep
I, [2024-05-14T16:08:01.946169#1-149060]  INFO -- : Running job thread for 'Update exchange oauth 2 token.' (ExternalCredential::Exchange.refresh_token) status is: sleep
I, [2024-05-14T16:08:01.946192#1-149060]  INFO -- : Running job thread for 'Check channels.' (Channel.fetch) status is: sleep
I, [2024-05-14T16:08:01.946209#1-149060]  INFO -- : Running job thread for 'Process automatic ticket unassignments.' (Ticket.process_auto_unassign) status is: sleep
I, [2024-05-14T16:08:01.946223#1-149060]  INFO -- : Running job thread for 'Process ticket escalations.' (Ticket.process_escalation) status is: sleep
I, [2024-05-14T16:08:01.946240#1-149060]  INFO -- : Running job thread for 'Handle data privacy tasks.' (DataPrivacyTaskJob.perform_now) status is: sleep
I, [2024-05-14T16:08:01.947104#1-149060]  INFO -- : Running job thread for 'Clean up dead sessions.' (SessionTimeoutJob.perform_now) status is: sleep
I, [2024-05-14T16:08:01.947168#1-149060]  INFO -- : Running job thread for 'Delete old online notification entries.' (OnlineNotification.cleanup) status is: sleep
I, [2024-05-14T16:08:01.947228#1-149060]  INFO -- : Running job thread for 'Generate user-based stats.' (Stats.generate) status is: sleep
I, [2024-05-14T16:08:01.947540#1-149060]  INFO -- : Running job thread for 'Close chat sessions where participants are offline.' (Chat.cleanup_close) status is: sleep
I, [2024-05-14T16:08:01.947646#1-149060]  INFO -- : Running job thread for 'Execute planned jobs.' (Job.run) status is: sleep
E, [2024-05-14T16:08:02.918660#1-159100] ERROR -- : execute Ticket.process_auto_unassign (try_count 8) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.000310106 seconds.
E, [2024-05-14T16:08:02.944379#1-159200] ERROR -- : execute SessionTimeoutJob.perform_now (try_count 1) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.000251807 seconds.
E, [2024-05-14T16:08:03.944897#1-159220] ERROR -- : execute OnlineNotification.cleanup (try_count 1) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.000280882 seconds.
E, [2024-05-14T16:08:04.943091#1-159140] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>
E, [2024-05-14T16:08:04.944778#1-159240] ERROR -- : execute Stats.generate (try_count 1) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.00021298 seconds.
E, [2024-05-14T16:08:05.907592#1-159080] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>
E, [2024-05-14T16:08:05.907744#1-159080] ERROR -- : Failed to run ImportJob.start_registered after 11 tries.
E, [2024-05-14T16:08:05.940307#1-159120] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>
E, [2024-05-14T16:08:05.943204#1-159160] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>
E, [2024-05-14T16:08:05.945166#1-159260] ERROR -- : execute Chat.cleanup_close (try_count 1) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.000192878 seconds.
E, [2024-05-14T16:08:06.943661#1-159180] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>
E, [2024-05-14T16:08:06.945652#1-159280] ERROR -- : execute Job.run (try_count 1) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.000225233 seconds.
E, [2024-05-14T16:08:07.918939#1-159100] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>
E, [2024-05-14T16:08:07.944594#1-159200] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>
E, [2024-05-14T16:08:08.945150#1-159220] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>
E, [2024-05-14T16:08:09.945021#1-159240] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>
E, [2024-05-14T16:08:10.908190#1-159080] ERROR -- : could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use (ActiveRecord::ConnectionTimeoutError)
lib/background_services/service/process_scheduled_jobs/job_executor.rb:95:in `retry_limit_reached'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:82:in `retry_execution'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:34:in `rescue in execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:24:in `execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:88:in `retry_execution'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:34:in `rescue in execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:24:in `execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:88:in `retry_execution'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:34:in `rescue in execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:24:in `execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:88:in `retry_execution'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:34:in `rescue in execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:24:in `execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:88:in `retry_execution'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:34:in `rescue in execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:24:in `execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:88:in `retry_execution'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:34:in `rescue in execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:24:in `execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:88:in `retry_execution'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:34:in `rescue in execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:24:in `execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:88:in `retry_execution'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:34:in `rescue in execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:24:in `execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:88:in `retry_execution'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:34:in `rescue in execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:24:in `execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:88:in `retry_execution'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:34:in `rescue in execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:24:in `execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:88:in `retry_execution'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:34:in `rescue in execute'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:24:in `execute'
lib/background_services/service/process_scheduled_jobs/job_executor/one_time.rb:6:in `run'
lib/background_services/service/process_scheduled_jobs/job_executor.rb:10:in `run'
lib/background_services/service/process_scheduled_jobs/manager.rb:84:in `block in start_in_thread'
lib/application_handle_info.rb:19:in `use'
lib/background_services/service/process_scheduled_jobs/manager.rb:82:in `start_in_thread'
lib/background_services/service/process_scheduled_jobs/manager.rb:73:in `block (2 levels) in start'
lib/background_services/service/process_scheduled_jobs/manager.rb:72:in `block in start'
E, [2024-05-14T16:08:10.945404#1-159260] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>
E, [2024-05-14T16:08:11.943715#1-159140] ERROR -- : execute ExternalCredential::Exchange.refresh_token (try_count 2) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.000303536 seconds.
E, [2024-05-14T16:08:11.945841#1-159280] ERROR -- : Can't reconnect to database #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use>

MrGeneration · May 15, 2024, 10:09am

You do know that pool size does count per process and thus in your case would mean 4*500 connections without any tuning?

This is an absolutely over the top and uneeded configuration. The default of 50 should be more than enough in most scenarious. You’ll need at least 200 max_connections on postgres, 2000 will be better for tuning etc.

JasperE · May 15, 2024, 10:38am

Thanks, I’m not exactly familiar with how all max thread limits affect each other.
So I reverted all environment settings back to default, increased the postgresql limit, and the issue still shows. What could be the next debugging step, or what should I try now?

PS.
From what I’ve read (link below), this issue has likely not to do with postgresql but with rails configuration. That’s why I tried upping/changing the rails thread limit environment variables.

https :// judoscale .com/blog/active-record-connection-pool

root@svrdocker01:/home/username/docker_zammad# docker-compose up -d
Starting docker_zammad_zammad-elasticsearch_1 ... done
Starting docker_zammad_zammad-postgresql_1    ... done
Starting docker_zammad_zammad-redis_1         ... done
Starting docker_zammad_zammad-memcached_1     ... done
Starting docker_zammad_zammad-init_1          ... done
Starting docker_zammad_zammad-scheduler_1     ... done
Starting docker_zammad_zammad-railsserver_1   ... done
Starting docker_zammad_zammad-websocket_1     ... done
Starting docker_zammad_zammad-nginx_1         ... done

root@svrdocker01:/home/username/docker_zammad# docker logs -f docker_zammad_zammad-scheduler_1
[..snip..]
E, [2024-05-15T10:24:08.457665#1-150100] ERROR -- : execute Channel.fetch (try_count 0) exited with error #<ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use> in: 5.000267993 seconds.

username@svrdocker01:~/docker_zammad$ docker exec -it  docker_zammad_zammad-scheduler_1 /bin/bash
zammad@0be657ce42d3:~$ cat config/database/database.yml
default: &default
  pool: 50
  timeout: 15000
  encoding: utf8
  adapter: postgresql

production:
  <<: *default
  pool: 50
  timeout: 15000
  database: zammad_production

username@svrdocker01:~/docker_zammad$ docker exec -it  docker_zammad_zammad-postgresql_1 /bin/bash
15d11848aeb6:/# cat ./var/lib/postgresql/data/postgresql.conf | grep max
max_connections = 5000                  # (change requires restart)

This is my .env (possibly relevant bits):

~/docker_zammad$ cat .env
ELASTICSEARCH_VERSION=8.12.2
IMAGE_REPO=ghcr.io/zammad/zammad
MEMCACHE_SERVERS=zammad-memcached:11211
MEMCACHE_VERSION=1.6.25-alpine
POSTGRES_VERSION=15.6-alpine
REDIS_VERSION=7.2.4-alpine
RESTART=always
VERSION=6.3.0       # including minor updates

And docker-compose.yml (should be default):

x-shared:
  zammad-service: &zammad-service
    environment: &zammad-environment
      MEMCACHE_SERVERS: ${MEMCACHE_SERVERS}
      POSTGRESQL_DB: ${POSTGRES_DB}
      POSTGRESQL_HOST: ${POSTGRES_HOST}
      POSTGRESQL_USER: ${POSTGRES_USER}
      POSTGRESQL_PASS: ${POSTGRES_PASS}
      POSTGRESQL_PORT: ${POSTGRES_PORT}
      REDIS_URL: ${REDIS_URL}
      # Allow passing in these variables via .env:
      AUTOWIZARD_JSON:
      AUTOWIZARD_RELATIVE_PATH:
      ELASTICSEARCH_ENABLED:
      ELASTICSEARCH_HOST:
      ELASTICSEARCH_PORT:
      ELASTICSEARCH_SCHEMA:
      ELASTICSEARCH_NAMESPACE:
      ELASTICSEARCH_REINDEX:
      ELASTICSEARCH_SSL_VERIFY:
      NGINX_PORT:
      NGINX_SERVER_NAME:
      NGINX_SERVER_SCHEME:
      POSTGRESQL_DB_CREATE:
      POSTGRESQL_OPTIONS:
      RAILS_TRUSTED_PROXIES:
      ZAMMAD_WEB_CONCURRENCY:
      ZAMMAD_SESSION_JOBS:
      ZAMMAD_PROCESS_SCHEDULED:
      ZAMMAD_PROCESS_DELAYED_JOBS_WORKERS:

n0xlf · May 16, 2024, 4:43pm

Just for tracking, this is related:

https://community.zammad.org/t/scheduler-not-connecting-to-database/14296/4

JasperE · May 17, 2024, 6:54am

So I’ve spent another hour or so trying to get the combination of config settings right in puma.rb / database.yml / postgresql.conf. Without any success unfortunately.

A postgresql.conf setting of max_connections = 5000
With a database.yml of pool: 50 and puma.rb workers 1 and threads 5, 10 still results in the same ‘could not obtain a connection’ error.

From my understanding this should have the effect that each puma worker (of which only workers 1 is allowed) should have a pool of pool: 50 available database connections. Each puma worker is allowed to scale between threads 1, 10 threads depending on the server load (default between 5 and 30 threads).

If the server load then goes up, the server will create max 10 threads with these settings. So there shouldn’t ever be more than 10 threads waiting for a database connection. Since the pool size is 50 there shouldnt be an error that its waiting for a connection. Yet this does happen. Possible reasons could be things such as connections not being returned to the pool or connections being made outside of the pool context. That goes a bit over my head to troubleshoot any further.

JasperE · May 17, 2024, 9:11am

Am I right settings from database.yml do not seem to propagate?

Pool size seems to be set to the default value of 5 in runtime, rather than 50 as was configured in database.yml

$ docker compose run --rm zammad-scheduler rails r "p ActiveRecord::Base.connection_pool.stat"
[+] Creating 3/0
 ✔ Container docker_zammad-zammad-memcached-1   Running                                                                                            0.0s
 ✔ Container docker_zammad-zammad-redis-1       Running                                                                                            0.0s
 ✔ Container docker_zammad-zammad-postgresql-1  Running                                                                                            0.0s
I, [2024-05-17T08:31:11.773981 #1]  INFO -- : ActionCable is using the redis instance at redis://zammad-redis:6379.
I, [2024-05-17T08:31:11.824991#1-6060]  INFO -- : Using memcached as Rails cache store.
I, [2024-05-17T08:31:11.825266#1-6060]  INFO -- : Using the Redis back end for Zammad's web socket session store.
{:size=>5, :connections=>0, :busy=>0, :dead=>0, :idle=>0, :waiting=>0, :checkout_timeout=>5.0}

$ docker exec -it  docker_zammad_zammad-scheduler_1 /bin/bash
$ cat /opt/zammad/config/database/database.yml | grep pool
  pool: 50
$ cat /opt/zammad/config/database.yml | grep pool
cat: /opt/zammad/config/database.yml: No such file or directory

In the docker container database.yml is only in ~/config/databases/database.yml and no symlink in ~/config/database.yml

JasperE · May 17, 2024, 10:54am

So I managed to workaround the issue, and verified what I posted before is the issue: database.yml is in the wrong place. It should be in /opt/zammad/config not only in /opt/zammad/config/database/

I copied the default database.yml file and mounted that in the docker template as so:

x-shared:
  zammad-service: &zammad-service
    environment: &zammad-environment
      [..snip..]
    volumes:
      - ./tempdb.yml:/opt/zammad/config/database.yml

Restarted docker and problem solved. No more connection errors in log.

$ docker compose run --rm zammad-scheduler rails r "p ActiveRecord::Base.connection_pool.stat"
[+] Creating 3/3
 ✔ Container docker_zammad_zammad-memcached_1   Recreated                                                                                                                                                                                   0.5s
 ✔ Container docker_zammad_zammad-redis_1       Recreated                                                                                                                                                                                   0.3s
 ✔ Container docker_zammad_zammad-postgresql_1  Recreated                                                                                                                                                                                   0.4s
[+] Running 3/3
 ✔ Container docker_zammad-zammad-memcached-1   Started                                                                                                                                                                                     0.5s
 ✔ Container docker_zammad-zammad-postgresql-1  Started                                                                                                                                                                                     0.4s
 ✔ Container docker_zammad-zammad-redis-1       Started                                                                                                                                                                                     0.3s
I, [2024-05-17T10:53:21.527187 #1]  INFO -- : ActionCable is using the redis instance at redis://zammad-redis:6379.
I, [2024-05-17T10:53:21.578017#1-6060]  INFO -- : Using memcached as Rails cache store.
I, [2024-05-17T10:53:21.578302#1-6060]  INFO -- : Using the Redis back end for Zammad's web socket session store.
{:size=>50, :connections=>0, :busy=>0, :dead=>0, :idle=>0, :waiting=>0, :checkout_timeout=>5.0}

MrGeneration · May 17, 2024, 11:06am

That shouldn’t be required at all because Zammads init container does touch that file etc. What’s different in your approach now?

JasperE · May 17, 2024, 11:14am

Where does it touch that file or create the symlink?
From what I’ve seen I don’t see contrib/docker/docker-entrypoint.sh touching database.yml but I might be missing something.

The only change that I made which resulted in rails applying the database.yml is moving it from /opt/zammad/config/database/ to /opt/zammad/config/

n0xlf · May 18, 2024, 8:06am

@JasperE Great find! Fixed all of my DB/scheduler maintenance issues.

sigurbjartur · May 21, 2024, 9:49pm

I had this issue as well when pulling the zammad/zammad-docker-compose project.

Turns out that when updating the zammad-docker-compose project with the settings for Zamma 6.3.0 specifically:

POSTGRESQL_OPTIONS=?pool=50

It is not being propagated into the docker-compose.yml. For me changing: POSTGRESQL_OPTIONS: in docker-compose.yml to POSTGRESQL_OPTIONS: ${POSTGRESQL_OPTIONS} fixed it.

sigurbjartur · May 21, 2024, 10:16pm

I didn’t try removing the environment setting altogether from the docker-compose.yml file, but that could work as well as it may override the value in the .env file.

stb · May 25, 2024, 12:15pm

That fixes it for me, as well.

JasperE · May 25, 2024, 3:09pm

The supported way to fix this issue is setting environment variable POSTGRESQL_OPTIONS=?pool=50 in .env

This is a variable which was recently added to .env.

Check this out for more details : database.yml is in the wrong location for docker containers · Issue #5179 · zammad/zammad · GitHub

system · June 1, 2024, 3:09pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.