DB corrupted, how to fix?

Infos:

  • Used Zammad version: 4.0.0
  • Used Zammad installation source: docker-compose (with persisant storage in a Gitlab-CI implementation)
  • Operating system: Linux
  • Browser + version: Firefox 86.0.1

Expected behavior:

No error, and to be able to change users permissions

Actual behavior:

When trying to change a specific user (LDAP) permissions I get this:
PG::DataCorrupted: ERROR: could not read block 46 in file "base/16385/17391": read only 0 of 8192 bytes : INSERT INTO "active_job_locks" ("lock_key", "active_job_id", "created_at", "updated_at") VALUES ($1, $2, $3, $4) RETURNING "id"

Steps to reproduce the behavior:

  1. Open the specific user (bartas) and change permissions and click save.

Some background

We have setup Zammad in a Gitlab-CI enviroment with persistent storage to a NFS sevrer in the same subnet.

swarm-compose.yml

version: '3.7'

services:

  zammad-backup:
    command: ["zammad-backup"]
    depends_on:
      - zammad-railsserver
    entrypoint: /usr/local/bin/backup.sh
    environment:
      - BACKUP_SLEEP=86400
      - HOLD_DAYS=10
      - POSTGRESQL_USER=${POSTGRES_USER}
      - POSTGRESQL_PASSWORD=${POSTGRES_PASS}
    image: ${IMAGE_REPO}:zammad-postgresql${VERSION}
    links:
      - zammad-postgresql
    restart: ${RESTART}
    volumes:
      - zammad-backup:/var/tmp/zammad
      - zammad-data:/opt/zammad

  zammad-elasticsearch:
    environment:
      - discovery.type=single-node
    image: ${IMAGE_REPO}:zammad-elasticsearch${VERSION}
    restart: ${RESTART}
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data

  zammad-init:
    command: ["zammad-init"]
    depends_on:
      - zammad-postgresql
    environment:
      - POSTGRESQL_USER=${POSTGRES_USER}
      - POSTGRESQL_PASS=${POSTGRES_PASS}
    image: ${IMAGE_REPO}:zammad${VERSION}
    links:
      - zammad-elasticsearch
      - zammad-postgresql
    restart: on-failure
    volumes:
      - zammad-data:/opt/zammad

  zammad-memcached:
    command: memcached -m 256M
    image: memcached:1.6.9-alpine
    restart: ${RESTART}

  zammad-nginx:
    command: ["zammad-nginx"]
    depends_on:
      - zammad-railsserver
    image: ${IMAGE_REPO}:zammad${VERSION}
    links:
      - zammad-railsserver
      - zammad-websocket
    restart: ${RESTART}
    volumes:
      - zammad-data:/opt/zammad
    environment:
      - VIRTUAL_HOST=${SWARM_PROD_URL}
      - NGINX_SERVER_SCHEME=https
    networks:
      - default
      - ${PROXY_NETWORK}
    deploy:
        resources:
          limits:
            cpus: '4.0'
            memory: 8GB
        labels:
          - traefik.backend.loadbalancer.stickiness=true
          - traefik.frontend.rule=Host:${SWARM_PROD_URL}
          - traefik.port=8080
        restart_policy:
          condition: any
        update_config:
          parallelism: 1
          order: start-first
        replicas: 1

  zammad-postgresql:
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASS}
    image: ${IMAGE_REPO}:zammad-postgresql${VERSION}
    restart: ${RESTART}
    volumes:
      - postgresql-data:/var/lib/postgresql/data

  zammad-railsserver:
    command: ["zammad-railsserver"]
    depends_on:
      - zammad-memcached
      - zammad-postgresql
    image: ${IMAGE_REPO}:zammad${VERSION}
    links:
      - zammad-elasticsearch
      - zammad-memcached
      - zammad-postgresql
    restart: ${RESTART}
    volumes:
      - zammad-data:/opt/zammad

  zammad-scheduler:
    command: ["zammad-scheduler"]
    depends_on:
      - zammad-memcached
      - zammad-railsserver
    image: ${IMAGE_REPO}:zammad${VERSION}
    links:
      - zammad-elasticsearch
      - zammad-memcached
      - zammad-postgresql
    restart: ${RESTART}
    volumes:
      - zammad-data:/opt/zammad

  zammad-websocket:
    command: ["zammad-websocket"]
    depends_on:
      - zammad-memcached
      - zammad-railsserver
    image: ${IMAGE_REPO}:zammad${VERSION}
    links:
      - zammad-postgresql
      - zammad-memcached
    restart: ${RESTART}
    volumes:
      - zammad-data:/opt/zammad

volumes:
  elasticsearch-data:
    driver: local
    driver_opts:
     type: nfs
     o: addr=192.168.7.157,rw,soft,nolock,vers=4.1
     device: ":/storage/${SWARM_PROD_URL}/elasticsearch-data"
  postgresql-data:
    driver: local
    driver_opts:
     type: nfs
     o: addr=192.168.7.157,rw,soft,nolock,vers=4.1
     device: ":/storage/${SWARM_PROD_URL}/postgresql-data"
  zammad-backup:
    driver: local
    driver_opts:
     type: nfs
     o: addr=192.168.7.157,rw,soft,nolock,vers=4.1
     device: ":/storage/${SWARM_PROD_URL}/zammad-backup"
  zammad-data:
    driver: local
    driver_opts:
     type: nfs
     o: addr=192.168.7.157,rw,soft,nolock,vers=4.1
     device: ":/storage/${SWARM_PROD_URL}/zammad-data"


networks:
     traefik-net:
         external: true

.env

# don't forget to add the minus before the version
VERSION=-4.0.0-7
IMAGE_REPO=zammad/zammad-docker-compose
POSTGRES_PASS=zammad
POSTGRES_USER=zammad
RESTART=always
PROXY_NETWORK=traefik-net
NGINX_SERVER_SCHEME=https
SWARM_PROD_URL=zammad01.domain.com

.gitlab-ci.yml

image: docker:stable

variables:
  DOCKER_DRIVER: overlay2
  DOCKER_TLS_CERTDIR: "/certs"

services:
    - name:  docker:dind

stages:
  - deploy

Deploy Zammad:
  stage: deploy
  script:
    - echo "Deploy Zammad to production"
    - set -a && . .env set +a
    - docker stack deploy --compose-file swarm-compose.yml zammad01-domain-com
  when: always

Is there a way around this, or do we need to redeploy the whole thing? Also, are there any recomendations regarding using persistent storage, especially for the DB?

Same thing happens for the LDAP sync (bartas is an LDAP user):

An error occurred: PG::DataCorrupted: ERROR: could not read block 46 in file "base/16385/17391": read only 0 of 8192 bytes : SELECT "active_job_locks".* FROM "active_job_locks" WHERE "active_job_locks"."lock_key" = $1 LIMIT $2 FOR UPDATE

Usually this isn’t fixable and you should recover from a backup. Postgress is quite unforgiving in corrupted storage files. Hosting database files from a network share can be tricky business, and should be tested thoroughly, especially fsync and alloc call’s could be mangled by the network share, otherwise it could be you have an underlaying hardware issue, like a broken disk. Please also see the discussion here: Issue with LDAP

This is a fresh install for testing, so I guess we’ll come up with some other plan for the PostgreSQL database in our swarm.

Problem is that we can’t run it locally since the data isn’t persistent in the swarm - a restart would loose all data. Maybe that’s fixable, I don’t know right now and I would have to investigate that. Do you have any suggestions maybe? I mean, is it possible to tweak NFS to suit PostgreSQL? For example, is async better than async? Are there any other options for the export that would be good to set?

Thanks for helping out here!

well what postgress it self advices at least to run in none cached mode (see: PostgreSQL: Documentation: 9.0: Creating a Database Cluster), but it’s probably requires some more tuning than that. I’m running on a glusterfs filesystem, with the postgress on a volume which is configured for direct I/O, which run’s fine, besides a couple of minor issues on which i can live with, and should be fixed in the latest version (9). You could also investigate other distributed filesystems, there are many.

Just talked to my CTO, we will run it on iSCSI instead.

Thanks a lot for your help!

This topic was automatically closed 120 days after the last reply. New replies are no longer allowed.