Server regularly stops responding

tomsparrow · January 12, 2021, 9:08am

Infos:

Zammad Version 3.6
Installed from apt repository
Ubuntu server (One is 18.04.5, the other 20.04.1)
Firefox 84 (seen on all browsers)
LDAP & 365 login integration.
emails from 365, but was IMAP until recently with the same behaviour.
no other integration enabled. No incoming API use.

Expected behaviour:

Server works (!)

Actual behaviour:

After some time web service stops responding. Either I get the page loaded, but no content (i.e. plain white page) - I suspect this is caching in the browser getting me that far. A reload normally gives an Nginx error (bad gateway). Occasionally, if the page is already running I get a ‘could not connect’ message but I think this is the same thing - a reload at that point tends to give the nginx gateway error.
Sometimes emails stop being collected (not always - maybe 50/50) - I can see the incoming mailbox growing so know they’re not being collected as they are not left in the mailbox once imported.

Steps to reproduce the behaviour:

On my system (mostly just me using it), it’s always broken first thing in my morning. The user system (15 users, 3 or 4 active at the most) just stops at irregular intervals during the day.
it is possible that it’s triggered by closing the browser session. People will come and go more often on the user system - I’m logged on all day and just close it in the evening. I think I noticed it stop once when I restarted the browser during the day, but was fixing something else at the time so wasn’t paying attention.
htop shows no real difference when the system has died - 2.0 - 2.8GB memory used (out of 8GB). CPU use is always low.

Workaround

Restarting the zammad service from the shell clears the problem, but if I add a regular restart into the crontab, it doesn’t seem to help. I’ve also added a restart for elasticsearch to no effect.
I found that restarting postgresql will clear the problem, but sometimes leaves odd effects in the browser. Restarting postgres and zammad in the crontab seems to have helped the user system somewhat, but not entirely removed the problem. It didn’t help my system.

I’m not sure which logs I would need to look at, but I can’t find anything out of the ordinary in those I’ve looked through.

tomsparrow · January 29, 2021, 8:05am

Well, I’ve got somewhere - it seems to be related to having a proxy on the front of it (in addition to the local NGinx proxy). I had it running through cloudflare as it adds protection and an easy way of using the routable IPv6 address without it only being available to IPv6 enabled users.

I swapped this for an HAProxy server (to do SNI resolution on a single IPv4 address) and one of the servers seemed to stay up that day. The other still went down overnight.
Yesterday I changed the DNS so that all clients go through this proxy (previously anyone in the office or on VPN was accessing the nginx service directly) and both seem to have stayed up since then.

There’s obviously a problem somewhere, but at least I seem to have figured out the trigger now and so can avoid it.

TLDR: it seems to be caused by a mix of direct and proxied traffic to the server.

tomsparrow · February 2, 2021, 1:13pm

The web side seems to have been fixed by the change to the proxying, but the scheduler was still hanging constantly. I was on the verge of deciding there were too many problems with Zammad as a whole and was looking for alternatives (again).
Web frontend was hanging, the scheduler was giving up and no emails were coming in, LDAP was having trouble and disabling accounts all over the place. It was a mess.

As a last resort, I decided to reinstall on a completely fresh VM, and after 1 day I’m not seeing any of the issues we had with the previous installs. I did a fresh install and went through the instructions again (twice, once for each system) and it’s all working fine now on both sides.
I’ll have to keep a close eye on it, but at the moment it seems like there was something very messed up somewhere on the system and a clean install has got rid of it.
Fortunately, whatever it was didn’t carry through with the backups - taking a backup and moving it to the new system went as smoothly as you could wish for and was done in 10 minutes.

system · June 2, 2021, 1:13pm

This topic was automatically closed 120 days after the last reply. New replies are no longer allowed.