[solved] Zammad memory leak?

Maybe web_concurrency=4 is a bit much for only 4 cores?
Doesn’t address the RAM usuage, but might lead to performance issues (as you got other stuff that needs some steam too)

Maybe @thorsteneckel has an idea regarding the memory usuage of PUMA.

Hi there - have you tried setting the Ruby MALLOC vars for your memory hungry processes?

Rails ActiveRecord (ORM) holds on pretty long for certain object instances. With these ENVs they get cleaned up more frequently.

Hello Thorsten,

Yes I saw that post from Johannes before.
The following are our zamamd conf settings:

export MIN_THREADS=“6”
export MAX_THREADS=“30”
export WEB_CONCURRENCY=“4”
export RUBY_GC_MALLOC_LIMIT=“1077216”
export RUBY_GC_MALLOC_LIMIT_MAX=“2177216”
export RUBY_GC_OLDMALLOC_LIMIT=“2177216”
export RUBY_GC_OLDMALLOC_LIMIT_MAX=“3000100”

So the first two malloc values are already the same.
The last two are even lower than the ones suggested. Wouldn’t that mean that it would do garbage collection even more often or do I misinterpret that values here? Would changing it to the ones suggested make a difference?
How are MIN and MAX Threads to be interpreted? Actual number of available CPU threads or threads in the sense that Zammad can open up to x threads on the OS for processing?

BR,
Nino

Hello,

We now tested what happens with the system if RAM runs out…
The result is not pleasing. Basically all “unnecessary” processes get dropped. About 70% of all postgress processes were dropped and elasticsearch was killed completely leaving the search GUI unresponsive and Zammad was not able to query Users or Tickets from the GUI. Swap was filled only by 40%.
Luckily incoming tickets were not blocked as postgress was still running. Very low, but still running so nothing was lost.

Our “Workaround” is to restart the server daily with a cronjob… this is not great but the only way we see having a stable system at the moment.

We would really appreciate if you could guide us with a proper setup here so we do not risk getting a RAM overflow after ~1 week.

Br,
Nino

That’s a very bad workaround / solution and doesn’t solve any problems.
If this really is a Zammad issue, this needs testing as this might strike any one.

Problem is, dozens of people use Zammad in productive with even more concurrent active agents.

Small hint regarding killing of processes: This is a linux behavior you can configure.
And another thing: You should always try to avoid to use swap, as your systems performance will decrease significantly.

How many max_connection does your PGSQL allow? Sounds like you’re allowing too much RAM somewhere.

Hi,

Agreed that it is not a solution but a mere workaround.
Our PGSQL config has only changed in default in max_connections from default 100 to 200 as we got internal 500 errors when increasing web_concurrecny.
See here for more info: [SOLVED] Internal server error 500 when WEB_CONCURRENCY is set >1

However these two changes started the slow but constant increase of memory usage.

Do they have regular maintenance periods where servers or services might be restarted?
Or do they have 0 maintenance 100% uptime setups?

Fair enough, but it should have never gotten to that point. The swappiness of the system is currently set to start using swap if less than 10% memory are available.

Br,
Nino

if this happened after increasing max_connections it could be postgres is configured to use more memory than available, but not necessarily, it might be web_concurrency making the difference whereas previously it was limited to roughly 100 threads that could open a connection to postgres, now there’s no such limit.

I would confirm postgres can’t use more than “(total memory - elasticsearch memory) / 2” with max_connections full, and then lower MAX_THREADS, maybe halve it. This is just me experimenting, making a change that makes sense without knowing for sure what the exact problem is.

Hi,

Thank you for the input.
We will try out some of the suggestions. psql has currently ~150 processes running on average, so lifting it to 200 was definitely the right move.
However testing different parameters is a bit tedious, as we can only say after 4-5 days of running if we are running into a memory issue again or if memory usage plateaus.
Over Christmas holidays we will stick with the reset strategy and try different setups after our IT team is complete again after holidays.

We will try setting MAX_THREADS to a lower number first and see if it makes a difference.
If not we will check the psql conf and try to fine tune it.
Any other suggestions are welcome as well.

Best regards,
Nino

The way I see it is you’re going to have to restart every few days until it’s fixed anyway, so adjusting MAX_THREADS in the blind wouldn’t be a problem for me, I’d do it and see what happens.

All it going to do is lower the max amount of memory rails will use, I’m assuming the more threads the more memory it can use. It will still start low and grow over time, but this time it the potential max it could grow to will be lower. I don’t think the growing memory use is definitely a memory leak.

Also whatever created that graph you posted might be able to create graphs of processes and threads, you might be able narrow down exactly where the memory is being used. If not, you can use the ps command.

The thing is that this probably not memory leak per definition. A memory leak is indicated by a continuous and steep rise of memory. Here is a picture that shows it pretty well:

While it’s not exactly the same as yours - the image in the lower right is what we face here. We already analyzed this earlier for one of our customers and found out that Rails holds on to the ActiveRecord instances created (which are quite a few in your system, according to the key figures you provided). We then found the ENVs posted earlier and they did the trick for us/the customer. Since then we used/recommended them a couple of times and they did their job.

So I think the question is more of why the ENVs aren’t working as expected. I think it’s the combination of WEB_CONCURRENCY=4 and the 2GB RAM limit per ruby process. Quick math: 4*2 = 8. This means that your Puma process is allowed to take 8 GB of RAM + 2 GB for the Scheduler and 2 GB for the Websocket server which makes a sum of 12 GB of RAM allowed for only the Ruby processes - elasticsearch and postresql not included :scream:

Conclusion: Could you please try to limit the allowed memory size to 1 oder 1.5 GB of RAM via the ENVs?

Hi @thorsteneckel,

Thank you for your answer.
For clarification with what ENV can I set this limit?
Is it one of these?
MIN_THREADS=“6”
MAX_THREADS=“30”
RUBY_GC_MALLOC_LIMIT=“1077216”
RUBY_GC_MALLOC_LIMIT_MAX=“2177216”
RUBY_GC_OLDMALLOC_LIMIT=“2177216”
RUBY_GC_OLDMALLOC_LIMIT_MAX=“3000100”

Or is it a completely separate value?

BR,
Nino

Hi @thorsteneckel,

Could you point me to the correct ENV that needs to be configured to limit the ruby process to 1-1.5 GB?

BR,
Nino

Hi @SEGGER-NV - sorry for my late response. We had some technical internal issues I needed to resolve first. However, I can only tell you from our experience. We maintain systems with a similar or bigger size than yours with that ENVs set without (memory) problems. As mentioned earlier after applying these back then when we faced memory issues these resolved. I’m afraid I can’t help you any further in the scope of free community support because this seems to be a specific error with your installation/setup. We focus our community work to resolve issues that affect the majority. However, we’re happy to help you with our commercial enterprise services.

Hello,

This is quite a weird statement as you wrote earlier:

So this statement raised the questions:

  • Which ENVs exactly?
  • Where do you get your “quick math” values from?
  • Is this somewhere documented?

Could you answer these as it seems like that information would be quite valued in the community seeing all the threads pop up discussing server setups?

It is weird that you tease an “obviously simple answer” and then not to specify it. At least that is the vibe I got from that post.
I don’t mind getting my hands dirty, but I would like to know what I am up against without an avoidable trial & error spiral.

BR,
Nino

Hi Nino,

once again: Sorry for my late response. And sorry for the confusion. To be honest I wasn’t directly involved in the task and these are all the information I could gather. To provide you a proper answer I’d have to invest quite an amount of time I currently don’t have due to the Knowledge Base development and support for our customers. However, I’ll provide information as soon as I can (which might take quite a while). Not usual as with proprietary software, we finance Zammad through service. So it’s less money we have at our disposal to push Zammad.

If you would like to support Zammad, you can do so via Zammad Support: https://zammad.com/pricing#selfhosted

Thank you for your understanding.

Hi Thorsten,

Thank you for your elaborate response.
It seems it was a misunderstanding on my part then that the setup can “easily” be fixed.
I can totally understand that you have to tend to your paid projects first.
In the meantime we will see if we can find a solution for our selves with some trail&error and keep you posted as well.
Should you find the time to investigate this further yourself, any hints are welcome.

Best regards,
Nino

2 Likes

Little update.
We reduced now WEB_CONCURRENCY to 3 and MAX_THREADS to 16.
This seems to have drastically reduced memory usage while keeping the speed benefits from earlier.
After 5 days testing instead of an overflow we are currently hovering at around 58% active memory usage.

We will keep tracking this a bit longer and see if we run into memory issues again or if we are good now.
Keep you posted.

2 Likes

OK we seem to be stable now. After 13 days uptime we hover at around 65% RAM usage of the 16 gig.
You can mark this thread as solved now.
Finally we can give the memory hungry elastic search a bit more RAM to play with :wink:

BR,
Nino

Hey there,

thank you very much for your Feedback, glad you could bring it to stable!

Hi Nino! Thanks for your feedback. I’m glad that you managed to find a way - kudos for that! I just read a great article about the memory consumption of Ruby applications. They already created a testable patch that probably will get evaluated over at Github, Discourse etc. We’ll keep an eye on that.
However, the article mentions a workaround by setting the ENV MALLOC_ARENA_MAX=2 (for the session of the Ruby process). This should reduce the memory consumption a lot but will come with the cost of higher CPU usage. We haven’t tested it yet because the memory consumption of our hosted instances is quite good. Anyhow, I just wanted to share the knowledge in this context so maybe someone can profit from it.