Failed Health Check due to SMTP Driver Read Timeout Error

asqdmin · August 4, 2023, 12:56pm

Infos:

Used Zammad version: 5.4.1
Used Zammad installation type: docker-compose
Operating system: debian 11

Expected behavior:

If we can ignore the message, the health check should ignore it too

Actual behavior:

healt check fails

Steps to reproduce the behavior:

Adding M365 as E-Mail Channel an waiting

Hi,

our healt check failes multiple times a week with the error message SMTP Driver Read Timeout Error.

We’ve setup M365 as E-Mail channel and we didn’t noticed any problems with fetching and sending mails.

But we are monitoring the health check via the api via https://our.zammad.instance/api/v1/monitoring/health_check?token=1234 and this is resltung in multiple alerts.
We can screw down our monitoring triggers to only alert if the healt check fails 10 times in a row. But this doesn’t seem to be a good solution…

The was a topic with the same error message in 2020:

If the answer is still, that we can ignore the error. Shouldn’t the health check ignore it too?

If the health check fails because of an error which can be ignored, it does not make any sense to me.

Best Regads!

MrGeneration · August 5, 2023, 12:56pm

That’s a regular error on Microsoft 365 that, at least for us, does not appear on a daily but at least weekly basis. For some time the SMTP services (sometimes IMAP as well) just doesn’t want to talk to us. That usually self heals within 30 minutes or less.

Channels in Zammad are self healing. That means if you can’t send an email and don’t try it again for let’s say one day, the error will be persist until you try it after 24 hours (given that the sendout then was okay).

Timeout errors can occur - that’s perfectly normal unless they’re persistent for several hours etc. You’re generally at the mercy of Microsofts uptime and service quality on that regard. From the big players, Microsoft 365 is the only thing that regularly pops up in our monitoring with connectivity issues. Might be load balancing and outdated DNS or a service being out of service temporarily. But oh well, what can you do.

Yes and no.
For this, please see my answer from back then in context:

The monitoring endpoint returns the current as is situation. That’s what you’d usually want to have anyway. It’s not designed to decide “what’s an okay to ignore error” or not because admins may see that differently in general. Also it would mean that it would have to track how long a specific issue is being present which is not happening.

You can fine tune your individual monitoring triggers for specific scenarious if they don’t satisfy your internal needs. But that’s most likely something that’s not relevant for the broad mass of people.

I for example always want to see the full extend of issues.

asqdmin · August 7, 2023, 6:00am

Thank you for your fast response!

I will try to tweak my monitoring.

system · August 14, 2023, 6:01am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.