If we can ignore the message, the health check should ignore it too
Actual behavior:
healt check fails
Steps to reproduce the behavior:
Adding M365 as E-Mail Channel an waiting
Hi,
our healt check failes multiple times a week with the error message SMTP Driver Read Timeout Error.
Weâve setup M365 as E-Mail channel and we didnât noticed any problems with fetching and sending mails.
But we are monitoring the health check via the api via https://our.zammad.instance/api/v1/monitoring/health_check?token=1234 and this is resltung in multiple alerts.
We can screw down our monitoring triggers to only alert if the healt check fails 10 times in a row. But this doesnât seem to be a good solutionâŚ
The was a topic with the same error message in 2020:
If the answer is still, that we can ignore the error. Shouldnât the health check ignore it too?
If the health check fails because of an error which can be ignored, it does not make any sense to me.
Thatâs a regular error on Microsoft 365 that, at least for us, does not appear on a daily but at least weekly basis. For some time the SMTP services (sometimes IMAP as well) just doesnât want to talk to us. That usually self heals within 30 minutes or less.
Channels in Zammad are self healing. That means if you canât send an email and donât try it again for letâs say one day, the error will be persist until you try it after 24 hours (given that the sendout then was okay).
Timeout errors can occur - thatâs perfectly normal unless theyâre persistent for several hours etc. Youâre generally at the mercy of Microsofts uptime and service quality on that regard. From the big players, Microsoft 365 is the only thing that regularly pops up in our monitoring with connectivity issues. Might be load balancing and outdated DNS or a service being out of service temporarily. But oh well, what can you do.
Yes and no.
For this, please see my answer from back then in context:
The monitoring endpoint returns the current as is situation. Thatâs what youâd usually want to have anyway. Itâs not designed to decide âwhatâs an okay to ignore errorâ or not because admins may see that differently in general. Also it would mean that it would have to track how long a specific issue is being present which is not happening.
You can fine tune your individual monitoring triggers for specific scenarious if they donât satisfy your internal needs. But thatâs most likely something thatâs not relevant for the broad mass of people.
I for example always want to see the full extend of issues.