When we introduced our FAQ in November last year we did so with the intention to keep adding questions and answers to it on an ongoing basis.
To highlight some of the most common issues reported by you, our customer, we will also publish some of the additions here on the Royal Pingdom blog. This is the first such entry.
Pingdom says my site is down, but it is not
It may happen that you find yourself in a situation where our monitoring service reports that your site or server is unavailable, but you clearly know it’s up and running.
In such cases, there may be many reasons that can explain the discrepancy, and we wanted to give you some idea of where to look to figure it out.
First of all, please note that Pingdom is an external monitoring service. This means that our probe servers, located around the world, will connect to your site or server from outside the local network where it’s hosted. Therefore, your site or server may still be locally accessible even though Pingdom recognize it as down.
It is also important that you understand that when one of our probe servers cannot connect to a site or server, Pingdom’s system will automatically ask another probe server to try to make the same connection. We call this “second opinion.” Your check (site or server) will only be marked as confirmed down if the second test also fails.
To find out what caused any outage, our recommended approach is to look at the Root Cause Analysis and Test Result Log, which will show you further details about the outage.
If the outage was short (less than one or a few minutes) or intermittent, it was most likely caused by an occasional issue somewhere between the probe server locations and your site or server. This then makes it very hard to determine the exact cause of the problem. Please note though, as the check was recognized as down from at least two of our probe server locations, the cause of the issue may be located close to your server location.
Besides this, more uncommon issues may be blocking firewalls, blacklists, or cached DNS records.
If you filter requests with the help of a blacklist, or use a firewall in front of your site or server, then please make sure that you whitelist our probe servers. Also, keep the whitelist up to date in case new probe servers are added to the Pingdom service, or details of existing probe servers change. We always announce new probe servers as well as changes to existing ones several days before deployments are made. Please read about how to find a list of our probe servers and their details here.
Are the errors for the outage in the Test Result Log “Unknown target” or “DNS error”? In such cases, the outage is most likely related to propagation of new DNS records or cached NX records. Each of our probe servers run their own individual Bind9 caching DNS server as their DNS resolver, thus DNS records will be cached. If invalid records were returned, NX domain records will be cached for that domain. Unfortunately, in such a case you have to wait until the invalid records have expired.
The FAQ keeps growing
We hope that this new FAQ entry can help you and others to figure out what’s really happened when you receive an alert about your site or server being down. As we’ve said, we’ll continue to keep publishing new FAQ entries and highlight at least some of them here on the blog as well.
Do you have a suggestion for something you think we should definitely add to the FAQ, which is not there today? Let us know in the comments below.