The infamous “Couldn’t fetch sitemap”

I was having this problem for some time. After submitting this site’s sitemap.xml to Google Search Console, I was getting a red warning that it could’t fetch the sitemap, because of some unspecified HTTP error. Not very helpful. Plus there wasn’t any sign in the server logs that it was attempting to fetch it.

Apparently this is a fairly common problem and there can be many causes for it:

  • A malformed sitemap.xml
  • robots.txt or X-Robots-Tag blocking access to crawlers
  • Connectivity problems, such as IPv6

I double checked that the sitemap was correct and it could be fetched by others. Bing Webmaster Tools had no problem accessing and parsing it either.

Some people advice to just wait it out and it will solve itself, but after waiting for a few weeks things hadn’t changed. The cause had to be something more obscure. A hint was in nginx error.log:

2021/09/11 15:15:33 [crit] 35499#35499: *136 SSL_do_handshake() failed (SSL: error:141CF06C:SSL routines:tls_parse_ctos_key_share:bad key share) while SSL handshaking, client: 123.123.123.123, server: 0.0.0.0:443

geoiplookup told me that the client’s IP resolved to a domain that belongs to Google.

After some more investigating using tshark and wireshark, I suspected that the cause might be some TLS cipher incompatibility. So, I tried changing this setting in nginx:

ssl_prefer_server_ciphers on;

which Let’s Encrypt sets to off. And it worked. It worked well enough to allow me to submit the sitemap. Although the initial response was that it still couldn’t fetch the file, I could see in the logs that it was getting it and next time I checked I saw the green success sign.

However, this is more of a workaround. I still get such TLS error messages in the logs from Google’s domains, although the crawler works properly. I am convinced that it isn’t so much a problem in my configuration, since I had it with both vanilla Let’s Encrypt and Caddy’s automatic HTTPS configuration.

The real cause is probably that Googlebot, having to index the whole Internet, supports a large number of old and insecure ciphers. Some of them may be causing a problem in new configurations. I could possibly identify exactly which cipher causes the problem and disable it but I would rather have them take a look at this issue, which may be affecting others as well.

Redirecting a domain and Bing

Bing does things a bit differently. For example, it doesn’t distinguish between http:// and https:// domains and even more confusingly, it doesn’t distinguish between www and non-www domains. At least that’s my understanding. I wonder how it handles multiple subdomains.

It also seems to be getting confused by redirections of files that really belong to a single domain, such as robots.txt and sitemap.xml. If I understand correctly, Googlebot ignores such redirections (at least it does for robots.txt) while Bingbot follows them and…

I was trying to redirect this site from its old domain to its new domain. The first approach was to simply 301 redirect everything. Using Change Of Address in Google Search Console was sufficient to redirect the old pages to their new location and to add new content as well. Bing on the other hand, has removed its equivalent Site Move Tool. The old pages weren’t being reindexed to discover the redirections and new pages on the new domain were ignored.

After further investigation/hair pulling/trying various things, I came to the conclusion that this was caused by the redirection of sitemap.xml and robots.txt (which includes the URL of sitemap.xml). So I tried redirecting everything except for these files, which were served as they were in the old domain. Here’s an example for nginx:

# Redirect old domain: https://www.olddomain.com -> https://www.newdomain.com (canonical)
# except for robots.txt and sitemap.xml
server {
    server_name www.olddomain.com;

    access_log /var/log/nginx/olddomain.log;

    location / {
        return 301 https://www.newdomain.com$request_uri;
    }

    location /robots.txt {
        root /var/www/www.olddomain.com;
    }

    location /sitemap.xml {
        root /var/www/www.olddomain.com;
    }

    # ...
}

It gets more complicated because of http://, https://, non-www and www redirections, but I decided to redirect everything old to the old canonical and from there to the new canonical, with the exceptions that were mentioned.

So we have two versions of sitemap.xml, one served from the old domain with the old URLs and the new one. After reindexing some pages and waiting, it seems that it got unstuck. The old pages have been reindexed, Bingbot found the redirections and removed them and the new ones are slowly getting indexed as well. I will have to wait a bit more to be entirely sure because, apparently, Bingbot crawls slowly and selectively.