Check which bot is crawling the blog feeds

Linux Aug 5, 2017

A crawling bot checks periodically if your published posts are modified to update the search engine index. In this post I want to show how to find which bots are crawling your blog feed.

I was wondering which bots are reading/crawling my RSS feed or sitemap.xml. To find this out, a easy grep to the access_log will show the result. Depending on your web server, we will find the logfile in the /var/log directory or in your central log management system.

grep '/feed/' access_log | cut -d ' ' -f4,5,12-

This will first filter all entries in the access log. The lines with '/feed/' will be divide. The result looks like this:

[29/Jul/2017:17:07:08 +0200] "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[29/Jul/2017:17:38:45 +0200] "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0"
[29/Jul/2017:17:38:48 +0200] "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0"
[29/Jul/2017:17:39:06 +0200] "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[30/Jul/2017:18:53:26 +0200] "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[31/Jul/2017:03:28:35 +0200] "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
[31/Jul/2017:09:34:43 +0200] "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0"
[31/Jul/2017:17:30:02 +0200] "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[31/Jul/2017:22:38:05 +0200] "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0"
...

As you can see in the result, I can find Google and Bing in the logs. You can store this lines in a database by running this command in a cronjob. Before you do that, you should remove unnecessary information. This now can be analyzed for a long term result.

Tested on: