A crawling bot checks periodically if your published posts are modified to update the search engine index. In this post i want to show how to find which bots are crawling your feed.

I was asking which bots are reading/crawling my RSS feed or sitemap.xml. To find this out, I wrote a simple command:

grep '/feed/' access_log | cut -d ' ' -f4,5,12-

This will first filter all entries in the access log. The lines with '/feed/' will be divide. The result looks like this:

[29/Jul/2017:17:07:08 +0200] "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[29/Jul/2017:17:38:45 +0200] "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0"
[29/Jul/2017:17:38:48 +0200] "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0"
[29/Jul/2017:17:39:06 +0200] "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[30/Jul/2017:18:53:26 +0200] "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[31/Jul/2017:03:28:35 +0200] "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
[31/Jul/2017:09:34:43 +0200] "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0"
[31/Jul/2017:17:30:02 +0200] "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[31/Jul/2017:22:38:05 +0200] "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0"
...

As you can see in the result, I can find bots like Google and Gecko. You can store this lines in a database by running this command in a cronjob. Before you do that, you should remove unnecessary information. This now can be analyzed for a long term result.


Tested on:

  • OS: CentOS 7
  • Web server: Apache httpd 2.4.6

Credits: