| Services & Solutions | Products | Trial | Contacts | About Us | Partner With Us | Blog |
Pavel LebedevCrawling Web 2.0: Content HuntPavel Lebedev | 2009-06-26 Anyone tracking search engine bots (crawlers) for a few last years has certainly noticed dramatic changes in crawling activity. First, previously it took relatively long time for a new web site to get into the indexes of major search engines, but nowadays it is just a matter of days really. Often web analytics detects first ever visit by search engine crawler within days from the moment a brand new site was submitted (or linked from already indexed site). Second, frequency of crawling a site by major search engines has increased substantially. It is not uncommon to see crawlers revisiting same web page several times a day. Nowadays there is an intense crawlers activity on a web site (activity that is often invisible in web analytics or, at least, not transparent to web site owners). Furthermore, having been retrieved by crawler, new or updated web page becomes searchable by entire world (gets updated in search index) within days. All these are the signs of Web 2.0 Era we are now living in. Before Web 2.0 Era content was mostly static – there was no need for crawlers to revisit same web page (recrawl site) often since in most cases it did not change anyway. Currently, things are entirely different. There are plenty of web publishing technology available that make content dynamic, easy to change, update, etc. It is now easy to publish online as never before. In particular, the advent of blogging technology (you read a blog post now) has contributed to explosive growth of dynamic content. In addition, the adoption of syndication feeds technology (Atom, RSS) substantially increased reach of and, hence, demand for dynamic content. The supply has followed. As a result, web is now flooded with dynamic content. On the other hand, web has become a true commercial venue, unleashing more and more of its economic potential. As a result, competition between search engines has increased. To be competitive, search engines need to keep their indexes up to date with that flood of dynamic Internet content, so they have increased crawling activity substantially. As a result, in Web 2.0 Era search engine bots are constantly crawling the site, hunting for new and updated content. Anecdotal stories like that real one (from client) should no longer surprise anyone: "I recently emailed some link pointing to my site. The link is unique in that it was never published on the web, somewhat weird too. People, whom I send the link, somehow leaked it to the web same day I send it, and again same day I was so VERY surprised to see Google's crawler (Googlebot) hitting that link. That's it - after the link was leaked to the web, it was immediately spotted and hit by search engine crawler. When I saw it in crawlers tracking report, I first could not imagine how it could happen, that a link I've just written in email (not even web page) almost immediately hammers my door back from web. All same day, even within hours!" |

