Monday, November 10, 2008

Working Of Search Engine Robots

Search Engine Robots also called a web crawler (Web spider or Web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner. Other less frequently used names for Web crawlers are ants, automatic indexers, bots, and worms (Kobayashi and Takeda, 2000). They are the seekers of the web pages.
Many legitimate sites, in particular search engines, use spidering as a means of providing up-to-date data. Robots are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Robots can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

Search engine robots have only basic functionality they just can't do certain things. Robots don't understand frames, Flash movies, images or JavaScript. They can't enter password protected areas and they can't click all those buttons you have on your website. They can be stopped cold while indexing a dynamically generated URL and slowed to a stop with JavaScript navigation.

The automated robots first check the availability of a robots.txt file while arriving at a website. This file is used to tell robots which areas of the site are off-limits to them. Robots collect links from each page visited, and later follow those links through to other pages. In this way, they essentially follow the links from one page to another. The entire World Wide Web is made up of links, the original idea being that you could follow links from one place to another. This is how robots get around.

When a search engine robot visits a page, it looks at the visible text on the page, the content of the various tags in the page's source code (title tag, meta tags, etc.), and the hyperlinks on the page. From the words and the links that the robot finds, the search engine decides what the page is about. Depending on how the robot is set up through the search engine, the information is indexed and then delivered to the search engine's database.



The information delivered to the databases then becomes part of the search engine and directory ranking process. When the search engine visitor submits query, the search engine digs through the database to give the final listing that is displayed on the results page.

One can see the pages visited by the search engine robots on the site, by looking at the server logs or the results from the log statistics program. Some robots are readily identifiable by their user agent names, like Google's "Googlebot"; others are bit more obscure, like Inktomi's "Slurp".

No comments: