1. Google spider name
1) Googlebot: crawl web pages from Google’s website index and news index
2) Googlebot-Mobile for Google’s Mobile index crawls web pages
3) Googlebot-Image: crawls web pages for Google’s image index
4) Mediapartners-Google: crawls web pages to determine the content of AdSense. Google will use this bot to crawl your site only if AdSense ads are displayed on your site.
5) Adsbot-Google: Crawl web pages to measure the quality of AdWords target pages. Google only uses this bot if you use Google AdWords to advertise your website.
2. Baidu spider name:
The first letter B of Baiduspider is capitalized, and the rest are lowercase
3. Yahoo (Yahoo!) spider name:
1) Yahoo! search spider name: Yahoo!Slurp.
2) Yahoo! search engine advertising spider: Yahoo!-AdCrawler. Used to crawl Yahoo! search engine advertising landing page pages
p>4. Youdao spider name:
YodaoBot
5. Tencent Soso spider name:
The first letter of Sosospider is capitalized, and the rest are capital letters is lowercase
6. Sogou (sogou) spider name:
sogousspider
7. Live spider name
1) MSNBot: Mainwebcrawler (www.live.com)
2) MSNBot-Media: Images&allothermedia (images.live.com)
3) MSNBot-NewsBlogs: Newsandblogs (search.live.com/news )
4) MSNBot-Products: Products&shopping (products.live.com)
5) MSNBot-Academic: Academicsearch (academic.live.com)
Extended reading: Analysis of rules for crawling web pages by search engine spiders
1. Crawler framework
We can think of web pages as the spider’s dinner, which includes:
Downloaded web page. The content of the web page that has been crawled by the spider is placed in the stomach.
Expired web page. Spiders crawl a lot of web pages every time, and some of them are already corrupted in their stomachs.
Webpage to be downloaded. When the spider sees food, it will grab it.
Visible web page. It hasn't been downloaded and discovered yet, but spiders can sense them and will grab it sooner or later.
Agnostic web page. The Internet is too big, and many page spiders cannot find it, and may never find it. This part accounts for a high proportion.
Through the above divisions, we can clearly understand the work of search engine spiders and the challenges they face. Most spiders crawl according to this framework. But it's not entirely certain. Everything is special. There are some differences in the spider system according to different functions.
2. Types of crawlers
1. Batch type spiders.
This type of spider has a clear crawling range and goal. When the spider completes its goal and task, it stops crawling. What is the specific goal? It may be the number of web pages crawled, the size of the web pages, the crawling time, etc.
2. Incremental spiders
This type of spider is different from batch spiders. They will continue to crawl and regularly crawl and update the crawled web pages. Because web pages on the Internet are updated at any time, incremental spiders need to be able to reflect this update.
3. Vertical spiders
This kind of spider only focuses on specific topics or specific industry web pages. Taking health websites as an example, this type of specialized spider will only crawl health-related topics, but will not crawl web pages with other topics. The difficulty in testing this spider is how to more accurately identify the industry to which the content belongs. At present, many vertical industry websites require this kind of spider to crawl.
3. Crawling strategy
Spiders crawl through seed URLs and list a large number of URLs to be crawled. However, there are a huge number of URLs to be crawled. How does the spider determine the order of crawling? There are many strategies for spider crawling, but the ultimate goal is one: to crawl important web pages first. To evaluate whether a page is important, spiders will calculate it based on the originality of the page content, link weight analysis and many other methods. The more representative crawling strategies are as follows:
1. Breadth-first strategy
Breadth-first means: after the spider crawls a web page, it continues to crawl other pages contained in the web page. Pages are sequenced for further crawling. This idea seems simple, but it is actually very practical. Because most web pages are sorted by priority, important pages will be recommended first on the page.
2. PageRank strategy
PageRank is a very famous link analysis method, mainly used to measure the weight of web pages. For example, Google's PR is a typical PageRank algorithm. Through the PageRank algorithm we can find out which pages are more important, and then spiders will prioritize crawling these important pages.
3. Large website priority strategy
This is easy to understand. Large websites usually have more content pages and the quality will be higher. The spider will first analyze the website classification and attributes. If this website has been included a lot, or has a high weight in the search engine system, priority will be given to inclusion.
4. Web page updates
Most of the pages on the Internet will be kept updated, which requires that the pages stored by the spider can also be updated in time to maintain consistency. To use an analogy: a webpage had a good ranking before. If the page has been deleted but still ranks, the experience will be very bad. Therefore, search engines need to know these and update the pages at any time to provide the latest pages to users. There are three commonly used web page update strategies: historical reference strategy and user experience strategy. Cluster sampling strategy.
1. Historical reference strategy
This is an update strategy based on an assumption. For example, if your web page has been updated regularly in the past, search engines will also believe that your page will be updated frequently in the future, and spiders will also come to the website regularly to crawl web pages according to this pattern. This is why Dianshui has always emphasized that website content needs to be updated regularly.
2. User experience strategy
Generally speaking, users will only view the content of the first three pages of search results, and few people will read the following pages. The user experience strategy is that the search engine updates based on this characteristic of the user. For example, a web page may have been published earlier and has not been updated for a while, but users still find it useful and click to browse it, so it is okay for search engines not to update these outdated web pages. This is why in the search results, the latest page does not necessarily rank higher. Ranking depends more on the quality of this page than on the time of update.
3. Cluster sampling strategy
The above two update strategies mainly refer to the historical information of the web page. But storing a large amount of historical information is a burden for search engines. In addition, if a new web page is included, there is no historical information for reference. What should we do? The cluster sampling strategy refers to: based on some attributes displayed by the web page. , to classify many similar web pages, and the classified pages will be updated according to the same rules.
From the process of understanding the working principle of search engine spiders, we will know: the correlation between website content, the update rules of website and web page content, the distribution of links on the web page, and the weight of the website will all affect Spider crawling efficiency. Knowing the enemy, let the spider come more violently!