Python crawler: Want to listen to list songs? It only takes 14 lines of code to get it done.

Although XPath is more convenient to use than regular expressions, it is not the most convenient, only more convenient. Our BeautifulSoup library can make it easier for you to climb what you want.

Before use, install the BeautifulSoup library as usual. The explanation is as follows:

Its Chinese development documents:

BeautifulSoup library is a powerful XML and HTML parsing library in Python language. It provides some simple functions to handle functions such as navigation, searching and modifying the analysis tree.

The BeautifulSoup library can also automatically convert input documents into Unicode encoding and output documents into UTF-8 encoding.

Therefore, in the process of using the BeautifulSoup library, you don't need to consider the coding problem in development, unless the document you parse itself doesn't specify the coding method, which requires coding processing in development.

Below, let's introduce the usage rules of BeautifulSoup library in detail.

Below, let's introduce the key knowledge of BeautifulSoup library in detail.

First of all, an important concept in the BeautifulSoup library is to choose an interpreter. Because its bottom depends on all these interpreters, it is necessary for us to know them. The blogger specially listed a table:

From the above table, we generally use the lxml HTML parser of crawler, which is not only fast, but also very compatible, except for the shortcomings of installing the C language library (it should be called trouble instead of shortcomings).

To use the BeautifulSoup library, you need to import it like other libraries, but although beautifulsoup4 is installed, the name of the import is not beautifulsoup4, but bs4. Usage is as follows:

After running, the output text is as follows:

The basic usage is very simple, so I won't go into details here. From now on, let's learn all the important knowledge points of the BeautifulSoup library in detail. The first is the node selector.

The so-called node selector is to select a node directly by name, and then use the string property to get the text in the node, which is the fastest way to get it.

For example, in the basic usage, we use h 1 to get the node h 1 directly, and then we can get its text through h 1.string, but this usage has an obvious disadvantage, that is, it is not suitable for complex levels.

Therefore, we need to shrink the document before using the node selector. For example, a document is very large, but all we get is the blog id in P, so it is very appropriate for us to get this P first and then use the node selector in P.

HTML sample code:

In the following example, we still use this HTML code to explain the node selector.

Let's teach you how to get the name, attributes and contents of a node first. Examples are as follows: