Before use, install the BeautifulSoup library as usual. The explanation is as follows:
Its Chinese development documents:
BeautifulSoup library is a powerful XML and HTML parsing library in Python language. It provides some simple functions to handle functions such as navigation, searching and modifying the analysis tree.
The BeautifulSoup library can also automatically convert input documents into Unicode encoding and output documents into UTF-8 encoding.
Therefore, in the process of using the BeautifulSoup library, you don't need to consider the coding problem in development, unless the document you parse itself doesn't specify the coding method, which requires coding processing in development.
Below, let's introduce the usage rules of BeautifulSoup library in detail.
Below, let's introduce the key knowledge of BeautifulSoup library in detail.
First of all, an important concept in the BeautifulSoup library is to choose an interpreter. Because its bottom depends on all these interpreters, it is necessary for us to know them. The blogger specially listed a table:
From the above table, we generally use the lxml HTML parser of crawler, which is not only fast, but also very compatible, except for the shortcomings of installing the C language library (it should be called trouble instead of shortcomings).
To use the BeautifulSoup library, you need to import it like other libraries, but although beautifulsoup4 is installed, the name of the import is not beautifulsoup4, but bs4. Usage is as follows:
After running, the output text is as follows:
The basic usage is very simple, so I won't go into details here. From now on, let's learn all the important knowledge points of the BeautifulSoup library in detail. The first is the node selector.
The so-called node selector is to select a node directly by name, and then use the string property to get the text in the node, which is the fastest way to get it.
For example, in the basic usage, we use h 1 to get the node h 1 directly, and then we can get its text through h 1.string, but this usage has an obvious disadvantage, that is, it is not suitable for complex levels.
Therefore, we need to shrink the document before using the node selector. For example, a document is very large, but all we get is the blog id in P, so it is very appropriate for us to get this P first and then use the node selector in P.
HTML sample code:
In the following example, we still use this HTML code to explain the node selector.
Let's teach you how to get the name, attributes and contents of a node first. Examples are as follows:
After running, the effect is as follows:
Generally speaking, there may be many children of a node, and only the first one can be obtained by the above method. If you want to get all the child nodes of a tag, there are two ways. Let's look at the code first:
After running, the effect is as follows:
As shown in the code above, we have two ways to get all the child nodes, one is through the contents attribute, and the other is through the children attribute, and the results of the two traversal are the same.
Since direct child nodes can be obtained, of course, all descendant nodes can also be obtained. The BeautifulSoup library provides us with descendants attribute to obtain descendant nodes. Examples are as follows:
After running, the effect is as follows:
Similarly, in the actual crawler program, we sometimes need to find the parent node or the sibling node in reverse.
The BeautifulSoup library provides us with the ability to obtain the parent attribute of the parent node, the next_sibling attribute of the next sibling node of the current node, and the previous_sibling attribute of the previous sibling node.
The sample code is as follows:
After running, the effect is as follows:
For the node selector, the blogger introduced that it can be done with less text content. However, the actual URL crawled by the crawler is a lot of data, so it is not appropriate to start using the node selector. Therefore, we should consider the first step through the method selector.
The find_all () method is mainly used to select all nodes that meet the requirements according to their names, attributes and text contents. Its complete definition is as follows:
Actually, let's test the HTML on it. We get a node with name=a, attr={"class":"aaa"}, and the text is equal to text="Python plate ".
The sample code is as follows:
After running, the effect is as follows:
There is only one difference between find () and find_all (), but there are two differences in the result:
1.find () only finds the first eligible node, and find_all () finds all eligible nodes. 2. The find () method returns the bs4.element.Tag object, while find_all () returns the bs4.element.ResultSet object.
Next, let's look for the A tag in the HTML above to see if the returned results are different. Examples are as follows:
After running, the effect is as follows:
First, let's understand the rules of CSS selectors:
1 ...classname: select the node with the style name classname, that is, the node with the class attribute value classname 2. #idname: select a node with the id attribute idname 3, and nodename: select a node with the node name nodename.
Generally speaking, in the BeautifulSoup library, we use the function select () to operate the CSS selector. Examples are as follows:
Here, we choose the node with class equal to li 1. After running, the effect is as follows:
Because we need to implement the use of nested CSS selectors, but the HTML above is not suitable. Here, we make a slight modification, just change.