I just configured the scrappy framework in the morning, and I couldn't wait to make a small demonstration to test it in the afternoon. It turns out that scrappy is really good. After I feel completely mastered, there is no data that I can't climb, O(∩_∩)O haha ~.
The following steps are based on the successful installation and configuration of scrapy.
1. Create a new Scrapy project.
Open the cmd console interface?
Input: scrapy startproject myScrapy
The following files are included in the created project:
__init__。 Py: the initialization file of the project; ?
Items.py: the target file of the project?
Pipelines.py: Pipeline file of the project?
Settings file for the project?
Spiders/: A file for storing crawler code?
Spider /__init__. Py: the initial word file of the crawler.
2. Define the project
Items is a container for saving the captured data, and its usage is similar to python dictionary, which provides an additional protection mechanism to avoid undefined field errors caused by spelling mistakes, similar to the mapping relationship in ORM.
This is the default code in items.py:
Import scrapyclass mySpiderItem(scrapy. Project):
#name = scrapy。 Field ()
Pass 12345
Let's revise it: (I need to climb to the data? Name, occupation and details)
Import scrapyclass MyspiderItem(scrapy. Project):
# This is the field contained in the data you need to climb.
Name = prickly head. Field ()
title = scrapy。 Field ()
info = scrapy。 Field() pass 123456789
3. Create a crawler file
Are you online? Spider/? Create a file under? demo_spider.py? . ?
Then NotePad++ opens the file and adds the following code:
Import scrapy# references myspideritemfrommyspider.item in item.py in myspider directory Import myspideritemclass demo _ spider (scrap.spider):
# Crawler name, which must be unique.
Name = "demo"
# Reptiles' crawling field (I want to get the data of Chanz)
The URL list starting with Allowed_domains = ["itcast.cn"] # is the address of the first batch of requests.
start _ URLs =[" http/channel/teacher . shtml "
] #pase method is responsible for parsing the returned data, obtaining the data items to be extracted, and generating the URL of the request object to be further processed.
Define resolution (self, response):
# Collected data set
node _ list = response . XPath("//div[@ class = ' Li _ txt ']")
For nodes in node_list:
Item = MyspiderItem() #。 Extract () converts xpath objects into Unicode strings.
name = node.xpath("。 /h3/text()"。 Extract ()
title = node.xpath("。 /h4/text()")。 Extract ()
info = node.xpath("。 /p/text()")。 Extract ()
Item ['Name'] = Name [0]
Item ['Title'] = Title [0]
Item['info'] = info[0] #yield: Pause the loop after obtaining an item data, then hand it over to the pipeline and continue the loop.
Output item12345678910121314151617/.
4. Modify the settings file
Open the setting.py file, and change ROBOTSTXT_OBEY to false, so as to prevent some websites from prohibiting crawlers from crawling data.
# Obey robots . txt rules robots txt _ Obey = false 12
Uncomment ITEM_PIPELINES, which defines the priority of pipelines. The smaller the value, the higher the priority.
ITEM _ PIPELINES = { ' Tencent . PIPELINES . Tencent pipeline ':300
} 123
5. Modify the pipeline execution file
This is the default pipeline file:
Import jsonclass MyspiderPipeline (object):
Def process_item (self, project, spider):
Pass? 12345
We revised it as follows:
Import jsonclass MyspiderPipeline (object):
def __init__(self):
Self.f = open ("demo.json ","WB+") # This method is necessary.
Def process_item (self, project, spider):
Content = json.dumps(dict(item), ensure _ascii = False)+",\n"
Self.f.write (content.encode ("utf-8")) returns the item def colse_spider(self, spider):
self . f . close() 123456789 10 1 1 12 13 14
Add __init__ method. When the pipeline event is executed for the first time, create the demo.json file and open it.
Add the colse_spider method to close the file at the end of the pipeline event.
Modify the process_item method to save the project data obtained in Demo_Spider to the demo.json file.
Start the spider
In mySpider directory, create a data folder to store crawled data files.
Enter: mkdir data, and then: cd data/?
Use the command: scrapy to grab the demo?
You can see the details of the execution completion.
The required data is provided in the demo.json file.
This is a simple example of capturing website data. I believe that after in-depth research, we can certainly achieve a very powerful data capture function.
1 Generally speaking, adults should ensure one bottle (about 250 ml per bottle) and strive for two bottles, preferably three bottles, and no more than fou