Log in to Octopus 7.0 collector → click the "+"icon in the upper left corner → select a custom collection (you can also click "Use Now" below the custom collection on the home page) to enter the task configuration page. Then enter the URL → Save the URL, and the system will enter the process design page and automatically open the URL entered in front.
After the webpage is opened, we can modify the task name. If we don't modify it, it will be named after the title of the page by default. You can change the task name at any time before running the collection.
Extract data
In the web page, just select the data to be extracted directly, and there will be a corresponding prompt in the upper right corner of the window. In this tutorial, we take news headlines, dates and texts as examples. Please use it flexibly and get what you need.
After setting the extracted data, you can click Save and start running the collection. But the field name at this time is automatically generated by the system. In order to better meet your own needs, you can click "Process" in the upper right corner to enter the process page to modify the field name. First, select the Chinese field name to modify. At this time, there will be alternative field names in the drop-down box, which can be directly selected for use. If there is nothing you want, please enter a new field name. After modifying the field name, click OK to save. After saving, you can run the collection.
All versions can run local collection, and the flagship version and above can run cloud collection and set scheduled cloud collection, but before running cloud collection, run local collection for testing. After the task is collected, it can be exported or imported into the database in Excel, CSV, HTML and other formats. After the data is exported, you can click the link to enter the data storage folder to view the data. By default, files are named after the task name.
1. Octopus collection principle
The development language used by Octopus web data acquisition client is C#, which runs on Windows system. The client main program is responsible for task configuration and management, cloud acquisition control of tasks, and management of cloud integrated data (export, cleaning and publishing). The data exporter is responsible for exporting Excel, SQL, TXT, MYSQL and other data. It supports exporting millions of data at a time. The local acquisition program is responsible for opening, grabbing and collecting the data of web pages according to the workflow, and quickly obtaining the data of web pages through regular expressions and Xpath principles. The whole collection process is based on the Firefox kernel browser, which automatically extracts the content of the webpage by simulating people's thinking and operation (such as opening the webpage and clicking a button in the webpage). The system can completely visualize the process operation and easily realize data acquisition without professional knowledge. By accurately locating the XPath path of each data in the web page source code, Octopus can accurately collect the data needed by users in batches.
2. Functions realized by Octopus
Octopus web data collection system takes the distributed cloud computing platform independently developed as the core, which can easily obtain a large number of standardized data from various websites or webpages in a short time, help any customer who needs to obtain information from webpages to realize automatic data collection, editing and standardization, and get rid of dependence on manual search and data collection, thus reducing the cost of obtaining information and improving efficiency. It involves government, universities, enterprises, banks, e-commerce, scientific research, automobiles, real estate, media and many other industries and fields.
Octopus, as a universal web data collector, does not collect data from a website or an industry, but almost all the text information that can be seen on the web page or the source code of the web page can be collected, and 98% of the web pages on the market can be collected by Octopus.
Using local collection (stand-alone collection), you can not only capture the vast majority of web data, but also preliminarily clean up the data during the collection process. For example, if you use the regularization tool that comes with your program, you can use regular expressions to format data. In the data source, you can delete spaces, filter dates and other operations. Secondly, Octopus also provides a branch judgment function, which can logically judge whether the information in the webpage is true or not, and realize the screening needs of users.
In addition to all the functions of local acquisition (stand-alone acquisition), cloud acquisition can also realize the functions of timing acquisition, real-time monitoring, automatic duplicate data deletion and storage, incremental acquisition, automatic identification of verification code, diversified data export of API interface, parameter modification and so on. At the same time, multi-nodes in the cloud operate concurrently, and the collection speed will far exceed the local collection (single-machine collection). When the task starts, multi-IP automatic switching can also avoid the IP blockade of the website and maximize the data collection.