What is the way to collect big data? 1. Offline Collection:
Tool: ETL. In the context of data warehouse, ETL is basically the representative of data collection, including data extraction, transformation and loading. In the process of conversion, it is necessary to manage data according to specific business scenarios, such as monitoring and filtering illegal data, format conversion and data standardization, data replacement, and ensuring data integrity.
2. Real-time acquisition:
Tools: sink/Kafka. Real-time collection is mainly used to consider the business scenarios of stream processing, for example, to record various operational activities of data sources, such as traffic management of network monitoring, stock accounting of financial applications, user access behavior recorded by web servers, etc. In the stream processing scenario, data acquisition will become Kafka's consumer, just like a dam intercepting continuous data from the upstream, and then doing corresponding processing (such as de-duplication, de-noise, intermediate calculation, etc.). ) according to the business scenario, and then write it into the corresponding data storage.
This process is similar to traditional ETL, but it is a stream processing mode, not a scheduled batch job. These tools all adopt distributed architecture, which can meet the requirements of logging data collection and transmission of hundreds of MB per second.
3. Internet collection:
Tools: reptile, DPI, etc. Scribe is a data collection system developed by Facebook. Also known as web spider and web robot, it is a program or script that automatically crawls information from the World Wide Web according to certain rules, and it supports the collection of files or attachments such as pictures, audio and video.
What is the process of big data collection? The process of big data collection and processing mainly includes data collection, data preprocessing, data storage, data processing and analysis, and data quality runs through the whole big data process, which is very critical. Every data processing link will have an impact on the quality of big data. Let's talk about the process and processing method of big data data collection.
Big data data collection In the process of data collection, data sources will affect the authenticity, integrity, consistency, accuracy and security of big data quality.
Data preprocessing There are usually one or more data sources in the process of big data collection, including homogeneous or heterogeneous databases, file systems, service interfaces, etc. , vulnerable to noise data, missing data values, data conflicts, etc. Therefore, it is necessary to preprocess the collected large data sets to ensure the accuracy and value of the big data analysis and prediction results.