1, data collection and preprocessing: FlumeNG real-time log collection system, support in the logging system to customize all kinds of data senders, used to collect data; Zookeeper is a distributed, open source distributed application coordination services, to provide data synchronization services.
2, data storage: Hadoop as an open source framework designed for offline and large-scale data analysis, HDFS as its core storage engine, has been widely used for data storage. HBase, a distributed, column-oriented open source database, can be considered as a wrapper for hdfs, the essence of the data storage, NoSQL database.
3, data cleansing: MapReduce as the query engine of Hadoop, used for parallel computing of large-scale data sets.
4, data query analysis: Hive's core job is to translate SQL statements into MR programs, which can map structured data into a database table and provide HQL (HiveSQL) query functionality.Spark enables in-memory distribution of datasets, in addition to being able to provide interactive querying, it can optimize iterative workloads.
5. Data visualization: docking some BI platforms to visualize the data obtained from the analysis for guided decision-making services.