Crowd perspective in the era of big data

Crowd perspective, also known as crowd analysis, is to select a specific group of people according to user attributes and explore the essence behind the data by using related technologies of big data. Common analysis needs include observing the purchase conversion rate in a specific region, the number of new users and conversion rate in designated distribution channels, the retention rate of business activities and so on.

Let's look at a simple example first. We created a designated crowd for yesterday's active users. Product personnel want to analyze whether the proportion of male users is higher than that of female users, and get the distribution map by using correlation analysis technology.

Unknown users did not fill in gender information (using ID information can also be done automatically or recognized by other rules and models).

After reading this, I think everyone has a certain understanding of what is the crowd perspective. Then why did you do a crowd fluoroscopy? Let me describe a scene for you first. After seeing this data, we should immediately find out what caused it. First of all, operators will confirm the retention rate of each channel (divide people according to drainage channels) and find that the number of new users registered in a certain channel increases rapidly, but the retention rate drops sharply; Finally, it is found that channel distributors set up advertisements for specific groups of people, but these users gave up because the products themselves could not bring satisfaction and pleasure.

The above is an analysis of the scene from the perspective of the crowd. If there is such a data analysis tool, is it cool to complete most of the data analysis work without a professional data analyst?

At present, there are many scenes from the perspective of crowds. Analyze and compare the effects of different activities, and find the optimal decision point according to the trend and growth link of products.

Students who are familiar with big data or data warehouse models are also familiar with fact tables and dimension tables. Fact tables refer to business behaviors in specific subject areas, while dimension tables record descriptive information about entities. The registered behavior table is a fact table, while the user portrait table or commodity table can be called a dimension table. The crowd perspective is to choose the appropriate value according to the user attribute dimension table to see the performance on the fact business table. Check the distribution of women's consumption types in Hangzhou, where "Hangzhou women" is an attribute of the dimension table and the consumption type is a record of the fact table of consumption records.

First, make clear the business indicators we want to analyze. Taking channels as an example, we want to analyze the growth rate of new users, registration and login conversion rate of each channel. First, establish a fact table to define storage granularity and business field set.

According to the channel, we need to analyze the daily click of advertising space, including the number of newly activated devices, the number of newly registered users, the number of newly registered users, the conversion rate of device activation, the conversion rate of registered activation, the conversion rate of registered active users, the activation cost, registration cost, activity cost and total cost of each channel.

Indicators are usually numerical, and their calculation rules should meet the requirements of accumulation, such as sum, max, min and cnt, and the functions should meet the following relations:

F(A)=f(a, A-a), where A is the set and A is the element in A, that is, the calculation of the set can be iterated.

For example, the mean and variance are not cumulative summary functions.

The analysis of crowd perspective first needs to select crowd set according to attribute circle, which is an inverted index query category, and El Eastern Search is a common inverted index service in the market. First of all, we can use its inverted query ability to quickly query the user ID list.

The query of indicators is an index query process, and the corresponding records are found according to the user ID. Commonly used multidimensional query tools include Kylin, Druid, Presto, ES, etc. The advantages and disadvantages of each framework are compared below.

Because the transformation of most services takes some time to accumulate, most data can meet the query of T+ 1 At the same time, the data of T+ 1 can be summarized and calculated directly by using the data of several warehouses. If the model of a business analysis index is fixed, then the data can be directly analyzed and stored with the help of Kylin. If the query index is interactive and flexible, you can use ES, Presto and other storage query methods. Such as the above channel analysis model, you can directly use Kylin for storage.

Objective: To meet the daily analysis work of non-big data analysts, help to find problems faster, put forward the direction and priority of problems, and solve problems. In addition, a standard framework is provided to facilitate users to import appropriate analysis models.

According to the life cycle, it can be divided into five layers, namely, the definition of crowd index, data collection and processing, data storage, data query and data graphic display.

According to the functional modules, we can get the following architecture diagram.

Scenario requirements, specifying the business scenarios and data sources to be observed and analyzed. Channel registration transformation analysis needs to collect buried clicks of advertisements, buried clicks of apps, user registration events, click cost statistics and so on. Finally, with the help of data warehouse, it is processed into a channel conversion fact table. The dimension table information of the channel is constructed in the same way.

Crowd: a collection of users who meet a certain attribute value. For example, the lending user group and the wealth management user group.

Crowd: The combination of crowds. We usually compare different groups of people first, such as comparing loan users in Hangzhou and Beijing, and what dimensions can be selected for crowd management, such as geographical location, gender and so on. People group users build multidimensional analysis, and the attribute columns that divide groups are dimensions. The attributes of some continuous numeric columns can be classified according to interval values.

The indicator type is usually an aggregate function of numerical values, and the aggregate function should satisfy additivity.

The most commonly used is count, and the count function does not need to be based on any index, and is often used to count the number of a certain group of people.

Sum: A set of values for a column.

Maximum/Minimum: the maximum/minimum statistical value.

Distinct count: Count the number of duplicates deleted.

You can select a clear indicator column according to the table (it can only be a numerical column, and the default value of count is 1), and then check the corresponding aggregation function. Here, you can choose different numeric columns from different tables.

The selection of a certain dimension accounts for the proportion of the whole population, such as the purchase of houses by high-income people in Hangzhou.

It provides a common layer for database and table structure management, provides a unified set of SQL for application layer, and translates it into a specific physical query plan according to the specific physical table storage medium. The request and response of the query interface are packaged into a unified result, and the specific storage details are not reflected externally.

Dashboard management can create the specified analysis template for the specified crowd, and add, modify and delete icons at the same time.

Chart type: supports one-dimensional chart and two-dimensional chart. One-dimensional charts are usually quantities, such as pie charts, bar charts and dashboards.

Advanced function: select a chart, and you can check the dimensions to be displayed (dimensions can come from dimension table or fact table, for example, time can come from fact table) and indicators to build a two-dimensional or even multi-dimensional graph.

What specialties are available at Jilin College of Agricultural Science and Technology?

Has Lao Wu, the family next door, registered a trademark? What other categories can be registered?

Creative fast food advertising words?

How to open a good restaurant

1 1 The picture takes you to read Howard Marx's Cycle.

English name of Fisherman's Village Square

Which is more fun, Dongguan Yinxian Mountain Villa or shenzhen happy valley?

How much is the rent on the first floor of Laoxingwang, Qipu Road?

Individual sole proprietor to open a restaurant need to have what kind of conditions

What about Beijing Huajia Xingye Food Co.