Current location - Recipe Complete Network - Catering franchise - Data exploration-data quality analysis
Data exploration-data quality analysis

data quality analysis is an important part of data preparation in data mining, the premise of data preprocessing, and the basis of the validity and accuracy of data mining analysis conclusions. Without credible data, the model built by data mining will be a castle in the air.

The main task of data quality analysis is to check whether there is dirty data in the original data. Dirty data generally refers to data that does not meet the requirements and cannot be directly analyzed. In common data mining work, dirty data includes the following contents:

The missing data mainly includes the missing records and the missing information of a field in the records, both of which will lead to inaccurate analysis results. The following paper analyzes the causes and effects of missing values.

(1) Reasons for missing values

1) Some information is temporarily unavailable, or the cost of obtaining information is too high.

2) Some information is missing. It may be missed due to some human factors such as not being important when inputting, forgetting to fill in or misunderstanding the data, or it may be lost due to non-human reasons such as the failure of data acquisition equipment, the failure of storage media and the failure of transmission media.

3) the attribute value does not exist. In some cases, missing values do not mean that there is an error in the data. For some objects, some attribute values do not exist, such as the name of an unmarried spouse, the fixed income of a child, etc.

(2) Influence of missing values

1) Data mining modeling will lose a lot of useful information.

2) The uncertainty of data mining model is more obvious, and the laws contained in the model are more difficult to grasp.

3) data with null values will confuse the modeling process and lead to unreliable output.

(3) Analysis of missing values

Using simple statistical analysis, we can get the number of attributes with missing values, as well as the missing number, missing number and missing rate of each attribute.

outlier analysis is to check whether the data has any input errors and contains unreasonable data. It is very dangerous to ignore the existence of outliers, and including outliers in the calculation and analysis of data without elimination will have adverse effects on the results; Paying attention to the appearance of outliers and analyzing their causes often become an opportunity to find problems and improve decision-making.

outliers refer to individual values in the sample, and their data obviously deviate from the rest of the observed values. Outliers are also called outliers, and the analysis of outliers is also called outlier analysis.

(1) Simple econometric analysis

You can do descriptive statistics on variables first, and then see which data are unreasonable. The most commonly used statistics are maximum and minimum, which are used to judge whether the value of this variable is beyond a reasonable range. If the maximum customer age is 1.99 years old, the value of this variable is abnormal.

(2)3σ principle

If the data obey the normal distribution, under the 3σ principle, the abnormal value is defined as the value whose deviation from the average value exceeds 3 times the standard deviation. Under the assumption of normal distribution, the probability of the occurrence of values other than the average value of 3σ is p (| x-μ | >; 3σ)≤1.113, which belongs to very few small probability events.

if the data does not obey the normal distribution, it can also be described by how many times the standard deviation is far from the average.

(3) Box Diagram Analysis

Box Diagram provides a standard for identifying abnormal values: abnormal values are usually defined as values less than QL-1.5IQR or greater than Qu+1.5IQR. QL becomes the lower quartile, which means that four-quarters of all observed values are smaller than it; Qu is called the upper quartile, which means that a quarter of all observed values are larger than it; IQR is called quartile interval, which is the difference between the upper quartile and the lower quartile, and contains half of all the observed values.

the box chart is drawn according to the actual data, without any restrictive requirements on the data (such as obeying a certain distribution form), but it only shows the original appearance of the data distribution truly and intuitively; On the other hand, the box chart's criteria for judging outliers are based on quartile and quartile distance, and quartile has certain robustness: as much as 25% of data can become arbitrarily far without disturbing quartile greatly, so outliers can't influence this criterion. From this, it can be seen that the results of identifying outliers by box diagram are objective and have certain advantages in identifying outliers.

There may be missing values and abnormal values in the sales volume data in the catering system, as shown in the following table:

By analyzing the daily sales volume data of the catering system, it can be found that some data are missing, but if there are many data records and attributes, it is not practical to use the method of manual discrimination, so it is necessary to write a program to detect the records and attributes with missing values, as well as the number and rate of missing.

in Python's Pandas library, you only need to read the data, and then use the describe () function to view the basic situation of the data.

the running results are as follows:

where count is a non-null value, we can know that there are 211 data records through len(data), so the number of missing values is 1. In addition, the basic parameters provided are mean, standard deviation (std), minimum (min), maximum (max) and 1/4, 1/2, 3/4 quantiles (25%, 51%, 75%). The way to display these data more intuitively and detect abnormal values is to use box charts.

after running the program, the result is "the number of missing values is 1", and at the same time, the box diagram as shown in the above figure can be obtained.

as can be seen from the figure, seven sales figures exceeding the upper and lower bounds in the box chart may be abnormal values. Combined with specific services, 865, 4161.3 and 4165.2 can be classified as normal values, and 22, 51, 61, 6617.4 and 9116.44 can be classified as abnormal values. Finally, the filtering rules are determined as follows: if the daily sales volume is below 411 and above 5111, it belongs to abnormal data, and the filtering program is written for subsequent processing.

data inconsistency refers to the contradiction and incompatibility of data. Mining inconsistent data directly may produce mining results that are contrary to reality.

in the process of data mining, inconsistent data mainly occurs in the process of data integration, which may be caused by the failure to make consistency with the re-stored data from different data sources. For example, both tables store the user's phone number, but only the data in one table is updated when the user's phone number changes, so there are inconsistent data in the two tables.