The purpose of data analysis:
This paper takes the sales data of Chaoyang Hospital in 20 18 as an example to understand the sales situation of Chaoyang Hospital in 20 18. Through the analysis of the drug sales data of Chaoyang District Hospital, we can understand the average monthly consumption times, average monthly consumption amount, customer unit price, consumption trend and drugs in top demand in Chaoyang Hospital.
The basic process of data analysis includes data collection, data cleaning, model building, data visualization and consumption trend analysis.
data preparation
The data exists in Excel. You can use Panda's Excel file reading function to read data into memory. What should be noted here is the file name and the name of the table page in Excel. After reading the data, you can preview the data and view some basic information.
Obtained data: sales data of Chaoyang Hospital in 20 18 years. Xlsx (phantom data) extraction code: 6xm2.
Import raw data
data preparation
The data exists in Excel. You can use Panda's Excel file reading function to read data into memory. What should be noted here is the file name and the name of the table page in Excel. After reading the data, you can preview the data and view some basic information.
Obtained data: sales data of Chaoyang Hospital in 20 18 years. Xlsx (phantom data) extraction code: 6xm2.
Import raw data
Data cleaning
The data cleaning process includes subset selection, column name renaming, missing data processing, data type conversion, data sorting and outlier processing.
(1) Select a subset.
In the data we have obtained, the amount of data may be huge, and not every column is valuable and needs to be analyzed. At this time, it is necessary to select a suitable subset from the whole data for analysis, so as to obtain the maximum value from the data. In this case, there is no need to select a subset, and this step can be ignored for the time being.
(2) column renaming
In the process of data analysis, some column names and data are easily confused or ambiguous, which is not conducive to data analysis. At this time, it is necessary to change the column name to an easy-to-understand name, which can be realized by renaming function:
(3) Missing value processing
There may be missing values in the obtained data. By looking at the basic information, it can be inferred that there are missing values in the columns of "Drug Purchase Time" and "Social Security Card Number". If these missing values are not handled, it will interfere with the subsequent data analysis results.
The common processing methods of missing data are deleting records containing missing data or using algorithms to complete missing data.
In this case, for convenience, we directly use the dropna function to delete the missing data, as shown below:
(4) data type conversion
In order to prevent data from being imported, all data will be forced to be of object type. However, in the actual data analysis process, the columns of sales quantity, receivable amount and paid-in amount need floating-point data, and the sales time needs to be changed into time format, so the data type needs to be converted.
You can use the astype () function to convert to floating-point data:
The column of "sales time" has weekly data, but it is not needed in the process of data analysis. Therefore, the date and week in the "Sales Time" column should be divided by split function, and the divided time will return the serial data type:
At this time, the time is chaotic, and it still needs to be sorted out. The index will be scrambled after sorting, so it needs to be reset.
Where by: indicates the column to sort by, ascending=True indicates ascending order, and ascending=False indicates descending order.
Check the descriptive statistics of the data first.
Through the description of statistical information, we can see that the minimum values of the three columns of data of sales quantity, receivable amount and paid-in amount are all negative, which is obviously unreasonable. There is interference from abnormal values in the data, so it is necessary to further process the data to eliminate the influence of abnormal values:
After the data cleaning is completed, it is necessary to use the data to build a model (that is, calculate the corresponding business indicators) and present the results in a visual way.
Average monthly consumption times = total consumption times/month (all consumption of the same person is counted as one consumption on the same day).
Average monthly consumption amount = total consumption amount/number of months
Customer unit price = total consumption amount/total consumption times
As can be seen from the results, the daily total consumption varies greatly. Except for a few days when there is a large amount of consumption, most people's consumption situation remains within 1000-2000 yuan.
Next, first summarize my sales time, and then analyze it by month:
The results show that the consumption amount in July is the least, because the data in July is incomplete, so it has no reference value.
Monthly consumption 1, April, May and June have little difference.
Consumption dropped rapidly in February and March, which may be the reason why most people went home for the New Year in February and March.
D. Drug sales analysis
Aggregate the data of "commodity name" and "sales quantity" into a series form, which is convenient for later statistics and arranged in descending order:
Intercept the top ten drugs with the largest sales volume and display the results with a histogram:
Conclusion: Hospitals should always pay attention to the drugs with the highest sales volume to ensure that the shortage of drugs will not affect patients. Obtaining the information of the top ten drugs with the largest sales volume is also helpful to strengthen the management of hospital pharmacies.
Distribution of daily consumption amount: time on the horizontal axis and paid amount on the vertical axis.
Conclusion: From the scatter chart, it can be seen that the vast majority of people spend less than 500 yuan in one day, and there are also cases where the amount of consumption on individual days is very large.
& lt/article & gt;