Current location - Recipe Complete Network - Complete cookbook of home-style dishes - Python is actually very simple. Chapter 2 1 Data Processing of Data Frame
Python is actually very simple. Chapter 2 1 Data Processing of Data Frame
After reading the data from Excel into DataFrame, it is very convenient to process all kinds of data.

2 1. 1 sum between columns

Find the total score (total score = Chinese+math+English)

For the student report mentioned in the previous chapter, you only need one sentence to calculate the total score and fill it in.

Df, 0, in situ = true)

Replace the value "98, 76, 99" in the whole data frame with "0" at a time.

2 1.2 sorting

You can sort a column as a key field, or sort several columns as primary key fields and secondary key fields respectively. Sorting can be sorted in ascending or descending order.

The syntax format of the function sort_values () is as follows:

Df. sort _ values (by = ["col1","col2", ..., "col"], ascending order =False)

Where coln represents a column name or a list of column names; Ascending indicates sorting method; a value of True indicates ascending order, which can be defaulted; a value of False indicates descending order.

For example:

Df=df.sort_values(by=[' total score'], ascending order =False)

It is to sort from high to low by "total score".

Df=df.sort_values(by=[' total score',' language'], ascending order =False)

It is sorted from high to low according to the "total score". If the "total score" is the same, the order is from high to low according to the "Chinese" score.

2 1.3 field truncation

Slice () can intercept a string from a column. The format is as follows:

Slice (start, stop)

Where start represents the starting position; Stop indicates the end position.

Example:

df[' grade ']= df[' student number ']. str . slice(0,2)

Through this statement, the 1 and the second character of the student number field can be intercepted and assigned to the grade field.

2 1.4 record extraction

You can extract qualified records.

For example, extract records with total score > 300.

df【df。 Total score > 300]

Extract records with total scores between 300 and 3 10 (inclusive).

df【df。 Total score Between (306,310)]

Extract the record with "0803" in the student number. This can easily extract the information of a class.

df【df。 Student ID card. The string contains ('0803', na=False)]

Here Na=False, which means that if you encounter data like NaN, you can directly do mismatch processing.

2 1.5 modify record

1, whole column replacement

We have filled the whole column with data before, and the original data will be overwritten when filling.

That is, the following statement:

Df[' total score' ]=df[' Chinese' ]+df[' mathematics' ]+df[' English']

2. Individual modification

If the value "99" is replaced by the value "100", you can use the following statement:

df.replace('99 ',' 100 ')

To replace the value of the specified column, such as replacing the value' 99' in Chinese and English columns with the value' 100', you can use the following statement:

Df.replace({' Chinese': 99,' English': 99}, 100)

The following procedures can be used for verification:

Import read_excel from panda.

file='d:/student.xlsx '

Df = read _ excel (file,sheet _ name = 0,converters = { ' student number ':str })

Print (df[(df. Chinese = ==99) |(df. English ==99)])

Df=df.replace({' Chinese': 99,' English': 99}, 100)

Print (df[(df. Chinese = ==99) |(df. English ==99)])

The running results are as follows:

Serial number, student number, name, grade, class, Chinese, mathematics and English total score ranking

28 29 090802 Ding Nengtong 09 South119120 99 338 South

29 30 090203 Shen 09 South109108 99 316 South

Empty data frame

Column: [serial number, student number, name, grade, class, Chinese, math, English, total score, ranking]

Index: []

It can be seen that two records in the output result of the first print () statement meet the condition of "99 points in Chinese or English". After executing the substitution statement, there is no record in df that meets the condition of "99 points in Chinese or English".

2 1.6 record merging

The format of function concat () () is as follows:

concat([dataFrame 1,dataFrame2,......],ignore_index=True)

Where dataFrame 1 etc. represent the dataFrame data set to be merged; Ignore_index=True means re-indexing after merging. Its return value is also of type DataFrame.

Concat () function and append () function have very similar functions.

Example:

Import Panda # Import Panda Module

Import read_excel # from panda import read_execel

File='d:/student.xlsx' # Variable file represents the file path. Please note that the usage data of'/'is shown in the table in chapter 1818/.

Df = read _ excel (file,sheet _ name = 0,converters = { ' student number ':str })

# Import Excel files into DataFrame variables.

Df=Df[:5] # Intercept the first 5 records of df.

Print (df) # output df

The first three records of df intercepted by Df 1=Df[:3] # are stored in Df 1.

Df2=Df[3:5] # The last two records of the intercepted df are stored in Df2.

Df3 = Panda. Concat ([df2, df 1]) # merges df2 and df 1 and stores them in DF3.

Print (df3) # Output df3

The running results are as follows:

Serial number, student number, name, grade, class, Chinese, mathematics and English total score ranking

0 1 070 10 1 Wang Boyu Nannan 84 7 1 93 Nannan.

1 2 070 102 Chen Guantao Nannan 89 89 89 Nannan

Li Nannan 89 72 76 Nannan

3 4 070204 South-South Haiyan Jiang 89 89 89 South-South

4 5 070205 Lin Ruoxi Nannan 9 1 95 83 Nannan

Serial number, student number, name, grade, class, Chinese, mathematics and English total score ranking

3 4 070204 South-South Haiyan Jiang 89 89 89 South-South

4 5 070205 Lin Ruoxi Nannan 9 1 95 83 Nannan

0 1 070 10 1 Wang Boyu Nannan 84 7 1 93 Nannan.

1 2 070 102 Chen Guantao Nannan 89 89 89 Nannan

Li Nannan 89 72 76 Nannan

Since df 1 is merged into df2, it can be seen that the index still maintains its original state.

2 1.7 Statistics

You can calculate the number of times a value appears in a row or an area in the following way.

Import read_excel from panda.

file='d:/student.xlsx '

Df = read _ excel (file,sheet _ name = 0,converters = { ' student number ':str })

df=df[:5]

Print (df)

Print (df[' Chinese']. value_counts())

The output results are as follows:

Serial number, student number, name, grade, class, Chinese, mathematics and English total score ranking

0 1 070 10 1 Wang Boyu Nannan 84 7 1 93 Nannan.

1 2 070 102 Chen Guantao Nannan 89 89 89 Nannan

Li Nannan 89 72 76 Nannan

3 4 070204 South-South Haiyan Jiang 89 89 89 South-South

4 5 070205 Lin Ruoxi Nannan 9 1 95 83 Nannan

89 3

84 1

9 1 1

Name: Chinese, model: int64

As you can see, the number of occurrences of each value in a column can be calculated by the value_counts () function.

The parameters of the value_counts () function are:

Ascending, ascending = true when ascending, and ascending = false when ascending (this parameter can be defaulted);

Normalize, when normalize=True, the display is no longer the number of occurrences of each value, but the proportion.

Change report printing (df[' Chinese']. Value_counts ()) is converted to:

Print (df[' Chinese']. Value _ counts (ascending = true, normalization = true))

The output becomes:

9 1 0.2

84 0.2

89 0.6

Name: Chinese, model: float64

2 1.8 find by value

Print (df[' Chinese']. isin([84,9 1])

Its function is to find records whose values in the Language column are consistent with the elements in the list pointed by isin. If the result is true, otherwise it is false.

Output result:

0 true

1 false

2 error

3 error

4 correct

Name: Chinese, data type: Boolean

2 1.9 data partition

According to a division standard, data can be divided into regions and represented by corresponding labels, which can be achieved by the cut () method.

The syntax format is as follows:

Cut (series, box, right = true, label = empty)

These include:

Series represents the data to be grouped;

Bins represents the basis of grouping, which is a list whose elements are the boundary values of partitions, such as [0, 72, 96, 120], that is, it is divided into three partitions, namely, 0~72, 72~96, 96~ 120, and the default value is "excluding left packets and right packets".

Right indicates whether the right side is closed when grouping;

Labels represents a custom label for grouping, or it cannot be redefined.

Let's group the Chinese scores in the above-mentioned student report form and add a new column "Chinese scores".

Import pandas as pd

Import read_excel # from panda import read_execel

file='d:/student.xlsx '

Df = read _ excel (file,sheet _ name = 0,converters = { ' student number ':str })

df[' grade ']= df[' student number ']. str . slice(0,2)

Df[' class' ]=df[' student number']. str.slice (0,4)

Df。 Total score =df. Chinese +df. Mathematics +df. English

Bin = [0,72,96, maximum (df. Chinese)+1] #

Lab=[' failed',' passed',' excellent']

Grade=pd.cut(df。 Language, trash can, right = false, label = laboratory).

Df[' language proficiency'] = grade

print(df.head())

Print ("Statistical results of Chinese scores:")

Print (df[' language level']. value_counts())

The running results are as follows:

Serial number student number name grade class Chinese mathematics English total score Chinese grade

0 1 070 1 Wang Boyu 07 070 1 84 7 1 93 248 was adopted.

1 2 070 102 Chen Guantao 07 070 1 89 89 267.

Li 070 1 89 72 76 237 passed.

Haiyan Jiang adopted 07 0702 89 89 267.

4 5 070205 Lin Ruoxi 07 0702 9 1 95 83 269 passed.

Statistical results of Chinese scores:

Pass 17

Excellent 10

Failure 4

Name: language level, data type: int64