Pandas 初学笔记全英文(不断更新)

The contents are based on my personal study notes to the University of Michigan open course "Applied Data Science with Python Specialization", lectured by Professor Christopher Brooks, published on Coursera. If any mistakes or improper contents found, please feel free to let me know. 

Wish all of us a pleasant journey in the universe of python programming!

Series (a data structure in Pandas) could be created from list or dictionaries.

  • series: 2 colums of data
column 1: labels column 2: actual data
- automatically assigned, starting from 0

- stored in an order 

- could be tuples

  •  dtype of the elements in the series: 
type of actual data dtype of the elements in the series
- strings - object
- numbers - int64
- strings and none - object and None
- numbers and none

- floats ("upcasting") and NaN (different with None; identified as a numeric value for effective reason)

Create a series   &   Set the index: 

  • create series from list (labels automatically assigned, starting from 0):

        list = [value1, value2, value3]

        s = pd.Series(list)

  • create series from list (labels explicitly assigned):

        list = [value1, value2, value3]

        s = pd.Series(list, index = [idx1, idx2, idx3])

  • create series from dictionary (labels explicitly assigned: dict's keys as index):

        dict = {key1 : value1, key2 : value2, key3 : value3}

        s = pd.Series(dict, index = [key1, key2, key3])

Querying a Series' objects:

  • by index position (starting from 0): .iloc[ ]
  • by index label: .loc[ ]

Examples:

  • s.iloc[3]  or  s[3]
  • s.loc['label_name']  or  s['label_name']

Notes:

  1. .iloc[ ] and .loc[ ] are attributes, not methods, so don't use (), but [ ] instead.
  2. For row selection only, not for colulmn selection. To select column(s), have to "transpose" the data first.

Merging Series' objects: merge Series A with a second Series B to become a new Series C

C = A.append(B)

Note: Series A will be unchanged as the common pattern in pandas is to return a new object by default instead of modifying the original ones.


DataFrame: a data structure in pandas

Creating a DataFrame:

  • create df from (multiple) series, each series as a row of data; index (a list) as a second argument:

        df = pd.DataFrame([pd.Series(dict1), pd.Series(dict2), pd.Series(dict3)], index = [idx1, idx2, idx3])

  • create df from a list, with dictionaries embedded:

        df = pd.DataFrame([dict1, dict2, dict3], index = [idx1, idx2, idx3])

Loading data as DataFrame via reading a file:   pd.read_csv(filepath, index_col=0)

df = pd.read_csv('datasets/Admission_Predict.csv', index_col=0)

index_col=0 is an optional argument, meaning to set the first column of the .csv file as the index.

Set the df index using existing columns:   pd.df.set_index()         The original df will change.

If we want to change the index from column A to column B, and we don't want the original df to be changed, we have to firstly copy the column A (to keep it):

        df[column_A_name] = df.index

Then set column B as the index:

        df = df.set_index(column_B_name)

Then if we want to add a default numbered index starting from 0 at the very left side of the df:

        df = df.reset_index()

Now the DataFrame has 2-level index ("multi-level indexing" or "hierarchical indices"): the first level index is the numerical index starting from 0, and the second level index is cloumn B. Pandas will serch through in order.

Duel index:

To select all the rows with the first level index (column A, e.g. state name) and the second level index (column B, e.g. city name):

        df = df.set_index([column_A_name, column_B_name])

To compare the values of column C (e.g. the population) among 2 cities (e.g. city5 and city6) in the same state (e.g. state1). put the state name and city name in a tuple to do the duel indexing:

        df.loc[  [(state1, city5),

                     (state1, city6)]  ]

Note: if it is the columns to be duel indexed, just transpose first.

Changing the column's name:  pd.rename(), by passing into a dictionary, with old column names as keys and new column names as values.

new_df = pd.rename(columns={old_column_name1 : new_column_name1,

                                                   old_column_name2 : new_column_name2,

                                                   old_column_name3 : new_column_name3})

Get all column names:  df.columns()

Selecting values from a DataFrame: 

  • selecting a specific value: df.loc[row_name, column_name]
  • selecting a row (the series): df.loc[row_name]
  • selecting a column: df[column_name]
  • a second way to select a column: first transpose, then select the rows. df.T.loc[column_name_before_trans]

 the original df before transpose:

name
school1 alice
school2 jack
school3 helen

after transpose, good for row selection:

school1 school2 school3
name alice jack hlen

return:

school1 alice
school2 jack
school3 helen

Deleting values from a DataFrame: 

  • option 1: df.drop()       syntax: df.drop(the_label_name, inplace=True, axis=1)

inplace=True means to change the original DataFrame instead of returning a copy,

axis=0 by default means that it is the row to be deleted, while 1 means to delete the column. 

  • option 2: the "del" keyword (to drop a column), it will change the original df.

del df_name[the_column_name]

Adding a new column with assigned values:

df[name_of_the_added_column] = the_assigned_value

e.g.: df["class ranking"] = None

name class ranking
school1 alice None
school2 jack None
school3 helen None

Check all the unique values in one column:   df[column_name].unique()

It'll return an array with all the unique values:

        array([40, 50])       the column has only 2 different values: 40 and 50

To select all rows with the value of 50:

        df = df[df[column_name] == 50]

Merging Dataframes (horizontally merge): pd.merge()    3 ways of merge: 1) UNION, 2) INTERSECT, 3) EXCEPT

Imaging there 2 datasets, one is information of students (student_df), the other is information of staff (staff_df). Note that there are students who are also working as staff.

  • union ("a full outer join"): combine all information, students and staff

pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)

  • intersect ("an inner join"): select students who are also working as staff

pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True)

  • left join: list all the staff info; also list their info as students if they are also students

pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)

  • right join: list all the students info; also list their info as staff if they are also staff

pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)

  • to assign index of the newly merged datasets. Note that the index shall exist in the both original datasets. Here we set the "name" column as the index for the combined dataset.

pd.merge(staff_df, student_df, how='right', on='name')

Merging Dataframes (vertically):   pd.concat()

merge 3 datasets (df1, df2 and df3) vertically. df1, df2 and df3 are data of year 2021, 2022 and 2023, respectively.

        pd.concat([df1, df2, df3])

remark which rows of data are for which year:  the optional parameter "keys"

        pd.concat([df1, df2, df3], keys=['2021', '2022', '2023'])

你可能感兴趣的:(pandas,学习,python,数据分析)