The contents are based on my personal study notes to the University of Michigan open course "Applied Data Science with Python Specialization", lectured by Professor Christopher Brooks, published on Coursera. If any mistakes or improper contents found, please feel free to let me know.
Wish all of us a pleasant journey in the universe of python programming!
Series (a data structure in Pandas) could be created from list or dictionaries.
column 1: labels | column 2: actual data |
- automatically assigned, starting from 0 | - stored in an order - could be tuples |
type of actual data | dtype of the elements in the series |
- strings | - object |
- numbers | - int64 |
- strings and none | - object and None |
- numbers and none | - floats ("upcasting") and NaN (different with None; identified as a numeric value for effective reason) |
Create a series & Set the index:
list = [value1, value2, value3]
s = pd.Series(list)
list = [value1, value2, value3]
s = pd.Series(list, index = [idx1, idx2, idx3])
dict = {key1 : value1, key2 : value2, key3 : value3}
s = pd.Series(dict, index = [key1, key2, key3])
Querying a Series' objects:
Examples:
Notes:
Merging Series' objects: merge Series A with a second Series B to become a new Series C
C = A.append(B)
Note: Series A will be unchanged as the common pattern in pandas is to return a new object by default instead of modifying the original ones.
DataFrame: a data structure in pandas
Creating a DataFrame:
df = pd.DataFrame([pd.Series(dict1), pd.Series(dict2), pd.Series(dict3)], index = [idx1, idx2, idx3])
df = pd.DataFrame([dict1, dict2, dict3], index = [idx1, idx2, idx3])
Loading data as DataFrame via reading a file: pd.read_csv(filepath, index_col=0)
df = pd.read_csv('datasets/Admission_Predict.csv', index_col=0)
index_col=0 is an optional argument, meaning to set the first column of the .csv file as the index.
Set the df index using existing columns: pd.df.set_index() The original df will change.
If we want to change the index from column A to column B, and we don't want the original df to be changed, we have to firstly copy the column A (to keep it):
df[column_A_name] = df.index
Then set column B as the index:
df = df.set_index(column_B_name)
Then if we want to add a default numbered index starting from 0 at the very left side of the df:
df = df.reset_index()
Now the DataFrame has 2-level index ("multi-level indexing" or "hierarchical indices"): the first level index is the numerical index starting from 0, and the second level index is cloumn B. Pandas will serch through in order.
Duel index:
To select all the rows with the first level index (column A, e.g. state name) and the second level index (column B, e.g. city name):
df = df.set_index([column_A_name, column_B_name])
To compare the values of column C (e.g. the population) among 2 cities (e.g. city5 and city6) in the same state (e.g. state1). put the state name and city name in a tuple to do the duel indexing:
df.loc[ [(state1, city5),
(state1, city6)] ]
Note: if it is the columns to be duel indexed, just transpose first.
Changing the column's name: pd.rename(), by passing into a dictionary, with old column names as keys and new column names as values.
new_df = pd.rename(columns={old_column_name1 : new_column_name1,
old_column_name2 : new_column_name2,
old_column_name3 : new_column_name3})
Get all column names: df.columns()
Selecting values from a DataFrame:
the original df before transpose:
name | |
school1 | alice |
school2 | jack |
school3 | helen |
after transpose, good for row selection:
school1 | school2 | school3 | |
name | alice | jack | hlen |
return:
school1 | alice |
school2 | jack |
school3 | helen |
Deleting values from a DataFrame:
inplace=True means to change the original DataFrame instead of returning a copy,
axis=0 by default means that it is the row to be deleted, while 1 means to delete the column.
del df_name[the_column_name]
Adding a new column with assigned values:
df[name_of_the_added_column] = the_assigned_value
e.g.: df["class ranking"] = None
name | class ranking | |
school1 | alice | None |
school2 | jack | None |
school3 | helen | None |
Check all the unique values in one column: df[column_name].unique()
It'll return an array with all the unique values:
array([40, 50]) the column has only 2 different values: 40 and 50
To select all rows with the value of 50:
df = df[df[column_name] == 50]
Merging Dataframes (horizontally merge): pd.merge() 3 ways of merge: 1) UNION, 2) INTERSECT, 3) EXCEPT
Imaging there 2 datasets, one is information of students (student_df), the other is information of staff (staff_df). Note that there are students who are also working as staff.
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)
pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True)
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)
pd.merge(staff_df, student_df, how='right', on='name')
Merging Dataframes (vertically): pd.concat()
merge 3 datasets (df1, df2 and df3) vertically. df1, df2 and df3 are data of year 2021, 2022 and 2023, respectively.
pd.concat([df1, df2, df3])
remark which rows of data are for which year: the optional parameter "keys"
pd.concat([df1, df2, df3], keys=['2021', '2022', '2023'])