首先导入数据:
pd.read_csv
pd.read_excel
pd.read_json
然后粗略查看数据by:
dataframe.head()
dataframe.tail()
# Load library
import pandas as pd
# Create URL
url = 'https://tinyurl.com/titanic-csv'
# Load data as a dataframe
dataframe = pd.read_csv(url)
# Show first 5 rows
dataframe.head(5)
dataframe = pd.DataFrame()
# Load library
import pandas as pd
# Create DataFrame
dataframe = pd.DataFrame()
# Add columns
dataframe['Name'] = ['Jacky Jackson', 'Steven Stevenson']
dataframe['Age'] = [38, 25]
dataframe['Driver'] = [True, False]
# Show DataFrame
dataframe
append新的行——也就是一个observation
ps: pd.Series([ ’ ‘] , index = [’ ']) 一维数据
# Create row
new_person = pd.Series(['Molly Mooney', 40, True], index=['Name','Age','Driver'])
# Append row
dataframe.append(new_person, ignore_index=True)
df.describe()
# Show dimensions
dataframe.shape
# Show statistics
dataframe.describe()
注意:
1.选取行,用 loc选取位置(‘string’),用iloc选取实际index坐标位置.
2.选取列,用df[‘columns’]就行
# Select first row
dataframe.iloc[0]
# Select three rows
dataframe.iloc[1:4]
# Select four rows
dataframe.iloc[:4]
# Set index
dataframe = dataframe.set_index(dataframe['Name'])
# Show row
dataframe.loc['Allen, Miss Elisabeth Walton']
!!! 直接dataframe[(条件1) & (条件2)] !!!
# Show top two rows where column 'sex' is 'female'
dataframe[dataframe['Sex'] == 'female'].head(2)
# Filter rows
dataframe[(dataframe['Sex'] == 'female') & (dataframe['Age'] >= 65)]
replace用 list 做参数
replace([‘A’],[‘B’])
replace A with B
# Load library
import pandas as pd
# Create URL
url = 'https://tinyurl.com/titanic-csv'
# Load data
dataframe = pd.read_csv(url)
# Replace values, show two rows
dataframe['Sex'].replace("female", "Woman").head(2)
0 Woman
1 Woman
Name: Sex, dtype: object
# Replace "female" and "male with "Woman" and "Man"
dataframe['Sex'].replace(["female", "male"], ["Woman", "Man"]).head(5)
0 Woman
1 Woman
2 Man
3 Woman
4 Man
Name: Sex, dtype: object
# Replace values, show two rows
dataframe.replace(1, "One").head(2)
rename with 字典作为参数!
# Load library
import pandas as pd
# Create URL
url = 'https://tinyurl.com/titanic-csv'
# Load data
dataframe = pd.read_csv(url)
# Rename column, show two rows
dataframe.rename(columns={'PClass': 'Passenger Class'}).head(2)
# Rename columns, show two rows
dataframe.rename(columns={'PClass': 'Passenger Class', 'Sex': 'Gender'}).head(2)
如果要重命名所有的列:
# Load library
import collections
# Create dictionary
column_names = collections.defaultdict(str)
# Create keys
for name in dataframe.columns:
column_names[name]
# Show dictionary
column_names
defaultdict(str,
{'Age': '',
'Name': '',
'PClass': '',
'Sex': '',
'SexCode': '',
'Survived': ''})
# Calculate statistics
print('Maximum:', dataframe['Age'].max())
print('Minimum:', dataframe['Age'].min())
print('Mean:', dataframe['Age'].mean())
print('Sum:', dataframe['Age'].sum())
print('Count:', dataframe['Age'].count())
# Show counts
dataframe.count()
Name 1313
PClass 1313
Age 756
Sex 1313
Survived 1313
SexCode 1313
dtype: int64
# Select unique values
dataframe['Sex'].unique()
# Show counts
dataframe['Sex'].value_counts()
# Show counts
dataframe['PClass'].value_counts()
3rd 711
1st 322
2nd 279
* 1
Name: PClass, dtype: int64
# While almost all passengers belong to one of three classes as expected, a single pas‐
senger has the class *.
Show number of unique values
dataframe['PClass'].nunique()
三种方式,参考后面的笔记。
## Select missing values, show two rows
dataframe[dataframe['Age'].isnull()].head(2)
# Replace values with NaN
dataframe['Sex'] = dataframe['Sex'].replace('male', np.nan)
# Load data, set missing values
dataframe = pd.read_csv(url, na_values=[np.nan, 'NONE', -999])
==I recommend treating DataFrames as immutable objects. ==
dataframe.drop('Age', axis=1).head(2)
# Drop columns
dataframe.drop(['Age', 'Sex'], axis=1).head(2)
# Create a new DataFrame
dataframe_name_dropped = dataframe.drop(dataframe.columns[0], axis=1)
第二种方式有点奇妙
df.drop([0, 1], axis=0)
dataframe[dataframe['Name'] != 'Allison, Miss Helen Loraine'].head(2)
# 删除Allison, Miss Helen Loraine
# Delete row, show first two rows of output
dataframe[dataframe.index != 0].head(2)
drop_duplicates 默认要完美across所有的列,如果你只是写一个列名,那就用subset,而且用完后只能显示重复的行,要用keep = ‘last’,才会最终达到目的。
perfectly across all columns.
# Drop duplicates, show first two rows of output
dataframe.drop_duplicates().head(2)
# Drop duplicates
dataframe.drop_duplicates(subset=['Sex'])
# Drop duplicates
dataframe.drop_duplicates(subset=['Sex'], keep='last')
groupby()后面必须要计算方式:groupby(‘happy’)[‘name’].count()
# Group rows by the values of the column 'Sex', calculate mean
# of each group
dataframe.groupby('Sex').mean()
# Group rows, count rows
dataframe.groupby('Survived')['Name'].count()
# Group rows, calculate mean
dataframe.groupby(['Sex','Survived'])['Age'].mean()
#举例子
# Create date range
ime_index = pd.date_range('06/06/2017', periods=100000, freq='30S')
# Create DataFrame
dataframe = pd.DataFrame(index=time_index)
# Create column of random values
dataframe['Sale_Amount'] = np.random.randint(1, 10, 100000)
# Group rows by week, calculate sum per week
dataframe.resample('W').sum()
# Show first two names uppercased
[name.upper() for name in dataframe['Name'][0:2]]
['ALLEN, MISS ELISABETH WALTON', 'ALLISON, MISS HELEN LORAINE']
Use apply to apply a built-in or custom function on every element in a column:
# Create function
def uppercase(x):
return x.upper()
# Apply function, show two rows
dataframe['Name'].apply(uppercase)[0:2]
0 ALLEN, MISS ELISABETH WALTON
1 ALLISON, MISS HELEN LORAINE
Name: Name, dtype: object
apply( function)
# Group rows, apply function to groups
dataframe.groupby('Sex').apply(lambda x: x.count())
concat([ A , B ] ,axis = ?)
# Concatenate DataFrames by rows
pd.concat([dataframe_a, dataframe_b], axis=0)
#You can use axis=1 to concatenate along the #column axis:
# Concatenate DataFrames by columns
pd.concat([dataframe_a, dataframe_b], axis=1)
pd.merge(dataframe_employees, dataframe_sales, on='employee_id')
# Merge DataFrames
pd.merge(dataframe_employees, dataframe_sales, on='employee_id', how='outer')
# Merge DataFrames
pd.merge(dataframe_employees, dataframe_sales, on='employee_id', how='left')
# Merge DataFrames
pd.merge(dataframe_employees,
dataframe_sales,
left_on='employee_id',
right_on='employee_id')
What is the left and right DataFrame? The simple answer is that the left DataFrame is
the first one we specified in merge and the right DataFrame is the second one. This
language comes up again in the next sets of parameters we will need.