python机器学习基础笔记2之pandas的dataframe(cook book)

pandas 中的数据导入以及常用操作语句

首先导入数据:

pd.read_csv
pd.read_excel
pd.read_json

然后粗略查看数据by:
dataframe.head()
dataframe.tail()

# Load library
import pandas as pd

# Create URL
url = 'https://tinyurl.com/titanic-csv'

# Load data as a dataframe
dataframe = pd.read_csv(url)

# Show first 5 rows
dataframe.head(5)

创建 数据框dataframe(二维数据框)

dataframe = pd.DataFrame()

# Load library
import pandas as pd

# Create DataFrame
dataframe = pd.DataFrame()

# Add columns
dataframe['Name'] = ['Jacky Jackson', 'Steven Stevenson']
dataframe['Age'] = [38, 25]
dataframe['Driver'] = [True, False]

# Show DataFrame
dataframe

append新的行——也就是一个observation
ps: pd.Series([ ’ ‘] , index = [’ ']) 一维数据

# Create row
new_person = pd.Series(['Molly Mooney', 40, True], index=['Name','Age','Driver'])

# Append row
dataframe.append(new_person, ignore_index=True)

描述数据

df.describe()

# Show dimensions
dataframe.shape

# Show statistics
dataframe.describe()

选择元素,切片处理

注意:
1.选取,用 loc选取位置(‘string’),用iloc选取实际index坐标位置.

2.选取,用df[‘columns’]就行

# Select first row
dataframe.iloc[0]

# Select three rows
dataframe.iloc[1:4]

# Select four rows
dataframe.iloc[:4]

# Set index
dataframe = dataframe.set_index(dataframe['Name'])

# Show row
dataframe.loc['Allen, Miss Elisabeth Walton']

选择行,based on conditionals

!!! 直接dataframe[(条件1) & (条件2)] !!!

# Show top two rows where column 'sex' is 'female'

dataframe[dataframe['Sex'] == 'female'].head(2)

# Filter rows

dataframe[(dataframe['Sex'] == 'female') & (dataframe['Age'] >= 65)]

替换数值value

replace用 list 做参数

replace([‘A’],[‘B’])

replace A with B

# Load library
import pandas as pd

# Create URL
url = 'https://tinyurl.com/titanic-csv'

# Load data
dataframe = pd.read_csv(url)

# Replace values, show two rows
dataframe['Sex'].replace("female", "Woman").head(2)

0 Woman
1 Woman
Name: Sex, dtype: object

# Replace "female" and "male with "Woman" and "Man"
dataframe['Sex'].replace(["female", "male"], ["Woman", "Man"]).head(5)
0 Woman
1 Woman
2 Man
3 Woman
4 Man
Name: Sex, dtype: object

# Replace values, show two rows
dataframe.replace(1, "One").head(2)

重命名 列(rename)

rename with 字典作为参数!

# Load library
import pandas as pd

# Create URL
url = 'https://tinyurl.com/titanic-csv'

# Load data
dataframe = pd.read_csv(url)

# Rename column, show two rows
dataframe.rename(columns={'PClass': 'Passenger Class'}).head(2)

# Rename columns, show two rows
dataframe.rename(columns={'PClass': 'Passenger Class', 'Sex': 'Gender'}).head(2)

如果要重命名所有的列

# Load library
import collections

# Create dictionary
column_names = collections.defaultdict(str)

# Create keys
for name in dataframe.columns:
	column_names[name]
	
# Show dictionary
column_names
defaultdict(str,
		    {'Age': '',
			 'Name': '',
			 'PClass': '',
			 'Sex': '',
			 'SexCode': '',
			 'Survived': ''})

statistics

# Calculate statistics
print('Maximum:', dataframe['Age'].max())
print('Minimum:', dataframe['Age'].min())
print('Mean:', dataframe['Age'].mean())
print('Sum:', dataframe['Age'].sum())
print('Count:', dataframe['Age'].count())

# Show counts
dataframe.count()
Name 1313
PClass 1313
Age 756
Sex 1313
Survived 1313
SexCode 1313
dtype: int64

Finding Unique Values

# Select unique values
dataframe['Sex'].unique()

# Show counts
dataframe['Sex'].value_counts()
# Show counts
dataframe['PClass'].value_counts()
3rd 711
1st 322
2nd 279
* 1
Name: PClass, dtype: int64
# While almost all passengers belong to one of three classes as expected, a single pas‐
senger has the class *. 

 Show number of unique values
dataframe['PClass'].nunique()

缺失值的处理

三种方式,参考后面的笔记。

## Select missing values, show two rows
dataframe[dataframe['Age'].isnull()].head(2)

# Replace values with NaN
dataframe['Sex'] = dataframe['Sex'].replace('male', np.nan)

# Load data, set missing values
dataframe = pd.read_csv(url, na_values=[np.nan, 'NONE', -999])

删除列(drop([ ,] , axis =1))

==I recommend treating DataFrames as immutable objects. ==

dataframe.drop('Age', axis=1).head(2)

# Drop columns
dataframe.drop(['Age', 'Sex'], axis=1).head(2)

# Create a new DataFrame
dataframe_name_dropped = dataframe.drop(dataframe.columns[0], axis=1)

删除行(drop([,] , axis =0 ))

第二种方式有点奇妙

 df.drop([0, 1], axis=0)

dataframe[dataframe['Name'] != 'Allison, Miss Helen Loraine'].head(2)
#  删除Allison, Miss Helen Loraine

# Delete row, show first two rows of output
dataframe[dataframe.index != 0].head(2)

舍弃重复行(drop_duplicates(subset = []))

drop_duplicates 默认要完美across所有的列,如果你只是写一个列名,那就用subset,而且用完后只能显示重复的行,要用keep = ‘last’,才会最终达到目的。

perfectly across all columns.

# Drop duplicates, show first two rows of output
dataframe.drop_duplicates().head(2)


# Drop duplicates
dataframe.drop_duplicates(subset=['Sex'])

# Drop duplicates
dataframe.drop_duplicates(subset=['Sex'], keep='last')

grouping

groupby()后面必须要计算方式:groupby(‘happy’)[‘name’].count()

# Group rows by the values of the column 'Sex', calculate mean
# of each group

dataframe.groupby('Sex').mean()

# Group rows, count rows
dataframe.groupby('Survived')['Name'].count()

# Group rows, calculate mean
dataframe.groupby(['Sex','Survived'])['Age'].mean()

行加时间戳

#举例子

# Create date range
ime_index = pd.date_range('06/06/2017', periods=100000, freq='30S')

# Create DataFrame
dataframe = pd.DataFrame(index=time_index)

# Create column of random values
dataframe['Sale_Amount'] = np.random.randint(1, 10, 100000)

# Group rows by week, calculate sum per week
dataframe.resample('W').sum()

Looping Over a Column

# Show first two names uppercased
[name.upper() for name in dataframe['Name'][0:2]]
['ALLEN, MISS ELISABETH WALTON', 'ALLISON, MISS HELEN LORAINE']

apply函数在列

Use apply to apply a built-in or custom function on every element in a column:
# Create function
def uppercase(x):
	return x.upper()
	
# Apply function, show two rows
dataframe['Name'].apply(uppercase)[0:2]

0 ALLEN, MISS ELISABETH WALTON
1 ALLISON, MISS HELEN LORAINE
Name: Name, dtype: object

Applying a Function to Groups

apply( function)

# Group rows, apply function to groups

dataframe.groupby('Sex').apply(lambda x: x.count())

连接数据框(concatenate)

concat([ A , B ] ,axis = ?)

# Concatenate DataFrames by rows
pd.concat([dataframe_a, dataframe_b], axis=0)
#You can use axis=1 to concatenate along the #column axis:
# Concatenate DataFrames by columns

pd.concat([dataframe_a, dataframe_b], axis=1)

合并数据框(merge(A,B, on=’?’,how = ?))

pd.merge(dataframe_employees, dataframe_sales, on='employee_id')
# Merge DataFrames
pd.merge(dataframe_employees, dataframe_sales, on='employee_id', how='outer')
# Merge DataFrames
pd.merge(dataframe_employees, dataframe_sales, on='employee_id', how='left')
# Merge DataFrames
pd.merge(dataframe_employees,
dataframe_sales,
left_on='employee_id',
right_on='employee_id')
What is the left and right DataFrame? The simple answer is that the left DataFrame is
the first one we specified in merge and the right DataFrame is the second one. This
language comes up again in the next sets of parameters we will need.

你可能感兴趣的:(python,机器学习)