import pandas as pd
阅读csv文件,sep表示分隔符,encoding表示编码格式,parse_dates表示对其中一列日期进行解析,index_col表示以那一列为索引
df=pd.read_csv(‘filename’,sep=’;’,encoding=’utf-8’,parse_dates=[‘Date’],dayfirt=True,index_col=’Date’)
选择一列的画直接用字典的方式,其中time为其中的一列,通过read的对象读的数据叫作DataFrame,
df[‘time’]
可以使用plot()函数对其中的列依据主索引画图
df[‘Berri 1’].plot()
df.plot(figsize=(15, 10))
当读取多列数据时,可以使用
df[[‘event’,’time’]][:10]#表示读取event和time行前十项数据
value_count()函数实现对一列数据的统计,分别列出相应的种类和数量
对大数据csv行列的有条件读取
有两种方式可以实现有条件读取,一是将条件列出,用&方式叠加,
count_server_1=df[‘source’]==’server’
count_access_1=df[‘event’]==’access’
access_server=df[count_server_1&count_access_1][:10][[‘time’,’enrollment_id’]]
在上面的代码中第一二行实际上是一个判断,其结果为true和false.同时可以指定相应的列(注意采用双括号)
二是直接的方式
count_access=df[df[‘event’]==’access’]#注意其需要两个df.
csv列项统计和类型变换
is_noise = complaints[‘Complaint Type’] == “Noise - Street/Sidewalk”
noise_complaints = complaints[is_noise]
noise_complaints[‘Borough’].value_counts()
输出为:
MANHATTAN 917
BROOKLYN 456
BRONX 292
QUEENS 226
STATEN ISLAND 36
Unspecified 1
dtype: int64
noise_complaint_counts = noise_complaints[‘Borough’].value_counts()
complaint_counts = complaints[‘Borough’].value_counts()
实现类型的转换和比例,其各个地区的抱怨数/各个地区的总数
noise_complaint_counts / complaint_counts.astype(float)
BRONX 0.014833
BROOKLYN 0.013864
MANHATTAN 0.037755
QUEENS 0.010143
STATEN ISLAND 0.007474
Unspecified 0.000141
dtype: float64
注意这个kind=’bar’
(noise_complaint_counts / complaint_counts.astype(float)).plot(kind=’bar’)
import matplotlib.pyplot as plt#采用加plt的方式进行显示
result[‘enrollment_id’].plot()
plt.show()
列的增加和赋值,以及对时间的运营
berri_bikes = bikes[[‘Berri 1’]]#当以一个DataFrame为对象,进行复制的时候,把要赋值的列看做index来处理,因此有两个中括号
在时间的处理上可以运用day和weekday来区分相应的时间
berri_bikes.index.day
Out[6]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, 31, 1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 1,
2, 3, 4, 5], dtype=int32)
增加新的列,直接赋值即可:
berri_bikes[‘weekday’] = berri_bikes.index.weekday
通过groupby(‘weekday’)进行分组,aggregate(sum)统计相应的和
weekday_counts = berri_bikes.groupby(‘weekday’).aggregate(sum)
weekday_counts
Out[9]:
Berri 1
weekday
0 134298
1 135305
2 152972
3 160131
4 141771
5 101578
6 99310
删除列,以hour为groupby的标志,赋值给新的DataFrame对象,对列元素的替换,空值列实现删除
对列元素的替换,可以把整个列看做列表来实现,代码如下:
weather_mar2012.columns = [s.replace(u’\xb0’, ”) for s in weather_mar2012.columns]
以上代码实现在列中替换元素
空值列实现删除.代码如下:
weather_mar2012=weather_mar2012.dropna(axis=1,how=’any’)#注意axis=1表示列,同时dropna
删除特定列
weather_mar2012=weather_mar2012.drop([‘Year’,’Month’,’Day’],axis=1)
以hour为groupby的标志,赋值给新的DataFrame对象
temp=weather_mar2012[[u’Temp (C)’]]
temp[‘Hour’]=weather_mar2012.index.hour#将索引以小时划分
temp.groupby(‘Hour’).aggregate(np.median).plot()
数据集的连接,写入csv文件
weather_2012=pd.concat(data_by_month)
weather_2012.to_csv(‘weather_2012.csv’)
字符串的操作
weather_description = weather_2012[‘Weather’]#在字符串下只有一个中括号
is_snowing = weather_description.str.contains(‘Snow’)#其结果为True和False
以时间一定时间为间隔采样,此时是按月采样,方法是每月的中值
weather_2012[‘Temp(C)’].resample(‘M’,how=np.median).plot(kind=’bar’)
the percentage of time it was snowing each month,是每个月的比例,因为之前float已经变成了0-1,所以直接求mean就是其中的比例
is_snowing.astype(float).resample(‘M’, how=np.mean).plot(kind=’bar’)
将两个数据进行叠加,成为一个数据
temperature = weather_2012[‘Temp (C)’].resample(‘M’, how=np.median)
is_snowing = weather_2012[‘Weather’].str.contains(‘Snow’)
snowiness = is_snowing.astype(float).resample(‘M’, how=np.mean)
Name the columns,必须要对其命名
temperature.name = “Temperature”
snowiness.name = “Snowiness”
stats = pd.concat([temperature, snowiness], axis=1)#用concat函数叠加,并指明axis
unique()函数辨别总共多少中