import numpy as np
x = np.array([2,3,4,5,11,13])
x * 2
array([ 4, 6, 8, 10, 22, 26])
由于NumPy并没有为字符串数组提供简单的接口, 因此需要通过繁琐的for循环来解决问题
data = ['peter','Paul','MARY','gUIDO']
[s.capitalize() for s in data]
['Peter', 'Paul', 'Mary', 'Guido']
加入数据中出现缺失值, 则可能报错
data = ['peter','Paul',None,'MARY','gUIDO']
[s.capitalize() for s in data]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in ()
1 data = ['peter','Paul',None,'MARY','gUIDO']
----> 2 [s.capitalize() for s in data]
in (.0)
1 data = ['peter','Paul',None,'MARY','gUIDO']
----> 2 [s.capitalize() for s in data]
AttributeError: 'NoneType' object has no attribute 'capitalize'
Pandas为包含字符串的Series和Index对象提供的str属性既可以满足向量化字符串操作, 又可以正确的处理缺失值
import pandas as pd
names = pd.Series(data)
names
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
in ()
1 import pandas as pd
----> 2 names = pd.Series(data)
3 names
NameError: name 'data' is not defined
names.str.capitalize()
0 Peter
1 Paul
2 None
3 Mary
4 Guido
dtype: object
monte = pd.Series(['Graham Chapman','John Cleese','Terry Gilliam',
'Eric Idle', 'Terry Jones','Michael Palin'])
与Python字符串方法相似的方法
len() | lower() | translate() | islower() |
ljust() | upper() | startswith() | isupper() |
rjust() | find() | endswith() | isdecimal() |
center() | rfind() | isalnim() | isdecimal() |
zfill() | index() | isalpha() | split() |
strip() | rindex() | isdigit() | rspllit() |
rstrip() | capitalize() | isspace() | partition() |
lstrip() | swapcase() | istitle() | rpartition() |
这些方法的返回值不同,例如lower()方法返回一个字符串Series
monte.str.lower()
0 graham chapman
1 john cleese
2 terry gilliam
3 eric idle
4 terry jones
5 michael palin
dtype: object
有些方法返回的是数值
monte.str.len()
有些返回布尔值
monte.str.startswith('T')
0 False
1 False
2 True
3 False
4 True
5 False
dtype: bool
还有些返回列表或其他复合值
monte.str.split()
0 [Graham, Chapman]
1 [John, Cleese]
2 [Terry, Gilliam]
3 [Eric, Idle]
4 [Terry, Jones]
5 [Michael, Palin]
dtype: object
monte.str.extract('([A-Za-z]+)')
0 | |
---|---|
0 | Graham |
1 | John |
2 | Terry |
3 | Eric |
4 | Terry |
5 | Michael |
monte.str.findall(F'^[^AEIOU].*[^aeiou]$')
0 [Graham Chapman]
1 []
2 [Terry Gilliam]
3 []
4 [Terry Jones]
5 [Michael Palin]
dtype: object
能将更多的正则表达式运用到Series与DataFrame之中的话.
- 其他Pandas字符串方法
方法 | 描述 |
---|---|
get() | 获取元素索引位置的值, 索引从0开始 |
slice() | 对元素进行切片取值 |
slice_replace() | 对元素进行切片替换 |
cat() | 连接字符串 |
repeat() | 重复元素 |
normalize() | 将字符串替换为Unicode规范形式 |
pad() | 在字符串的左边,右边或两边增加空格 |
wrap() | 将字符串按照指定的宽度换行 |
join() | 用分隔符连接Series的每个元素 |
get_dummies() | 按照分隔符提取每个元素的dummy变量 |
# 向量化字符串的取值与切片操作
# get(),slice()操作可以从每个字符串数组中获取向量化元素
monte.str[0:3]
0 Gra
1 Joh
2 Ter
3 Eri
4 Ter
5 Mic
dtype: object
full_monte = pd.DataFrame({'name': monte,
'info': ['b|c|d','b|d','a|c','b|d','b|c','b|c|d']})
full_monte['info'].str.get_dummies('|')
a | b | c | d | |
---|---|---|---|---|
0 | 0 | 1 | 1 | 1 |
1 | 0 | 1 | 0 | 1 |
2 | 1 | 0 | 1 | 0 |
3 | 0 | 1 | 0 | 1 |
4 | 0 | 1 | 1 | 0 |
5 | 0 | 1 | 1 | 1 |
1 原生Python的日期与时间工具:datetime与dateutil
from datetime import datetime
datetime(year=2015,month=7,day=4)
datetime.datetime(2015, 7, 4, 0, 0)
from dateutil import parser
date = parser.parse('4th of July,2015')
date
datetime.datetime(2015, 7, 4, 0, 0)
# 打印出这一天是星期几
date.strftime('%A')
'Saturday'
2 时间类型数组: NumPy的datetime64类型
- datetime64需要在设置日期时确定具体的输入类型
import numpy as np
date = np.array('2015-07-04',dtype=np.datetime64)
date
array('2015-07-04', dtype='datetime64[D]')
date + np.arange(12)
array(['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07',
'2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11',
'2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'],
dtype='datetime64[D]')
np.datetime64('2015-07-04')
numpy.datetime64('2015-07-04')
np.datetime64('2015-07-04 12:00')
numpy.datetime64('2015-07-04T12:00')
np.datetime64('2015-07-04 12:59:59','ns')
numpy.datetime64('2015-07-04T12:59:59.000000000')
Pandas的日期与时间工具:理想与现实的最佳解决方案
import pandas as pd
date = pd.to_datetime('4th of July,2015')
date
Timestamp('2015-07-04 00:00:00')
date.strftime('%A')
'Saturday'
date + pd.to_timedelta(np.arange(12),'D')
DatetimeIndex(['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07',
'2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11',
'2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'],
dtype='datetime64[ns]', freq=None)
index = pd.DatetimeIndex(['2014-07-04','2014-08-04',
'2015-07-04','2015-08-04'])
data = pd.Series([0,1,2,3],index=index)
data
2014-07-04 0
2014-08-04 1
2015-07-04 2
2015-08-04 3
dtype: int64
data['2014-07-04':'2015-07-04']
2014-07-04 0
2014-08-04 1
2015-07-04 2
dtype: int64
data['2015']
2015-07-04 2
2015-08-04 3
dtype: int64
dates = pd.to_datetime([datetime(2015,7,3),'4th of July,2015',
'2015-July-6','07-07-2015','20150708'])
dates
DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
'2015-07-08'],
dtype='datetime64[ns]', freq=None)
dates.to_period('D')
PeriodIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
'2015-07-08'],
dtype='period[D]', freq='D')
dates-dates[0]
TimedeltaIndex(['0 days', '1 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq=None)
有规律的时间序列:pd.date_range()
- pd.date_range()可以处理时间戳
- pd.period_range()可以处理周期
- pd.timedelta_range()可以处理时间间隔
- Python的range()和NumPy的np.arange()可以用七点,终点和步长代码创建一个有规律的日期序列
pd.date_range('2015-07-03','2015-7-10')
DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
'2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
dtype='datetime64[ns]', freq='D')
pd.date_range('2015-07-03',periods=8)
DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
'2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
dtype='datetime64[ns]', freq='D')
pd.date_range('2015-07-03',periods=8,freq='H')
DatetimeIndex(['2015-07-03 00:00:00', '2015-07-03 01:00:00',
'2015-07-03 02:00:00', '2015-07-03 03:00:00',
'2015-07-03 04:00:00', '2015-07-03 05:00:00',
'2015-07-03 06:00:00', '2015-07-03 07:00:00'],
dtype='datetime64[ns]', freq='H')
pd.period_range('2015-07-03',periods=8,freq='M')
PeriodIndex(['2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12',
'2016-01', '2016-02'],
dtype='period[M]', freq='M')
pd.timedelta_range(0,periods=10,freq='H')
TimedeltaIndex(['00:00:00', '01:00:00', '02:00:00', '03:00:00', '04:00:00',
'05:00:00', '06:00:00', '07:00:00', '08:00:00', '09:00:00'],
dtype='timedelta64[ns]', freq='H')
Pandas频率代码
代码 | 描述 |
---|---|
D | 天 |
W | 周 |
M | 月末 |
Q | 季末 |
A | 年末 |
H | 小时 |
T | 分钟 |
S | 秒 |
L | 毫秒 |
U | 微秒 |
N | 纳秒 |
B | 天(仅含工作日) |
BM | 月末(仅含工作日) |
BQ | 季末(仅含工作日) |
BA | 年末(仅含工作日) |
BH | 小时(工作时间) |
带开始索引的频率代码
代码 | 描述 |
---|---|
MS | 月初 |
BMS | 月初(仅含工作日) |
QS | 季初 |
BQS | 季初(仅含工作日) |
AS | 年初 |
BAS | 年初(仅含工作日) |
pd.timedelta_range(0,periods=9,freq='2H30T')
TimedeltaIndex(['00:00:00', '02:30:00', '05:00:00', '07:30:00', '10:00:00',
'12:30:00', '15:00:00', '17:30:00', '20:00:00'],
dtype='timedelta64[ns]', freq='150T')
from pandas.tseries.offsets import BDay
pd.date_range('2015-07-01',periods=5,freq=BDay())
DatetimeIndex(['2015-07-01', '2015-07-02', '2015-07-03', '2015-07-06',
'2015-07-07'],
dtype='datetime64[ns]', freq='B')
conda install pandas_datareader
from pandas_datareader import data
goog = data.DataReader('GOOG',start='2004', end='2016',
data_source='google')
goog.head()
File "", line 1
conda install pandas_datareader
^
SyntaxError: invalid syntax
Pandas的eval()函数用字符串代数实现了DataFrame的高性能运算, 例如下面的DataFrame
import pandas as pd
import numpy as np
rng = np.random.RandomState(42)
nrows, ncols = 100000,100
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
for i in range(4))
# 普通计算4个DataFrame的和
%timeit df1 + df2 + df3 + df4
158 ms ± 2.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 也可以用pd.eval和字符串代数式计算
%timeit pd.eval('df1 + df2 + df3 + df4')
68.4 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pd.eval()支持的运算
df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000,(100,3)))
for i in range(5))
1 算术运算符
result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1,result2)
True
比较运算符
result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
reuslt2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1,result2)
False
位运算符
result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
reuslt2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1,result2)
False
其他运算:pd.eval()还不支持函数调用,条件语句,循环以及更负责的运算. 如果想进行这些运算 ,可以借助Numexpr来实现.
由于pd.eval()是Pandas的顶层函数, 因此DataFrame有一个eval()方法可以做类似的运算. 使用eval()方法的好处是可以借助列名称进行运算
df = pd.DataFrame(rng.rand(1000,3),columns=['a','b','c'])
df.head()
a | b | c | |
---|---|---|---|
0 | 0.375506 | 0.406939 | 0.069938 |
1 | 0.069087 | 0.235615 | 0.154374 |
2 | 0.677945 | 0.433839 | 0.652324 |
3 | 0.264038 | 0.808055 | 0.347197 |
4 | 0.589161 | 0.252418 | 0.557789 |
result3 = df.eval('(a + b) / (c - 1)')
用DataFrame.eval()新增列
df.eval('d = (a + b) /c',inplace=True)
df.head()
a | b | c | d | |
---|---|---|---|---|
0 | 0.375506 | 0.406939 | 0.069938 | 11.187620 |
1 | 0.069087 | 0.235615 | 0.154374 | 1.973796 |
2 | 0.677945 | 0.433839 | 0.652324 | 1.704344 |
3 | 0.264038 | 0.808055 | 0.347197 | 3.087857 |
4 | 0.589161 | 0.252418 | 0.557789 | 1.508776 |
DataFrame.eval()使用局部变量
column_mean = df.mean(1)
result1 = df['a'] + column_mean
result2 = df.eval('a + @column_mean')
np.allclose(result1,result2)
True
@符号表示这是一个变量名称而不是一个列名称.