不秃头小白

利用python进行数据分析之数据清洗与准备--小白笔记

数据清洗和准备

处理缺失数据

import pandas as pd
import numpy as np

string_data=pd.Series(['aardvark','artichoke',np.nan,'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

对于数值数据，pandas使用浮点值NaN（Not a Number）表示缺失数据

string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

即将缺失值表示为NA，它表示不可用not available。在统计应用中，NA数据可能是不存在的数据或者虽然存在，但是没
有观察到（例如，数据采集中发生了问题）
Python内置的None值在对象数组中也可以作为NA：

string_data[0]=None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

关于缺失数据处理的函数：
dropna:根据各标签的之值中是否存在缺失数据对轴标签进行过滤，可通过与之调节对缺失值得容忍度
fillna:用指定值或插值方法（ffill或者bfill）填充数据
isnull：返回一个含有布尔值的对象，这些对象表示哪些值是缺失值NA，该对象的类型与原类型一样
notnull：isnull的否定式

from numpy import nan as NA
data=pd.Series([1,NA,3.5,NA,7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

#等价于
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

dropna默认丢弃任何含有缺失值的列

data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
cleaned=data.dropna()

data

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

cleaned

	0	1	2
0	1.0	6.5	3.0

#how='all'丢弃全为空值的那些行
cleaned_how=data.dropna(how='all')
cleaned_how

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
3	NaN	6.5	3.0

#用这种方式丢弃列，axis=1
data[4]=NA
data

	0	1	2	4
0	1.0	6.5	3.0	NaN
1	1.0	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	NaN	6.5	3.0	NaN

data.dropna(how='all',axis=1)

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

除DataFrame行的问题涉及时间序列数据。假设你只想留下一部分观测数
据，可以用thresh参数实现此目的：

thresh参数用法是：保留至少有n个非NaN数据的行/列

df=pd.DataFrame(np.random.randn(7,3))
df.iloc[:4,1]=NA
df.iloc[:2,2]=NA
df

	0	1	2
0	1.219978	NaN	NaN
1	0.341182	NaN	NaN
2	0.782306	NaN	0.402269
3	0.033353	NaN	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

df.dropna()

	0	1	2
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

df.dropna(thresh=2)

	0	1	2
2	0.782306	NaN	0.402269
3	0.033353	NaN	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

填充缺失数据fillna（）

df.fillna(0)

	0	1	2
0	1.219978	0.000000	0.000000
1	0.341182	0.000000	0.000000
2	0.782306	0.000000	0.402269
3	0.033353	0.000000	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

若是通过一个字典调用fillna，就可以实现对不同的列填充不同的值

df.fillna({1:0.5,2:0})

	0	1	2
0	1.219978	0.500000	0.000000
1	0.341182	0.500000	0.000000
2	0.782306	0.500000	0.402269
3	0.033353	0.500000	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

_=df.copy()
_

	0	1	2
0	1.219978	NaN	NaN
1	0.341182	NaN	NaN
2	0.782306	NaN	0.402269
3	0.033353	NaN	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

_.fillna(0)

	0	1	2
0	1.219978	0.000000	0.000000
1	0.341182	0.000000	0.000000
2	0.782306	0.000000	0.402269
3	0.033353	0.000000	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

	0	1	2
0	1.219978	NaN	NaN
1	0.341182	NaN	NaN
2	0.782306	NaN	0.402269
3	0.033353	NaN	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

# fillna默认会返回新对象，但也可以对现有对象进行就地修改：
_.fillna(0,inplace=True)
_

	0	1	2
0	1.219978	0.000000	0.000000
1	0.341182	0.000000	0.000000
2	0.782306	0.000000	0.402269
3	0.033353	0.000000	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

	0	1	2
0	-1.029961	-0.41851	0.634309
1	-0.621635	-0.24739	0.783342
2	-1.659875	NaN	-0.231234
3	0.513173	NaN	-1.094123
4	1.787183	NaN	NaN
5	-0.611099	NaN	NaN

df.fillna(method='ffill')#ffill向前填充

	0	1	2
0	-1.029961	-0.41851	0.634309
1	-0.621635	-0.24739	0.783342
2	-1.659875	-0.24739	-0.231234
3	0.513173	-0.24739	-1.094123
4	1.787183	-0.24739	-1.094123
5	-0.611099	-0.24739	-1.094123

df.fillna(0,limit=2)#limit限制填充的个数

	0	1	2
0	-1.029961	-0.41851	0.634309
1	-0.621635	-0.24739	0.783342
2	-1.659875	0.00000	-0.231234
3	0.513173	0.00000	-1.094123
4	1.787183	NaN	0.000000
5	-0.611099	NaN	0.000000

fillna

value:用于填充缺失值的标量值或字典对象
method：插值方式，默认ffill
axis：待填充的轴，默认axis=0
inplace：修改调用者对象而不产生副本
limit：（对于前向和后向填充）可以连续填充的最大数量

数据转换

重复数据处理

data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4
6	two	4

DataFrame的duplicated方法返回一个布尔型Series，表示各行是否是重复行

data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

data.drop_duplicates()

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4

data['v1']=range(7)
data

	k1	k2	v1
0	one	1	0
1	two	1	1
2	one	2	2
3	two	3	3
4	one	3	4
5	two	4	5
6	two	4	6

# 指定部分列进行重复项判断
data.drop_duplicates(['k1'])

	k1	k2	v1
0	one	1	0
1	two	1	1

duplicated和drop_duplicates默认保留的是第一个出现的值组合。传入keep='last’则
保留最后一个：

data.drop_duplicates(['k1','k2'],keep='last')

	k1	k2	v1
0	one	1	0
1	two	1	1
2	one	2	2
3	two	3	3
4	one	3	4
6	two	4	6

利用函数或映射进行数据转换

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon','pastrami', 'honey ham','nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3,5, 6]})
data

	food	ounces
0	bacon	4.0
1	pulled pork	3.0
2	bacon	12.0
3	Pastrami	6.0
4	corned beef	7.5
5	Bacon	8.0
6	pastrami	3.0
7	honey ham	5.0
8	nova lox	6.0

meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
lowercased=data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

data['animal']=lowercased.map(meat_to_animal)
data

	food	ounces	animal
0	bacon	4.0	pig
1	pulled pork	3.0	pig
2	bacon	12.0	pig
3	Pastrami	6.0	cow
4	corned beef	7.5	cow
5	Bacon	8.0	pig
6	pastrami	3.0	cow
7	honey ham	5.0	pig
8	nova lox	6.0	salmon

data['food'].map(lambda x:meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

替换值replace

data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

data.replace(-999,np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

data.replace([-999,-1000],np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

data.replace([-999,-1000],[np.nan,0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

# 也可以传递字典
data.replace({-999:np.nan,-1000:0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

重命名轴索引

data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data.index.map(lambda x:x[:4].upper())

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

data.index =data.index.map(lambda x:x[:4].upper())
data

	one	two	three	four
OHIO	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

如果想要创建数据集的转换版（而不是修改原始数据），比较实用的方法是
rename

data.rename(index=str.upper,columns=str.upper)

	ONE	TWO	THREE	FOUR
OHIO	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

# 特别说明一下，rename可以结合字典型对象实现对部分轴标签的更新
data.rename(index={'Ohio':'INDIANA'},
           columns={'three':'peekaboo'})

	one	two	peekaboo	four
OHIO	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

# 如果希望就地修改某个数据集，传入inplace=True即可：
data.rename(index={'Ohio':'indiana'},inplace=True)
data

	one	two	three	four
OHIO	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

离散化和面元划分

为了便于分析，连续数据常常被离散化或拆分为“面元”（bin）,使用pandas的cut函数

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats=pd.cut(ages,bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

# pd.value_counts(cats)是pandas.cut结果的面元计数
pd.value_counts(cats)

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
Name: count, dtype: int64

跟“区间”的数学符号一样，圆括号表示开端，而方括号则表示闭端（包括）。哪边
是闭端可以通过right=False进行修改

pd.cut(ages,[18,26,36,61,100],right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

你可以通过传递一个列表或数组到labels，设置自己的面元名称

group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages,bins,labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

data=np.random.rand(20)
data

array([0.50910844, 0.01886219, 0.95908375, 0.72900936, 0.88044385,
       0.94608156, 0.13493984, 0.91195245, 0.46857512, 0.38525391,
       0.02991488, 0.31362695, 0.15493992, 0.74873532, 0.6170826 ,
       0.84356457, 0.09466064, 0.01974264, 0.97598584, 0.43164735])

如果向cut传入的是面元的数量而不是确切的面元边界，则它会根据数据的最小值和
最大值计算等长面元

pd.cut(data,4,precision=2)#选项precision=2，限定小数只有两位。

[(0.5, 0.74], (0.018, 0.26], (0.74, 0.98], (0.5, 0.74], (0.74, 0.98], ..., (0.74, 0.98], (0.018, 0.26], (0.018, 0.26], (0.74, 0.98], (0.26, 0.5]]
Length: 20
Categories (4, interval[float64, right]): [(0.018, 0.26] < (0.26, 0.5] < (0.5, 0.74] < (0.74, 0.98]]

qcut是一个非常类似于cut的函数，它可以根据样本分位数对数据进行面元划分。根
据数据的分布情况，cut可能无法使各个面元中含有相同数量的数据点。而qcut由于
使用的是样本分位数，因此可以得到大小基本相等的面元：

data=np.random.randn(1000)
cats=pd.qcut(data,4)
cats

[(-0.601, -0.0125], (-2.885, -0.601], (-0.0125, 0.673], (-0.0125, 0.673], (-2.885, -0.601], ..., (-2.885, -0.601], (-2.885, -0.601], (-0.0125, 0.673], (-0.0125, 0.673], (0.673, 3.875]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.885, -0.601] < (-0.601, -0.0125] < (-0.0125, 0.673] < (0.673, 3.875]]

pd.value_counts(cats)

(-2.885, -0.601]     250
(-0.601, -0.0125]    250
(-0.0125, 0.673]     250
(0.673, 3.875]       250
Name: count, dtype: int64

与cut类似，你也可以传递自定义的分位数（0到1之间的数值，包含端点）

pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[(-1.22, -0.0125], (-2.885, -1.22], (-0.0125, 1.303], (-0.0125, 1.303], (-1.22, -0.0125], ..., (-1.22, -0.0125], (-2.885, -1.22], (-0.0125, 1.303], (-0.0125, 1.303], (-0.0125, 1.303]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.885, -1.22] < (-1.22, -0.0125] < (-0.0125, 1.303] < (1.303, 3.875]]

检测和过滤异常值

过滤或变换异常值（outlier）在很大程度上就是运用数组运算。

data=pd.DataFrame(np.random.randn(1000,4))
data.describe()

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	-0.001687	-0.036570	0.049180	0.010509
std	1.007410	0.971646	1.013227	0.982367
min	-3.127882	-2.643439	-2.949846	-2.962251
25%	-0.678444	-0.719395	-0.650478	-0.636513
50%	0.001463	-0.022189	0.061264	0.046104
75%	0.675910	0.629861	0.710325	0.642668
max	3.162745	4.108418	3.597951	4.410464

col=data[2]
col[np.abs(col)>3]

565    3.597951
Name: 2, dtype: float64

data[(np.abs(data) > 3).any(axis=1)]
# 使用 .any(1) 方法，它会检查每一行中是否存在至少一个 True 值，即是否有至少一个绝对值大于 3 的元素。这将返回一个布尔型的 Series，其中每个元素对应于每一行是否满足条件。

	0	1	2	3
16	-3.010992	-0.122886	1.194125	0.702766
111	-1.152743	4.108418	-2.097178	0.831827
219	-3.127882	1.781813	0.011281	0.587799
565	0.099141	-1.705600	3.597951	0.345174
596	3.162745	-1.597465	-0.552896	-2.756078
625	-0.042392	3.189888	0.723891	-0.670110
835	-1.125737	-0.699685	-1.730857	4.410464

data[np.abs(data) > 3] = np.sign(data) * 3#以将值限制在区间－3到3以内
data.describe()

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	-0.001711	-0.037868	0.048582	0.009099
std	1.006490	0.966920	1.011305	0.977041
min	-3.000000	-2.643439	-2.949846	-2.962251
25%	-0.678444	-0.719395	-0.650478	-0.636513
50%	0.001463	-0.022189	0.061264	0.046104
75%	0.675910	0.629861	0.710325	0.642668
max	3.000000	3.000000	3.000000	3.000000

np.sign(data).head()

	0	1	2	3
0	1.0	1.0	1.0	-1.0
1	-1.0	1.0	1.0	-1.0
2	1.0	1.0	1.0	-1.0
3	1.0	1.0	1.0	-1.0
4	-1.0	-1.0	-1.0	-1.0

排列和随机采样

利用numpy.random.permutation函数可以轻松实现对Series或DataFrame的列的排
列工作（permuting，随机重排序）。通过需要排列的轴的长度调用permutation，
可产生一个表示新顺序的整数数组

df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
df

	0	1	2	3
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19

sampler = np.random.permutation(5)
sampler

array([4, 0, 3, 1, 2])

df.take(sampler)

	0	1	2	3
4	16	17	18	19
0	0	1	2	3
3	12	13	14	15
1	4	5	6	7
2	8	9	10	11

df.sample(n=3)

	0	1	2	3
0	0	1	2	3
1	4	5	6	7
3	12	13	14	15

choices=pd. Series([5,7,-1,6,4])
draws=choices.sample(n=10,replace=True)
draws

0    5
2   -1
0    5
3    6
0    5
3    6
1    7
1    7
2   -1
0    5
dtype: int64

哑变量处理

df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                    'data1': range(6)})
df

	key	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	b	5

pd.get_dummies(df['key'])

	a	b	c
0	False	True	False
1	False	True	False
2	True	False	False
3	False	False	True
4	True	False	False
5	False	True	False

你可能想给指标DataFrame的列加上一个前缀，以便能够跟其他数据进行
合并。get_dummies的prefix参数可以实现该功能

dummies=pd.get_dummies(df['key'],prefix='key')
df_with_dummy=df[['data1']].join(dummies)
df_with_dummy

	data1	key_a	key_b	key_c
0	0	False	True	False
1	1	False	True	False
2	2	True	False	False
3	3	False	False	True
4	4	True	False	False
5	5	False	True	False

如果DataFrame中的某行同属于多个分类，则事情就会有点复杂。看一下
MovieLens 1M数据集

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('F:/项目学习/利用Pyhon进行数据分析（第二版）/利用Pyhon进行数据分析/pydata-book-2nd-edition/datasets/movielens/movies.dat', sep='::', header=None, names=mnames,encoding='ISO-8859-1')
movies[:10]

C:\Users\Dell\AppData\Local\Temp\ipykernel_26068\3411970987.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  movies = pd.read_table('F:/项目学习/利用Pyhon进行数据分析（第二版）/利用Pyhon进行数据分析/pydata-book-2nd-edition/datasets/movielens/movies.dat', sep='::', header=None, names=mnames,encoding='ISO-8859-1')

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children's
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller

all_genres=[]
movies.genres.map(lambda x:all_genres.extend(x.split('|')))
all_genres

['Animation',
 "Children's",
 'Comedy',
 'Adventure',
 "Children's",
 'Fantasy',
 'Comedy',
 'Romance',
 'Comedy',
 'Drama',
 'Comedy',
 'Action',
 'Crime',
 'Thriller',
 'Comedy',
 'Romance',
 'Adventure',
 "Children's",
 'Action',
 'Action',
 'Adventure',
 'Thriller',
 'Comedy',
 'Drama',
 'Romance',
 'Comedy',
 'Horror',
 'Animation',
 "Children's",
 'Drama',
 'Action',
 'Adventure',
 'Romance',
 'Drama',
 'Thriller',
 'Drama',
 'Romance',
 'Thriller',
 'Comedy',
 'Action',
 'Action',
 'Comedy',
 'Drama',
 'Crime',
 'Drama',
 'Thriller',
 'Thriller',
 'Drama',
 'Sci-Fi',
 'Drama',
 'Romance',
 'Drama',
 'Drama',
 'Romance',
 'Adventure',
 'Sci-Fi',
 'Drama',
 'Drama',
 'Drama',
 'Sci-Fi',
 'Adventure',
 'Romance',
 "Children's",
 'Comedy',
 'Drama',
 'Drama',
 'Romance',
 'Drama',
 'Documentary',
 'Comedy',
 'Comedy',
 'Romance',
 'Drama',
 'Drama',
 'War',
 'Action',
 'Crime',
 'Drama',
 'Drama',
 'Action',
 'Adventure',
 'Comedy',
 'Drama',
 'Drama',
 'Romance',
 'Crime',
 'Thriller',
 'Animation',
 "Children's",
 'Musical',
 'Romance',
 'Drama',
 'Romance',
 'Crime',
 'Thriller',
 'Action',
 'Drama',
 'Thriller',
 'Comedy',
 'Drama',
 "Children's",
 'Comedy',
 'Drama',
 'Adventure',
 "Children's",
 'Fantasy',
 'Drama',
 'Drama',
 'Romance',
 'Drama',
 'Mystery',
 'Adventure',
 "Children's",
 'Fantasy',
 'Drama',
 'Thriller',
 'Drama',
 'Comedy',
 'Comedy',
 'Romance',
 'Comedy',
 'Sci-Fi',
 'Thriller',
 'Drama',
 'Comedy',
 'Romance',
 'Comedy',
 'Action',
 'Comedy',
 'Crime',
 'Horror',
 'Thriller',
 'Action',
 'Comedy',
 'Drama',
 'Drama',
 'Musical',
 'Drama',
 'Romance',
 'Comedy',
 'Drama',
 'Sci-Fi',
 'Thriller',
 'Documentary',
 'Drama',
 'Drama',
 'Thriller',
 'Drama',
 'Crime',
 'Drama',
 'Romance',
 'Drama',
 'Drama',
 'Comedy',
 'Drama',
 'Drama',
 'Romance',
 'Adventure',
 'Drama',
 "Children's",
 'Comedy',
 'Comedy',
 'Action',
 'Thriller',
 'Drama',
 'Drama',
 'Thriller',
 'Comedy',
 'Romance',
 'Drama',
 'Action',
 'Thriller',
 'Comedy',
 'Drama',
 'Action',
 'Thriller',
 'Documentary',
 'Drama',
 'Thriller',
 'Comedy',
 'Comedy',
 'Thriller',
 'Comedy',
 'Drama',
 'Romance',
 'Comedy',
 'Drama',
 'Adventure',
 "Children's",
 'Comedy',
 'Musical',
 'Documentary',
 'Comedy',
 'Action',
 'Drama',
 'War',
 'Drama',
 'Thriller',
 'Action',
 'Adventure',
 'Crime',
 'Drama',
 'Mystery',
 'Drama',
 'Comedy',
 'Documentary',
 'Crime',
 'Comedy',
 'Romance',
 'Comedy',
 'Drama',
 'Drama',
 'Comedy',
 'Romance',
 'Drama',
 'Mystery',
 'Romance',
 'Drama',
 'Comedy',
 'Adventure',
 "Children's",
 'Fantasy',
 'Drama',
 'Documentary',
 'Comedy',
 'Romance',
 'Drama',
 'Drama',
 'Romance',
 'Thriller',
 'Comedy',
 'Drama',
 'Documentary',
 'Comedy',
 'Documentary',
 'Documentary',
 'Drama',
 'Action',
 'Drama',
 'Drama',
 'Romance',
 'Comedy',
 'Drama',
 'Drama',
 'Comedy',
 'Action',
 'Adventure',
 "Children's",
 'Drama',
 'Drama',
 'Crime',
 'Drama',
 'Thriller',
 'Drama',
 'Drama',
 'Romance',
 'War',
 'Horror',
 'Action',
 'Adventure',
 'Comedy',
 'Crime',
 'Drama',
 'Drama',
 'War',
 'Comedy',
 'Comedy',
 'War',
 'Adventure',
 "Children's",
 'Drama',
 'Action',
 'Adventure',
 'Mystery',
 'Sci-Fi',
 'Drama',
 'Thriller',
 'War',
 'Documentary',
 'Action',
 'Romance',
 'Thriller',
 'Crime',
 'Film-Noir',
 'Mystery',
 'Thriller',
 'Action',
 'Thriller',
 'Comedy',
 'Drama',
 'Drama',
 'Action',
 'Adventure',
 'Drama',
 'Romance',
 'Adventure',
 "Children's",
 'Drama',
 'Action',
 'Crime',
 'Thriller',
 'Comedy',
 'Action',
 'Sci-Fi',
 'Thriller',
 'Action',
 'Adventure',
 'Sci-Fi',
 'Comedy',
 'Drama',
 'Comedy',
 'Horror',
 'Comedy',
 'Drama',
 'Romance',
 'Comedy',
 'Action',
 "Children's",
 'Drama',
 'Romance',
 'Thriller',
 'Drama',
 'Sci-Fi',
 'Thriller',
 'Comedy',
 'Comedy',
 'Horror',
 'Comedy',
 'Thriller',
 'Drama',
 'Documentary',
 'Drama',
 'Drama',
 'Comedy',
 'Drama',
 'Romance',
 'Horror',
 'Sci-Fi',
 'Drama',
 'Action',
 'Crime',
 'Sci-Fi',
 'Drama',
 'Musical',
 'Thriller',
 'Drama',
 'Drama',
 'Romance',
 'Comedy',
 'Action',
 'Comedy',
 'Drama',
 'Documentary',
 'Drama',
 'Romance',
 'Action',
 'Adventure',
 'Drama',
 'Western',
 'Drama',
 'Comedy',
 'Drama',
 'Drama',
 'Drama',
 'Romance',
 'Comedy',
 'Drama',
 'Thriller',
 'Comedy',
 'Drama',
 'Drama',
 'Horror',
 'Drama',
 'Romance',
 'Comedy',
 'Comedy',
 'Drama',
 'Romance',
 'Drama',
 'Thriller',
 'Thriller',
 'Action',
 'Comedy',
 'Drama',
 'Thriller',
 'Drama',
 'Thriller',
 'Comedy',
 'Comedy',
 'Drama',
 'Drama',
 'Comedy',
 'Comedy',
 'Drama',
 'Comedy',
 'Romance',
 'Comedy',
 'Romance',
 'Adventure',
 "Children's",
 'Animation',
 "Children's",
 'Comedy',
 'Romance',
 'Thriller',
 "Children's",
 'Drama',
 'Drama',
 'Musical',
 'Comedy',
 'Animation',
 "Children's",
 'Crime',
 'Drama',
 'Documentary',
 'Drama',
 'Fantasy',
 'Romance',
 'Thriller',
 'Comedy',
 'Drama',
 'Romance',
 "Children's",
 'Comedy',
 'Action',
 'Comedy',
 'Romance',
 'Drama',
 'Horror',
 'Drama',
 'Comedy',
 'Comedy',
 'Sci-Fi',
 'Mystery',
 'Thriller',
 'Adventure',
 "Children's",
 'Comedy',
 'Fantasy',
 'Romance',
 'Crime',
 'Drama',
 'Thriller',
 'Action',
 'Adventure',
 'Fantasy',
 'Sci-Fi',
 'Drama',
 "Children's",
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Romance',
 'Drama',
 'Romance',
 'War',
 'Western',
 'Comedy',
 'Drama',
 'Drama',
 'Drama',
 'Romance',
 'Drama',
 'Drama',
 'Drama',
 'Horror',
 'Comedy',
 'Comedy',
 'Comedy',
 'Romance',
 'Drama',
 'Comedy',
 'Drama',
 'Drama',
 'Thriller',
 'Drama',
 'Drama',
 'Crime',
 'Drama',
 'Action',
 'Crime',
 'Drama',
 'Horror',
 'Action',
 'Sci-Fi',
 'Thriller',
 'Comedy',
 'Romance',
 'Action',
 'Thriller',
 'Comedy',
 'Romance',
 'Crime',
 'Drama',
 'Thriller',
 'Action',
 'Drama',
 'Thriller',
 'Crime',
 'Drama',
 'Romance',
 'Thriller',
 'Comedy',
 'Romance',
 'Comedy',
 'Romance',
 'Crime',
 'Drama',
 'Drama',
 'Comedy',
 'Drama',
 'Drama',
 'Drama',
 'Romance',
 'Drama',
 'Romance',
 'Action',
 'Adventure',
 'Western',
 'Comedy',
 'Drama',
 'Comedy',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Comedy',
 'Horror',
 'Thriller',
 'Comedy',
 'Animation',
 "Children's",
 'Drama',
 'Action',
 'Action',
 'Adventure',
 'Sci-Fi',
 "Children's",
 'Comedy',
 'Fantasy',
 'Drama',
 'Thriller',
 'Film-Noir',
 'Thriller',
 'Drama',
 'Comedy',
 'Drama',
 'Comedy',
 'Comedy',
 'Drama',
 'Action',
 'Comedy',
 'Musical',
 'Sci-Fi',
 'Horror',
 'Action',
 'Adventure',
 'Sci-Fi',
 'Comedy',
 'Horror',
 'Drama',
 'Horror',
 'Sci-Fi',
 'Comedy',
 'Drama',
 'Mystery',
 'Thriller',
 'Drama',
 'War',
 'Drama',
 'Sci-Fi',
 'Thriller',
 'Comedy',
 'Romance',
 'Adventure',
 'Drama',
 'Drama',
 'Comedy',
 'Romance',
 "Children's",
 'Comedy',
 'Comedy',
 'Drama',
 'Drama',
 'Musical',
 'Drama',
 'Comedy',
 'Action',
 'Adventure',
 'Thriller',
 'Drama',
 'Mystery',
 'Thriller',
 'Comedy',
 'Drama',
 'Romance',
 'Comedy',
 'Action',
 'Romance',
 'Thriller',
 'Drama',
 "Children's",
 'Comedy',
 'Comedy',
 'Romance',
 'War',
 'Comedy',
 'Romance',
 'Drama',
 'Comedy',
 'Drama',
 'Romance',
 'Action',
 'Comedy',
 'Drama',
 'Romance',
 'Adventure',
 "Children's",
 'Romance',
 'Documentary',
 'Animation',
 "Children's",
 'Musical',
 'Drama',
 'Horror',
 'Comedy',
 'Crime',
 'Fantasy',
 'Action',
 'Comedy',
 'Western',
 'Drama',
 'Comedy',
 'Comedy',
 'Drama',
 'Comedy',
 'Drama',
 'Thriller',
 "Children's",
 'Comedy',
 'Drama',
 'Action',
 'Thriller',
 'Action',
 'Romance',
 'Thriller',
 'Comedy',
 'Romance',
 'Action',
 'Sci-Fi',
 'Action',
 'Adventure',
 'Comedy',
 'Romance',
 'Drama',
 'Drama',
 'Horror',
 'Western',
 'Action',
 'Drama',
 'Drama',
 'Action',
 'Comedy',
 'Drama',
 'Drama',
 'Romance',
 'War',
 'Action',
 'Comedy',
 'Drama',
 'Crime',
 'Drama',
 'Adventure',
 "Children's",
 'Action',
 'Action',
 'Drama',
 'Drama',
 'Horror',
 'Documentary',
 'Drama',
 'Drama',
 'Action',
 'Thriller',
 'Comedy',
 'Comedy',
 'Crime',
 'Drama',
 'Documentary',
 'Action',
 'Sci-Fi',
 'Drama',
 'Horror',
 'Thriller',
 'Drama',
 'Drama',
 'Comedy',
 'Comedy',
 'Drama',
 'Comedy',
 'Comedy',
 'Comedy',
 'Thriller',
 'Western',
 'Comedy',
 'Romance',
 'Drama',
 'Comedy',
 'Action',
 'Comedy',
 'Adventure',
 "Children's",
 'Thriller',
 'Action',
 'Thriller',
 'Drama',
 'Drama',
 'Romance',
 'Horror',
 'Sci-Fi',
 'Thriller',
 'Mystery',
 'Romance',
 'Thriller',
 'Drama',
 'Comedy',
 'Drama',
 'Crime',
 'Drama',
 'Comedy',
 'Western',
 'Comedy',
 'Action',
 'Adventure',
 'Crime',
 'Comedy',
 'Sci-Fi',
 'Drama',
 'Thriller',
 'Comedy',
 'Action',
 'Comedy',
 'Drama',
 'Comedy',
 'Romance',
 'Comedy',
 'Action',
 'Sci-Fi',
 'Documentary',
 'Comedy',
 'Romance',
 'Comedy',
 'Drama',
 'Romance',
 'Comedy',
 'Romance',
 'Drama',
 'Comedy',
 'Comedy',
 'Drama',
 'Drama',
 'Mystery',
 'Romance',
 'Drama',
 'Comedy',
 'Drama',
 'Thriller',
 'Adventure',
 "Children's",
 'Drama',
 'Drama',
 'Action',
 'Thriller',
 'Drama',
 'Western',
 'Action',
 'Comedy',
 'Drama',
 'Romance',
 'Action',
 'Adventure',
 'Crime',
 'Drama',
 'Thriller',
 'Action',
 'Adventure',
 'Crime',
 'Thriller',
 'Action',
 'Drama',
 'War',
 'Action',
 'Comedy',
 'War',
 'Comedy',
 'Comedy',
 'Romance',
 'Drama',
 'Romance',
 'Comedy',
 'Comedy',
 'Romance',
 'Comedy',
 'Drama',
 'Comedy',
 'War',
 'Action',
 'Thriller',
 'Drama',
 'Comedy',
 'Drama',
 'Drama',
 'Comedy',
 'Action',
 'Action',
 'Adventure',
 'Sci-Fi',
 'Drama',
 'Thriller',
 'Thriller',
 'Drama',
 'Adventure',
 "Children's",
 'Action',
 'Comedy',
 'Comedy',
 'Comedy',
 'Western',
 'Drama',
 'Comedy',
 'Thriller',
 'Drama',
 'Comedy',
 'Mystery',
 'Action',
 'Crime',
 'Drama',
 'Action',
 'Thriller',
 'Drama',
 'Comedy',
 'Drama',
 'Romance',
 'Comedy',
 'Romance',
 'Drama',
 'Romance',
 'Comedy',
 'Romance',
 'Comedy',
 'Drama',
 'Action',
 "Children's",
 'Drama',
 'Action',
 'Sci-Fi',
 'Comedy',
 'Drama',
 'Action',
 'Drama',
 'Drama',
 'Drama',
 'Romance',
 'Drama',
 'Action',
 'Drama',
 'Horror',
 'Sci-Fi',
 'Comedy',
 'Mystery',
 'Romance',
 'Comedy',
 'Drama',
 'Comedy',
 'Drama',
 'War',
 'Action',
 'Drama',
 'Mystery',
 'Comedy',
 'Sci-Fi',
 'Thriller',
 'Comedy',
 'Crime',
 'Thriller',
 'Action',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'War',
 'Drama',
 'Drama',
 'Drama',
 "Children's",
 'Drama',
 'Comedy',
 'Crime',
 'Horror',
 'Action',
 'Drama',
 'Romance',
 'Drama',
 'Drama',
 'Comedy',
 'Drama',
 'Drama',
 'Comedy',
 'Romance',
 'Thriller',
 'Film-Noir',
 'Sci-Fi',
 'Comedy',
 'Comedy',
 'Romance',
 'Thriller',
 'Action',
 'Drama',
 'Action',
 'Adventure',
 "Children's",
 'Sci-Fi',
 'Action',
 'Adventure',
 'Thriller',
 'Action',
 'Documentary',
 'Comedy',
 'Romance',
 "Children's",
 'Comedy',
 'Musical',
 'Action',
 'Adventure',
 'Comedy',
 'Western',
 'Thriller',
 'Action',
 'Crime',
 'Romance',
 'Documentary',
 'Drama',
 'Action',
 'Adventure',
 'Animation',
 "Children's",
 'Fantasy',
 'Comedy',
 'Drama',
 'Thriller',
 'Comedy',
 'Drama',
 'Drama',
 'Comedy',
 'Horror',
 'Comedy',
 'Romance',
 'Drama',
 'Comedy',
 'Drama',
 "Children's",
 'Comedy',
 'Comedy',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Comedy',
 'Drama',
 "Children's",
 'Comedy',
 'Comedy',
 'Adventure',
 "Children's",
 'Drama',
 'Mystery',
 'Thriller',
 'Drama',
 'Documentary',
 'Comedy',
 'Comedy',
 'Drama',
 'Drama',
 'Comedy',
 "Children's",
 'Comedy',
 'Comedy',
 'Romance',
 'Thriller',
 'Animation',
 "Children's",
 'Comedy',
 'Musical',
 'Action',
 'Sci-Fi',
 'Thriller',
 'Adventure',
 ...]

genres=pd.unique(all_genres)
genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)

gen=movies.genres[0]
gen.split("|")

['Animation', "Children's", 'Comedy']

dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2], dtype=int64)

for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                              1.0
Genre_Children's                             1.0
Genre_Comedy                                 1.0
Genre_Adventure                              0.0
Genre_Fantasy                                0.0
Genre_Romance                                0.0
Genre_Drama                                  0.0
Genre_Action                                 0.0
Genre_Crime                                  0.0
Genre_Thriller                               0.0
Genre_Horror                                 0.0
Genre_Sci-Fi                                 0.0
Genre_Documentary                            0.0
Genre_War                                    0.0
Genre_Musical                                0.0
Genre_Mystery                                0.0
Genre_Film-Noir                              0.0
Genre_Western                                0.0
Name: 0, dtype: object

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('F:/项目学习/利用Pyhon进行数据分析（第二版）/利用Pyhon进行数据分析/pydata-book-2nd-edition/datasets/movielens/movies.dat', sep='::', header=None, names=mnames,encoding='ISO-8859-1')
movies[:10]

C:\Users\Dell\AppData\Local\Temp\ipykernel_26068\3411970987.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  movies = pd.read_table('F:/项目学习/利用Pyhon进行数据分析（第二版）/利用Pyhon进行数据分析/pydata-book-2nd-edition/datasets/movielens/movies.dat', sep='::', header=None, names=mnames,encoding='ISO-8859-1')

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children's
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller

dummies_demo = movies['genres'].str.get_dummies('|')
prefix = 'genre_'
dummies_demo = dummies_demo.add_prefix(prefix)

# 合并数据集
merged_df = pd.concat([movies, dummies_demo], axis=1)
# merged_df = merged_df.drop(columns=['genres'])

merged_df

	movie_id	title	genres	genre_Action	genre_Adventure	genre_Animation	genre_Children's	genre_Comedy	genre_Crime	genre_Documentary	...	genre_Fantasy	genre_Film-Noir	genre_Horror	genre_Musical	genre_Mystery	genre_Romance	genre_Sci-Fi	genre_Thriller	genre_War	genre_Western
0	1	Toy Story (1995)	Animation\|Children's\|Comedy	0	0	1	1	1	0	0	...	0	0	0	0	0	0	0	0	0	0
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy	0	1	0	1	0	0	0	...	1	0	0	0	0	0	0	0	0	0
2	3	Grumpier Old Men (1995)	Comedy\|Romance	0	0	0	0	1	0	0	...	0	0	0	0	0	1	0	0	0	0
3	4	Waiting to Exhale (1995)	Comedy\|Drama	0	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
4	5	Father of the Bride Part II (1995)	Comedy	0	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3878	3948	Meet the Parents (2000)	Comedy	0	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
3879	3949	Requiem for a Dream (2000)	Drama	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3880	3950	Tigerland (2000)	Drama	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3881	3951	Two Family House (2000)	Drama	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3882	3952	Contender, The (2000)	Drama\|Thriller	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	1	0	0

3883 rows × 21 columns

一个对统计应用有用的秘诀是：结合get_dummies和诸如cut之类的离散化函数

np.random.seed(12345)
values=np.random.rand(10)
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

bins= [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values,bins))

	(0.0, 0.2]	(0.2, 0.4]	(0.4, 0.6]	(0.6, 0.8]	(0.8, 1.0]
0	False	False	False	False	True
1	False	True	False	False	False
2	True	False	False	False	False
3	False	True	False	False	False
4	False	False	True	False	False
5	False	False	True	False	False
6	False	False	False	False	True
7	False	False	False	True	False
8	False	False	False	True	False
9	False	False	False	True	False

字符串操作

Python能够成为流行的数据处理语言，部分原因是其简单易用的字符串和文本处理
功能。

字符串对象方法

val = 'a,b, guido'
val.split(',')

['a', 'b', ' guido']

split常常与strip一起使用，以去除空白符（包括换行符）：

pieces=[x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

'::'.join(pieces)

'a::b::guido'

检测子串的最佳方式是利用Python的in关键字，还可
以使用index和find：

'guido' in val

True

 val.index(',')

val.find(':')

-1

注意find和index的区别：如果找不到字符串，index将会引发一个异常（而不是返回
－1）

val.index(':')

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

Cell In[88], line 1
----> 1 val.index(':')


ValueError: substring not found

count可以返回指定子串的出现次数

val.count(',')

replace用于将指定模式替换为另一个模式。通过传入空字符串，它也常常用于删除

val.replace(',','::')

'a::b:: guido'

val.replace(',','')

'ab guido'

Python内置的字符串方法

count:返回在字符串中的出现次数（非重叠）
endswith、startswith:返回字符串以某个后缀结尾（以某个前缀结尾），则返回True
join:将字符串用作连接其他字符串序列的分隔符
index:如果在字符串中找到子串，则返回子串第一个字符所在的位置，如果没有找到，则引发ValueError
find: 如果在字符串中找到子串，则返回第一个发现子串第一个字符所在的位置，如果没有找到，返回-1
rfind:如果在字符串中找到子串，则返回最后一个发现子串第一个字符所在的位置，如果没有找到，返回-1
replace:用另一个字符替换指定字符
strip.rstrip.lstrip:去除空白符（包括换行符）
split:通过指定的分隔符将字符串拆分成一组子串
lower、upper:大小写
ljust、rjust:用空格（或其他字符）填充字符串的空白侧以返回符合最低宽度的字符串

正则表达式

re模块的函数可以分为三个大类：模式匹配、替换以及拆分

import re
text = "foo bar\t baz \tqux"
re.split('\s+',text)

['foo', 'bar', 'baz', 'qux']

可以用re.compile自己编译regex以得到一个可重用的regex对象

regex=re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

regex.findall(text)#findall返回的是字符串中所有的匹配项

[' ', '\t ', ' \t']

match和search跟findall功能类似

text = """Dave [email protected]
Steve [email protected]
Rob [email protected]
Ryan [email protected]
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern, flags=re.IGNORECASE)

 regex.findall(text)

['[email protected]', '[email protected]', '[email protected]', '[email protected]']

m = regex.search(text)
m

text[m.start():m.end()]

'[email protected]'

print(regex.match(text))

None

sub方法可以将匹配到的模式替换为指定字符串

print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED

pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('[email protected]')
m.groups()

('wesm', 'bright', 'net')

regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com

正则表达式方法

findall、finditer:返回字符串中所有的非重叠匹配模式。findall返回的是有所有模式组成的列表，而finditer则通过一个迭代器逐个返回
match:从字符串起始位置匹配模式，还可以对模式各个部分进行分组。如果匹配到模式，则返回一个匹配项对象，否则返回None
search:扫描整个字符串以匹配模式，如果找到则返回一个匹配项对象。跟match不同，其匹配项可以位于字符串的任意位置，而不仅仅是起始处
split:根据找到的模式将字符串拆分成数段
sub、subn:将字符串中所有的（sub）或前n个（subn）模式替换成指定表达式。在替换字符串中可以通过\1、\2等符号表示各分项组

pandas的矢量化字符串函数

清理待分析的散乱数据时，常常需要做一些字符串规整化工作。更为复杂的情况
是，含有字符串的列有时还含有缺失数据：

data = {'Dave': '[email protected]', 'Steve': '[email protected]','Rob': '[email protected]', 'Wes': np.nan}
data=pd.Series(data)
data

Dave     [email protected]
Steve    [email protected]
Rob        [email protected]
Wes                  NaN
dtype: object

data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

有两个办法可以实现矢量化的元素获取操作：要么使用str.get，要么在str属性上使
用索引：

matches = data.str.match(pattern, flags=re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jxFV7MY2-1692080773017)(image.png)]

你可能感兴趣的:(python,笔记,开发语言)

linuxcentos6笔记 lnes， linux centos vim
目录Linux笔记11目录结构51.1基本指令51.2Ls指令：51.3Pwd指令：61.4Cd指令：71.5mkdir指令：71.6touch指令：71.7cp指令：71.8mv指令：81.9rm指令：81.10vim指令：91.11输出重定向：91.12cat指令：102进阶指令102.1Df指令：102.2free指令：102.3head指令：112.4tail指令：112.5less指令：
【考研计算机网络】课堂笔记1 第一章概述刘鑫磊up #操作系统计算机网络计算机网络
文章目录：一：计算机网络的概述1.计算机网络的基本概念2.计算机网络的组成3.计算机网络的功能4.计算机网络的分类4.1分布范围分类4.2传输技术分类4.3按照拓扑结构分类4.4按照使用者分类4.5按照传输介质分类二：计算机网络的标准化工作及相关组织三：计算机网络的性能指标速率kb千Mb兆Gb吉Tb太的单位换算存储容量KBMBGBTB的单位换数四：网络分层五：计算机网络协议、接口、服务的概念1.协
常用的pdf技术有哪些？--笔记我不是彭于晏灬 pdf 笔记
常用的pdf技术有哪些？1.iTextPDF：iText是著名的开放项目，是用于生成PDF文档的一个java类库。通过iText不仅可以生成PDF或rtf的文档，而且可以将XML、Html文件转化为PDF文件。Openoffice：openoffice是开源软件且能在windows和linux平台下运行，可以灵活的将word或者Excel转化为PDF文档。JasperReport：是一个强大、灵活
【科大讯飞笔试题汇总】2024-04-21-科大讯飞春招笔试题-三语言题解(CPP/Python/Java) 春秋招笔试突围最新互联网春秋招试题合集 python java 开发语言春招笔试互联网大厂笔试题
大家好这里是KK爱Coding，一枚热爱算法的程序员✨本系列打算持续跟新科大讯飞近期的春秋招笔试题汇总～ACM银牌|多次AK大厂笔试｜编程一对一辅导感谢大家的订阅➕和喜欢KK这边最近正在收集近一年互联网各厂的笔试题汇总，如果有需要的小伙伴可以关注后私信一下KK领取，会在飞书进行同步的跟新，5月1日之前限时免费领取哦，后续会由ACM银牌团队持续维护~。文章目录01.硬币最少组合问题问题描述输入格式输
Pycharm python解释器 unsupported python 3.1 解决大表哥在曾母暗沙 Python PyCharm python pycharm ide 解释器模式
Pycharm环境unsupportedpython3.1解决1.问题重现2.原因分析3.解决方法1.问题重现之前使用Pycharm2024.1.1的时候，环境配置的Python3.11.9，现在改成使用Pycharm2020.2.2，结果Python解释器显示“unsupportedpython3.1”，如下图：2.原因分析因为Pycharm2020.2.2支持的Python最高版本就是Pyth
计算机网络笔记再战——理解几个经典的协议HTTP章4 charlie114514191 计算机网络学习计算机网络笔记 http 学习网络协议网络
计算机网络笔记再战——理解几个经典的协议10HTTP章4确保Web安全的HTTPSHTTP是不安全的，它使用的是明文传递，这意味着潜在的报文纂改。这里我们将学习更加安全的HTTPS协议通信使用明文（不加密），内容可能会被窃听不验证通信方的身份，因此有可能遭遇伪装无法证明报文的完整性，所以有可能已遭篡改HTTP本身没有办法加密，但是可以跟SSL（SecureSocketLayer）或者是TLS（Tr
C++ 结构型设计模式十七12138 C++c++设计模式
C++设计模式自己理解整理笔记结构型-适配器模式适配器模式（AdapterPattern）是一种结构型设计模式，它的主要作用是将一个类的接口转换成客户希望的另一个接口，使得原本由于接口不兼容而不能一起工作的那些类可以一起工作。适配器模式主要有两种实现方式：类适配器模式和对象适配器模式。类适配器类适配器通过多重继承实现，这种方式利用了继承优点直接调用：由于适配器类继承了被适配类，所以可以直接调用被适
申请 Let's Encrypt 的免费 TLS 证书实现网站的 https 访问 python
因为这个使用apt安装的python第三方包的版本为什么这么滞后？原因，所以我不是用sudo把证书弄到系统路径，而是选择到普通用户路径下面╭─pon@aliyun2core2GB~/certbot╰─➤tree.├──config│ ├──accounts│ │ └──acme-v02.api.letsencrypt.org│ │ └──directory│ │ └──9401598
python面试题详解 __wishing__ python
十道经典面试题（python）1.一行代码实现累加1-100之和print(sum(range(1,101)))输出结果：5050分析：利用sum函数进行累加。range控制序列。2.一行代码实现列表去重#声明需要去重的列表list1=[1,1,2,2,3,3,4,4]list1=list(set(list1))</
LangChain入门：使用Python和通义千问打造免费的Qwen大模型聊天机器人南七小僧人工智能网站开发 AI技术产品经理服务器数据库 windows
前言LangChain是一个用于开发由大型语言模型（LargeLanguageModels，简称LLMs）驱动的应用程序的框架。它提供了一个灵活的框架，使得开发者可以构建具有上下文感知能力和推理能力的应用程序，这些应用程序可以利用公司的数据和APIs。这个框架由几个部分组成。LangChain库：Python和JavaScript库。包含了各种组件的接口和集成，一个基本的运行时，用于将这些组件组合
pygmsh 项目常见问题解决方案葛雨禹
pygmsh项目常见问题解决方案pygmsh:spider_web:GmshforPython项目地址:https://gitcode.com/gh_mirrors/py/pygmsh1.项目基础介绍和主要编程语言项目名称:pygmsh项目简介:pygmsh是一个结合了Gmsh和Python的开源项目。它通过提供Gmsh的Python接口，简化了复杂几何体的创建过程。pygmsh提供了许多有用的抽
numpy学习笔记3：三维数组 np.ones((2, 3, 4)) 的详细解释宁宁可可 #机器学习 #Python基础与进阶 numpy 学习笔记
numpy学习笔记3：三维数组np.ones((2,3,4))的详细解释以下是关于三维数组np.ones((2,3,4))的详细解释：1.三维数组的形状形状(2,3,4)表示：最外层维度：2个“层”（或“块”）；中间维度：每个层有3行；最内层维度：每行有4个元素。可以类比为：2本书（外层），每本书有3页（中间层），每页有4行文字（内层）。2.创建全1三维数组代码示例：importnumpyasnp
python之gmsh划分网格老歌老听老掉牙 python有限元分析 python 开发语言 gmsh 划分网格
Gmsh（GeometryModelingandMeshingSuite）是一个开源的三维有限元网格生成器，它集成了内置的CAD引擎和后处理器。Gmsh的设计目标是提供一个快速、轻量级且用户友好的网格工具，同时具备参数化输入和高级可视化能力。Gmsh围绕几何（geometry）、网格（mesh）、求解器（solver）和后处理（post-processing）四个模块构建，用户可以通过图形用户界面
已解决：python多线程使用TensorRT输出为零？附tensorrt推理代码李卓璐算法实战 python 开发语言
我是多个不同类型的模型多线程调用报错。设备：cuda12.1,cudnn8.9.2,tensorrt8.6.11.问题tensorrt的推理没输出？？？有输入：想要的输出：原因：多进程时,每进程应单独调用importpycuda.driverascuda和cuda.init()，完成初始化CUDA驱动，并需要使用self.cfx.push()和self.cfx.pop()管理CUDA上下文，以保证
Python 的 ultralytics 库详解白.夜人工智能
ultralytics是一个专注于计算机视觉任务的Python库，尤其以YOLO（YouOnlyLookOnce）系列模型为核心，提供了简单易用的接口，支持目标检测、实例分割、姿态估计等任务。本文将详细介绍ultralytics库的功能、安装方法、核心模块以及使用示例。1.ultralytics库简介ultralytics库由Ultralytics团队开发，旨在为YOLO系列模型提供高效、灵活且易
输入某年某月某日，判断这一天是这一年的第几天python 发现文化fu python python
题目：输入某年某月某日，判断这一天是这一年的第几天python输入某年某月某日，判断这一天是这一年的第几天python思路：*判断闰年能被4整除但不能被100整除，年份能被400整除#方法1sum=0if(year%4==0andyear%100!=0)oryear%400==0:feb=29else:feb=28month_day=[0,31,feb,31,30,31,30,31,31,30,3
python练习3：输入某年某月某日，判断这一天是这一年的第几天？柯.姐姐 python
#输入某年某月某日，判断这一天是这一年的第几天？list=[0,31,59,90,120,151,181,212,243,273,304,334]year=int(input('请输入年份：'))month=int(input('请输入月份：'))day=int(input('请输入天：'))ifmonth>0andmonth2:result=result+1print("这是第%d天"%resu
Ts学习笔记初学者7. 学习笔记 typescript
一、Ts与Js区别TsJsJavaScript的超集，用于解决大型项目的代码复杂性一种脚本语言，用于创建动态网页。强类型，支持静态和动态类型动态弱类型语言可以在编译期间发现并纠正错误只能在运行时发现错误不允许改变变量的数据类型变量可以被赋予不同类型的值二、Ts基础类型：boolean,number,string,undefined,null,any,unknown,void，neverany,un
初学python100例-案例4 计算一年第几天多种不同解法少儿编程案例讲解小兔子编程初学python100例 python学习 python100例 python计算天数 python算法 python案例
题目输入某年某月某日，判断这一天是这一年的第几天？解法1程序分析1、以5月2日为例，应该先把前四个月的加起来，2、然后再加上2天即本年的第几天，3、特殊情况，闰年且输入月份大于2时需考虑多加一天：4、闰年1、年份能被4整除；2、年份若是100的整数倍的话需被400整除，否则是平年。程序源代码：year=int(input('year:\n'))month=int(input('month:\n')
Python 的类中，self 是一个特殊的参数可可乐不加冰知识学习专栏 python 开发语言
在Python的类中，self是一个特殊的参数，它代表类的实例本身。self是方法的第一个参数，用于访问实例的属性和方法。下面我将从多个角度解释self的含义、作用以及如何使用它。1.self表示类的实例本身在Python中，当你创建一个类的实例时，实际上是在内存中创建了一个对象。self参数代表的就是这个对象本身。通过self，你可以在类的方法中访问和修改实例的属性。2.为什么需要self？se
Trae AI 上新 SSHremote：服务器 Python 接口日志排查实战指南芯作者 DD：日记人工智能深度学习机器学习
在当今的软件开发中，服务器端的稳定性和可靠性至关重要。然而，生产环境中的问题往往难以预测，尤其是接口返回502错误却无日志记录的情况，更是让开发者头疼不已。幸运的是，字节跳动推出的AI原生IDE——Trae，近期上线的SSHremote功能，为远程服务器日志排查提供了全新的解决方案。本文将结合实战案例，深入探讨如何利用TraeAI的SSHremote功能高效排查Python接口日志问题，并分享创新
Python入门程序练习004：输入某年某月某日，判断这一天是这一年的第几天？若北辰 Python实战练习
【程序4】题目：输入某年某月某日，判断这一天是这一年的第几天？1.程序分析：其实这一题的难度不在于编程，而在于对闰年有没有一些基本的认识，相信很多人都知道闰年，但是又不太清楚具体怎么判断闰年。在下面两个条件中只要满足一个即是闰年：1、能被4整除但是不能被一百整除2、能被四百整除。为了方便记忆，总结为：四年一闰,百年不闰,四百年再闰那么判断出闰年和平年（除了闰年其他都是平年）之后呢，其实只要记住：闰
Python后端学习系列（10）：分布式系统与数据一致性（使用分布式锁、分布式事务等） DoYangTan python 学习分布式
Python后端学习系列（10）：分布式系统与数据一致性（使用分布式锁、分布式事务等）前言随着业务规模的不断扩大以及对系统性能、可扩展性的更高要求，后端应用往往会朝着分布式系统的方向发展。然而，分布式系统带来诸多优势的同时，也面临着如数据一致性等复杂的挑战。本期我们就聚焦于分布式系统中的关键问题——数据一致性，深入探讨分布式锁、分布式事务等相关知识以及保障数据一致性的策略与实践，让我们一起深入学习
python进阶，类的继承，封装，多态，super 胡萝卜糊了 python 开发语言
#单继承#子类只继承一个父类classPerson:defsay(self,value):print('say:',value)defwalk(self,value):print('walk:',value,'km')#Student类继承PersonclassStudent(Person):defstudy(self,value):print('study:',value)#Teacher类继承
python进阶，迭代器和生成器，函数式编程，闭包，装饰器胡萝卜糊了 python 开发语言
l=[1,2,3,4]it=iter(l)print(next(it))print(next(it))print(next(it))print(next(it))#while循环l=[1,2,3,4]len=len(l)i=0it=iter(l)whilei=self.end:raiseStopIterationself.current+=1returnself.current-1it=MyIte
Day6：python面向对象编程——构建可扩展的订单管理系统 weixin_44650422 python 开发语言
目标：掌握类与对象的核心概念，实现模块化的订单业务逻辑一、类与对象：订单管理系统核心1.基础订单类classOrder:"""订单基类"""def__init__(self,order_id,customer):self.order_id=order_id#订单号self.customer=customer#客户名self.items=[]#商品列表self.total=0.0#总金额defadd
Qt爬坑笔记 klzed_ qt c++后端 ui
1.自定义一个QWidget的派生类，将其作为子部件并设置样式表时，需要重写paintEvent事件，否则样式表可能无效，如下所示：voidCustomWidget::paintEvent(QPaintEvent*){QStyleOptionopt;opt.init(this);QPainterp(this);
python assert()函数欢天喜地小姐姐 python编程学习 python
1.断言函数作用断言函数是对表达式布尔值的判断，要求表达式计算值必须为真。可用于自动调试。如果表达式为假，触发异常；如果表达式为真，不会报错。2.使用assert判断数组是否相等np.array.any()和numpy.array.all()np.array.any()是或操作，任意一个元素为True，输出为True。np.array.all()是与操作，所有元素为True，输出为True。当我们
【LeetCode 热题100】 23. 合并 K 个升序链表的算法思路及python代码 pljnb LeetCode热题100 算法 leetcode 链表
23.合并K个升序链表给你一个链表数组，每个链表都已经按升序排列。请你将所有链表合并到一个升序链表中，返回合并后的链表。示例1：输入：lists=[[1,4,5],[1,3,4],[2,6]]输出：[1,1,2,3,4,4,5,6]解释：链表数组如下：[1->4->5,1->3->4,2->6]将它们合并到一个有序链表中得到。1->1->2->3->4->4->5->6示例2：输入：lists=[
LeetCode刷题笔记小李李李李腊八 leetcode 算法 java
leetcode_01两数之和斐波那契数列三个数最大乘积反转链表x的平方根环形列表LeetCode随笔两数之和给定一个整数数组nums和一个整数目标值target，请你在该数组中找出和为目标值target的那两个整数，并返回它们的数组下标。你可以假设每种输入只会对应一个答案。但是，数组中同一个元素在答案里不能重复出现。你可以按任意顺序返回答案。暴力法记录下数组第一个数值，对数组进行循环，将之后的值
LeetCode[位运算] - #137 Single Number II Cwind java Algorithm LeetCode 题解位运算
原题链接：#137 Single Number II 要求：给定一个整型数组，其中除了一个元素之外，每个元素都出现三次。找出这个元素注意：算法的时间复杂度应为O(n)，最好不使用额外的内存空间难度：中等分析：与#136类似，都是考察位运算。不过出现两次的可以使用异或运算的特性 n XOR n = 0, n XOR 0 = n，即某一
《JavaScript语言精粹》笔记 aijuans JavaScript
0、JavaScript的简单数据类型包括数字、字符创、布尔值（true/false）、null和undefined值，其它值都是对象。 1、JavaScript只有一个数字类型，它在内部被表示为64位的浮点数。没有分离出整数，所以1和1.0的值相同。 2、NaN是一个数值，表示一个不能产生正常结果的运算结果。NaN不等于任何值，包括它本身。可以用函数isNaN(number)检测NaN,但是
你应该更新的Java知识之常用程序库 Kai_Ge java
在很多人眼中，Java 已经是一门垂垂老矣的语言，但并不妨碍 Java 世界依然在前进。如果你曾离开 Java，云游于其它世界，或是每日只在遗留代码中挣扎，或许是时候抬起头，看看老 Java 中的新东西。 Guava Guava[gwɑ:və]，一句话，只要你做Java项目，就应该用Guava（Github）。 guava 是 Google 出品的一套 Java 核心库，在我看来，它甚至应该
HttpClient 120153216 httpclient
/** * 可以传对象的请求转发，对象已流形式放入HTTP中 */ public static Object doPost(Map<String,Object> parmMap,String url) { Object object = null; HttpClient hc = new HttpClient(); String fullURL
Django model字段类型清单 2002wmj django
Django 通过 models 实现数据库的创建、修改、删除等操作，本文为模型中一般常用的类型的清单，便于查询和使用： AutoField：一个自动递增的整型字段，添加记录时它会自动增长。你通常不需要直接使用这个字段；如果你不指定主键的话，系统会自动添加一个主键字段到你的model。(参阅自动主键字段) BooleanField：布尔字段,管理工具里会自动将其描述为checkbox。 Cha
在SQLSERVER中查找消耗CPU最多的SQL 357029540 SQL Server
返回消耗CPU数目最多的10条语句 SELECT TOP 10 total_worker_time/execution_count AS avg_cpu_cost, plan_handle, execution_count, (SELECT SUBSTRING(text, statement_start_of
Myeclipse项目无法部署，Undefined exploded archive location 7454103 eclipse MyEclipse
做个备忘！错误信息为： Undefined exploded archive location 原因：在工程转移过程中，导致工程的配置文件出错；解决方法：
GMT时间格式转换 adminjun GMT 时间转换
普通的时间转换问题我这里就不再罗嗦了，我想大家应该都会那种低级的转换问题吧，现在我向大家总结一下如何转换GMT时间格式，这种格式的转换方法网上还不是很多，所以有必要总结一下，也算给有需要的朋友一个小小的帮助啦。 1、可以使用 SimpleDateFormat SimpleDateFormat EEE-三位星期 d-天 MMM-月 yyyy-四位年
Oracle数据库新装连接串问题 aijuans oracle数据库
割接新装了数据库，客户端登陆无问题，apache/cgi-bin程序有问题，sqlnet.log日志如下： Fatal NI connect error 12170. VERSION INFORMATION: TNS for Linux: Version 10.2.0.4.0 - Product
回顾java数组复制 ayaoxinchao java 数组
在写这篇文章之前，也看了一些别人写的，基本上都是大同小异。文章是对java数组复制基础知识的回顾，算是作为学习笔记，供以后自己翻阅。首先，简单想一下这个问题：为什么要复制数组？我的个人理解：在我们在利用一个数组时，在每一次使用，我们都希望它的值是初始值。这时我们就要对数组进行复制，以达到原始数组值的安全性。java数组复制大致分为3种方式：①for循环方式 ②clone方式 ③arrayCopy方
java web会话监听并使用spring注入 bewithme Java Web
在java web应用中，当你想在建立会话或移除会话时，让系统做某些事情，比如说，统计在线用户，每当有用户登录时，或退出时，那么可以用下面这个监听器来监听。 import java.util.ArrayList; import java.ut
NoSQL数据库之Redis数据库管理(Redis的常用命令及高级应用) bijian1013 redis 数据库 NoSQL
一 .Redis常用命令 Redis提供了丰富的命令对数据库和各种数据库类型进行操作，这些命令可以在Linux终端使用。 a.键值相关命令 b.服务器相关命令 1.键值相关命令 &
java枚举序列化问题 bingyingao java 枚举序列化
对象在网络中传输离不开序列化和反序列化。而如果序列化的对象中有枚举值就要特别注意一些发布兼容问题: 1.加一个枚举值新机器代码读分布式缓存中老对象，没有问题，不会抛异常。老机器代码读分布式缓存中新对像，反序列化会中断，所以在所有机器发布完成之前要避免出现新对象，或者提前让老机器拥有新增枚举的jar。 2.删一个枚举值新机器代码读分布式缓存中老对象，反序列
【Spark七十八】Spark Kyro序列化 bit1129 spark
当使用SparkContext的saveAsObjectFile方法将对象序列化到文件，以及通过objectFile方法将对象从文件反序列出来的时候，Spark默认使用Java的序列化以及反序列化机制，通常情况下，这种序列化机制是很低效的，Spark支持使用Kyro作为对象的序列化和反序列化机制，序列化的速度比java更快，但是使用Kyro时要注意，Kyro目前还是有些bug。 Spark
Hybridizing OO and Functional Design bookjovi erlang haskell
推荐博文： Tell Above, and Ask Below - Hybridizing OO and Functional Design 文章中把OO和FP讲的深入透彻，里面把smalltalk和haskell作为典型的两种编程范式代表语言，此点本人极为同意，smalltalk可以说是最能体现OO设计的面向对象语言，smalltalk的作者Alan kay也是OO的最早先驱，
Java-Collections Framework学习与总结-HashMap BrokenDreams Collections
开发中常常会用到这样一种数据结构，根据一个关键字，找到所需的信息。这个过程有点像查字典，拿到一个key，去字典表中查找对应的value。Java1.0版本提供了这样的类java.util.Dictionary(抽象类)，基本上支持字典表的操作。后来引入了Map接口，更好的描述的这种数据结构。 &nb
读《研磨设计模式》-代码笔记-职责链模式-Chain Of Responsibility bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ /** * 业务逻辑：项目经理只能处理500以下的费用申请，部门经理是1000，总经理不设限。简单起见，只同意“Tom”的申请 * bylijinnan */ abstract class Handler { /*
Android中启动外部程序 cherishLC android
1、启动外部程序引用自： http://blog.csdn.net/linxcool/article/details/7692374 //方法一 Intent intent=new Intent(); //包名包名+类名（全路径） intent.setClassName("com.linxcool", "com.linxcool.PlaneActi
summary_keep_rate coollyj SUM
BEGIN /*DECLARE minDate varchar(20) ; DECLARE maxDate varchar(20) ;*/ DECLARE stkDate varchar(20) ; DECLARE done int default -1; /* 游标中注册服务器地址 */ DE
hadoop hdfs 添加数据目录出错 daizj hadoop hdfs 扩容
由于原来配置的hadoop data目录快要用满了，故准备修改配置文件增加数据目录，以便扩容，但由于疏忽，把core-site.xml, hdfs-site.xml配置文件dfs.datanode.data.dir 配置项增加了配置目录，但未创建实际目录，重启datanode服务时，报如下错误： 2014-11-18 08:51:39,128 WARN org.apache.hadoop.h
grep 目录级联查找 dongwei_6688 grep
在Mac或者Linux下使用grep进行文件内容查找时，如果给定的目标搜索路径是当前目录，那么它默认只搜索当前目录下的文件，而不会搜索其下面子目录中的文件内容，如果想级联搜索下级目录，需要使用一个“-r”参数： grep -n -r "GET" . 上面的命令将会找出当前目录“.”及当前目录中所有下级目录
yii 修改模块使用的布局文件 dcj3sjt126com yii layouts
方法一：yii模块默认使用系统当前的主题布局文件，如果在主配置文件中配置了主题比如: 'theme'=>'mythm', 那么yii的模块就使用 protected/themes/mythm/views/layouts 下的布局文件；如果未配置主题，那么 yii的模块就使用 protected/views/layouts 下的布局文件，总之默认不是使用自身目录 pr
设计模式之单例模式 come_for_dream 设计模式单例模式懒汉式饿汉式双重检验锁失败无序写入
今天该来的面试还没来，这个店估计不会来电话了，安静下来写写博客也不错，没事翻了翻小易哥的博客甚至与大牛们之间的差距，基础知识不扎实建起来的楼再高也只能是危楼罢了，陈下心回归基础把以前学过的东西总结一下。 *********************************
8、数组豆豆咖啡二维数组数组一维数组
一、概念数组是同一种类型数据的集合。其实数组就是一个容器。二、好处可以自动给数组中的元素从0开始编号，方便操作这些元素三、格式 //一维数组 1,元素类型[] 变量名 = new 元素类型[元素的个数] int[] arr =
Decode Ways hcx2013 decode
A message containing letters from A-Z is being encoded to numbers using the following mapping: 'A' -> 1 'B' -> 2 ... 'Z' -> 26 Given an encoded message containing digits, det
Spring4.1新特性——异步调度和事件机制的异常处理 jinnianshilongnian spring 4.1
目录 Spring4.1新特性——综述 Spring4.1新特性——Spring核心部分及其他 Spring4.1新特性——Spring缓存框架增强 Spring4.1新特性——异步调用和事件机制的异常处理 Spring4.1新特性——数据库集成测试脚本初始化 Spring4.1新特性——Spring MVC增强 Spring4.1新特性——页面自动化测试框架Spring MVC T
squid3(高命中率)缓存服务器配置 liyonghui160com
系统:centos 5.x 需要的软件:squid-3.0.STABLE25.tar.gz 1.下载squid wget http://www.squid-cache.org/Versions/v3/3.0/squid-3.0.STABLE25.tar.gz tar zxf squid-3.0.STABLE25.tar.gz &&
避免Java应用中NullPointerException的技巧和最佳实践 pda158 java
1) 从已知的String对象中调用equals()和equalsIgnoreCase()方法，而非未知对象。　　总是从已知的非空String对象中调用equals()方法。因为equals()方法是对称的，调用a.equals(b)和调用b.equals(a)是完全相同的，这也是为什么程序员对于对象a和b这么不上心。如果调用者是空指针，这种调用可能导致一个空指针异常 Object unk
如何在Swift语言中创建http请求 shoothao http swift
概述：本文通过实例从同步和异步两种方式上回答了”如何在Swift语言中创建http请求“的问题。如果你对Objective-C比较了解的话，对于如何创建http请求你一定驾轻就熟了，而新语言Swift与其相比只有语法上的区别。但是，对才接触到这个崭新平台的初学者来说，他们仍然想知道“如何在Swift语言中创建http请求？”。在这里,我将作出一些建议来回答上述问题。常见的
Spring事务的传播方式 uule spring事务
传播方式：新建事务 required required_new - 挂起当前非事务方式运行 supports &nbs