完整脚本在公共号有链接
原文有些错误的地方,已改正。
Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.
IPython 7.12.0 -- An enhanced Interactive Python.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['figure.figsize'] = (15, 5)
plt.rcParams['font.family'] = 'sans-serif'
requests = pd.read_csv('311-service-requests.csv')
E:\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (8) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
7.1 我怎么知道它是否杂乱?
我们在这里查看几列。 我知道邮政编码有一些问题, 所以让我们先看看它。
要了解列是否有问题, 我通常使用 .unique() 来查看所有的值。 如果它是一列数字, 我将绘制一个直方图来获得分布的感觉
requests['Incident Zip'].unique()
Out[4]:
array([11432.0, 11378.0, 10032.0, 10023.0, 10027.0, 11372.0, 11419.0,
11417.0, 10011.0, 11225.0, 11218.0, 10003.0, 10029.0, 10466.0,
11219.0, 10025.0, 10310.0, 11236.0, nan, 10033.0, 11216.0, 10016.0,
...
'10307', '11103', '10004', '10069', '10005', '10474', '11428',
'11436', '10020', '11001', '11362', '11693', '10464', '11427',
'10044', '11363', '10006', '10000', '02061', '77092-2016', '10280',
'11109', '14225', '55164-0737', '19711', '07306', '000000',
'NO CLUE', '90010', '10281', '11747', '23541', '11776', '11697',
'11788', '07604', 10112.0, 11788.0, 11563.0, 11580.0, 7087.0,
11042.0, 7093.0, 11501.0, 92123.0, 0.0, 11575.0, 7109.0, 11797.0,
'10803', '11716', '11722', '11549-3650', '10162', '92123', '23502',
'11518', '07020', '08807', '11577', '07114', '11003', '07201',
'11563', '61702', '10103', '29616-0759', '35209-3114', '11520',
'11735', '10129', '11005', '41042', '11590', 6901.0, 7208.0,
11530.0, 13221.0, 10954.0, 11735.0, 10103.0, 7114.0, 11111.0,
10107.0], dtype=object)
7.2 修复 nan 值和字符串/浮点混淆
我们可以将 na_values 选项传递到 pd.read_csv 来清理它们。 我们还可以指
定 Incident Zip 的类型是字符串, 而不是浮点。
na_values = ['NO CLUE', 'N/A', '0']
requests = pd.read_csv('311-service-requests.csv', na_values=na_values, dtype={'Incident Zip': str})
requests['Incident Zip'].unique()
Out[5]:
array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',
'11417', '10011', '11225', '11218', '10003', '10029', '10466',
'11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',
...
'10461', '11224', '11429', '10035', '11366', '11362', '11206',
'10460', '10304', '11360', '11411', '10455', '10475', '10069',
'10303', '10308', '10302', '11357', '10470', '11367', '11370',
'10454', '10451', '11436', '11426', '10153', '11004', '11428',
'11427', '11001', '11363', '10004', '10474', '11430', '10000',
'10307', '11239', '10119', '10006', '10048', '11697', '11692',
'11693', '10573', '00083', '11559', '10020', '77056', '11776',
'70711', '10282', '11109', '10044', '02061', '77092-2016', '14225',
'55164-0737', '19711', '07306', '000000', '90010', '11747',
'23541', '11788', '07604', '10112', '11563', '11580', '07087',
'11042', '07093', '11501', '92123', '00000', '11575', '07109',
'11797', '10803', '11716', '11722', '11549-3650', '10162', '23502',
'11518', '07020', '08807', '11577', '07114', '11003', '07201',
'61702', '10103', '29616-0759', '35209-3114', '11520', '11735',
'10129', '11005', '41042', '11590', '06901', '07208', '11530',
'13221', '10954', '11111', '10107'], dtype=object)
7.3 短横线处发生了什么
rows_with_dashes = requests['Incident Zip'].str.contains('-').fillna(False)
len(requests[rows_with_dashes])
Out[6]: 5
requests[rows_with_dashes]
Out[7]:
Unique Key Created Date ... Longitude Location
29136 26550551 10/24/2013 06:16:34 PM ... NaN NaN
30939 26548831 10/24/2013 09:35:10 AM ... NaN NaN
70539 26488417 10/15/2013 03:40:33 PM ... NaN NaN
85821 26468296 10/10/2013 12:36:43 PM ... NaN NaN
89304 26461137 10/09/2013 05:23:46 PM ... NaN NaN
[5 rows x 52 columns]
我认为这些都是缺失的数据, 像这样删除它们:
requests['Incident Zip'][rows_with_dashes] = np.nan
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
9 位邮政编码是正常的。 让我们看看所有超过 5 位数的邮政编码, 确保它们没问题, 然后截断它们。
long_zip_codes = requests['Incident Zip'].str.len() > 5
requests['Incident Zip'][long_zip_codes].unique()
Out[9]: array(['000000'], dtype=object)
requests['Incident Zip'] = requests['Incident Zip'].str.slice(0,5)
requests[requests['Incident Zip'] == '00000']
Out[11]:
Unique Key Created Date ... Longitude Location
42600 26529313 10/22/2013 02:51:06 PM ... NaN NaN
60843 26507389 10/17/2013 05:48:44 PM ... NaN NaN
[2 rows x 52 columns]
zero_zips = requests['Incident Zip'] == '00000'
requests.loc[zero_zips, 'Incident Zip'] = np.nan
zips = requests['Incident Zip'] is_close=zips.str.startswith('0')|zips.str.startswith('1') is_far=~(is_close)&zips.notna() requests[is_far][['Incident Zip', 'Descriptor', 'City']] 7.4 把它们放到一起 requests['Incident Zip'] = fix_zip_codes(requests['Incident Zip']) #8.1 解析 Unix 时间戳 popcon = pd.read_csv('popularity-contest', sep=' ', )[:-1] 列是访问时间, 创建时间, 包名称最近使用的程序, 以及标签。 [5 rows x 5 columns] pandas 中的时间戳解析的神奇部分是 numpy datetime 已经存储为 Unix 时间戳。 所以我们需要做的是告诉 pandas 这些整数实际上是数据时间 - 它不需要做任何转换。 popcon['atime'] = popcon['atime'].astype(int) 每个 numpy 数组和 pandas 序列都有一个 dtype - 这通常是 int64 , float64 或 object 。 一些可用的时间类型 popcon['atime'] = pd.to_datetime(popcon['atime'], unit='s') popcon['atime'].dtype 现在我们将 atime 和 ctime 看做时间 [5 rows x 5 columns] 首先, 我想去掉一切带有时间戳 0 的东西。 注意, 我们可以在这个比较中使用一个字符串, 即使它实际上在里面是一个时间戳。 这是因为 Pandas 是非常厉害的。 [10 rows x 5 columns] 连接数据库,不在讲述
unique_zips = requests['Incident Zip'].unique().astype('str')
unique_zips.sort()
unique_zips
Out[13]:
array(['00083', '02061', '06901', '07020', '07087', '07093', '07109',
'07114', '07201', '07208', '07306', '07604', '08807', '10000',
'10001', '10002', '10003', '10004', '10005', '10006', '10007',
...
'11370', '11372', '11373', '11374', '11375', '11377', '11378',
'11379', '11385', '11411', '11412', '11413', '11414', '11415',
'11416', '11417', '11418', '11419', '11420', '11421', '11422',
'11423', '11426', '11427', '11428', '11429', '11430', '11432',
'11433', '11434', '11435', '11436', '11501', '11518', '11520',
'11530', '11559', '11563', '11575', '11577', '11580', '11590',
'11691', '11692', '11693', '11694', '11697', '11716', '11722',
'11735', '11747', '11776', '11788', '11797', '13221', '14225',
'19711', '23502', '23541', '41042', '61702', '70711', '77056',
'90010', '92123', 'nan'], dtype='
zips[is_far]
Out[16]:
12102 77056
13450 70711
44008 90010
47048 23541
57636 92123
71001 92123
71834 23502
80573 61702
94201 41042
Name: Incident Zip, dtype: object
Out[17]:
Incident Zip Descriptor City
12102 77056 Debt Not Owed HOUSTON
13450 70711 Contract Dispute CLIFTON
44008 90010 Billing Dispute LOS ANGELES
47048 23541 Harassment NORFOLK
57636 92123 Harassment SAN DIEGO
71001 92123 Billing Dispute SAN DIEGO
71834 23502 Harassment NORFOLK
80573 61702 Billing Dispute BLOOMIGTON
94201 41042 Harassment FLORENCE
na_values = ['NO CLUE', 'N/A', '0']
requests = pd.read_csv('311-service-requests.csv',
na_values=na_values,
dtype={'Incident Zip': str})
def fix_zip_codes(zips):
# Truncate everything to length 5
zips = zips.str.slice(0, 5)
# Set 00000 zip codes to nan
zero_zips = zips == '00000'
zips[zero_zips] = np.nan
return zips
requests['Incident Zip'].unique()
Out[18]:
array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',
'11417', '10011', '11225', '11218', '10003', '10029', '10466',
'11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',
...
'11501', '92123', '11575', '07109', '11797', '10803', '11716',
'11722', '11549', '10162', '23502', '11518', '07020', '08807',
'11577', '07114', '11003', '07201', '61702', '10103', '29616',
'35209', '11520', '11735', '10129', '11005', '41042', '11590',
'06901', '07208', '11530', '13221', '10954', '11111', '10107'],
dtype=object)第八章
popcon.columns = ['atime', 'ctime', 'package-name', 'mru-program', 'tag']
popcon[:5]
Out[19]:
atime ... tag
0 1387295797 ... NaN
1 1387295796 ... NaN
2 1387295743 ... NaN
3 1387295743 ...
4 1387295742 ... NaN
我们需要首先将这些转换为整数:
popcon['ctime'] = popcon['ctime'].astype(int)
是 datetime64 [s], datetime64 [ms]和 datetime64 [us]。 与之相似, 也有 timedelta 类型。
我们可以使用 pd.to_datetime 函数将我们的整数时间戳转换为 datetimes 。这是一个常量时间操作 - 我们实际上并不改变任何数据, 只是改变了 Pandas 如何看待它
popcon['ctime'] = pd.to_datetime(popcon['ctime'], unit='s')
Out[22]: dtype('
popcon[:5]
Out[23]:
atime ... tag
0 2013-12-17 15:56:37 ... NaN
1 2013-12-17 15:56:36 ... NaN
2 2013-12-17 15:55:43 ... NaN
3 2013-12-17 15:55:43 ...
4 2013-12-17 15:55:42 ... NaN
popcon = popcon[popcon['atime'] > '1970-01-01']
nonlibraries = popcon[~popcon['package-name'].str.contains('lib')]
nonlibraries.sort_values('ctime', ascending=False)[:10]
Out[24]:
atime ... tag
57 2013-12-17 04:55:39 ...
450 2013-12-16 20:03:20 ...
454 2013-12-16 20:03:20 ...
445 2013-12-16 20:03:20 ...
396 2013-12-16 20:08:27 ...
449 2013-12-16 20:03:20 ...
397 2013-12-16 20:08:25 ...
398 2013-12-16 20:08:23 ...
452 2013-12-16 20:03:20 ...
440 2013-12-16 20:03:20 ... 第九章