今天开始是第十四章实例部分,想了一下,还是决定分开学习,争取每个都学会。
第14章 数据分析案例
14.1 来自Bitly的USA.gov数据
利用json模块及loads函数逐行加载已经下载好的数据文件。
import json
path = 'datasets/bitly_usagov/example.txt'
records = [json.loads(line) for line in open(path)]
通过上述方法加载数据之后,整个数据文件就变为了Python字典数据。
用纯Python代码对时区进行计数
获得时区数据
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
# 在末尾加上if 'tz'in rec,可以避免出现如果有tz参数不是时区的情况。
time_zones[:10]
['America/New_York',
'America/Denver',
'America/New_York',
'America/Sao_Paulo',
'America/New_York',
'America/New_York',
'Europe/Warsaw',
'',
'',
'']
# 使用Pythony标准语言,进行记数统计。
def get_counts(sequence):
counts = {}
for x in sequence:
if x in counts:
counts[x] += 1
else:
counts[x] = 1
return counts
#使用Python库也可以进行相关的数据记数统计。
from collections import defaultdict
def get_counts2(sequence):
counts = defaultdict(int)
for x in sequence:
counts[x] += 1
return counts
counts = get_counts(time_zones)
print( counts['America/New_York'])
1251
#为获取排名前10位的数据,可以编写一个相关排序的程序。
def top_counts(count_dict, n=10):
value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]
top_counts(counts)
[(33, 'America/Sao_Paulo'),
(35, 'Europe/Madrid'),
(36, 'Pacific/Honolulu'),
(37, 'Asia/Tokyo'),
(74, 'Europe/London'),
(191, 'America/Denver'),
(382, 'America/Los_Angeles'),
(400, 'America/Chicago'),
(521, ''),
(1251, 'America/New_York')]
#此外还可以利用第三方库进行更简单的数据统计。
from collections import Counter
counts = Counter(time_zones)
counts.most_common(10)
[('America/New_York', 1251),
('', 521),
('America/Chicago', 400),
('America/Los_Angeles', 382),
('America/Denver', 191),
('Europe/London', 74),
('Asia/Tokyo', 37),
('Pacific/Honolulu', 36),
('Europe/Madrid', 35),
('America/Sao_Paulo', 33)]
用pandas对时区进行计数
将数据转为DataFrame进行相关的统计。
import pandas as pd
frame = pd.DataFrame(records)
frame.info()
RangeIndex: 3560 entries, 0 to 3559
Data columns (total 18 columns):
a 3440 non-null object
c 2919 non-null object
nk 3440 non-null float64
tz 3440 non-null object
gr 2919 non-null object
g 3440 non-null object
h 3440 non-null object
l 3440 non-null object
al 3094 non-null object
hh 3440 non-null object
r 3440 non-null object
u 3440 non-null object
t 3440 non-null float64
hc 3440 non-null float64
cy 2919 non-null object
ll 2919 non-null object
_heartbeat_ 120 non-null float64
kw 93 non-null object
dtypes: float64(4), object(14)
memory usage: 306.0+ KB
frame['tz'][:10]
0 America/New_York
1 America/Denver
2 America/New_York
3 America/Sao_Paulo
4 America/New_York
5 America/New_York
6 Europe/Warsaw
7
8
9
Name: tz, dtype: object
#对于数组可以使用value_counts()统计相关的数据记数。
tz_counts = frame['tz'].value_counts()
tz_counts[:10]
America/New_York 1251
521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
America/Sao_Paulo 33
Name: tz, dtype: int64
#可视化数据的时候,需要对缺失值进行相关的替换。
clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknown'
tz_counts = clean_tz.value_counts()
tz_counts[:10]
America/New_York 1251
Unknown 521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Missing 120
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
Name: tz, dtype: int64
import seaborn as sns
subset = tz_counts[:10]
sns.barplot(y=subset.index, x=subset.values)
results = pd.Series([x.split()[0] for x in frame.a.dropna()])
print(results[:5])
print('\n')
print(results.value_counts()[:8])
0 Mozilla/5.0
1 GoogleMaps/RochesterNY
2 Mozilla/4.0
3 Mozilla/5.0
4 Mozilla/5.0
dtype: object
Mozilla/5.0 2594
Mozilla/4.0 601
GoogleMaps/RochesterNY 121
Opera/9.80 34
TEST_INTERNET_AGENT 24
GoogleProducer 21
Mozilla/6.0 5
BlackBerry8520/5.0.0.681 4
dtype: int64
import numpy as np
cframe = frame[frame.a.notnull()]
cframe['os'] = np.where(cframe['a'].str.contains('Windows'),
'Windows', 'Not Windows')
cframe['os'][:5]
E:\anaconda\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
after removing the cwd from sys.path.
0 Windows
1 Not Windows
2 Windows
3 Not Windows
4 Windows
Name: os, dtype: object
#分组数据统计
by_tz_os = cframe.groupby(['tz', 'os'])
agg_counts = by_tz_os.size().unstack().fillna(0)
print(agg_counts[:10])
os Not Windows Windows
tz
245.0 276.0
Africa/Cairo 0.0 3.0
Africa/Casablanca 0.0 1.0
Africa/Ceuta 0.0 2.0
Africa/Johannesburg 0.0 1.0
Africa/Lusaka 0.0 1.0
America/Anchorage 4.0 1.0
America/Argentina/Buenos_Aires 1.0 0.0
America/Argentina/Cordoba 0.0 1.0
America/Argentina/Mendoza 0.0 1.0
indexer = agg_counts.sum(1).argsort()
count_subset = agg_counts.take(indexer[-10:])
#使用pandas的nlagest方法可以快速排序。
agg_counts.sum(1).nlargest(10)
tz
America/New_York 1251.0
521.0
America/Chicago 400.0
America/Los_Angeles 382.0
America/Denver 191.0
Europe/London 74.0
Asia/Tokyo 37.0
Pacific/Honolulu 36.0
Europe/Madrid 35.0
America/Sao_Paulo 33.0
dtype: float64
count_subset = count_subset.stack()
count_subset.name = 'total'
count_subset = count_subset.reset_index()
sns.barplot(x='total', y='tz', hue='os', data=count_subset)
说明:
放上参考链接,这个系列都是复现的这个链接中的内容。
放上原链接: https://www.jianshu.com/p/04d180d90a3f
作者在链接中放上了书籍,以及相关资源。因为平时杂七杂八的也学了一些,所以这次可能是对书中的部分内容的复现。也可能有我自己想到的内容,内容暂时都还不定。在此感谢原作者SeanCheney的分享。