前言
這是我第一次开通博客,主要目的是想记录下自己学习python的过程,同时也是想作为学习笔记,我会把《利用python进行数据分析》这本树上的每个例子都自己敲一边,很多语句并不知道为什么这么写,里面也有很多语句可能是因为版本的问题而有出入,我会尽量搞懂每一部分,希望这博客能见证我学习的过程,也希望自己能坚持下来
因为是第一次用博客,不知道怎么样才能写的想别人那么好看,那就一点点来,现在就开始吧
第二章 引言
这章主要是给出一些范例数据集,并讲解了我们能对其做些什么。这章没有详细讲解每个语句,只是一个大概的讲解。
文件中各行的格式JSON(即JavaScript Object Notation,这是一种常用的web数据格式)Python有许多内置或第三方模块可以将JSON字符串转换成Python字典对象。这里,我将json模块及其loads函数逐行加载已经下载好的数据文件:
import json
path='C:\\pytest\\ch02\\usagov_bitly_data2012-03-16-1331923249.txt'
records=[json.loads(line) for line in open(path)]
上面最后那行表达式叫做列表推导式(list comprehension),这是一种在一组字符串(或一组别的对象)上执行一条相同操作(如json.loads)的简洁方式。在一个打开的文件句柄上进行迭代即可获得一个由行组成的序列。
records[0]
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
'al': 'en-US,en;q=0.8',
'c': 'US',
'cy': 'Danvers',
'g': 'A6qOVH',
'gr': 'MA',
'h': 'wfLQtf',
'hc': 1331822918,
'hh': '1.usa.gov',
'l': 'orofrog',
'll': [42.576698, -70.954903],
'nk': 1,
'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
't': 1331923247,
'tz': 'America/New_York',
'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}
现在只要以字符串的形式给出想要访问的键就可以得到当前记录中相应的值了
records[0]['tz']
'America/New_York'
time_zones=[rec['tz'] for rec in records if 'tz' in rec]
time_zones[:10]
['America/New_York',
'America/Denver',
'America/New_York',
'America/Sao_Paulo',
'America/New_York',
'America/New_York',
'Europe/Warsaw',
'',
'',
'']
下面利用标准python库进行计数
方法1
def get_counts(sequence):
counts={}
for x in sequence:
if x in counts:
counts[x] +=1
else:
counts[x] =1
return counts
方法2 在非常了解python标准库时,可以将代码写的更简洁
from collections import defaultdict
def get_counts2(sequence):
counts = defaultdict(int) #所有的值均会被初始化为0
for x in sequence:
counts[x] +=1
return counts
counts=get_counts(time_zones)
counts['America/New_York']
输出:1251
counts=get_counts2(time_zones)
counts['America/New_York']
同样输出1251
len(time_zones) #计算有多少个时区
输出:3440
如果想得到前10位的时区及其计数值,可以用到一些字典的处理技巧
def top_counts(count_dict,n=10):
value_key_pairs=[(count,tz) for tz,count in count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]
top_counts(counts)
[(33, 'America/Sao_Paulo'),
(35, 'Europe/Madrid'),
(36, 'Pacific/Honolulu'),
(37, 'Asia/Tokyo'),
(74, 'Europe/London'),
(191, 'America/Denver'),
(382, 'America/Los_Angeles'),
(400, 'America/Chicago'),
(521, ''),
(1251, 'America/New_York')]
可以在python标准库中找到collections.Counter类,使任务变得更简单
from collections import Counter
counts = Counter(time_zones)
counts.most_common(10)
[('America/New_York', 1251),
('', 521),
('America/Chicago', 400),
('America/Los_Angeles', 382),
('America/Denver', 191),
('Europe/London', 74),
('Asia/Tokyo', 37),
('Pacific/Honolulu', 36),
('Europe/Madrid', 35),
('America/Sao_Paulo', 33)]
DataFrame是pandas中最重要的数据结构,它用于将数据表示为一个表格。从一组原始记录中创建DataFrame是很简单的
from pandas import DataFrame,Series
import pandas as pd
import numpy as np
frame = DataFrame(records)
frame
_heartbeat_ | a | al | c | cy | g | gr | h | hc | hh | kw | l | ll | nk | r | t | tz | u | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi… | en-US,en;q=0.8 | US | Danvers | A6qOVH | MA | wfLQtf | 1.331823e+09 | 1.usa.gov | NaN | orofrog | [42.576698, -70.954903] | 1.0 | http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/… | 1.331923e+09 | America/New_York | http://www.ncbi.nlm.nih.gov/pubmed/22415991 |
1 | NaN | GoogleMaps/RochesterNY | NaN | US | Provo | mwszkS | UT | mwszkS | 1.308262e+09 | j.mp | NaN | bitly | [40.218102, -111.613297] | 0.0 | http://www.AwareMap.com/ | 1.331923e+09 | America/Denver | http://www.monroecounty.gov/etc/911/rss.php |
2 | NaN | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT … | en-US | US | Washington | xxr3Qb | DC | xxr3Qb | 1.331920e+09 | 1.usa.gov | NaN | bitly | [38.9007, -77.043098] | 1.0 | http://t.co/03elZC4Q | 1.331923e+09 | America/New_York | http://boxer.senate.gov/en/press/releases/0316… |
3 | NaN | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)… | pt-br | BR | Braz | zCaLwp | 27 | zUtuOu | 1.331923e+09 | 1.usa.gov | NaN | alelex88 | [-23.549999, -46.616699] | 0.0 | direct | 1.331923e+09 | America/Sao_Paulo | http://apod.nasa.gov/apod/ap120312.html |
4 | NaN | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi… | en-US,en;q=0.8 | US | Shrewsbury | 9b6kNl | MA | 9b6kNl | 1.273672e+09 | bit.ly | NaN | bitly | [42.286499, -71.714699] | 0.0 | http://www.shrewsbury-ma.gov/selco/ | 1.331923e+09 | America/New_York | http://www.shrewsbury-ma.gov/egov/gallery/1341… |
5 | NaN | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi… | en-US,en;q=0.8 | US | Shrewsbury | axNK8c | MA | axNK8c | 1.273673e+09 | bit.ly | NaN | bitly | [42.286499, -71.714699] | 0.0 | http://www.shrewsbury-ma.gov/selco/ | 1.331923e+09 | America/New_York | http://www.shrewsbury-ma.gov/egov/gallery/1341… |
6 | NaN | Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1… | pl-PL,pl;q=0.8,en-US;q=0.6,en;q=0.4 | PL | Luban | wcndER | 77 | zkpJBR | 1.331923e+09 | 1.usa.gov | NaN | bnjacobs | [51.116699, 15.2833] | 0.0 | http://plus.url.google.com/url?sa=z&n=13319232… | 1.331923e+09 | Europe/Warsaw | http://www.nasa.gov/mission_pages/nustar/main/… |
7 | NaN | Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/2… | bg,en-us;q=0.7,en;q=0.3 | None | NaN | wcndER | NaN | zkpJBR | 1.331923e+09 | 1.usa.gov | NaN | bnjacobs | NaN | 0.0 | http://www.facebook.com/ | 1.331923e+09 | http://www.nasa.gov/mission_pages/nustar/main/… | |
8 | NaN | Opera/9.80 (X11; Linux zbov; U; en) Presto/2.1… | en-US, en | None | NaN | wcndER | NaN | zkpJBR | 1.331923e+09 | 1.usa.gov | NaN | bnjacobs | NaN | 0.0 | http://www.facebook.com/l.php?u=http%3A%2F%2F1… | 1.331923e+09 | http://www.nasa.gov/mission_pages/nustar/main/… | |
9 | NaN | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi… | pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4 | None | NaN | zCaLwp | NaN | zUtuOu | 1.331923e+09 | 1.usa.gov | NaN | alelex88 | NaN | 0.0 | http://t.co/o1Pd0WeV | 1.331923e+09 | http://apod.nasa.gov/apod/ap120312.html | |
10 | NaN | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)… | en-us,en;q=0.5 | US | Seattle | vNJS4H | WA | u0uD9q | 1.319564e+09 | 1.usa.gov | NaN | o_4us71ccioa | [47.5951, -122.332603] | 1.0 | direct | 1.331923e+09 | America/Los_Angeles | https://www.nysdot.gov/rexdesign/design/commun… |
11 | NaN | Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4… | en-us,en;q=0.5 | US | Washington | wG7OIH | DC | A0nRz4 | 1.331816e+09 | 1.usa.gov | NaN | darrellissa | [38.937599, -77.092796] | 0.0 | http://t.co/ND7SoPyo | 1.331923e+09 | America/New_York | http://oversight.house.gov/wp-content/uploads/… |
12 | NaN | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)… | en-us,en;q=0.5 | US | Alexandria | vNJS4H | VA | u0uD9q | 1.319564e+09 | 1.usa.gov | NaN | o_4us71ccioa | [38.790901, -77.094704] | 1.0 | direct | 1.331923e+09 | America/New_York | https://www.nysdot.gov/rexdesign/design/commun… |
13 | 1.331923e+09 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
14 | NaN | Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US… | en-us,en;q=0.5 | US | Marietta | 2rOUYc | GA | 2rOUYc | 1.255770e+09 | 1.usa.gov | NaN | bitly | [33.953201, -84.5177] | 1.0 | direct | 1.331923e+09 | America/New_York | http://toxtown.nlm.nih.gov/index.php |
15 | NaN | Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1… | zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4 | HK | Central District | nQvgJp | 00 | rtrrth | 1.317318e+09 | j.mp | NaN | walkeryuen | [22.2833, 114.150002] | 1.0 | http://forum2.hkgolden.com/view.aspx?type=BW&m… | 1.331923e+09 | Asia/Hong_Kong | http://www.ssd.noaa.gov/PS/TROP/TCFP/data/curr… |
16 | NaN | Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1… | zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4 | HK | Central District | XdUNr | 00 | qWkgbq | 1.317318e+09 | j.mp | NaN | walkeryuen | [22.2833, 114.150002] | 1.0 | http://forum2.hkgolden.com/view.aspx?type=BW&m… | 1.331923e+09 | Asia/Hong_Kong | http://www.usno.navy.mil/NOOC/nmfc-ph/RSS/jtwc… |
17 | NaN | Mozilla/5.0 (Macintosh; Intel Mac OS X 10.5; r… | en-us,en;q=0.5 | US | Buckfield | zH1BFf | ME | x3jOIv | 1.331840e+09 | 1.usa.gov | NaN | andyzieminski | [44.299702, -70.369797] | 0.0 | http://t.co/6Cx4ROLs | 1.331923e+09 | America/New_York | http://www.usda.gov/wps/portal/usda/usdahome?c… |
18 | NaN | GoogleMaps/RochesterNY | NaN | US | Provo | mwszkS | UT | mwszkS | 1.308262e+09 | 1.usa.gov | NaN | bitly | [40.218102, -111.613297] | 0.0 | http://www.AwareMap.com/ | 1.331923e+09 | America/Denver | http://www.monroecounty.gov/etc/911/rss.php |
19 | NaN | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi… | it-IT,it;q=0.8,en-US;q=0.6,en;q=0.4 | IT | Venice | wcndER | 20 | zkpJBR | 1.331923e+09 | 1.usa.gov | NaN | bnjacobs | [45.438599, 12.3267] | 0.0 | http://www.facebook.com/ | 1.331923e+09 | Europe/Rome | http://www.nasa.gov/mission_pages/nustar/main/… |
20 | NaN | Mozilla/5.0 (compatible; MSIE 9.0; Windows NT … | es-ES | ES | Alcal | zQ95Hi | 51 | ytZYWR | 1.331671e+09 | bitly.com | NaN | jplnews | [37.516701, -5.9833] | 0.0 | http://www.facebook.com/ | 1.331923e+09 | Africa/Ceuta | http://voyager.jpl.nasa.gov/imagesvideo/uranus… |
21 | NaN | Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6… | en-us,en;q=0.5 | US | Davidsonville | wcndER | MD | zkpJBR | 1.331923e+09 | 1.usa.gov | NaN | bnjacobs | [38.939201, -76.635002] | 0.0 | http://www.facebook.com/ | 1.331923e+09 | America/New_York | http://www.nasa.gov/mission_pages/nustar/main/… |
22 | NaN | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT … | en-us | US | Hockessin | y3ZImz | DE | y3ZImz | 1.331064e+09 | 1.usa.gov | NaN | bitly | [39.785, -75.682297] | 0.0 | direct | 1.331923e+09 | America/New_York | http://portal.hud.gov/hudportal/documents/hudd… |
23 | NaN | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3)… | en-us | US | Lititz | wWiOiD | PA | wWiOiD | 1.330218e+09 | 1.usa.gov | NaN | bitly | [40.174999, -76.3078] | 0.0 | http://www.facebook.com/l.php?u=http%3A%2F%2F1… | 1.331923e+09 | America/New_York | http://www.tricare.mil/mybenefit/ProfileFilter… |
24 | NaN | Mozilla/5.0 (Windows; U; Windows NT 5.1; es-ES… | es-es,es;q=0.8,en-us;q=0.5,en;q=0.3 | ES | Bilbao | wcndER | 59 | zkpJBR | 1.331923e+09 | 1.usa.gov | NaN | bnjacobs | [43.25, -2.9667] | 0.0 | http://www.facebook.com/ | 1.331923e+09 | Europe/Madrid | http://www.nasa.gov/mission_pages/nustar/main/… |
25 | NaN | Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1… | en-GB,en;q=0.8,en-US;q=0.6,en-AU;q=0.4 | MY | Kuala Lumpur | wcndER | 14 | zkpJBR | 1.331923e+09 | 1.usa.gov | NaN | bnjacobs | [3.1667, 101.699997] | 0.0 | http://www.facebook.com/ | 1.331923e+09 | Asia/Kuala_Lumpur | http://www.nasa.gov/mission_pages/nustar/main/… |
26 | NaN | Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1… | ro-RO,ro;q=0.8,en-US;q=0.6,en;q=0.4 | CY | Nicosia | wcndER | 04 | zkpJBR | 1.331923e+09 | 1.usa.gov | NaN | bnjacobs | [35.166698, 33.366699] | 0.0 | http://www.facebook.com/?ref=tn_tnmn | 1.331923e+09 | Asia/Nicosia | http://www.nasa.gov/mission_pages/nustar/main/… |
27 | NaN | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)… | en-US,en;q=0.8 | BR | SPaulo | zCaLwp | 27 | zUtuOu | 1.331923e+09 | 1.usa.gov | NaN | alelex88 | [-23.5333, -46.616699] | 0.0 | direct | 1.331923e+09 | America/Sao_Paulo | http://apod.nasa.gov/apod/ap120312.html |
28 | NaN | Mozilla/5.0 (iPad; CPU OS 5_0_1 like Mac OS X)… | en-us | None | NaN | vNJS4H | NaN | u0uD9q | 1.319564e+09 | 1.usa.gov | NaN | o_4us71ccioa | NaN | 0.0 | direct | 1.331923e+09 | https://www.nysdot.gov/rexdesign/design/commun… | |
29 | NaN | Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X… | en-us | None | NaN | FPX0IM | NaN | FPX0IL | 1.331923e+09 | 1.usa.gov | NaN | twittershare | NaN | 1.0 | http://t.co/5xlp0B34 | 1.331923e+09 | http://www.ed.gov/news/media-advisories/us-dep… | |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
3530 | NaN | Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.1… | en-US,en;q=0.8 | US | San Francisco | xVZg4P | CA | wqUkTo | 1.331908e+09 | go.nasa.gov | NaN | nasatwitter | [37.7645, -122.429398] | 0.0 | http://www.facebook.com/l.php?u=http%3A%2F%2Fg… | 1.331927e+09 | America/Los_Angeles | http://www.nasa.gov/multimedia/imagegallery/im… |
3531 | NaN | Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6… | en-US | None | NaN | wcndER | NaN | zkpJBR | 1.331923e+09 | 1.usa.gov | NaN | bnjacobs | NaN | 0.0 | direct | 1.331927e+09 | http://www.nasa.gov/mission_pages/nustar/main/… | |
3532 | NaN | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)… | en-us,en;q=0.5 | US | Washington | Au3aUS | DC | A9ct6C | 1.331926e+09 | 1.usa.gov | NaN | ncsha | [38.904202, -77.031998] | 1.0 | http://www.ncsha.org/ | 1.331927e+09 | America/New_York | http://portal.hud.gov/hudportal/HUD?src=/press… |
3533 | NaN | Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) A… | en-us | US | Jacksonville | b2UtUJ | FL | ieCdgH | 1.301393e+09 | go.nasa.gov | NaN | nasatwitter | [30.279301, -81.585098] | 1.0 | direct | 1.331927e+09 | America/New_York | http://apod.nasa.gov/apod/ |
3534 | NaN | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)… | en-us | US | Frisco | vNJS4H | TX | u0uD9q | 1.319564e+09 | 1.usa.gov | NaN | o_4us71ccioa | [33.149899, -96.855499] | 1.0 | direct | 1.331927e+09 | America/Chicago | https://www.nysdot.gov/rexdesign/design/commun… |
3535 | NaN | Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/… | en-us | US | Houston | zIgLx8 | TX | yrPaLt | 1.331903e+09 | aash.to | NaN | aashto | [29.775499, -95.415199] | 1.0 | direct | 1.331927e+09 | America/Chicago | http://ntl.bts.gov/lib/44000/44300/44374/FHWA-… |
3536 | NaN | Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; e… | en-US,en;q=0.5 | None | NaN | xIcyim | NaN | yG1TTf | 1.331728e+09 | go.nasa.gov | NaN | nasatwitter | NaN | 0.0 | http://t.co/g1VKE8zS | 1.331927e+09 | http://www.nasa.gov/mission_pages/hurricanes/a… | |
3537 | NaN | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)… | es-es,es;q=0.8,en-us;q=0.5,en;q=0.3 | HN | Tegucigalpa | zCaLwp | 08 | w63FZW | 1.331547e+09 | 1.usa.gov | NaN | bufferapp | [14.1, -87.216698] | 0.0 | http://t.co/A8TJyibE | 1.331927e+09 | America/Tegucigalpa | http://apod.nasa.gov/apod/ap120312.html |
3538 | NaN | Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Ma… | en-us | US | Los Angeles | qMac9k | CA | qds1Ge | 1.310474e+09 | 1.usa.gov | NaN | healthypeople | [34.041599, -118.298798] | 0.0 | direct | 1.331927e+09 | America/Los_Angeles | http://healthypeople.gov/2020/connect/webinars… |
3539 | NaN | Mozilla/5.0 (compatible; Fedora Core 3) FC3 KDE | NaN | US | Bellevue | zu2M5o | WA | zDhdro | 1.331586e+09 | bit.ly | NaN | glimtwin | [47.615398, -122.210297] | 0.0 | direct | 1.331927e+09 | America/Los_Angeles | http://www.federalreserve.gov/newsevents/press… |
3540 | NaN | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi… | en-US,en;q=0.8 | US | Payson | wcndER | UT | zkpJBR | 1.331923e+09 | 1.usa.gov | NaN | bnjacobs | [40.014198, -111.738899] | 0.0 | http://www.facebook.com/l.php?u=http%3A%2F%2F1… | 1.331927e+09 | America/Denver | http://www.nasa.gov/mission_pages/nustar/main/… |
3541 | NaN | Mozilla/5.0 (X11; U; OpenVMS AlphaServer_ES40;… | NaN | US | Bellevue | zu2M5o | WA | zDhdro | 1.331586e+09 | 1.usa.gov | NaN | glimtwin | [47.615398, -122.210297] | 0.0 | direct | 1.331927e+09 | America/Los_Angeles | http://www.federalreserve.gov/newsevents/press… |
3542 | NaN | Mozilla/5.0 (compatible; MSIE 9.0; Windows NT … | en-us | US | Pittsburg | y3reI1 | CA | y3reI1 | 1.331926e+09 | 1.usa.gov | NaN | bitly | [38.0051, -121.838699] | 0.0 | http://www.facebook.com/l.php?u=http%3A%2F%2F1… | 1.331927e+09 | America/Los_Angeles | http://www.sba.gov/community/blogs/community-b… |
3543 | 1.331927e+09 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3544 | NaN | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0.1) … | en-us,en;q=0.5 | US | Wentzville | vNJS4H | MO | u0uD9q | 1.319564e+09 | 1.usa.gov | NaN | o_4us71ccioa | [38.790001, -90.854897] | 1.0 | direct | 1.331927e+09 | America/Chicago | https://www.nysdot.gov/rexdesign/design/commun… |
3545 | NaN | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)… | en-us,en;q=0.5 | US | Saint Charles | vNJS4H | IL | u0uD9q | 1.319564e+09 | 1.usa.gov | NaN | o_4us71ccioa | [41.9352, -88.290901] | 1.0 | direct | 1.331927e+09 | America/Chicago | https://www.nysdot.gov/rexdesign/design/commun… |
3546 | NaN | Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Ma… | en-us | US | Los Angeles | qMac9k | CA | qds1Ge | 1.310474e+09 | 1.usa.gov | NaN | healthypeople | [34.041599, -118.298798] | 1.0 | direct | 1.331927e+09 | America/Los_Angeles | http://healthypeople.gov/2020/connect/webinars… |
3547 | NaN | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)… | en-us | US | Silver Spring | y0jYkg | MD | y0jYkg | 1.331852e+09 | 1.usa.gov | NaN | bitly | [39.052101, -77.014999] | 1.0 | direct | 1.331927e+09 | America/New_York | http://www.epa.gov/otaq/regs/fuels/additive/e1… |
3548 | NaN | Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Ma… | en-us | US | Mcgehee | y5rMac | AR | xANY6O | 1.331916e+09 | 1.usa.gov | NaN | twitterfeed | [33.628399, -91.356903] | 1.0 | https://twitter.com/fdarecalls/status/18069759… | 1.331927e+09 | America/Chicago | http://www.fda.gov/Safety/Recalls/ucm296326.htm |
3549 | NaN | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi… | sv-SE,sv;q=0.8,en-US;q=0.6,en;q=0.4 | SE | Sollefte | eH8wu | 24 | 7dtjei | 1.260316e+09 | 1.usa.gov | NaN | tweetdeckapi | [63.166698, 17.266701] | 1.0 | direct | 1.331927e+09 | Europe/Stockholm | http://www.nasa.gov/mission_pages/WISE/main/in… |
3550 | NaN | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT … | en-us | US | Conshohocken | A00b72 | PA | yGSwzn | 1.331918e+09 | 1.usa.gov | NaN | addthis | [40.0798, -75.2855] | 0.0 | http://www.linkedin.com/home?trk=hb_tab_home_top | 1.331927e+09 | America/New_York | http://www.nlm.nih.gov/medlineplus/news/fullst… |
3551 | NaN | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi… | en-US,en;q=0.8 | None | NaN | wcndER | NaN | zkpJBR | 1.331923e+09 | 1.usa.gov | NaN | bnjacobs | NaN | 0.0 | http://plus.url.google.com/url?sa=z&n=13319268… | 1.331927e+09 | http://www.nasa.gov/mission_pages/nustar/main/… | |
3552 | NaN | Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US… | NaN | US | Decatur | rqgJuE | AL | xcz8vt | 1.331227e+09 | 1.usa.gov | NaN | bootsnall | [34.572701, -86.940598] | 0.0 | direct | 1.331927e+09 | America/Chicago | http://travel.state.gov/passport/passport_5535… |
3553 | NaN | Mozilla/4.0 (compatible; MSIE 7.0; Windows NT … | en-us | US | Shrewsbury | 9b6kNl | MA | 9b6kNl | 1.273672e+09 | bit.ly | NaN | bitly | [42.286499, -71.714699] | 0.0 | http://www.shrewsbury-ma.gov/selco/ | 1.331927e+09 | America/New_York | http://www.shrewsbury-ma.gov/egov/gallery/1341… |
3554 | NaN | Mozilla/4.0 (compatible; MSIE 7.0; Windows NT … | en-us | US | Shrewsbury | axNK8c | MA | axNK8c | 1.273673e+09 | bit.ly | NaN | bitly | [42.286499, -71.714699] | 0.0 | http://www.shrewsbury-ma.gov/selco/ | 1.331927e+09 | America/New_York | http://www.shrewsbury-ma.gov/egov/gallery/1341… |
3555 | NaN | Mozilla/4.0 (compatible; MSIE 9.0; Windows NT … | en | US | Paramus | e5SvKE | NJ | fqPSr9 | 1.301298e+09 | 1.usa.gov | NaN | tweetdeckapi | [40.9445, -74.07] | 1.0 | direct | 1.331927e+09 | America/New_York | http://www.fda.gov/AdvisoryCommittees/Committe… |
3556 | NaN | Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1… | en-US,en;q=0.8 | US | Oklahoma City | jQLtP4 | OK | jQLtP4 | 1.307530e+09 | 1.usa.gov | NaN | bitly | [35.4715, -97.518997] | 0.0 | http://www.facebook.com/l.php?u=http%3A%2F%2F1… | 1.331927e+09 | America/Chicago | http://www.okc.gov/PublicNotificationSystem/Fo… |
3557 | NaN | GoogleMaps/RochesterNY | NaN | US | Provo | mwszkS | UT | mwszkS | 1.308262e+09 | j.mp | NaN | bitly | [40.218102, -111.613297] | 0.0 | http://www.AwareMap.com/ | 1.331927e+09 | America/Denver | http://www.monroecounty.gov/etc/911/rss.php |
3558 | NaN | GoogleProducer | NaN | US | Mountain View | zjtI4X | CA | zjtI4X | 1.327529e+09 | 1.usa.gov | NaN | bitly | [37.419201, -122.057404] | 0.0 | direct | 1.331927e+09 | America/Los_Angeles | http://www.ahrq.gov/qual/qitoolkit/ |
3559 | NaN | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT … | en-US | US | Mc Lean | qxKrTK | VA | qxKrTK | 1.312898e+09 | 1.usa.gov | NaN | bitly | [38.935799, -77.162102] | 0.0 | http://t.co/OEEEvwjU | 1.331927e+09 | America/New_York | http://herndon-va.gov/Content/public_safety/Pu… |
3560 rows × 18 columns
frame['tz'][:10]
0 America/New_York
1 America/Denver
2 America/New_York
3 America/Sao_Paulo
4 America/New_York
5 America/New_York
6 Europe/Warsaw
7
8
9
Name: tz, dtype: object
frame[‘tz’]所返回的Series对象有一个value_counts方法,该方法可以让我们得到所需的信息:
tz_counts = frame['tz'].value_counts()
tz_counts[:10]
America/New_York 1251
521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
America/Sao_Paulo 33
Name: tz, dtype: int64
先给记录中未知或缺失的时区填上一个替代只。fillna函数可以替换缺失值(NA),而未知值(空字符串)则可以通过布尔型数组索引加以替换:
clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz=='']='Unknown'
tz_counts = clean_tz.value_counts()
tz_counts[:10]
America/New_York 1251
Unknown 521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Missing 120
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
Name: tz, dtype: int64
在pandas中使用Series类的plot画图,如果tz_counts是一个Series类,需要先导入matplotlib.pyplot,最后加上plt.show(),显示图像。
import matplotlib.pyplot as plt
tz_counts[:10].plot(kind='barh',rot=0)
plt.show()
使用python内置的字符串函数和正则表达式可以将字符串中的信息解析出来
results = Series([x.split()[0] for x in frame.a.dropna()])
results[:5]
0 Mozilla/5.0
1 GoogleMaps/RochesterNY
2 Mozilla/4.0
3 Mozilla/5.0
4 Mozilla/5.0
dtype: object
results.value_counts()[:8]
Mozilla/5.0 2594
Mozilla/4.0 601
GoogleMaps/RochesterNY 121
Opera/9.80 34
TEST_INTERNET_AGENT 24
GoogleProducer 21
Mozilla/6.0 5
BlackBerry8520/5.0.0.681 4
dtype: int64
由于有的agent缺失,所以首先将他们从数据中移除
cframe = frame[frame.a.notnull()]
operating_system = np.where(cframe['a'].str.contains('Windows'),'Windows','Not Windows')
operating_system[:5]
array(['Windows', 'Not Windows', 'Windows', 'Not Windows', 'Windows'],
dtype='
现在可以根据时区和新得到的操作系统列表对数据进行分组了
by_tz_os = cframe.groupby(['tz',operating_system])
通过size对分组结果进行计数(类似于上面的value_counts函数),并利用unstack对计数结果进行重塑
agg_counts=by_tz_os.size().unstack().fillna(0)
agg_counts[:10]
Not Windows | Windows | |
---|---|---|
tz | ||
245.0 | 276.0 | |
Africa/Cairo | 0.0 | 3.0 |
Africa/Casablanca | 0.0 | 1.0 |
Africa/Ceuta | 0.0 | 2.0 |
Africa/Johannesburg | 0.0 | 1.0 |
Africa/Lusaka | 0.0 | 1.0 |
America/Anchorage | 4.0 | 1.0 |
America/Argentina/Buenos_Aires | 1.0 | 0.0 |
America/Argentina/Cordoba | 0.0 | 1.0 |
America/Argentina/Mendoza | 0.0 | 1.0 |
最后,选取最常出现的时区。根据agg_counts中的行数构造了一个间接索引数组:
用于按升序排列
indexer = agg_counts.sum(1).argsort()
indexer[:10]
tz
24
Africa/Cairo 20
Africa/Casablanca 21
Africa/Ceuta 92
Africa/Johannesburg 87
Africa/Lusaka 53
America/Anchorage 54
America/Argentina/Buenos_Aires 57
America/Argentina/Cordoba 26
America/Argentina/Mendoza 55
dtype: int64
然后通过take按照这个顺序截取了最后10行:
count_subset = agg_counts.take(indexer)[-10:]
count_subset
输出:
Not Windows | Windows | |
---|---|---|
tz | ||
America/Sao_Paulo | 13.0 | 20.0 |
Europe/Madrid | 16.0 | 19.0 |
Pacific/Honolulu | 0.0 | 36.0 |
Asia/Tokyo | 2.0 | 35.0 |
Europe/London | 43.0 | 31.0 |
America/Denver | 132.0 | 59.0 |
America/Los_Angeles | 130.0 | 252.0 |
America/Chicago | 115.0 | 285.0 |
245.0 | 276.0 | |
America/New_York | 339.0 | 912.0 |
使用stacked=True来生成一张堆积条形图
count_subset.plot(kind='barh',stacked=True)
#出现这语句,不知道原因
plt.show()
也可以将各行规范化为“总计为1”并重新绘图
normed_subset = count_subset.div(count_subset.sum(1),axis=0)
normed_subset.plot(kind='barh',stacked=True)
#还是那个语句
plt.show()
最后:
这里所用到的所有方法都会在本书后续的章节中详细讲解