原创文章,欢迎分享! http://my.oschina.net/u/2306127/blog/613875
最近空气污染严重,也为了演练一下Orange插件编写和数据处理的学习成果,准备开发一个AQI数据获取和分析的插件。目前做出来的一个样子如下,还有点酷吧?[下一步完善后,会将源码共享,目前暂不拿来误人,感兴趣的可交流]
目前已经可以从网上按照指定区域抓取AQI数据,然后转为Orange.data.Table,以及Pandas.DataFrame和 GeoPandas.DataFrame的数据对象,并且通过GeoPandas.DataFrame.to_file(fname)转为shp文件,然 后可以各种GIS软件中打开,进行后续的分析和制图等操作,我使用QGIS打开了,没有问题。
过程中遇到的问题和处理办法,与大家分享,也有一些未决的问题,看哪位牛人可以解决:
1、从网页上抓取AQI数据
数据来源用的http://aqicn.org。使用requests这个库进行数据抓取,功能很强,尤其是可以自定义Header。如果不自定义header,由于这个网站采用了反抓取技术,只返回过期的老数据,是无法得到最新的数据的。代码如下:
#Get AQI data from web,by a region. def getaqidata(left,right,bottom,top): aqi_url = geturl(left,right,bottom,top) aqi = requests.get(aqi_url,headers=gethead()) raqi = aqi.text raqi2 = re.search(r'\[\{.*\}\]',raqi) cities = json.loads(raqi2.group(0)) return cities
具体的Header可以打开FireFox的“开发者”功能,选择“网络”,再选中当前的数据访问请求列表,即可看到所有的消息。然后选择“原始头“,即可将相应的head拷贝下来,放到gethead()函数下,做成一个辞典返回。然后调用:
aqi = requests.get(aqi_url,headers=gethead())
返回的值是一个json的字符串,但是有一些头信息,使用正则表达式把数据提取出来,放到cities中。
mapShowLevel2Makers([{"lat":"38.871","lon":"115.521","aqi":"112", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"City Monitoring Station, Baoding", "img":"_c_az8khNSs3Uf7J_7tN1s57uaNIH4uezJz7b2v189UwA", "pol":"pm25","tz":"+0800","idx":781,"x":668}, {"lat":"38.896","lon":"115.522","aqi":"93", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"Huadian II, Baoding", "img":"_AR8A4P9DTjpIZWJlaS_kv53lrprluIIv5Y2O55S15LqM5Yy6", "pol":"pm25","tz":"+0800","idx":783,"x":670}, ... {"lat":"40.152","lon":"118.311","aqi":"48", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"Qianxi EPA, Tangshan", "img":"_ASUA2v9DTjpIZWJlaS_llJDlsbHluIIv6L-B6KW_546v5L-d5bGAKCop", "pol":"pm25","tz":"+0800","idx":823,"x":4640}], [7.8,0]);
2、AQI数据的解析
提取的cities内容如下:
[{"lat":"38.871","lon":"115.521","aqi":"112", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"City Monitoring Station, Baoding", "img":"_c_az8khNSs3Uf7J_7tN1s57uaNIH4uezJz7b2v189UwA", "pol":"pm25","tz":"+0800","idx":781,"x":668}, {"lat":"38.896","lon":"115.522","aqi":"93", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"Huadian II, Baoding", "img":"_AR8A4P9DTjpIZWJlaS_kv53lrprluIIv5Y2O55S15LqM5Yy6", "pol":"pm25","tz":"+0800","idx":783,"x":670}, ... {"lat":"40.152","lon":"118.311","aqi":"48", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"Qianxi EPA, Tangshan", "img":"_ASUA2v9DTjpIZWJlaS_llJDlsbHluIIv6L-B6KW_546v5L-d5bGAKCop", "pol":"pm25","tz":"+0800","idx":823,"x":4640}]
cities是一个标准的列表,其中包含一个dict对象,里面有若干个key-value数值对。
cities可以使用标准的json操作或者python的list进行访问。
3、转为Pandas.DataFrame
pandas有非常丰富的数据操作函数,pandas可以直接将上面的cities数据结构转为一个pandas.DataFrame。
import pandas as pd df = pandas.DataFrame(cities)
也可以使用pandas.DataFrame.to_csv()将数据保存到csv文件中,或者直接存为excel的表格,然后...可以干很多事了。
4、转为GeoPandas.GeoDataFrame
5、保存AQI数据为shp文件
6、转为Orange.data.Table