最近接受Python数据分析的培训,准备接下来深入研究一下,正处在初涉阶段,先上一个小练习热热身。
开发工具:PyCharm 2016.2
完整练习的GitHub地址:https://github.com/xinluqishi/pythonTrainingPro
项目分析数据:https://www.kaggle.com/osmi/mental-health-in-tech-survey,这是有关科技工作者心理健康数据的分析项目,数据是CSV格式的。这是一个很好的网站,里面的数据可以拿来做Python数据分析,大家可以下载,片段如下:
"Timestamp","Age","Gender","Country","state","self_employed","family_history","treatment","work_interfere","no_employees","remote_work","tech_company","benefits","care_options","wellness_program","seek_help","anonymity","leave","mental_health_consequence","phys_health_consequence","coworkers","supervisor","mental_health_interview","phys_health_interview","mental_vs_physical","obs_consequence","comments"
2014-08-27 11:29:31,37,"Female","United States","IL",NA,"No","Yes","Often","6-25","No","Yes","Yes","Not sure","No","Yes","Yes","Somewhat easy","No","No","Some of them","Yes","No","Maybe","Yes","No",NA
2014-08-27 11:29:37,44,"M","United States","IN",NA,"No","No","Rarely","More than 1000","No","No","Don't know","No","Don't know","Don't know","Don't know","Don't know","Maybe","No","No","No","No","No","Don't know","No",NA
2014-08-27 11:29:44,32,"Male","Canada",NA,NA,"No","No","Rarely","6-25","No","Yes","No","No","No","No","Don't know","Somewhat difficult","No","No","Yes","Yes","Yes","Yes","No","No",NA
需求:统计各个国家存在的心理健康问题的平均年龄。
这个需求很简单也很清晰,其目的就是通过Python的数据分析功能,进行数据的清洗、分类和计算。
我分两步实现:
1. 先找到有心理问题的记录中的年龄数据,然后根据国家列出所有符合条件的年龄集合;
2. 将年龄相加除以有心理健康问题的人数;
另外,观察到有脏数据,发现Zimbabwe国家的年龄数据是999999,直接过滤掉了;并且要将最后的统计结果保留两位小数,因为求平均数时会有很多小数位数的结果出现。这些应该属于数据的前期清洗和最后的数据整理工作。
因为是初涉,所以用Python最简单直接的实现方式,比如会做循环过滤数据,等后面我会写更简单的处理案例,用上一些Python现有的数据处理包。
附上我的处理代码,里面的注释有我的思考过程:
# -*- coding: utf-8 -*-
"""
作者: kevin shi
版本: 1.0
日期: 2017/02/18
项目名称:科技工作者心理健康数据分析 (Mental Health in Tech Survey)
"""
import csv
# 数据集路径
data_path = './survey.csv'
def run_main():
mental_health_set = {'Yes'} # 心理健康问题要找到的值
result_dict = {} # 最终结果存放列表
with open(data_path, 'r', newline='') as csvfile:
# 加载数据
rows = csv.reader(csvfile)
for i, row in enumerate(rows):
if i == 0:
# 跳过第一行表头数据
continue
if i % 50 == 0:
print('正在处理第{}行数据...'.format(i))
age_val = row[1] # 性别数据
country_val = row[3] # 国家
mental_health_val = row[18] # 是否有心理问题
# sum([1, 2, 3]) 可以使用sum函数相加生成的列表 这里简单用累加了
# 去掉可能存在的空格
age_val = age_val.replace(' ', '')
mental_health_val = mental_health_val.replace(' ', '')
# 判断“国家”是否已经存在
if country_val not in result_dict:
# 如果不存在,初始化数据
# result_dict[country_val] = [] # 存放所有符合条件的年龄
result_dict[country_val] = [0, 0, 0] # 第一个参数存储符合条件的年龄总和, 第二个参数存储有多少条记录
# 有心理问题, 要过滤不合常理的数据,如Zimbabwe 年龄999999 392行
if mental_health_val in mental_health_set and (len(age_val) <= 3):
# 列出所有符合条件的年龄列表
# result_dict[country_val].append(age_val)
# 第一个参数存储符合条件的年龄总和, 第二个参数存储有多少条记录
result_dict[country_val][0] += int(age_val)
result_dict[country_val][1] += 1
else:
# 噪声数据,不做处理
pass
# 将结果写入文件
with open('mental_country1.csv', 'w', newline='', encoding='utf-16') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',')
# 写入表头
# csvwriter.writerow(['国家', '存在心理问题的年龄列表'])
csvwriter.writerow(['国家', '存在心理问题的平均年龄'])
# 写入统计结果
for k, v in list(result_dict.items()):
# if len(result_dict[k]) == 0:
# csvwriter.writerow([k, 0])
# else:
# csvwriter.writerow([k, v])
# csvwriter.writerow([k, v])
# 处理年龄为0的所属国家记录
if int(v[0]) == 0:
v[2] = 0
else:
v[2] = round(int(v[0]) / int(v[1]), 2) # 保证结果不出现多个小数位数
csvwriter.writerow([k, v[2]])
if __name__ == '__main__':
run_main()
这是处理后的数据结果:
国家,存在心理问题的平均年龄
Norway,0
Moldova,0
Finland,27.0
Australia,31.5
Romania,0
Hungary,27.0
Austria,0
Costa Rica,0
Canada,29.88
Brazil,0
India,24.0
Philippines,31.0
Slovenia,19.0
Belgium,30.0
Croatia,43.0
South Africa,61.0
Poland,0
Colombia,26.0
Ireland,35.27
Russia,28.0
Spain,30.0
Latvia,0
Uruguay,0
Netherlands,33.0
Israel,27.0
Czech Republic,0
Italy,37.0
Zimbabwe,0
Denmark,0
Greece,36.5
Singapore,39.0
France,26.0
Sweden,0
United States,33.38
Mexico,0
United Kingdom,31.57
Bulgaria,26.0
Georgia,20.0
Germany,32.0
Thailand,0
New Zealand,36.75
Nigeria,0
Switzerland,30.0
Bosnia and Herzegovina,0
China,0
Portugal,27.0
"Bahamas, The",8.0
Japan,49.0
结束,如果有可以优化的地方,希望大家赐教。