安然数据集分析
下载的安然数据集(E+F数据集)是以字典的形式存储的,这里将处理过程记录下来,也算是对python处理字典数据的一个记录,方便查阅和分享。
数据集来源于uadacity的分享。 以下代码是在python 3.6下运行。
本篇文章旨在通过对安然事件数据集的分析教会大家面对一个数据集,应该如何下手
安然事件造成有史以来最大的公司破产。在2000年度,安然是美国最大的能源公司,然而被揭露舞弊后,它在一年内就破产了。
我们之所以选择使用安然事件的数据集来做机器学习的项目,是因为我们已经有安然的电子邮件数据库,它包含150名前安然员工之间的50万封电子邮件,主要是高级管理人员。这也是唯一的大型公共的真实邮件数据库。
感兴趣的可以看一下安然的纪录片,也是非常令人唏嘘的一部经典纪录片:【纪录片】安然:房间里最聪明的人
或者阅读安然事件文章
数据集来源于uadacity的github.
git clone https://github.com/udacity/ud120-projects.git
克隆之后进入 tools/ 目录,运行 startup.py。该程序首先检查 python 模块(看你的numpy,scikit-learn等的包是否成功安装)然后下载并解压缩我们在后期将大量使用的大型数据集。下载和解压缩需要一些时间,但是你无需等到全部完成再开始第一部分。
数据集下载地址:
http://zoo.cs.yale.edu/classes/cs458/lectures/sklearn/ud/ud120-projects-master/enron_mail_20150507.tgz
由于下载下来的数据集是基于python2.x生成的,本机用的是python3.6,所以直接读取时会报错:_pickle.UnpicklingError: the STRING opcode argument must be quoted
这是由于操作系统不同,Unix 的 “\n” 和 DOS 的 “\r\n”
处理方法:重新生成一个新的文件.
original = "final_project_dataset.pkl"
destination = "final_project_dataset_unix.pkl"
content = ''
outsize = 0
with open(original, 'rb') as infile:
content = infile.read()
with open(destination, 'wb') as output:
for line in content.splitlines():
outsize += len(line) + 1
output.write(line + str.encode('\n'))
print("Done. Saved %s bytes." % (len(content)-outsize))
import pickle
enron_data = pickle.load(open("../final_project/final_project_dataset_unix.pkl", "rb"))
数据集中的数据是以字典的形式存放的,随机选取一个字典,可以看到形式如下:
print(enron_data['METTS MARK'])
'salary': 365788, 'to_messages': 807, 'deferral_payments': 'NaN', 'total_payments': 1061827, 'loan_advances': 'NaN', 'bonus': 600000, 'email_address': '[email protected]', 'restricted_stock_deferred': 'NaN', 'deferred_income': 'NaN', 'total_stock_value': 585062, 'expenses': 94299, 'from_poi_to_this_person': 38, 'exercised_stock_options': 'NaN', 'from_messages': 29, 'other': 1740, 'from_this_person_to_poi': 1, 'poi': False, 'long_term_incentive': 'NaN', 'shared_receipt_with_poi': 702, 'restricted_stock': 585062, 'director_fees': 'NaN'}
print("number of the person in AnRanData: ",len(enron_data) )
#number of the person in AnRanData: 146
numFeature = 0
for i in enron_data['METTS MARK']:
numFeature += 1
print("number of the feature / person: ",numFeature)
#number of the feature / person: 21
numPOI = 0
for i in enron_data.keys():
if enron_data[i]['poi'] == 1:
numPOI += 1
print("number of the E+F POI is: ",numPOI)
//number of the E+F POI is: 18
我们编辑了一个包含所有 POI 姓名的列表(在 …/final_project/poi_names.txt 中)并附上了相应的邮箱地址(在 …/final_project/poi_email_addresses.py 中)。
总共有多少 POI?(使用姓名列表,不要用邮箱地址,因为许多雇员不止一个邮箱,而且其中少数人员不是安然的雇员,我们没有他们的邮箱地址。)
countPOI = 0
for line in context:
if line.startswith('(y)') or line.startswith('(n)'):
countPOI += 1
print("sum of the POI is: ",countPOI)
//sum of the POI is: 35
print (enron_data["PRENTICE JAMES"]['total_stock_value'])
1095040
通过keys()可以查看他名下的所有参数:
print (enron_data["PRENTICE JAMES"].keys())
dict_keys(['salary', 'to_messages', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'email_address', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'from_poi_to_this_person', 'exercised_stock_options', 'from_messages', 'other', 'from_this_person_to_poi', 'poi', 'long_term_incentive', 'shared_receipt_with_poi', 'restricted_stock', 'director_fees'])
print (enron_data["COLWELL WESLEY"]['from_this_person_to_poi'])
# 11
print (enron_data["SKILLING JEFFREY K"]['exercised_stock_options'])
19250000
安然的CEO:Jeffrey Skilling
安然董事会主席:Kenneth Lay
安然CFO:Andrew Fastow
那么这三个人里面,谁卷走了最多的钱?
print (enron_data["SKILLING JEFFREY K"]['total_payments'])
print (enron_data["LAY KENNETH L"]['total_payments'])
print (enron_data["FASTOW ANDREW S"]['total_payments'])
maxPay = 0
maxpay_person = ''
s1 = enron_data["SKILLING JEFFREY K"]['total_payments']
if s1 > maxPay:
maxPay = s1
maxpay_person = 'SKILLING JEFFREY K'
s2 = enron_data["LAY KENNETH L"]['total_payments']
if s2 > maxPay:
maxPay = s2
maxpay_person = 'LAY KENNETH L'
s3 = enron_data["FASTOW ANDREW S"]['total_payments']
if s3 > maxPay:
maxPay = s3
maxpay_person = 'SFASTOW ANDREW S'
print(maxpay_person, maxPay)
// #LAY KENNETH L 103559793
num_salary = 0
num_email = 0
for i in enron_data.keys():
if enron_data[i]['salary'] != 'NaN':
num_salary += 1
if enron_data[i]['email_address'] != 'NaN':
num_email +=1
print(num_salary,num_email)
//# 95, 111
import numpy as np
def featureFormat( dictionary, features, remove_NaN=True, remove_all_zeroes=True, remove_any_zeroes=False, sort_keys = False):
""" convert dictionary to numpy array of features
remove_NaN = True will convert "NaN" string to 0.0
remove_all_zeroes = True will omit any data points for which
all the features you seek are 0.0
remove_any_zeroes = True will omit any data points for which
any of the features you seek are 0.0
sort_keys = True sorts keys by alphabetical order. Setting the value as
a string opens the corresponding pickle file with a preset key
order (this is used for Python 3 compatibility, and sort_keys
should be left as False for the course mini-projects).
NOTE: first feature is assumed to be 'poi' and is not checked for
removal for zero or missing values.
"""
return_list = []
# Key order - first branch is for Python 3 compatibility on mini-projects,
# second branch is for compatibility on final project.
if isinstance(sort_keys, str):
import pickle
keys = pickle.load(open(sort_keys, "rb"))
elif sort_keys:
keys = sorted(dictionary.keys())
else:
keys = dictionary.keys()
for key in keys:
tmp_list = []
for feature in features:
try:
dictionary[key][feature]
except KeyError:
print ("error: key ", feature, " not present")
return
value = dictionary[key][feature]
if value=="NaN" and remove_NaN:
value = 0
tmp_list.append( float(value) )
# Logic for deciding whether or not to add the data point.
append = True
# exclude 'poi' class as criteria.
if features[0] == 'poi':
test_list = tmp_list[1:]
else:
test_list = tmp_list
### if all features are zero and you want to remove
### data points that are all zero, do that here
if remove_all_zeroes:
append = False
for item in test_list:
if item != 0 and item != "NaN":
append = True
break
### if any features for a given data point are zero
### and you want to remove data points with any zeroes,
### handle that here
if remove_any_zeroes:
if 0 in test_list or "NaN" in test_list:
append = False
### Append the data point if flagged for addition.
if append:
return_list.append( np.array(tmp_list) )
return np.array(return_list)
def targetFeatureSplit( data ):
"""
given a numpy array like the one returned from
featureFormat, separate out the first feature
and put it into its own list (this should be the
quantity you want to predict)
return targets and features as separate lists
(sklearn can generally handle both lists and numpy arrays as
input formats when training/predicting)
"""
target = []
features = []
for item in data:
target.append( item[0] )
features.append( item[1:] )
return target, features
num_total_pay = 0
for i in enron_data.keys():
if enron_data[i]['total_payments'] == 'NaN':
num_total_pay += 1
proportion = num_total_pay/float(len(enron_data))
print(num_total_pay, proportion)
#21,0.14
numPOI = 0
num_POI_total_pay = 0
for i in enron_data.keys():
if enron_data[i]['poi'] == 1:
numPOI += 1
if enron_data[i]['total_payments'] == 'NaN':
num_POI_total_pay += 1
proportion_poi = num_POI_total_pay/float(numPOI)
print(num_POI_total_pay,numPOI, proportion_poi)
//#0,18,0