

  • 安然数据集介绍
  • 数据集获取
  • 数据集预处理
  • 读取数据集
  • 获取安然数据集中的人数
  • 每个人有多少特征:
  • E+F 数据集中有多少 POI?
  • 总共有多少 POI
  • James Prentice 名下的股票总值是多少
  • 我们有多少来自 Wesley Colwell 的发给嫌疑人的电子邮件?
  • Jeffrey Skilling 行使的股票期权价值是多少?
  • 谁卷走了最多的钱
  • 此数据集中有多少雇员有量化的工资?已知的邮箱地址是否可用?
  • 字典到数组的转换
  • (当前的)E+F 数据集中有多少人的薪酬总额被设置了“NaN”?数据集中这些人的比例占多少?
  • E+F 数据集中有多少 POI 的薪酬总额被设置了“NaN”?这些 POI 占多少比例?



数据集来源于uadacity的分享。 以下代码是在python 3.6下运行。








git clone https://github.com/udacity/ud120-projects.git

克隆之后进入 tools/ 目录,运行 startup.py。该程序首先检查 python 模块(看你的numpy,scikit-learn等的包是否成功安装)然后下载并解压缩我们在后期将大量使用的大型数据集。下载和解压缩需要一些时间,但是你无需等到全部完成再开始第一部分。



由于下载下来的数据集是基于python2.x生成的,本机用的是python3.6,所以直接读取时会报错:_pickle.UnpicklingError: the STRING opcode argument must be quoted

这是由于操作系统不同,Unix 的 “\n” 和 DOS 的 “\r\n”

original = "final_project_dataset.pkl"
destination = "final_project_dataset_unix.pkl"

content = ''
outsize = 0
with open(original, 'rb') as infile:
    content = infile.read()
with open(destination, 'wb') as output:
    for line in content.splitlines():
        outsize += len(line) + 1
        output.write(line + str.encode('\n'))

print("Done. Saved %s bytes." % (len(content)-outsize))


import pickle

enron_data = pickle.load(open("../final_project/final_project_dataset_unix.pkl", "rb"))


print(enron_data['METTS MARK'])
'salary': 365788, 'to_messages': 807, 'deferral_payments': 'NaN', 'total_payments': 1061827, 'loan_advances': 'NaN', 'bonus': 600000, 'email_address': '[email protected]', 'restricted_stock_deferred': 'NaN', 'deferred_income': 'NaN', 'total_stock_value': 585062, 'expenses': 94299, 'from_poi_to_this_person': 38, 'exercised_stock_options': 'NaN', 'from_messages': 29, 'other': 1740, 'from_this_person_to_poi': 1, 'poi': False, 'long_term_incentive': 'NaN', 'shared_receipt_with_poi': 702, 'restricted_stock': 585062, 'director_fees': 'NaN'}


print("number of the person in AnRanData: ",len(enron_data) )
#number of the person in AnRanData:  146


numFeature = 0
for i in  enron_data['METTS MARK']:    
    numFeature += 1

print("number of the feature / person: ",numFeature)

#number of the feature / person:  21

E+F 数据集中有多少 POI?

numPOI = 0

for i in  enron_data.keys():
    if enron_data[i]['poi'] == 1:
        numPOI += 1
print("number of the E+F POI is: ",numPOI)

//number of the E+F POI is:  18

总共有多少 POI

我们编辑了一个包含所有 POI 姓名的列表(在 …/final_project/poi_names.txt 中)并附上了相应的邮箱地址(在 …/final_project/poi_email_addresses.py 中)。
总共有多少 POI?(使用姓名列表,不要用邮箱地址,因为许多雇员不止一个邮箱,而且其中少数人员不是安然的雇员,我们没有他们的邮箱地址。)

countPOI = 0
for line in context:
    if line.startswith('(y)') or line.startswith('(n)'):
        countPOI += 1

print("sum of the POI is: ",countPOI)

//sum of the POI is:  35

James Prentice 名下的股票总值是多少

print (enron_data["PRENTICE JAMES"]['total_stock_value'])


print (enron_data["PRENTICE JAMES"].keys())

dict_keys(['salary', 'to_messages', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'email_address', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'from_poi_to_this_person', 'exercised_stock_options', 'from_messages', 'other', 'from_this_person_to_poi', 'poi', 'long_term_incentive', 'shared_receipt_with_poi', 'restricted_stock', 'director_fees'])

我们有多少来自 Wesley Colwell 的发给嫌疑人的电子邮件?

print (enron_data["COLWELL WESLEY"]['from_this_person_to_poi'])
# 11

Jeffrey Skilling 行使的股票期权价值是多少?

print (enron_data["SKILLING JEFFREY K"]['exercised_stock_options'])


安然的CEO:Jeffrey Skilling
安然董事会主席:Kenneth Lay
安然CFO:Andrew Fastow

print (enron_data["SKILLING JEFFREY K"]['total_payments'])
print (enron_data["LAY KENNETH L"]['total_payments'])
print (enron_data["FASTOW ANDREW S"]['total_payments'])

maxPay = 0
maxpay_person = ''
s1 = enron_data["SKILLING JEFFREY K"]['total_payments']
if s1 > maxPay:
    maxPay = s1
    maxpay_person = 'SKILLING JEFFREY K'
s2 = enron_data["LAY KENNETH L"]['total_payments']
if s2 > maxPay:
    maxPay = s2
    maxpay_person = 'LAY KENNETH L'
s3 = enron_data["FASTOW ANDREW S"]['total_payments']
if s3 > maxPay:
    maxPay = s3
    maxpay_person = 'SFASTOW ANDREW S'

print(maxpay_person, maxPay)

// #LAY KENNETH L 103559793


num_salary = 0
num_email = 0
for i in  enron_data.keys():
    if enron_data[i]['salary'] != 'NaN':
        num_salary += 1
    if enron_data[i]['email_address'] != 'NaN':
        num_email +=1

//# 95, 111


import numpy as np

def featureFormat( dictionary, features, remove_NaN=True, remove_all_zeroes=True, remove_any_zeroes=False, sort_keys = False):
    """ convert dictionary to numpy array of features
        remove_NaN = True will convert "NaN" string to 0.0
        remove_all_zeroes = True will omit any data points for which
            all the features you seek are 0.0
        remove_any_zeroes = True will omit any data points for which
            any of the features you seek are 0.0
        sort_keys = True sorts keys by alphabetical order. Setting the value as
            a string opens the corresponding pickle file with a preset key
            order (this is used for Python 3 compatibility, and sort_keys
            should be left as False for the course mini-projects).
        NOTE: first feature is assumed to be 'poi' and is not checked for
            removal for zero or missing values.

    return_list = []

    # Key order - first branch is for Python 3 compatibility on mini-projects,
    # second branch is for compatibility on final project.
    if isinstance(sort_keys, str):
        import pickle
        keys = pickle.load(open(sort_keys, "rb"))
    elif sort_keys:
        keys = sorted(dictionary.keys())
        keys = dictionary.keys()

    for key in keys:
        tmp_list = []
        for feature in features:
            except KeyError:
                print ("error: key ", feature, " not present")
            value = dictionary[key][feature]
            if value=="NaN" and remove_NaN:
                value = 0
            tmp_list.append( float(value) )

        # Logic for deciding whether or not to add the data point.
        append = True
        # exclude 'poi' class as criteria.
        if features[0] == 'poi':
            test_list = tmp_list[1:]
            test_list = tmp_list
        ### if all features are zero and you want to remove
        ### data points that are all zero, do that here
        if remove_all_zeroes:
            append = False
            for item in test_list:
                if item != 0 and item != "NaN":
                    append = True
        ### if any features for a given data point are zero
        ### and you want to remove data points with any zeroes,
        ### handle that here
        if remove_any_zeroes:
            if 0 in test_list or "NaN" in test_list:
                append = False
        ### Append the data point if flagged for addition.
        if append:
            return_list.append( np.array(tmp_list) )

    return np.array(return_list)

def targetFeatureSplit( data ):
        given a numpy array like the one returned from
        featureFormat, separate out the first feature
        and put it into its own list (this should be the 
        quantity you want to predict)

        return targets and features as separate lists

        (sklearn can generally handle both lists and numpy arrays as 
        input formats when training/predicting)

    target = []
    features = []
    for item in data:
        target.append( item[0] )
        features.append( item[1:] )

    return target, features

(当前的)E+F 数据集中有多少人的薪酬总额被设置了“NaN”?数据集中这些人的比例占多少?

num_total_pay = 0
for i in  enron_data.keys():
    if enron_data[i]['total_payments'] == 'NaN':
        num_total_pay += 1

proportion = num_total_pay/float(len(enron_data))
print(num_total_pay, proportion)

E+F 数据集中有多少 POI 的薪酬总额被设置了“NaN”?这些 POI 占多少比例?

numPOI = 0
num_POI_total_pay = 0
for i in  enron_data.keys():
    if enron_data[i]['poi'] == 1:
        numPOI += 1
        if enron_data[i]['total_payments'] == 'NaN':
            num_POI_total_pay += 1

proportion_poi = num_POI_total_pay/float(numPOI)
print(num_POI_total_pay,numPOI, proportion_poi)

