数据质量

1.修正有效性

审查DBPedia 数据文件autos.csv中字段"productionStartYear" 的有效性
要求:

  • check if the field "productionStartYear" contains a year
  • check if the year is in range 1886-2014
  • convert the value of the field to be just a year (not full datetime)
  • the rest of the fields and values should stay the same
  • if the value of the field is a valid year in the range as described above,
    write that line to the output_good file
  • if the value of the field is not a valid year as described above,
    write that line to the output_bad file
  • discard rows (neither write to good nor bad) if the URI is not from dbpedia.org
  • you should use the provided way of reading and writing data (DictReader and DictWriter)
import csv

INPUT_FILE = 'autos.csv'
OUTPUT_GOOD = 'autos-valid.csv'
OUTPUT_BAD = 'FIXME-autos.csv'

def process_file(input_file, output_good, output_bad):
    good_data=[]
    bad_data=[]

    with open(input_file, "r") as f:
        reader = csv.DictReader(f)
        header = reader.fieldnames
        
        for row in reader:
            if row['URI'].find('dbpedia.org')<0:
                continue
            
            ps_year=row['productionStartYear'][:4]
            try:
                ps_year=int(ps_year)
                row['productionStartYear']=ps_year
                if ps_year>=1886 and ps_year<=2014:
                    good_data.append(row)
                else:
                    bad_data.append(row)
                    
            except ValueError:
                if ps_year=='NULL':
                    bad_data.append(row)

    with open(output_good, "w") as g:
        writer = csv.DictWriter(g, delimiter=",", fieldnames= header)
        writer.writeheader()
        for row in good_data:
            writer.writerow(row)
            
    with open(output_bad, "w") as b:
        writer = csv.DictWriter(b, delimiter=",", fieldnames= header)
        writer.writeheader()
        for row in bad_data:
            writer.writerow(row)

process_file(INPUT_FILE, OUTPUT_GOOD, OUTPUT_BAD)

2.习题集

In this problem set you work with cities infobox data, audit it, come up with a
cleaning idea and then clean it up.

2.1审查数据质量

In the first exercise we want you to audit the datatypes that can be found in some particular fields in the dataset.

The possible types of values can be:

  • NoneType if the value is a string "NULL" or an empty string ""
  • list, if the value starts with "{"
  • int, if the value can be cast to int
  • float, if the value can be cast to float, but CANNOT be cast to int.
    For example, '3.23e+07' should be considered a float because it can be cast
    as float but int('3.23e+07') will throw a ValueError
  • 'str', for all other values

The audit_file function should return a dictionary containing fieldnames and a
SET of the types that can be found in the field. e.g.
{"field1": set([type(float()), type(int()), type(str())]),
"field2": set([type(str())]),
....
}
The type() function returns a type object describing the argument given to the
function. You can also use examples of objects to create type objects, e.g.
type(1.1) for a float

Note that the first three rows (after the header row) in the cities.csv file
are not actual data points. The contents of these rows should note be included
when processing data types. Be sure to include functionality in your code to
skip over or detect these rows.

import codecs
import csv
import json
import pprint

CITIES = 'cities.csv'

FIELDS = ["name", "timeZone_label", "utcOffset", "homepage", "governmentType_label",
          "isPartOf_label", "areaCode", "populationTotal", "elevation",
          "maximumElevation", "minimumElevation", "populationDensity",
          "wgs84_pos#lat", "wgs84_pos#long", "areaLand", "areaMetro", "areaUrban"]

def audit_file(filename, fields):
    fieldtypes = {field:set() for field in fields}
    
    with open(filename,'rb') as f:
        reader = csv.DictReader(f)
        
        for i in range(3):
            reader.next()
            
        for row in reader:
            for field in fields:
                
                if row[field]=='NULL' or row[field]==' ':
                    fieldtypes[field].add(type(None))
                    continue
                if row[field].startswith("{"):
                    fieldtypes[field].add(type(list()))
                    continue
                try:
                    int(row[field])
                    fieldtypes[field].add(type(int()))
                except ValueError:
                    try:
                        float(row[field])
                        fieldtypes[field].add(type(float()))
                    except ValueError:
                        fieldtypes[field].add(type(str()))


    return fieldtypes


fieldtypes = audit_file(CITIES, FIELDS)

2.2审查数据质量

“arealand”有时包含由 2 个稍有不同的值组成的数组。这个确实有点讲不通,因为一个城市的面积应该为单个值。所以,我们应该确保在我们的数据集中是这样的。但是,我们必须决定要保留哪个值。
那么我们应当保留:包含更多有效位数的值

2.3修复区域

修复字段'arealand'
Finish the function fix_area(). It will receive a string as an input, and it
has to return a float representing the value of the area or None.
修复前的字段'arealand'的值有三种类型:
1.float(可直接保留)
2.NULL(可直接保留)
3.{8.54696e+06|8.6e+06}(保留包含更多有效位数的值)

import codecs
import csv
import json
import pprint

CITIES = 'cities.csv'


def fix_area(area):
    if area=='NULL' or area==' ':
        area=None
    elif area.find("{")<0:
        area=float(area)
    elif area.find("{")==0:
        area_list=area.strip("{}").split("|")
        if len(area_list[0])>len(area_list[1]):
            area = float(area_list[0])
        else:
            area = float(area_list[1])

    return area



def process_file(filename):
    # CHANGES TO THIS FUNCTION WILL BE IGNORED WHEN YOU SUBMIT THE EXERCISE
    data = []

    with open(filename, "r") as f:
        reader = csv.DictReader(f)

        #skipping the extra metadata
        for i in range(3):
            l = reader.next()

        # processing file
        for line in reader:
            # calling your function to fix the area value
            if "areaLand" in line:
                line["areaLand"] = fix_area(line["areaLand"])
            data.append(line)

    return data

data = process_file(CITIES)

2.5修复姓名

字段“name”的值目前有三种情况:
1.null
2.单个字符串
3.{Krishnarajpet|嗖曕硟嗖粪硩嗖`舶嗖距矞嗖硣嗖熰硢}
本题处理字段“name”的值,完成函数fix_name()。它将获得字符串输入,并返回所有名称列表。如果只有一个名称,列表将只有一项。如果名称是“NULL”,该列表应该为空。

import codecs
import csv
import pprint

CITIES = 'cities.csv'


def fix_name(name):
    if name=='NULL' or name=='':
        name=[]
    elif name.find("{")<0:
        name=[name]
    elif name.find("{")==0:
        name=name.strip("{}").split("|")

    # YOUR CODE HERE

    return name


def process_file(filename):
    data = []
    with open(filename, "r") as f:
        reader = csv.DictReader(f)
        #skipping the extra metadata
        for i in range(3):
            l = reader.next()
        # processing file
        for line in reader:
            # calling your function to fix the area value
            if "name" in line:
                line["name"] = fix_name(line["name"])
            data.append(line)
    return data


data = process_file(CITIES)

2.6 交叉字段审查

如果你查看完整的城市数据,会发现有几个值似乎提供的是同一信息,只是格式不同:“point”似乎是“wgs84_pos#lat”和“wgs84_pos#long”的结合体。但是,我们不知道是否是这种情况,你应该检查它们是否相等。
Finish the function check_loc(). It will recieve 3 strings: first, the combined
value of "point" followed by the separate "wgs84_pos#" values. You have to
extract the lat and long values from the "point" argument and compare them to
the "wgs84_pos# values, returning True or False.
Note that you do not have to fix the values, only determine if they are
consistent.

import csv
import pprint

CITIES = 'cities.csv'


def check_loc(point, lat, longi):
    point_list = point.split(" ")
    if point_list[0]==lat and point_list[1]==longi:
        return True
    else:
        return False
    
    pass


def process_file(filename):
    data = []
    with open(filename, "r") as f:
        reader = csv.DictReader(f)
        #skipping the extra matadata
        for i in range(3):
            l = reader.next()
        # processing file
        for line in reader:
            # calling your function to check the location
            result = check_loc(line["point"], line["wgs84_pos#lat"], line["wgs84_pos#long"])
            if not result:
                print "{}: {} != {} {}".format(line["name"], line["point"], line["wgs84_pos#lat"], line["wgs84_pos#long"])
            data.append(line)

    return data


def test():
    assert check_loc("33.08 75.28", "33.08", "75.28") == True
    assert check_loc("44.57833333333333 -91.21833333333333", "44.5783", "-91.2183") == False

if __name__ == "__main__":
    test()

你可能感兴趣的:(数据质量)