课程地址:2.1 探索性数据分析【斯坦福21秋季:实用机器学习中文版】
数据集地址:https://c.d2l.ai/stanford-cs329p/assignments.html#assignment-1
原版代码:https://c.d2l.ai/stanford-cs329p/_static/notebooks/cs329p_notebook_eda.slides.html#/
1.最后输出图片时只能输出(4*4)的图片
2.输出地域箱形图时失真
个人解决方式:
data['Id']=data['Id'].astype(int)
data['Elementary School Score']=data['Elementary School Score'].astype(float)
data['Total spaces']=data['Total spaces'].astype(float)
data['Bathrooms']=data['Bathrooms'].astype(float)
data['Elementary School Distance']=data['Elementary School Distance'].astype(float)
data['Bathrooms']=data['Bathrooms'].astype(float)
data['Garage spaces']=data['Garage spaces'].astype(float)
data['Zip']=data['Zip'].astype(int)
# !pip install seaborn pandas matplotlib numpy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython import display
display.set_matplotlib_formats('svg')
# Alternative to set svg for newer versions
# import matplotlib_inline
# matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
data = pd.read_feather('house_sales.ftr')
data.shape
(164944, 1789)
data.head(10)
Id | Address | Sold Price | Sold On | Summary | Type | Year built | Heating | Cooling | Parking | ... | Well Disclosure | remodeled | DOH2 | SerialX | Full Baths | Tax Legal Lot Number | Tax Legal Block Number | Tax Legal Tract Number | Building Name | Zip | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2080183300 | 11205 Monterey, | $2,000,000 | 01/31/20 | 11205 Monterey, San Martin, CA 95046 is a sing... | SingleFamily | No Data | No Data | No Data | 0 spaces | ... | None | None | None | None | None | None | None | None | None | 95046 |
1 | 20926300 | 5281 Castle Rd, | $2,100,000 | 02/25/21 | Spectacular Mountain and incredible L.A. City ... | SingleFamily | 1951 | Central | Central Air, Dual | Driveway, Driveway - Brick | ... | None | None | None | None | None | None | None | None | None | 91011 |
2 | 19595300 | 3581 Butcher Dr, | $1,125,000 | 11/06/19 | Eichler Style home! with Santa Clara High! in ... | SingleFamily | 1954 | Central Forced Air - Gas | Central AC | Garage, Garage - Attached, Covered | ... | None | None | None | None | None | None | None | None | None | 95051 |
3 | 300472200 | 2021 N Milpitas Blvd, | $36,250,000 | 10/02/20 | 2021 N Milpitas Blvd, Milpitas, CA 95035 is a ... | Apartment | 1989 | Other | No Data | Mixed, Covered | ... | None | None | None | None | None | None | None | None | None | 95035 |
4 | 2074492000 | LOT 4 Tool Box Spring Rd, | $140,000 | 10/19/20 | Beautiful level lot dotted with pine trees ro... | VacantLand | No Data | No Data | No Data | 0 spaces | ... | None | None | None | None | None | None | None | None | None | 92561 |
5 | 2080638900 | 4707 La Villa Mari UNIT J, | $1,301,000 | 02/24/21 | AGENTS READ PRIVATE REMARKS BEFORE CALLING; S... | Townhouse | 1966 | Central | None | Garage - Attached | ... | None | None | None | None | None | None | None | None | None | 90292 |
6 | 19800000 | 7517 Deveron Ct, | $3,200 | 08/31/19 | This lovely rental is located in the prestigio... | Apartment | 1989 | Forced air, Gas | Central | Garage, Garage - Attached, Covered | ... | None | None | None | None | None | None | None | None | None | 95135 |
7 | 20635000 | 3025 E 8th St, | $300,000 | 11/06/19 | 3025 E 8th St, Los Angeles, CA 90023 is a sing... | SingleFamily | 1922 | Wall | Wall/Window Unit(s) | Garage, Covered | ... | None | None | None | None | None | None | None | None | None | 90023 |
8 | 20720300 | 1022 Manley Dr, | $795,000 | 01/30/21 | I'm gorgeous inside!!! Beautifully remodeled ... | SingleFamily | 1939 | Central | Central Air | Garage | ... | None | None | None | None | None | None | None | None | None | 91776 |
9 | 19522800 | 229 Del Monte Ave, | $1,750,000 | 08/31/18 | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | 94022 |
10 rows × 1789 columns
null_sum = data.isnull().sum()
data.columns[null_sum < len(data)*0.3]
Index(['Id', 'Address', 'Sold Price', 'Sold On', 'Summary', 'Type',
'Year built', 'Heating', 'Cooling', 'Parking', 'Bedrooms', 'Bathrooms',
'Total interior livable area', 'Total spaces', 'Garage spaces',
'Home type', 'Region', 'Elementary School', 'Elementary School Score',
'Elementary School Distance', 'High School', 'High School Score',
'High School Distance', 'Heating features', 'Parking features',
'Lot size', 'Parcel number', 'Tax assessed value', 'Annual tax amount',
'Listed On', 'Listed Price', 'Zip'],
dtype='object')
data.drop(columns = data.columns[null_sum > len(data) * 0.3],inplace=True)
data['Id']=data['Id'].astype(int)
data['Elementary School Score']=data['Elementary School Score'].astype(float)
data['Total spaces']=data['Total spaces'].astype(float)
data['Bathrooms']=data['Bathrooms'].astype(float)
data['Elementary School Distance']=data['Elementary School Distance'].astype(float)
data['Bathrooms']=data['Bathrooms'].astype(float)
data['Garage spaces']=data['Garage spaces'].astype(float)
data['Zip']=data['Zip'].astype(int)
data.dtypes
Id int32
Address object
Sold Price object
Sold On object
Summary object
Type object
Year built object
Heating object
Cooling object
Parking object
Bedrooms object
Bathrooms float64
Total interior livable area object
Total spaces float64
Garage spaces float64
Home type object
Region object
Elementary School object
Elementary School Score float64
Elementary School Distance float64
High School object
High School Score object
High School Distance object
Heating features object
Parking features object
Lot size object
Parcel number object
Tax assessed value object
Annual tax amount object
Listed On object
Listed Price object
Zip int32
dtype: object
currency = ['Sold Price','Listed Price','Tax assessed value','Annual tax amount']
for c in currency:
data[c] = data[c].replace(
r'[$,-]','',regex=True).replace(
r'^\s*$',np.nan,regex=True).astype(float)
areas=['Total interior livable area','Lot size']
for c in areas:
acres = data[c].str.contains('Acres') == True
col = data[c].replace(r'\b sqft\b|\b Acres\b|\b,\b','',regex=True).astype(float)
col[acres]*=43560
data[c]=col
data.describe()
Id | Sold Price | Bathrooms | Total interior livable area | Total spaces | Garage spaces | Elementary School Score | Elementary School Distance | Lot size | Tax assessed value | Annual tax amount | Listed Price | Zip | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1.649440e+05 | 1.648590e+05 | 141791.000000 | 1.465450e+05 | 156738.000000 | 156736.000000 | 145676.000000 | 146288.000000 | 1.358450e+05 | 1.450650e+05 | 1.433500e+05 | 1.250060e+05 | 164944.000000 |
mean | 2.791434e+08 | 1.194842e+06 | 2.303087 | 3.182221e+03 | 1.706044 | 1.607614 | 5.654892 | 1.260918 | 9.525061e+05 | 8.898781e+05 | 1.123415e+04 | 1.197671e+06 | 93084.811172 |
std | 6.424318e+08 | 3.336365e+06 | 1.646634 | 4.609881e+05 | 28.802242 | 28.782370 | 2.098547 | 2.888909 | 1.357197e+08 | 3.126888e+06 | 3.859389e+04 | 2.874721e+06 | 2265.021138 |
min | 7.387732e+06 | 1.000000e+00 | 0.000000 | 1.000000e+00 | -26.000000 | -26.000000 | 1.000000 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 85611.000000 |
25% | 1.913563e+07 | 4.350000e+05 | 2.000000 | 1.170000e+03 | 0.000000 | 0.000000 | 4.000000 | 0.300000 | 4.800000e+03 | 2.550000e+05 | 3.434250e+03 | 4.990000e+05 | 90232.000000 |
50% | 2.059865e+07 | 8.050000e+05 | 2.000000 | 1.558000e+03 | 1.000000 | 1.000000 | 6.000000 | 0.500000 | 6.603000e+03 | 5.635010e+05 | 7.372000e+03 | 8.490000e+05 | 94066.000000 |
75% | 8.923942e+07 | 1.370000e+06 | 3.000000 | 2.144000e+03 | 2.000000 | 2.000000 | 7.000000 | 1.000000 | 1.209000e+04 | 1.033832e+06 | 1.321300e+04 | 1.395000e+06 | 95053.000000 |
max | 2.147000e+09 | 8.660000e+08 | 256.000000 | 1.764164e+08 | 9999.000000 | 9999.000000 | 10.000000 | 76.400000 | 4.856770e+10 | 8.256328e+08 | 9.977342e+06 | 6.250000e+08 | 96155.000000 |
abnormal = (data[areas[1]] < 10) | (data[areas[1]] > 1e4)
data = data[~abnormal]
sum(abnormal)
41000
ax = sns.histplot(np.log10(data['Sold Price']))
ax.set_xlim([3, 8])
ax.set_xticks(range(3, 9))
ax.set_xticklabels(['%.0e'%a for a in 10**ax.get_xticks()]);
data['Type'].value_counts()[0:20]
SingleFamily 74318
Condo 18749
MultiFamily 6586
VacantLand 6199
Townhouse 5846
Unknown 5390
MobileManufactured 2588
Apartment 1416
Cooperative 161
Residential Lot 75
Single Family 69
Single Family Lot 56
Acreage 48
2 Story 39
3 Story 25
Hi-Rise (9+), Luxury 21
RESIDENTIAL 19
Duplex 19
Condominium 19
Mid-Rise (4-8) 17
Name: Type, dtype: int64
types = data['Type'].isin(['SingleFamily', 'Condo', 'MultiFamily', 'Townhouse'])
sns.displot(pd.DataFrame({'Sold Price':np.log10(data[types]['Sold Price']),
'Type':data[types]['Type']}),
x='Sold Price', hue='Type', kind='kde');
#箱式图
data['Price per living sqft'] = data['Sold Price'] / data['Total interior livable area']
ax = sns.boxplot(x='Type', y='Price per living sqft', data=data[types], fliersize=0)
ax.set_ylim([0, 2000]);
#中间横线是中位数
#上面的横线是最大值
#方框上边界为3/4的值
d = data[data['Zip'].isin(data['Zip'].value_counts()[:20].keys())]
ax = sns.boxplot(x='Zip', y='Price per living sqft', data=d, fliersize=0)
ax.set_ylim([0, 2000])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);
data.dtypes
Id int32
Address object
Sold Price float64
Sold On object
Summary object
Type object
Year built object
Heating object
Cooling object
Parking object
Bedrooms object
Bathrooms float64
Total interior livable area float64
Total spaces float64
Garage spaces float64
Home type object
Region object
Elementary School object
Elementary School Score float64
Elementary School Distance float64
High School object
High School Score object
High School Distance object
Heating features object
Parking features object
Lot size float64
Parcel number object
Tax assessed value float64
Annual tax amount float64
Listed On object
Listed Price float64
Zip int32
Price per living sqft float64
dtype: object
_, ax = plt.subplots(figsize=(6,6))
columns = ['Sold Price', 'Listed Price', 'Annual tax amount', 'Price per living sqft', 'Elementary School Score', 'High School Score']
sns.heatmap(data[columns].corr(),annot=True,cmap='RdYlGn', ax=ax);