李沐——2.1 探索性数据分析【斯坦福21秋季:实用机器学习中文版】——代码复现

前言

课程地址:2.1 探索性数据分析【斯坦福21秋季:实用机器学习中文版】
数据集地址:https://c.d2l.ai/stanford-cs329p/assignments.html#assignment-1
原版代码:https://c.d2l.ai/stanford-cs329p/_static/notebooks/cs329p_notebook_eda.slides.html#/

数据集不同导致的各种问题

1.最后输出图片时只能输出(4*4)的图片
2.输出地域箱形图时失真
个人解决方式:

data['Id']=data['Id'].astype(int)
data['Elementary School Score']=data['Elementary School Score'].astype(float)
data['Total spaces']=data['Total spaces'].astype(float)
data['Bathrooms']=data['Bathrooms'].astype(float)
data['Elementary School Distance']=data['Elementary School Distance'].astype(float)
data['Bathrooms']=data['Bathrooms'].astype(float)
data['Garage spaces']=data['Garage spaces'].astype(float)
data['Zip']=data['Zip'].astype(int)

代码

# !pip install seaborn pandas matplotlib numpy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython import display
display.set_matplotlib_formats('svg')
# Alternative to set svg for newer versions
# import matplotlib_inline
# matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
data = pd.read_feather('house_sales.ftr')
data.shape
(164944, 1789)
data.head(10)
Id Address Sold Price Sold On Summary Type Year built Heating Cooling Parking ... Well Disclosure remodeled DOH2 SerialX Full Baths Tax Legal Lot Number Tax Legal Block Number Tax Legal Tract Number Building Name Zip
0 2080183300 11205 Monterey, $2,000,000 01/31/20 11205 Monterey, San Martin, CA 95046 is a sing... SingleFamily No Data No Data No Data 0 spaces ... None None None None None None None None None 95046
1 20926300 5281 Castle Rd, $2,100,000 02/25/21 Spectacular Mountain and incredible L.A. City ... SingleFamily 1951 Central Central Air, Dual Driveway, Driveway - Brick ... None None None None None None None None None 91011
2 19595300 3581 Butcher Dr, $1,125,000 11/06/19 Eichler Style home! with Santa Clara High! in ... SingleFamily 1954 Central Forced Air - Gas Central AC Garage, Garage - Attached, Covered ... None None None None None None None None None 95051
3 300472200 2021 N Milpitas Blvd, $36,250,000 10/02/20 2021 N Milpitas Blvd, Milpitas, CA 95035 is a ... Apartment 1989 Other No Data Mixed, Covered ... None None None None None None None None None 95035
4 2074492000 LOT 4 Tool Box Spring Rd, $140,000 10/19/20 Beautiful level lot dotted with pine trees ro... VacantLand No Data No Data No Data 0 spaces ... None None None None None None None None None 92561
5 2080638900 4707 La Villa Mari UNIT J, $1,301,000 02/24/21 AGENTS READ PRIVATE REMARKS BEFORE CALLING; S... Townhouse 1966 Central None Garage - Attached ... None None None None None None None None None 90292
6 19800000 7517 Deveron Ct, $3,200 08/31/19 This lovely rental is located in the prestigio... Apartment 1989 Forced air, Gas Central Garage, Garage - Attached, Covered ... None None None None None None None None None 95135
7 20635000 3025 E 8th St, $300,000 11/06/19 3025 E 8th St, Los Angeles, CA 90023 is a sing... SingleFamily 1922 Wall Wall/Window Unit(s) Garage, Covered ... None None None None None None None None None 90023
8 20720300 1022 Manley Dr, $795,000 01/30/21 I'm gorgeous inside!!! Beautifully remodeled ... SingleFamily 1939 Central Central Air Garage ... None None None None None None None None None 91776
9 19522800 229 Del Monte Ave, $1,750,000 08/31/18 None None None None None None ... None None None None None None None None None 94022

10 rows × 1789 columns

null_sum = data.isnull().sum()
data.columns[null_sum < len(data)*0.3]
Index(['Id', 'Address', 'Sold Price', 'Sold On', 'Summary', 'Type',
       'Year built', 'Heating', 'Cooling', 'Parking', 'Bedrooms', 'Bathrooms',
       'Total interior livable area', 'Total spaces', 'Garage spaces',
       'Home type', 'Region', 'Elementary School', 'Elementary School Score',
       'Elementary School Distance', 'High School', 'High School Score',
       'High School Distance', 'Heating features', 'Parking features',
       'Lot size', 'Parcel number', 'Tax assessed value', 'Annual tax amount',
       'Listed On', 'Listed Price', 'Zip'],
      dtype='object')
data.drop(columns = data.columns[null_sum > len(data) * 0.3],inplace=True)
data['Id']=data['Id'].astype(int)
data['Elementary School Score']=data['Elementary School Score'].astype(float)
data['Total spaces']=data['Total spaces'].astype(float)
data['Bathrooms']=data['Bathrooms'].astype(float)
data['Elementary School Distance']=data['Elementary School Distance'].astype(float)
data['Bathrooms']=data['Bathrooms'].astype(float)
data['Garage spaces']=data['Garage spaces'].astype(float)
data['Zip']=data['Zip'].astype(int)
data.dtypes
Id                               int32
Address                         object
Sold Price                      object
Sold On                         object
Summary                         object
Type                            object
Year built                      object
Heating                         object
Cooling                         object
Parking                         object
Bedrooms                        object
Bathrooms                      float64
Total interior livable area     object
Total spaces                   float64
Garage spaces                  float64
Home type                       object
Region                          object
Elementary School               object
Elementary School Score        float64
Elementary School Distance     float64
High School                     object
High School Score               object
High School Distance            object
Heating features                object
Parking features                object
Lot size                        object
Parcel number                   object
Tax assessed value              object
Annual tax amount               object
Listed On                       object
Listed Price                    object
Zip                              int32
dtype: object
currency = ['Sold Price','Listed Price','Tax assessed value','Annual tax amount']
for c in currency:
    data[c] = data[c].replace(
    r'[$,-]','',regex=True).replace(
    r'^\s*$',np.nan,regex=True).astype(float)
areas=['Total interior livable area','Lot size']
for c in areas:
    acres = data[c].str.contains('Acres') == True
    col = data[c].replace(r'\b sqft\b|\b Acres\b|\b,\b','',regex=True).astype(float)
    col[acres]*=43560
    data[c]=col
data.describe()
Id Sold Price Bathrooms Total interior livable area Total spaces Garage spaces Elementary School Score Elementary School Distance Lot size Tax assessed value Annual tax amount Listed Price Zip
count 1.649440e+05 1.648590e+05 141791.000000 1.465450e+05 156738.000000 156736.000000 145676.000000 146288.000000 1.358450e+05 1.450650e+05 1.433500e+05 1.250060e+05 164944.000000
mean 2.791434e+08 1.194842e+06 2.303087 3.182221e+03 1.706044 1.607614 5.654892 1.260918 9.525061e+05 8.898781e+05 1.123415e+04 1.197671e+06 93084.811172
std 6.424318e+08 3.336365e+06 1.646634 4.609881e+05 28.802242 28.782370 2.098547 2.888909 1.357197e+08 3.126888e+06 3.859389e+04 2.874721e+06 2265.021138
min 7.387732e+06 1.000000e+00 0.000000 1.000000e+00 -26.000000 -26.000000 1.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 85611.000000
25% 1.913563e+07 4.350000e+05 2.000000 1.170000e+03 0.000000 0.000000 4.000000 0.300000 4.800000e+03 2.550000e+05 3.434250e+03 4.990000e+05 90232.000000
50% 2.059865e+07 8.050000e+05 2.000000 1.558000e+03 1.000000 1.000000 6.000000 0.500000 6.603000e+03 5.635010e+05 7.372000e+03 8.490000e+05 94066.000000
75% 8.923942e+07 1.370000e+06 3.000000 2.144000e+03 2.000000 2.000000 7.000000 1.000000 1.209000e+04 1.033832e+06 1.321300e+04 1.395000e+06 95053.000000
max 2.147000e+09 8.660000e+08 256.000000 1.764164e+08 9999.000000 9999.000000 10.000000 76.400000 4.856770e+10 8.256328e+08 9.977342e+06 6.250000e+08 96155.000000
abnormal = (data[areas[1]] < 10) | (data[areas[1]] > 1e4)
data = data[~abnormal]
sum(abnormal)
41000
ax = sns.histplot(np.log10(data['Sold Price']))
ax.set_xlim([3, 8])
ax.set_xticks(range(3, 9))
ax.set_xticklabels(['%.0e'%a for a in 10**ax.get_xticks()]);


李沐——2.1 探索性数据分析【斯坦福21秋季:实用机器学习中文版】——代码复现_第1张图片

data['Type'].value_counts()[0:20]
SingleFamily            74318
Condo                   18749
MultiFamily              6586
VacantLand               6199
Townhouse                5846
Unknown                  5390
MobileManufactured       2588
Apartment                1416
Cooperative               161
Residential Lot            75
Single Family              69
Single Family Lot          56
Acreage                    48
2 Story                    39
3 Story                    25
Hi-Rise (9+), Luxury       21
RESIDENTIAL                19
Duplex                     19
Condominium                19
Mid-Rise (4-8)             17
Name: Type, dtype: int64
types = data['Type'].isin(['SingleFamily', 'Condo', 'MultiFamily', 'Townhouse'])
sns.displot(pd.DataFrame({'Sold Price':np.log10(data[types]['Sold Price']),
                          'Type':data[types]['Type']}),
            x='Sold Price', hue='Type', kind='kde');


李沐——2.1 探索性数据分析【斯坦福21秋季:实用机器学习中文版】——代码复现_第2张图片

#箱式图
data['Price per living sqft'] = data['Sold Price'] / data['Total interior livable area']
ax = sns.boxplot(x='Type', y='Price per living sqft', data=data[types], fliersize=0)
ax.set_ylim([0, 2000]);
#中间横线是中位数
#上面的横线是最大值
#方框上边界为3/4的值


李沐——2.1 探索性数据分析【斯坦福21秋季:实用机器学习中文版】——代码复现_第3张图片

d = data[data['Zip'].isin(data['Zip'].value_counts()[:20].keys())]
ax = sns.boxplot(x='Zip', y='Price per living sqft', data=d, fliersize=0)
ax.set_ylim([0, 2000])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);

李沐——2.1 探索性数据分析【斯坦福21秋季:实用机器学习中文版】——代码复现_第4张图片

data.dtypes
Id                               int32
Address                         object
Sold Price                     float64
Sold On                         object
Summary                         object
Type                            object
Year built                      object
Heating                         object
Cooling                         object
Parking                         object
Bedrooms                        object
Bathrooms                      float64
Total interior livable area    float64
Total spaces                   float64
Garage spaces                  float64
Home type                       object
Region                          object
Elementary School               object
Elementary School Score        float64
Elementary School Distance     float64
High School                     object
High School Score               object
High School Distance            object
Heating features                object
Parking features                object
Lot size                       float64
Parcel number                   object
Tax assessed value             float64
Annual tax amount              float64
Listed On                       object
Listed Price                   float64
Zip                              int32
Price per living sqft          float64
dtype: object
_, ax = plt.subplots(figsize=(6,6))
columns = ['Sold Price', 'Listed Price', 'Annual tax amount', 'Price per living sqft', 'Elementary School Score', 'High School Score']
sns.heatmap(data[columns].corr(),annot=True,cmap='RdYlGn', ax=ax);


李沐——2.1 探索性数据分析【斯坦福21秋季:实用机器学习中文版】——代码复现_第5张图片


你可能感兴趣的:(数据分析,机器学习,python)