Python数据分析与机器学习实战笔记(5) - K近邻算法

文章目录

  • K 近邻算法
    • 1. K近邻算法概述
      • 1.1读取数据
      • 1.2 K nearest Neighbor 算法
      • 1.3(欧式)距离的定义
      • 1.4 模型评估
        • 1.4.1 首先制定好训练集和测试集
        • 1.4.2 基于单变量预测价格
        • 1.4.3 Root Mean Squared Error 均方根误差
        • 1.4.4 不同的变量效果会不会不同呢
        • 1.4.5 数据标准化,归一化
        • 1.4.6 多变量距离的计算
        • 1.4.7 多变量KNN模型
    • 2. sklearn库与功能
      • 2.1 使用Sklearn来完成KNN

K 近邻算法

1. K近邻算法概述

1.1读取数据

import pandas as pd
#选择部分列
features = ['accommodates','bedrooms','bathrooms','beds','price','minimum_nights','maximum_nights','number_of_reviews']

dc_listings = pd.read_csv('listings.csv')

dc_listings = dc_listings[features]

print(dc_listings.shape)

dc_listings.head()

数据特征:

accommodates: 可以容纳的旅客
bedrooms: 卧室的数量
bathrooms: 厕所的数量
beds: 床的数量
price: 每晚的费用
minimum_nights: 客人最少租了几天
maximum_nights: 客人最多租了几天
number_of_reviews: 评论的数量

1.2 K nearest Neighbor 算法

Python数据分析与机器学习实战笔记(5) - K近邻算法_第1张图片

Python数据分析与机器学习实战笔记(5) - K近邻算法_第2张图片
Python数据分析与机器学习实战笔记(5) - K近邻算法_第3张图片
Python数据分析与机器学习实战笔记(5) - K近邻算法_第4张图片

1.3(欧式)距离的定义

Alt
其中Q1到Qn是一条数据的所有特征信息,P1到Pn是另一条数据的所有特征信息

import numpy as np
#假设我们的房子有3个房间
our_acc_value = 3

dc_listings['distance'] = np.abs(dc_listings.accommodates - our_acc_value)
dc_listings.distance.value_counts().sort_index()
#sample操作可以得到洗牌后的数据
c_listings = dc_listings.sample(frac=1,random_state=0)
dc_listings = dc_listings.sort_values(by='distance')
dc_listings.price.head()
dc_listings['price'] = dc_listings.price.str.replace("\$|,",'').astype(float)

mean_price = dc_listings.price.iloc[:5].mean()
mean_price

1.4 模型评估

Python数据分析与机器学习实战笔记(5) - K近邻算法_第5张图片

1.4.1 首先制定好训练集和测试集

[外链图片转

dc_listings.drop('distance',axis=1)

train_df = dc_listings.copy().iloc[:2792]
test_df = dc_listings.copy().iloc[2792:]

1.4.2 基于单变量预测价格

def predict_price(new_listing_value,feature_column):
    temp_df = train_df
    temp_df['distance'] = np.abs(dc_listings[feature_column] - new_listing_value)
    temp_df = temp_df.sort_values('distance')
    knn_5 = temp_df.price.iloc[:5]
    predicted_price = knn_5.mean()
    return(predicted_price)
test_df['predicted_price'] = test_df.accommodates.apply(predict_price,feature_column='accommodates')

1.4.3 Root Mean Squared Error 均方根误差

在这里插入图片描述

test_df['squared_error'] = (test_df['predicted_price'] - test_df['price'])**(2)
mse = test_df['squared_error'].mean()
rmse = mse ** (1/2)
rmse

1.4.4 不同的变量效果会不会不同呢

for feature in ['accommodates','bedrooms','bathrooms','number_of_reviews']:
    #test_df['predicted_price'] = test_df.accommodates.apply(predict_price,feature_column=feature)
    test_df['predicted_price'] = test_df[feature].apply(predict_price,feature_column=feature)
    test_df['squared_error'] = (test_df['predicted_price'] - test_df['price'])**(2)
    mse = test_df['squared_error'].mean()
    rmse = mse ** (1/2)
    print("RMSE for the {} column: {}".format(feature,rmse))

1.4.5 数据标准化,归一化

import pandas as pd
from sklearn.preprocessing import StandardScaler
features = ['accommodates','bedrooms','bathrooms','beds','price','minimum_nights','maximum_nights','number_of_reviews']

dc_listings = pd.read_csv('listings.csv')

dc_listings = dc_listings[features]

dc_listings['price'] = dc_listings.price.str.replace("\$|,",'').astype(float)

dc_listings = dc_listings.dropna()

dc_listings[features] = StandardScaler().fit_transform(dc_listings[features])

normalized_listings = dc_listings

print(dc_listings.shape)

normalized_listings.head()

参见

  1. [机器学习] 数据特征 标准化和归一化你了解多少?

1.4.6 多变量距离的计算

![在这里插入图片描述](https://img-blog.csdnimg.cn/20200225203807785.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3NpbmF0XzMzNDE4MzA2,size_16,color_FFFFFF,t_70 #pic_center)

from scipy.spatial import distance

first_listing = normalized_listings.iloc[0][['accommodates', 'bathrooms']]
fifth_listing = normalized_listings.iloc[20][['accommodates', 'bathrooms']]
first_fifth_distance = distance.euclidean(first_listing, fifth_listing)
first_fifth_distance

1.4.7 多变量KNN模型

def predict_price_multivariate(new_listing_value,feature_columns):
    temp_df = norm_train_df
    temp_df['distance'] = distance.cdist(temp_df[feature_columns],[new_listing_value[feature_columns]])
    temp_df = temp_df.sort_values('distance')
    knn_5 = temp_df.price.iloc[:5]
    predicted_price = knn_5.mean()
    return(predicted_price)

cols = ['accommodates', 'bathrooms']
norm_test_df['predicted_price'] = norm_test_df[cols].apply(predict_price_multivariate,feature_columns=cols,axis=1)    
norm_test_df['squared_error'] = (norm_test_df['predicted_price'] - norm_test_df['price'])**(2)
mse = norm_test_df['squared_error'].mean()
rmse = mse ** (1/2)
print(rmse)

2. sklearn库与功能

2.1 使用Sklearn来完成KNN

from sklearn.neighbors import KNeighborsRegressor#导入包
cols = ['accommodates','bedrooms']
knn = KNeighborsRegressor()#实例化模型
knn.fit(norm_train_df[cols], norm_train_df['price'])#fit数据
two_features_predictions = knn.predict(norm_test_df[cols])# 预测
from sklearn.metrics import mean_squared_error

two_features_mse = mean_squared_error(norm_test_df['price'], two_features_predictions)
two_features_rmse = two_features_mse ** (1/2)
print(two_features_rmse)

加入更多的特征

knn = KNeighborsRegressor()

cols = ['accommodates','bedrooms','bathrooms','beds','minimum_nights','maximum_nights','number_of_reviews']

knn.fit(norm_train_df[cols], norm_train_df['price'])
four_features_predictions = knn.predict(norm_test_df[cols])
four_features_mse = mean_squared_error(norm_test_df['price'], four_features_predictions)
four_features_rmse = four_features_mse ** (1/2)
four_features_rmse

你可能感兴趣的:(Python数据分析与机器学习实战笔记(5) - K近邻算法)