机器学习第三站——tensorflow实战

tensorflow实战

首先了解一下什么是tensorflow,这是google提供的一个开源软件库…..恩就这样。 
我们通过python调用tensorflow api完成一系列机器学习操作 
在第一站中我讲了怎么安装anaconda,我们用自带的spyder进行编写: 
 
界面如下(可自行调节),类似于matlab有文件编辑区和命令行编辑区,非常方便调试: 

下面我们就开始编写第一次代码,分为几步 
1. 加载库 
2. 加载数据 
3. 设置特征和标签 
4. 配置LinearRegressor 
5. 预处理函数 
6. 训练模型 
7. 评估模型

加载库&&基础设置

加载库,各个库有什么用请自行百度…. 
附录tensorflow库的文档

 
  
  1. import math
  2. from IPython import display
  3. from matplotlib import cm
  4. from matplotlib import gridspec
  5. from matplotlib import pyplot as plt
  6. import numpy as np
  7. import pandas as pd
  8. from sklearn import metrics
  9. import tensorflow as tf
  10. from tensorflow.python.data import Dataset

下面进行基础设置,主要是程序报错信息级别和输出设置

 
  
  1. #set the log record
  2. #tutorial: http://lib.csdn.net/article/aiframework/61081
  3. tf.logging.set_verbosity(tf.logging.ERROR)
  4. #display 10 rows at most
  5. pd.options.display.max_rows = 10
  6. #set display format
  7. pd.options.display.float_format = '{:.1f}'.format

加载数据

加载数据

 
  
  1. #load data
  2. california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")

对数据进行随机化处理,确保不会出现病态排序

 
  
  1. #reorder as random
  2. california_housing_dataframe = california_housing_dataframe.reindex(np.random.permutation(california_housing_dataframe.index))
  3. #change the unit to thousand
  4. california_housing_dataframe["median_house_value"] /= 1000.0

检查数据:查看数据各列的简单统计信息,有异常就排查

 
  
  1. print(california_housing_dataframe.describe())

设置特征和标签

对于data,我们需要设定本次机器学习时的特征和标签,首先我们用total_rooms来预测median_house_value

 
  
  1. # Define the input feature: total_rooms.
  2. my_feature = california_housing_dataframe[["total_rooms"]]
  3. # Configure a numeric feature column for total_rooms.
  4. feature_columns = [tf.feature_column.numeric_column("total_rooms")]
  5. # Define the label.
  6. targets = california_housing_dataframe["median_house_value"]

配置LinearRegressor

我们将使用LinearRegressor配置线性回归模型,并使用 GradientDescentOptimizer(它会实现小批量随机梯度下降法 (SGD))训练该模型。

 
  
  1. #adjust learning_ratea to optimize
  2. my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
  3. #limit the gradient to 5
  4. my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
  5. linear_regressor = tf.estimator.LinearRegressor(feature_columns=feature_columns,optimizer=my_optimizer)

预处理函数

定义好训练模型后,就需要将数据导入训练模型,这就需要一个预处理: 
1. 将从csv导入的数据改成numpy数组字典 
2. 将数据拆分成batch_size大小的多批数据,以按照指定周期数 (num_epochs) 进行重复。(大概是对数据进行重复学习) 
3. 最终输入函数会得到一个迭代器,由训练器调用得到下一批数据迭代器讲解在这里

 
  
  1. def my_input_fn(features,targets,batch_size=1,shuffle=True,num_epochs=None):
  2. """Trains a linear regression model of one feature.
  3. Args:
  4. features: pandas DataFrame of features
  5. targets: pandas DataFrame of targets
  6. batch_size: Size of batches to be passed to the model
  7. shuffle: True or False. Whether to shuffle the data.
  8. num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
  9. Returns:
  10. Tuple of (features, labels) for next data batch
  11. """
  12. # Convert pandas data into a dict of np arrays.
  13. features = {key:np.array(value) for key,value in dict(features).items()}
  14. # Construct a dataset, and configure batching/repeating
  15. ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit
  16. ds = ds.batch(batch_size).repeat(num_epochs)
  17. # Shuffle the data, if specified
  18. if shuffle:
  19. ds = ds.shuffle(buffer_size=10000)
  20. # Return the next batch of data
  21. features, labels = ds.make_one_shot_iterator().get_next()
  22. return features, labels

单独讲一下第一步转化成数组字典,通过pandas导入的数据类型是pandas的dataframe戳这里看pandas是什么 
这个dataframe和R语言很像,有一个序号列,data根据feature以列的形式排列: 
 
而我们使用的是数组字典what‘s dictionary???内容一致但是形式不同:(列表to行表?) 

训练模型

到这步就很简单了,调用函数即可:

 
  
  1. _ = linear_regressor.train(input_fn = lambda:my_input_fn(my_feature,targets),steps=100)

这里使用lambda封装是为了调用形参…..lambda是撒子哟???

评估模型

ojbk,我们已经完成了机器学习,下面就可以展开预测并对模型进行评估 
同样对于要预测的数据,要有预处理函数

 
  
  1. #Since we're making just one prediction for each example, we don't need to repeat or shuffle the data here.
  2. prediction_input_fn =lambda: my_input_fn(my_feature, targetsnum_epochs=1, shuffle=False)

进行预测啦:

 
  
  1. # Call predict() on the linear_regressor to make predictions.
  2. predictions = linear_regressor.predict(input_fn=prediction_input_fn)

下面就要分析预测的好坏,对于进行预测的feature,我们有实际的label和预测出来的target,计算他的均方差,就能很好的描述预测的好坏:

 
  
  1. # Format predictions as a NumPy array, so we can calculate error metrics.
  2. predictions = np.array([item['predictions'][0] for item in predictions])
  3. # Print Mean Squared Error and Root Mean Squared Error.
  4. mean_squared_error = metrics.mean_squared_error(predictions, targets)
  5. root_mean_squared_error = math.sqrt(mean_squared_error)
  6. print "Mean Squared Error (on training data): %0.3f" % mean_squared_error
  7. print "Root Mean Squared Error (on training data):%0.3f"%root_mean_squared_error

心情好还可以取一些sample,分别画出预测直线和实际值散点,直观的看出误差

 
  
  1. # Get the min and max total_rooms values.
  2. x_0 = sample["total_rooms"].min()
  3. x_1 = sample["total_rooms"].max()
  4. # Retrieve the final weight and bias generated during training.
  5. weight = linear_regressor.get_variable_value('linear/linear_model/total_rooms/weights')[0]
  6. bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')
  7. # Get the predicted median_house_values for the min and max total_rooms values.
  8. y_0 = weight * x_0 + bias
  9. y_1 = weight * x_1 + bias
  10. # Plot our regression line from (x_0, y_0) to (x_1, y_1).
  11. plt.plot([x_0, x_1], [y_0, y_1], c='r')
  12. # Label the graph axes.
  13. plt.ylabel("median_house_value")
  14. plt.xlabel("total_rooms")
  15. # Plot a scatter plot from our data sample.
  16. plt.scatter(sample["total_rooms"], sample["median_house_value"])
  17. # Display graph.
  18. plt.show()

虽然我很想贴图但是谷歌的csv突然访问不了了orz就算了吧

练一练

刚刚我们做了很多工作,完成了模型的建立和评估,能不能把他们封装成一个函数呢? 
可调参数为learning_rate,同时可以给出模型运行的过程(提示:将step分为step_per_period,在每个period中算出loss并输出): 
 
在训练开始和结束添加输出使得过程更加直观: 
 
over


你可能感兴趣的:(机器学习第三站——tensorflow实战)