基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测

Abstract


This project is a modeling and prediction project based on cloud computing

We used the historical bike-sharing usage data set from Kaggle in the Washington area, analyzed and modeled the data on the cloud computing platform of Apache Spark and the Python processing platform, and finally predicted the bike-sharing rental demand in the Washington area.

Data Set Overview: The selected data set consists of a training set and a test set:

  • The training set consists of data from the first 19 days of each month, and the test set consists of data from the 20th day of each month to the end of the month.

  • The training set contains 12 attributes, including datetime, season, holiday and other attributes.The test set is missing the casual, count, and registered properties.

Model evaluation and application: In this modeling, three models are used, including Multiple Linear Regression, K Nearest Neighbor and Random forest. Through the model.score function of each model, the prediction accuracy score of each model for the data set is 0.3893, 0.1919 and 0.9926, respectively.The random forest has the highest accuracy, so we finally choose the random forest to predict the test set. Finally save the prediction output as test_pred.csv.

Through this project, we have solved the problems proposed at the beginning, gained a new understanding of data processing and model analysis, and also found new problems and challenges in the process of project execution:

  1. The factors affecting the number of car rentals are not a single variable, but these characteristics jointly determine the number of car rentals.At the same time, there are many characteristics that are related to each other (such as temperature, humidity, windspeed and atemp), which will affect each other.Of course, there are also some variables that are interference terms that are not directly related to the number of car rentals (the correlation is very low).Therefore, in the process of modeling, exploratory data analysis is very necessary.

  2. The data preprocessing step is very critical, which is directly related to the overall analysis and even the prediction of the model.Simple check missing value to weight and remove outliers often does not significantly improve model fitting effect, rather than through insight into the intrinsic characteristics of data visualization results, and according to the characteristics of the data model, processing (such as log, data conversion, etc.), so as to make the model prediction results more accurate.

  3. In this project, we also encountered a number of problems,for example: Spark’s DataFrame could not perform method operations on individual columns and rows, which made me have trouble in the initial data analysis and visualization. Secondly, our analysis and modeling process also has many deficiencies, for example:

    • Repeated values, missing values and so on in the original data set need to be completely processed before modeling;
    • After modeling, it is found that the impact of some variables on car rental is not suitable to be calculated by means of the average value, instead, the cumulative value should be used for statistics;

Packages imported:pyspark.sql.functions pyspark.context(SparkContext) pyspark.sql.session(SparkSession) seaborn matplotlib.pyplot warnings numpy pandas datetime(datetime) sklearn.ensemble(RandomForestRegressor) sklearn.neighbors(KNeighborsClassifier) sklearn.linear_model(LinearRegression) sklearn.model_selection(train_test_split)

1. Introduction


1.1 Project background

The bike-sharing system is a way of renting bicycles. Registration, renting and returning the bikes are all done through the self-service terminal network of the whole city, which can automatically obtain the bike rental and return data.
Through this system, people can rent bikes in one place and return them to different places.

1.2 Project requirements

The data generated by the system records the car’s ride time, departure point, arrival point and usage time.
In this project, we analyzed the impact of the number of shared bike rentals on natural and human factors such as weather and time, based on historical usage data, in order to predict the demand for shared bike rentals in the Washington area.

1.3 Methods and techniques

In this project, we mainly used the cloud computing platform of Apache Spark and the Python processing platform. We use the Park dataframe basic operations and the Pandas dataframe basic operations. Using exploratory data analysis: data cleaning, data description, view the distribution of data, compare the relationship between data, data summary; At the same time, using data visualization technology, using charts to present the results of exploratory data analysis, more intuitive understanding of the real distribution of data, see the hidden rules in the data, so as to get inspiration, in order to find a model suitable for the data.

In the process of modeling and analysis, we used Multiple Linear Regression, Random Forest and KNN:

  1. Multiple Linear Regression: When there are multiple factors affecting the dependent variable, the problem that multiple independent variables affect one dependent variable can be solved by multiple regression analysis.Multivariate regression analysis refers to a statistical analysis method that takes one variable as dependent variable and one or more variables as independent variables, establishes the quantitative relationship of linear or nonlinear mathematical models among the variables, and uses sample data for analysis.
  1. Random Forest: Random forest is a classifier containing multiple decision trees, and its basic unit is decision tree.The category of random forest output is determined by the mode of the category of individual tree output.Random forest can effectively run on large data sets, and can process input samples with high-dimensional features without dimensionality reduction, which has excellent accuracy
  1. KNN: KNN is the k nearest neighbor classification algorithm, which means that each sample can be represented by its nearest k neighborhood values: if most of the k most similar samples in the feature space of a sample belong to a certain category, then the sample also belongs to this category.KNN algorithm is more suitable for automatic classification of class domains with large sample sizes, while it is easy to generate errors in class domains with small sample sizes.

2. Problem Definition


In this project, we need to explore the question—"What factors affect the use of shared bikes?"

We need to predict the total rental number of shared bikes through characteristic values such as weather in the test set.

3. Data


Before analyzing the data, we need to have a certain understanding of the data in the data set, which will help us to choose the appropriate model later.

In this module, we present the following aspects:

  1. Data source and attribute interpretation;
  2. Data type;
  3. The main structure of the dataset;
  4. Data set size (rows and columns);
  5. Descriptive statistics for the data set;

3.1 Data source

The source of the data set is https://www.kaggle.com/c/bike-sharing-demand/data. The data set consists of the training set and the test set. The training set consists of data from the first 19 days of the month, and the test set consists of data from the 20th day of the month to the end of the month. The training set contains 12 attributes, and the test set lacks the casual, count and registered attributes.

The following table shows the names of the 12 attributes and their explanations

Attributes Explanation
datetime Time-Year/Month/Day/Hours
season 1.spring; 2.summer;3.autumn;4.winter
holiday Is it a holiday? 0:no; 1:yes
workingday Is it a workday? 0:no; 1:yes
weather 1:Sunny day; 2:cloudy days; 3:light rain or light snow; 4:bad weather (heavy rain, hail or blizzard)
temp The actual temperature – Celsius
atemp Sensory temperature - Celsius
humidity Humidity
windspeed Wind speed
casual Number of rentals by unregistered users
registered Number of rentals by registered users
count Total rental quantity

3.2 Data set format

Import data packet and show

from pyspark.sql.functions import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
sqlContext = SparkSession(sc)
spark = SparkSession.builder.appName('Final_project').getOrCreate()
train = spark.read.csv('file:///home/ljm/project/train.csv', header = True, inferSchema = True)
test = spark.read.csv('file:///home/ljm/project/test.csv', header = True, inferSchema = True)

Data type display in training set:

train.dtypes
[('datetime', 'string'),
 ('season', 'int'),
 ('holiday', 'int'),
 ('workingday', 'int'),
 ('weather', 'int'),
 ('temp', 'double'),
 ('atemp', 'double'),
 ('humidity', 'int'),
 ('windspeed', 'double'),
 ('casual', 'int'),
 ('registered', 'int'),
 ('count', 'int')]

Partial training data set display:

train.show()
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
|           datetime|season|holiday|workingday|weather| temp| atemp|humidity|windspeed|casual|registered|count|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
|2011-01-01 00:00:00|     1|      0|         0|      1| 9.84|14.395|      81|      0.0|     3|        13|   16|
|2011-01-01 01:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     8|        32|   40|
|2011-01-01 02:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     5|        27|   32|
|2011-01-01 03:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     3|        10|   13|
|2011-01-01 04:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     0|         1|    1|
|2011-01-01 05:00:00|     1|      0|         0|      2| 9.84| 12.88|      75|   6.0032|     0|         1|    1|
|2011-01-01 06:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     2|         0|    2|
|2011-01-01 07:00:00|     1|      0|         0|      1|  8.2| 12.88|      86|      0.0|     1|         2|    3|
|2011-01-01 08:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     1|         7|    8|
|2011-01-01 09:00:00|     1|      0|         0|      1|13.12|17.425|      76|      0.0|     8|         6|   14|
|2011-01-01 10:00:00|     1|      0|         0|      1|15.58|19.695|      76|  16.9979|    12|        24|   36|
|2011-01-01 11:00:00|     1|      0|         0|      1|14.76|16.665|      81|  19.0012|    26|        30|   56|
|2011-01-01 12:00:00|     1|      0|         0|      1|17.22| 21.21|      77|  19.0012|    29|        55|   84|
|2011-01-01 13:00:00|     1|      0|         0|      2|18.86|22.725|      72|  19.9995|    47|        47|   94|
|2011-01-01 14:00:00|     1|      0|         0|      2|18.86|22.725|      72|  19.0012|    35|        71|  106|
|2011-01-01 15:00:00|     1|      0|         0|      2|18.04| 21.97|      77|  19.9995|    40|        70|  110|
|2011-01-01 16:00:00|     1|      0|         0|      2|17.22| 21.21|      82|  19.9995|    41|        52|   93|
|2011-01-01 17:00:00|     1|      0|         0|      2|18.04| 21.97|      82|  19.0012|    15|        52|   67|
|2011-01-01 18:00:00|     1|      0|         0|      3|17.22| 21.21|      88|  16.9979|     9|        26|   35|
|2011-01-01 19:00:00|     1|      0|         0|      3|17.22| 21.21|      88|  16.9979|     6|        31|   37|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
only showing top 20 rows

Data type display in testing set:

test.dtypes
[('datetime', 'string'),
 ('season', 'int'),
 ('holiday', 'int'),
 ('workingday', 'int'),
 ('weather', 'int'),
 ('temp', 'double'),
 ('atemp', 'double'),
 ('humidity', 'int'),
 ('windspeed', 'double')]

Partial testing data set display:

test.show()
+-------------------+------+-------+----------+-------+-----+------+--------+---------+
|           datetime|season|holiday|workingday|weather| temp| atemp|humidity|windspeed|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+
|2011-01-20 00:00:00|     1|      0|         1|      1|10.66|11.365|      56|  26.0027|
|2011-01-20 01:00:00|     1|      0|         1|      1|10.66|13.635|      56|      0.0|
|2011-01-20 02:00:00|     1|      0|         1|      1|10.66|13.635|      56|      0.0|
|2011-01-20 03:00:00|     1|      0|         1|      1|10.66| 12.88|      56|  11.0014|
|2011-01-20 04:00:00|     1|      0|         1|      1|10.66| 12.88|      56|  11.0014|
|2011-01-20 05:00:00|     1|      0|         1|      1| 9.84|11.365|      60|  15.0013|
|2011-01-20 06:00:00|     1|      0|         1|      1| 9.02|10.605|      60|  15.0013|
|2011-01-20 07:00:00|     1|      0|         1|      1| 9.02|10.605|      55|  15.0013|
|2011-01-20 08:00:00|     1|      0|         1|      1| 9.02|10.605|      55|  19.0012|
|2011-01-20 09:00:00|     1|      0|         1|      2| 9.84|11.365|      52|  15.0013|
|2011-01-20 10:00:00|     1|      0|         1|      1|10.66|11.365|      48|  19.9995|
|2011-01-20 11:00:00|     1|      0|         1|      2|11.48|13.635|      45|  11.0014|
|2011-01-20 12:00:00|     1|      0|         1|      2| 12.3|16.665|      42|      0.0|
|2011-01-20 13:00:00|     1|      0|         1|      2|11.48|14.395|      45|   7.0015|
|2011-01-20 14:00:00|     1|      0|         1|      2| 12.3| 15.15|      45|   8.9981|
|2011-01-20 15:00:00|     1|      0|         1|      2|13.12| 15.91|      45|   12.998|
|2011-01-20 16:00:00|     1|      0|         1|      2| 12.3| 15.15|      49|   8.9981|
|2011-01-20 17:00:00|     1|      0|         1|      2| 12.3| 15.91|      49|   7.0015|
|2011-01-20 18:00:00|     1|      0|         1|      2|10.66| 12.88|      56|   12.998|
|2011-01-20 19:00:00|     1|      0|         1|      1|10.66|11.365|      56|  22.0028|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+
only showing top 20 rows

3.2 Data set size

print("Train dataset:", "The number of rows:", train.count(), "\n", "              The number of columns:", len(train.columns))
print("Test dataset :", "The number of rows:", test.count(), "\n", "              The number of columns:", len(test.columns))
Train dataset: The number of rows: 10886 
               The number of columns: 12
Test dataset : The number of rows: 6493 
               The number of columns: 9

Get descriptive statistics for a data type column (training set)

train.describe().show()
+-------+-------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+
|summary|           datetime|            season|            holiday|        workingday|           weather|              temp|            atemp|          humidity|         windspeed|           casual|        registered|             count|
+-------+-------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+
|  count|              10886|             10886|              10886|             10886|             10886|             10886|            10886|             10886|             10886|            10886|             10886|             10886|
|   mean|               null|2.5066139996325556|0.02856880396839978|0.6808745177291935| 1.418427337865148|20.230859819952173|23.65508405291192| 61.88645967297446|12.799395406945093|36.02195480433584| 155.5521771082124|191.57413191254824|
| stddev|               null|1.1161743093443237|0.16659885062470944|0.4661591687997361|0.6338385858190968| 7.791589843987573| 8.47460062648494|19.245033277394704|  8.16453732683871|49.96047657264955|151.03903308192452|181.14445383028493|
|    min|2011-01-01 00:00:00|                 1|                  0|                 0|                 1|              0.82|             0.76|                 0|               0.0|                0|                 0|                 1|
|    max|2012-12-19 23:00:00|                 4|                  1|                 1|                 4|              41.0|           45.455|               100|           56.9969|              367|               886|               977|
+-------+-------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+

Get descriptive statistics for a data type column (test set)

test.describe().show()
+-------+-------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+
|summary|           datetime|            season|             holiday|         workingday|           weather|              temp|             atemp|         humidity|        windspeed|
+-------+-------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+
|  count|               6493|              6493|                6493|               6493|              6493|              6493|              6493|             6493|             6493|
|   mean|               null|  2.49330047743724|0.029108270445094717| 0.6858154936085015|1.4367780686893579|20.620606807330972|24.012864623440585| 64.1252117665178|12.63115720006173|
| stddev|               null|1.0912579418644106| 0.16812296760854603|0.46422601479880476|0.6483898010717418| 8.059583026412682| 8.782741298669094|19.29339098607345|8.250151174075594|
|    min|2011-01-20 00:00:00|                 1|                   0|                  0|                 1|              0.82|               0.0|               16|              0.0|
|    max|2012-12-31 23:00:00|                 4|                   1|                  1|                 4|             40.18|              50.0|              100|          55.9986|
+-------+-------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+

The dataset “test.csv” does not have columns for casual, registered, and count, meaning that we can predict these three properties, but in this project we only choose to predict count.

4. Exploratory Data Analysis


In exploratory data analysis, we did the following

  1. Check the data: Are there missing values?
    Are there outliers?
    Are there duplicate values?
    Does the variable need to be converted?
    If so, data cleaning, conversion, etc.;
  2. Use descriptive statistics and charts to describe the data:
  3. Examine the relationship between variables:
import seaborn as sn
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')  # Cancel the warning
import numpy as np
import pandas as pd
from datetime import datetime
%matplotlib inline

4.1 Data cleaning

Check for missing values and delete them

train.select('count').describe().show()
+-------+------------------+
|summary|             count|
+-------+------------------+
|  count|             10886|
|   mean|191.57413191254824|
| stddev|181.14445383028493|
|    min|                 1|
|    max|               977|
+-------+------------------+
train.dropna().show()
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
|           datetime|season|holiday|workingday|weather| temp| atemp|humidity|windspeed|casual|registered|count|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
|2011-01-01 00:00:00|     1|      0|         0|      1| 9.84|14.395|      81|      0.0|     3|        13|   16|
|2011-01-01 01:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     8|        32|   40|
|2011-01-01 02:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     5|        27|   32|
|2011-01-01 03:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     3|        10|   13|
|2011-01-01 04:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     0|         1|    1|
|2011-01-01 05:00:00|     1|      0|         0|      2| 9.84| 12.88|      75|   6.0032|     0|         1|    1|
|2011-01-01 06:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     2|         0|    2|
|2011-01-01 07:00:00|     1|      0|         0|      1|  8.2| 12.88|      86|      0.0|     1|         2|    3|
|2011-01-01 08:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     1|         7|    8|
|2011-01-01 09:00:00|     1|      0|         0|      1|13.12|17.425|      76|      0.0|     8|         6|   14|
|2011-01-01 10:00:00|     1|      0|         0|      1|15.58|19.695|      76|  16.9979|    12|        24|   36|
|2011-01-01 11:00:00|     1|      0|         0|      1|14.76|16.665|      81|  19.0012|    26|        30|   56|
|2011-01-01 12:00:00|     1|      0|         0|      1|17.22| 21.21|      77|  19.0012|    29|        55|   84|
|2011-01-01 13:00:00|     1|      0|         0|      2|18.86|22.725|      72|  19.9995|    47|        47|   94|
|2011-01-01 14:00:00|     1|      0|         0|      2|18.86|22.725|      72|  19.0012|    35|        71|  106|
|2011-01-01 15:00:00|     1|      0|         0|      2|18.04| 21.97|      77|  19.9995|    40|        70|  110|
|2011-01-01 16:00:00|     1|      0|         0|      2|17.22| 21.21|      82|  19.9995|    41|        52|   93|
|2011-01-01 17:00:00|     1|      0|         0|      2|18.04| 21.97|      82|  19.0012|    15|        52|   67|
|2011-01-01 18:00:00|     1|      0|         0|      3|17.22| 21.21|      88|  16.9979|     9|        26|   35|
|2011-01-01 19:00:00|     1|      0|         0|      3|17.22| 21.21|      88|  16.9979|     6|        31|   37|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
only showing top 20 rows
print("There are", train.count() - train.select("count").dropna().count(),"rows were dropped at the previous step.")
There are 0 rows were dropped at the previous step.

To facilitate manipulation and drawing, we convert Spark’s DataFrame to Pandas’s DataFrame

train1 = train.toPandas()

Now check again for missing data

train1.info()

RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int32  
 2   holiday     10886 non-null  int32  
 3   workingday  10886 non-null  int32  
 4   weather     10886 non-null  int32  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int32  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int32  
 10  registered  10886 non-null  int32  
 11  count       10886 non-null  int32  
dtypes: float64(3), int32(8), object(1)
memory usage: 680.5+ KB

There is no missing value in train dataset.

4.2 Data transform

Check the probability density distribution of the train dataset

fig = plt.figure()
ax = fig.add_subplot(1, 2, 1)
fig.set_size_inches(12,5)

sn.distplot(train1['count'])
ax.set(xlabel='count',title='Distribution of count')
[Text(0.5, 0, 'count'), Text(0.5, 1.0, 'Distribution of count')]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CAU1qrXH-1621309884387)(output_48_1.png)]

As can be seen from the probability density distribution diagram, the data fluctuate greatly, and the modeling on this basis is easy to produce the situation of overfitting.
So we transform the data to make it relatively stable.
We chose the logarithm transformation.

yLabels=train1['count']
yLabels_log=np.log(yLabels)
sn.distplot(yLabels_log)
plt.title("The probability density distribution after logarithmic transformation")
Text(0.5, 1.0, 'The probability density distribution after logarithmic transformation')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YIwDLkbo-1621309884392)(output_50_1.png)]

After logarithmic transformation, the data distribution is more uniform and the size difference is reduced, so the use of such labels is more helpful for training the model.

4.3 Explicit data type

The original train dataset:

  1. Numeric type :(direct use)
    • TEMP
    • Atemp
    • Humidity
    • Windspeed
    • Casual
    • Registered
    • Count
  2. Time series:
    • Datetime: Change to a separate year, month, day, hour, and week
  3. Classified data: (Replace categories with numerical values)
    • Season: 1. Spring. 2. Summer; 3. Autumn; 4: Winter
    • Holiday: Is it a holiday? 0: no; 1: Yes
    • WorkingDay: Is it a workday? 0: no; 1: Yes
    • Weather: 1. Sunny days. 2. It’s cloudy. 3: light rain or light snow; 4: Bad weather

Since Spark’s DataFrame does not allow method manipulation on columns, the data processing is done in Pandas’s DataFrame.

train1['date'] = train1.datetime.apply(lambda a: a.split()[0]) 
train1['year'] = train1.datetime.apply(lambda a: a.split()[0].split('-')[0]).astype('int')
train1['month'] = train1.datetime.apply(lambda a: a.split()[0].split('-')[1]).astype('int')
train1['day'] = train1.datetime.apply(lambda a: a.split()[0].split('-')[2]).astype('int')
train1['weekday'] = train1.date.apply(lambda a: datetime.strptime(a , '%Y-%m-%d').isoweekday())
train1['hour'] = train1.datetime.apply(lambda a: a.split()[1].split(':')[0]).astype('int')
# delete datetime
train1.drop('datetime', axis = 1, inplace = True)
train1
season holiday workingday weather temp atemp humidity windspeed casual registered count date year month day weekday hour
0 1 0 0 1 9.84 14.395 81 0.0000 3 13 16 2011-01-01 2011 1 1 6 0
1 1 0 0 1 9.02 13.635 80 0.0000 8 32 40 2011-01-01 2011 1 1 6 1
2 1 0 0 1 9.02 13.635 80 0.0000 5 27 32 2011-01-01 2011 1 1 6 2
3 1 0 0 1 9.84 14.395 75 0.0000 3 10 13 2011-01-01 2011 1 1 6 3
4 1 0 0 1 9.84 14.395 75 0.0000 0 1 1 2011-01-01 2011 1 1 6 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10881 4 0 1 1 15.58 19.695 50 26.0027 7 329 336 2012-12-19 2012 12 19 3 19
10882 4 0 1 1 14.76 17.425 57 15.0013 10 231 241 2012-12-19 2012 12 19 3 20
10883 4 0 1 1 13.94 15.910 61 15.0013 4 164 168 2012-12-19 2012 12 19 3 21
10884 4 0 1 1 13.94 17.425 61 6.0032 12 117 129 2012-12-19 2012 12 19 3 22
10885 4 0 1 1 13.12 16.665 66 8.9981 4 84 88 2012-12-19 2012 12 19 3 23

10886 rows × 17 columns

(We saved the processed dataframe as train1.csv for import)

After data normalization, the correlation coefficient is calculated, and the correlation coefficient between each feature and count is checked

train2 = spark.read.csv('file:///home/ljm/project/train1.csv', header = True, inferSchema = True)
corrDF = train2.select([corr('count', 'season'), corr('count', 'holiday'), corr('count', 'workingday'), corr('count', 'weather'),
                       corr('count', 'temp'), corr('count', 'atemp'), corr('count', 'humidity'), corr('count', 'windspeed'),
                       corr('count', 'casual'), corr('count', 'registered'), corr('count', 'count'), corr('count', 'year'), 
                       corr('count', 'month'), corr('count', 'day'), corr('count', 'weekday'), corr('count', 'hour')])
corrDF.show()
+-------------------+--------------------+-----------------------+--------------------+------------------+-------------------+---------------------+----------------------+-------------------+-----------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+
|corr(count, season)|corr(count, holiday)|corr(count, workingday)|corr(count, weather)| corr(count, temp)| corr(count, atemp)|corr(count, humidity)|corr(count, windspeed)|corr(count, casual)|corr(count, registered)|corr(count, count)|  corr(count, year)| corr(count, month)|    corr(count, day)|corr(count, weekday)|  corr(count, hour)|
+-------------------+--------------------+-----------------------+--------------------+------------------+-------------------+---------------------+----------------------+-------------------+-----------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+
| 0.1634390165763605|-0.00539298447777...|   0.011593866091574893|-0.12865520103850572|0.3944536449672519|0.38978443662697554| -0.31737147887659584|   0.10136947021033213| 0.6904135653286751|     0.9709481058098266|               1.0|0.26040329737852264|0.16686223209772807|0.019825777342373795|-0.00228340038070...|0.40060119414684714|
+-------------------+--------------------+-----------------------+--------------------+------------------+-------------------+---------------------+----------------------+-------------------+-----------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+

In order to show the influence among all features more intuitively, we choose to do the correlation coefficient heat map

corrDf = train1.corr()
mask = np.array(corrDf)
mask[np.tril_indices_from(mask)] = False
fig = plt.figure(figsize=(16, 16))
sn.heatmap(corrDf, mask=mask, annot=True, square=True)
plt.title("Heat map of the correlation coefficient between the features")
Text(0.5, 1.0, 'Heat map of the correlation coefficient between the features')

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第1张图片

As can be seen from the correlation coefficient, hour, temp and atemp have obvious influence on count, and the correlation coefficient of temp and atemp on count is very close, so we can only choose temp for analysis.Year, month, season, windspeed, weather and humidity also have significant effects on count, while the correlation coefficients between day,workingday, weekend, holiday and count are extremely small.

4.4 Deeply analyze the influence rule of each feature on COUNT , and visualize each feature.

Next, the numerical data should be processed. Since the data are included in both Train and Test data sets, the two data sets should be combined first for the convenience of data processing.

test1 = test.toPandas()

Perform the same split operation on the test1 dataset:

test1['date'] = test1.datetime.apply(lambda a: a.split()[0]) 
test1['year'] = test1.datetime.apply(lambda a: a.split()[0].split('-')[0]).astype('int')
test1['month'] = test1.datetime.apply(lambda a: a.split()[0].split('-')[1]).astype('int')
test1['day'] = test1.datetime.apply(lambda a: a.split()[0].split('-')[2]).astype('int')
test1['weekday'] = test1.date.apply(lambda a: datetime.strptime(a , '%Y-%m-%d').isoweekday())
test1['hour'] = test1.datetime.apply(lambda a: a.split()[1].split(':')[0]).astype('int')
# delete datetime
test1.drop('datetime', axis = 1, inplace = True)
full = train1.append(test1, ignore_index = True )
print('The merged dataset:', full.shape)
The merged dataset: (17379, 17)
full.head()
season holiday workingday weather temp atemp humidity windspeed casual registered count date year month day weekday hour
0 1 0 0 1 9.84 14.395 81 0.0 3.0 13.0 16.0 2011-01-01 2011 1 1 6 0
1 1 0 0 1 9.02 13.635 80 0.0 8.0 32.0 40.0 2011-01-01 2011 1 1 6 1
2 1 0 0 1 9.02 13.635 80 0.0 5.0 27.0 32.0 2011-01-01 2011 1 1 6 2
3 1 0 0 1 9.84 14.395 75 0.0 3.0 10.0 13.0 2011-01-01 2011 1 1 6 3
4 1 0 0 1 9.84 14.395 75 0.0 0.0 1.0 1.0 2011-01-01 2011 1 1 6 4

① Time dimension

1.1 year

sn.boxplot(full['year'], full['count'])
plt.title('The influence of year')
plt.show()

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第2张图片

The number of rentals in 2012 is higher than that in 2011, indicating that as time goes by, more and more people are familiar with and accept shared bikes, and the number of users is gradually increasing.

1.2 month

sn.pointplot(full['month'], full['count'])
plt.title('The influence of month')
plt.show()

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第3张图片

It can be seen that months have a significant impact on the number of shared bike rentals, increasing month by month from January to June, maintaining near the maximum value from June to October, and decreasing month by month from October to December, showing strong seasonality.

1.3 season

sn.boxplot(full['season'], full['count'])
plt.title('The influence of season')
plt.show()

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第4张图片

It can be seen that the relationship between the number of users and seasonal changes is: Autumn > Summer > Winter > Spring. As for whether it is affected by temperature, humidity, wind speed, weather and other factors, we need to combine temperature, humidity, customs,weather and other characteristics for further analysis.

1.4 hour

sn.pointplot(full['hour'], full['count'], color = 'orange')
plt.title('The influence of hour')
plt.show()

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第5张图片

As can be seen from the above figure, there are two peaks of rental quantity around 8:00 and 17:00, obviously there is rush hour. However, we should know that a week is divided into working days and rest days, so we need to combine the two characteristics of working days and rest days for in-depth comparative analysis.

② weather

sn.barplot(full['weather'] , full['count'])
plt.title('The influence of weather')
plt.show()

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第6张图片

As can be seen from the figure, the number of rentals for sunny days is the largest, followed by mist and cloudy days, and the third and fourth are bad weather and light snow or rain. Not a few people were observed to travel during Category 4 (bad weather).

③ temperature

We first observe the relationship between temp and atemp:

cols = ['temp' , 'atemp']
sn.pairplot(full[cols])
plt.show()

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第7张图片

Make a correlation graph between multiple continuous variables, you can compare the relationship between any two continuous variables.

It can be clearly seen from the figure that TEMP and ATEMP are roughly linear, but there is also a group of data significantly deviating from the trend of linear correlation, which may be related to humidity and wind speed.The correlation between temp and count is also higher than that of atemp.Therefore, the ATEMP feature can be deleted in the subsequent modeling process.

Let’s start with the temperature trend:

sn.pointplot(full['month'], full['temp'], color = 'salmon')
plt.title("Temperature changes with month")
plt.show()

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第8张图片

It can be seen that the temperature is the lowest in January and the highest in July, showing an overall trend of rise and decline.

sn.regplot(full['temp'] , full['count'], marker="+", color = 'y')
plt.title('The influence of temp')
plt.show()

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第9张图片

temp_rentals = full.groupby(['temp'], as_index=True).agg({ 'count':'mean'})
temp_rentals.plot(title = 'The average number of rentals initiated per hour changes with the temperature')

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第10张图片

It can be seen that there is a positive correlation between temperature and rental quantity.

As the temperature rises, the number of car rentals generally shows an upward trend, but begins to decline when the temperature exceeds 35 degrees and reaches its lowest point when the temperature is 4 degrees.

④ humidity

Let’s start with the humidity trend:

sn.pointplot(full['month'], full['humidity'], color = 'plum')
plt.title("Humidity changes with month")
plt.show()

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第11张图片

Above are the monthly changes of humidity.

Obtain the changing trend of the number of renters with humidity, and take the average value of the number of renters according to humidity.

sn.regplot(full['humidity'] , full['count'], marker="+", color = 'orange')
plt.title('The influence of humidity')
plt.show()

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第12张图片

humidity_rentals = full.groupby(['humidity'], as_index=True).agg({'count':'mean'})
humidity_rentals.plot (title = 'Average number of rentals initiated  per hour in different humidity', color = 'g')

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第13张图片

It can be observed that humidity is negatively correlated with the number of rentals.At humidity levels around 20, the number of rentals quickly peaks and then slowly declines.

⑤ windspeed

Let’s first observe the distribution of windspeed:

sn.distplot(full['windspeed'])
plt.xlabel=('humidity')
plt.title('Distribution of humidity')
plt.show()

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第14张图片

Through the distribution of wind speed, we can find out the problem: why there are a lot of data of 0 wind speed, while the observation and statistical description find that the vacancy value is between 1 and 6. It can be inferred from here that the data itself may have missing value, but it is filled with 0, and these data of 0 wind speed will interfere with the prediction.
Therefore, we use random forest to fill the missing value of wind speed according to the same year, month, season, temperature, humidity and other characteristics.

from sklearn.ensemble import RandomForestRegressor as RF
full["windspeed_rfr"] = full["windspeed"]
dataWind0 = full[full["windspeed_rfr"]==0]
dataWindNot0 = full[full["windspeed_rfr"]!=0]
# select model
rfModel_wind = RF(n_estimators=1000, random_state=42)
# Select eigenvalues
windColumns = ["season", "weather", "humidity", "month", "temp", "year", "atemp"]
# Take the data with wind speed not equal to 0 as the training set and fit it into RandomForestRegressor
rfModel_wind.fit(dataWindNot0[windColumns], dataWindNot0["windspeed_rfr"])
# Prediction of wind speed using a trained model
wind0Values = rfModel_wind.predict(X = dataWind0[windColumns])
# Piles the predicted wind speed into the data with zero wind speed
dataWind0.loc[:,"windspeed_rfr"] = wind0Values
# join two pieces of data
full = dataWindNot0.append(dataWind0)
full.reset_index(inplace=True)
full.drop('index',inplace=True,axis=1)

Re-observe the density distribution of windspeed:

sn.distplot(full['windspeed_rfr'])
plt.title('Distribution of windspeed')
plt.show()

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第15张图片

We looked at the change in windspeed over the month:

sn.pointplot(full['month'], full['windspeed_rfr'], color = 'lightseagreen')
plt.title("Windspeed changes with month")
plt.show()

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第16张图片

Considering that the wind speed is very high rarely, if the average value is taken, there will be abnormalities. Therefore, when observing the relationship between the wind speed and the number of leases, the maximum number of leases should be taken according to the wind speed.

windspeed_rentals = full.groupby(['windspeed'], as_index=True).agg({'count':'max'})
windspeed_rentals.plot(title = 'Max number of rentals initiated per hour in different windspeed', color = 'orange')

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第17张图片

It can be seen that the number of leases decreases as the wind speed increases, and it decreases obviously when the wind speed exceeds 30, but there is a rebound when the wind speed is around 40.

⑥ Date, holiday and working day

fig, axes = plt.subplots(2,1,figsize = (16, 10))
ax1 = plt.subplot(2,1,1)
sn.pointplot(full['hour'] , full['count'] , hue = full['weekday'] , ax = ax1)
ax1.set_title('The influence of hour(weekday)')

ax2 = plt.subplot(2,2,3)
sn.pointplot(full['hour'] , full['count'] , hue = full['workingday'] , ax = ax2)
ax2.set_title('The influence of hour(workingday)')

ax3 = plt.subplot(2,2,4)
sn.pointplot(full['hour'] , full['count'] , hue = full['holiday'] , ax = ax3)
ax3.set_title('The influence of hour(holiday)')
Text(0.5, 1.0, 'The influence of hour(holiday)')

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第18张图片

As can be seen from the chart, the number of rentals is high during the morning and evening rush hours on weekdays, while the number of rentals is low at other times.Holiday noon and afternoon rental number is higher.

5. Modeling and Evaluation


In this module, we score the prediction accuracy of the selected model, and finally select the best prediction model to build an analysis model for the data, and finally get the prediction results of the test set

5.1 Select eigenvalue

According to the previous observation, the 11 items of hour, temp, humidity, year, month, season, weather, windspeed_rfr, weekday, workingday and holiday are decided as characteristic values.

5.2 Accuracy comparison

Let’s compare 3 algorithms: Logistic Regression, Random forest and K-nearest-neighbors to see their acuracy.

from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.linear_model import LinearRegression as LR
from sklearn.model_selection import train_test_split

features = ['hour', 'temp', 'humidity', 'year', 'month', 'season', 'weather', 'windspeed', 'weekday', 'workingday', 'holiday']

def accuracy(y_true,y_pred):
    return np.mean(y_true == y_pred)

def run_exp_on_feature(x_train,y_train,x_test,y_test):
    models= [['Linear Regression ', LR()],
            ['K Nearest Neighbor ', KNN()],
            ['Random Forest Classifier ', RF()]]

    models_score = []
    for name,model in models:

        model = model
        model.fit(x_train,y_train)
        model_pred = model.predict(x_test)
        models_score.append(model.score(x_train,y_train))

        print(name)
        print('Accuracy:', model.score(x_train,y_train))
        print('---------------------------------------')
    
    return models_score
                
x_train,x_test,y_train,y_test = train_test_split(train1[features], train1['count'], test_size=0.2, random_state=23)
models_score = run_exp_on_feature(x_train,y_train,x_test,y_test)


name = ['Linear Regression ','K Nearest Neighbor','Random Forest Classifier']
fig, ax = plt.subplots(figsize = (12, 7))
ax.bar(name, models_score, color = 'lightsalmon')
ax.set_facecolor('white')
ax.set_title("The accuracy of each model")
for x, y in zip(name, models_score):
    ax.text(x, y,'%.3f' % y)
Linear Regression 
Accuracy: 0.389327001694181
---------------------------------------
K Nearest Neighbor 
Accuracy: 0.19189251263206247
---------------------------------------
Random Forest Classifier 
Accuracy: 0.9925706830689325
---------------------------------------

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测_第19张图片

It can be clearly seen from the image that the accuracy of random forest is the highest, so the random forest model is selected for modeling.

5.3 Training model

features = ['hour', 'temp', 'humidity', 'year', 'month', 'season', 'weather', 'windspeed', 'weekday', 'workingday', 'holiday']

yLabels=train1['count']
# Take a logarithmic transformation
yLabels_log = np.log(yLabels)

rfModel = RF(n_estimators=1000 , random_state = 42)
# The data after logarithmic transformation was used as y_train input model for training
rfModel.fit(train1[features], yLabels_log)
preds1 = rfModel.predict(X = train1[features])

5.4 Forecast test set data

preds = rfModel.predict(X = test1[features])
# The predicted results are exponentially transformed and output
test1.loc[:,"count"] = np.exp(preds)
test1
season holiday workingday weather temp atemp humidity windspeed date year month day weekday hour count
0 1 0 1 1 10.66 11.365 56 26.0027 2011-01-20 2011 1 20 4 0 10.245482
1 1 0 1 1 10.66 13.635 56 0.0000 2011-01-20 2011 1 20 4 1 4.479768
2 1 0 1 1 10.66 13.635 56 0.0000 2011-01-20 2011 1 20 4 2 2.858717
3 1 0 1 1 10.66 12.880 56 11.0014 2011-01-20 2011 1 20 4 3 2.813818
4 1 0 1 1 10.66 12.880 56 11.0014 2011-01-20 2011 1 20 4 4 2.444608
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6488 1 0 1 2 10.66 12.880 60 11.0014 2012-12-31 2012 12 31 1 19 285.500426
6489 1 0 1 2 10.66 12.880 60 11.0014 2012-12-31 2012 12 31 1 20 194.329977
6490 1 0 1 1 10.66 12.880 60 11.0014 2012-12-31 2012 12 31 1 21 149.449278
6491 1 0 1 1 10.66 13.635 56 8.9981 2012-12-31 2012 12 31 1 22 106.499478
6492 1 0 1 1 10.66 13.635 65 8.9981 2012-12-31 2012 12 31 1 23 57.849388

6493 rows × 15 columns

And we save the result as “test_pred.csv”

6. Discussion and Conclusion


In this project, based on the platform of Apache Spark, we completed the data analysis and modeling of the problem by using the package such as Python Pandas.

Through the analysis of these characteristic data, we find that the factors affecting the number of car rentals are not a single variable, but these characteristics together determine the number of car rentals.
At the same time, many features are correlated (such as temperature, humidity, windspeed and atemp), which will affect each other. In the modeling process, we only need to select one of the variables that will affect each other for modeling.
Of course, there are also some variables that are interference terms that are not directly related to the number of car rentals (the correlation is very low).

Through this project, we found that exploratory data analysis is very necessary, which directly determines the selection of characteristic variables needed in the modeling.We have also found that data visualization is a useful tool for clearly extracting useful information.

In the process of data processing, we learned the complexity of data processing: there is no simple solution to the problem by simply deleting missing values and outliers from the data set.For example, in the process of processing the Windspeed data, we found that the original data of Windspeed had many zeros, which did not conform to the common sense. Therefore, we speculated that the zero was the missing data when the official collected the data, and filled it with zeros.Therefore, we used the random forest to re-predict and fill in the data with Windspeed of 0, and finally the accuracy of the model prediction also increased.

We also learned that the expression of different data would also affect the accuracy of the model: in the probability density distribution diagram of the variable COUNT, we found that its data fluctuated greatly, and the feature of COUNT was the variable we needed to predict, which was particularly important. If the variable itself fluctuated greatly, it would cause the over-fitting of the model.
Therefore, we took the logarithmic transformation method to process the COUNT feature, and the final data distribution was more uniform and the size difference was reduced, which was conducive to the training of our model.

In this project, weencountered a number of problems: for example, Spark’s DataFrame could not perform method operations on individual columns and rows, which made me have trouble in the initial data analysis and visualization.
I also had a lot of trouble learning Spark’s various operations, drawing functions, and parameters in the model.
At the same time, there are also many deficiencies in our analysis and modeling process.

  1. After modeling, we found that the influence of “weather” on car rental is not suitable to be calculated by means of average value, instead, the cumulative value should be used for statistics;

  2. The abnormal points of the influence of “Windspeed” on the rental volume should be deleted before drawing;

  3. It is not only the original data set of “Windspeed” that has problems. There are dozens of points of the same value in ATEMP;Atemp causes blank cracks in scatter map;Humidity has some values of zero;When the humidity value is close to 100, the scatter arrangement is even.In addition to the 0 problem with Windspeed, there are some outliers and so on.

These are the following analysis and modeling needs to pay attention to and improve.

7. References

[1].Tova Milo, Amit Somech. Deep Reinforcement-Learning Framework for Exploratory Data Analysis. ACM, 2018.

[2].Tova Milo, Amit Somech. Automating Exploratory Data Analysis via Machine Learning: An Overview. ACM, 2020.

[3].Jonathan D. Becher, Pavel Berkhin, Edmund Freeman. Automating exploratory data analysis for efficient data mining. ACM, 2000.

[4].Fabian Gieseke, Christian Igel. Training Big Random Forests with Little Resources. ACM, 2018.

[5].Random forest algorithm and its implementation. https://blog.csdn.net/yangyin007/article/details/82385967

[6].Python seaborn drawing. https://blog.csdn.net/suzyu12345/article/details/69029106

[7].How to deal with the correlation between features before training the model. https://blog.csdn.net/weixin_42835182/article/details/84104323

[8].Model evaluation in Sklearn - Build evaluation functions. https://blog.csdn.net/weixin_34184561/article/details/85900792?utm_medium=distribute.pc_relevant_t0.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-1.control&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-1.control

[9].Multiple linear regression (implemented by Sklearn) https://blog.csdn.net/weixin_39739342/article/details/93379653

[10].Sklearn in the KNN algorithm basic explanation. https://blog.csdn.net/sinat_23338865/article/details/80291159

[11].Sklearn train_test_split() each function parameter meaning interpretation. https://www.cnblogs.com/Yanjy-OnlyOne/p/11288098.html

你可能感兴趣的:(spark,机器学习,数据挖掘,python,linux)