This project is a modeling and prediction project based on cloud computing
We used the historical bike-sharing usage data set from Kaggle in the Washington area, analyzed and modeled the data on the cloud computing platform of Apache Spark and the Python processing platform, and finally predicted the bike-sharing rental demand in the Washington area.
Data Set Overview: The selected data set consists of a training set and a test set:
The training set consists of data from the first 19 days of each month, and the test set consists of data from the 20th day of each month to the end of the month.
The training set contains 12 attributes, including datetime, season, holiday and other attributes.The test set is missing the casual, count, and registered properties.
Model evaluation and application: In this modeling, three models are used, including Multiple Linear Regression, K Nearest Neighbor and Random forest. Through the model.score
function of each model, the prediction accuracy score of each model for the data set is 0.3893
, 0.1919
and 0.9926
, respectively.The random forest has the highest accuracy, so we finally choose the random forest to predict the test set. Finally save the prediction output as test_pred.csv
.
Through this project, we have solved the problems proposed at the beginning, gained a new understanding of data processing and model analysis, and also found new problems and challenges in the process of project execution:
The factors affecting the number of car rentals are not a single variable, but these characteristics jointly determine the number of car rentals.At the same time, there are many characteristics that are related to each other (such as temperature
, humidity
, windspeed
and atemp
), which will affect each other.Of course, there are also some variables that are interference terms that are not directly related to the number of car rentals (the correlation is very low).Therefore, in the process of modeling, exploratory data analysis is very necessary.
The data preprocessing step is very critical, which is directly related to the overall analysis and even the prediction of the model.Simple check missing value to weight and remove outliers often does not significantly improve model fitting effect, rather than through insight into the intrinsic characteristics of data visualization results, and according to the characteristics of the data model, processing (such as log, data conversion, etc.), so as to make the model prediction results more accurate.
In this project, we also encountered a number of problems,for example: Spark’s DataFrame could not perform method operations on individual columns and rows, which made me have trouble in the initial data analysis and visualization. Secondly, our analysis and modeling process also has many deficiencies, for example:
Packages imported:pyspark.sql.functions
pyspark.context(SparkContext)
pyspark.sql.session(SparkSession)
seaborn
matplotlib.pyplot
warnings
numpy
pandas
datetime(datetime)
sklearn.ensemble(RandomForestRegressor)
sklearn.neighbors(KNeighborsClassifier)
sklearn.linear_model(LinearRegression)
sklearn.model_selection(train_test_split)
The bike-sharing system is a way of renting bicycles. Registration, renting and returning the bikes are all done through the self-service terminal network of the whole city, which can automatically obtain the bike rental and return data.
Through this system, people can rent bikes in one place and return them to different places.
The data generated by the system records the car’s ride time, departure point, arrival point and usage time.
In this project, we analyzed the impact of the number of shared bike rentals on natural and human factors such as weather and time, based on historical usage data, in order to predict the demand for shared bike rentals in the Washington area.
In this project, we mainly used the cloud computing platform of Apache Spark and the Python processing platform. We use the Park dataframe basic operations and the Pandas dataframe basic operations. Using exploratory data analysis: data cleaning, data description, view the distribution of data, compare the relationship between data, data summary; At the same time, using data visualization technology, using charts to present the results of exploratory data analysis, more intuitive understanding of the real distribution of data, see the hidden rules in the data, so as to get inspiration, in order to find a model suitable for the data.
In the process of modeling and analysis, we used Multiple Linear Regression, Random Forest and KNN:
In this project, we need to explore the question—"What factors affect the use of shared bikes?"
We need to predict the total rental number of shared bikes through characteristic values such as weather in the test set.
Before analyzing the data, we need to have a certain understanding of the data in the data set, which will help us to choose the appropriate model later.
In this module, we present the following aspects:
The source of the data set is https://www.kaggle.com/c/bike-sharing-demand/data. The data set consists of the training set and the test set. The training set consists of data from the first 19 days of the month, and the test set consists of data from the 20th day of the month to the end of the month. The training set contains 12 attributes, and the test set lacks the casual
, count
and registered
attributes.
The following table shows the names of the 12 attributes and their explanations
Attributes | Explanation |
---|---|
datetime | Time-Year/Month/Day/Hours |
season | 1.spring; 2.summer;3.autumn;4.winter |
holiday | Is it a holiday? 0:no; 1:yes |
workingday | Is it a workday? 0:no; 1:yes |
weather | 1:Sunny day; 2:cloudy days; 3:light rain or light snow; 4:bad weather (heavy rain, hail or blizzard) |
temp | The actual temperature – Celsius |
atemp | Sensory temperature - Celsius |
humidity | Humidity |
windspeed | Wind speed |
casual | Number of rentals by unregistered users |
registered | Number of rentals by registered users |
count | Total rental quantity |
from pyspark.sql.functions import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
sqlContext = SparkSession(sc)
spark = SparkSession.builder.appName('Final_project').getOrCreate()
train = spark.read.csv('file:///home/ljm/project/train.csv', header = True, inferSchema = True)
test = spark.read.csv('file:///home/ljm/project/test.csv', header = True, inferSchema = True)
train.dtypes
[('datetime', 'string'),
('season', 'int'),
('holiday', 'int'),
('workingday', 'int'),
('weather', 'int'),
('temp', 'double'),
('atemp', 'double'),
('humidity', 'int'),
('windspeed', 'double'),
('casual', 'int'),
('registered', 'int'),
('count', 'int')]
train.show()
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
| datetime|season|holiday|workingday|weather| temp| atemp|humidity|windspeed|casual|registered|count|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
|2011-01-01 00:00:00| 1| 0| 0| 1| 9.84|14.395| 81| 0.0| 3| 13| 16|
|2011-01-01 01:00:00| 1| 0| 0| 1| 9.02|13.635| 80| 0.0| 8| 32| 40|
|2011-01-01 02:00:00| 1| 0| 0| 1| 9.02|13.635| 80| 0.0| 5| 27| 32|
|2011-01-01 03:00:00| 1| 0| 0| 1| 9.84|14.395| 75| 0.0| 3| 10| 13|
|2011-01-01 04:00:00| 1| 0| 0| 1| 9.84|14.395| 75| 0.0| 0| 1| 1|
|2011-01-01 05:00:00| 1| 0| 0| 2| 9.84| 12.88| 75| 6.0032| 0| 1| 1|
|2011-01-01 06:00:00| 1| 0| 0| 1| 9.02|13.635| 80| 0.0| 2| 0| 2|
|2011-01-01 07:00:00| 1| 0| 0| 1| 8.2| 12.88| 86| 0.0| 1| 2| 3|
|2011-01-01 08:00:00| 1| 0| 0| 1| 9.84|14.395| 75| 0.0| 1| 7| 8|
|2011-01-01 09:00:00| 1| 0| 0| 1|13.12|17.425| 76| 0.0| 8| 6| 14|
|2011-01-01 10:00:00| 1| 0| 0| 1|15.58|19.695| 76| 16.9979| 12| 24| 36|
|2011-01-01 11:00:00| 1| 0| 0| 1|14.76|16.665| 81| 19.0012| 26| 30| 56|
|2011-01-01 12:00:00| 1| 0| 0| 1|17.22| 21.21| 77| 19.0012| 29| 55| 84|
|2011-01-01 13:00:00| 1| 0| 0| 2|18.86|22.725| 72| 19.9995| 47| 47| 94|
|2011-01-01 14:00:00| 1| 0| 0| 2|18.86|22.725| 72| 19.0012| 35| 71| 106|
|2011-01-01 15:00:00| 1| 0| 0| 2|18.04| 21.97| 77| 19.9995| 40| 70| 110|
|2011-01-01 16:00:00| 1| 0| 0| 2|17.22| 21.21| 82| 19.9995| 41| 52| 93|
|2011-01-01 17:00:00| 1| 0| 0| 2|18.04| 21.97| 82| 19.0012| 15| 52| 67|
|2011-01-01 18:00:00| 1| 0| 0| 3|17.22| 21.21| 88| 16.9979| 9| 26| 35|
|2011-01-01 19:00:00| 1| 0| 0| 3|17.22| 21.21| 88| 16.9979| 6| 31| 37|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
only showing top 20 rows
test.dtypes
[('datetime', 'string'),
('season', 'int'),
('holiday', 'int'),
('workingday', 'int'),
('weather', 'int'),
('temp', 'double'),
('atemp', 'double'),
('humidity', 'int'),
('windspeed', 'double')]
test.show()
+-------------------+------+-------+----------+-------+-----+------+--------+---------+
| datetime|season|holiday|workingday|weather| temp| atemp|humidity|windspeed|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+
|2011-01-20 00:00:00| 1| 0| 1| 1|10.66|11.365| 56| 26.0027|
|2011-01-20 01:00:00| 1| 0| 1| 1|10.66|13.635| 56| 0.0|
|2011-01-20 02:00:00| 1| 0| 1| 1|10.66|13.635| 56| 0.0|
|2011-01-20 03:00:00| 1| 0| 1| 1|10.66| 12.88| 56| 11.0014|
|2011-01-20 04:00:00| 1| 0| 1| 1|10.66| 12.88| 56| 11.0014|
|2011-01-20 05:00:00| 1| 0| 1| 1| 9.84|11.365| 60| 15.0013|
|2011-01-20 06:00:00| 1| 0| 1| 1| 9.02|10.605| 60| 15.0013|
|2011-01-20 07:00:00| 1| 0| 1| 1| 9.02|10.605| 55| 15.0013|
|2011-01-20 08:00:00| 1| 0| 1| 1| 9.02|10.605| 55| 19.0012|
|2011-01-20 09:00:00| 1| 0| 1| 2| 9.84|11.365| 52| 15.0013|
|2011-01-20 10:00:00| 1| 0| 1| 1|10.66|11.365| 48| 19.9995|
|2011-01-20 11:00:00| 1| 0| 1| 2|11.48|13.635| 45| 11.0014|
|2011-01-20 12:00:00| 1| 0| 1| 2| 12.3|16.665| 42| 0.0|
|2011-01-20 13:00:00| 1| 0| 1| 2|11.48|14.395| 45| 7.0015|
|2011-01-20 14:00:00| 1| 0| 1| 2| 12.3| 15.15| 45| 8.9981|
|2011-01-20 15:00:00| 1| 0| 1| 2|13.12| 15.91| 45| 12.998|
|2011-01-20 16:00:00| 1| 0| 1| 2| 12.3| 15.15| 49| 8.9981|
|2011-01-20 17:00:00| 1| 0| 1| 2| 12.3| 15.91| 49| 7.0015|
|2011-01-20 18:00:00| 1| 0| 1| 2|10.66| 12.88| 56| 12.998|
|2011-01-20 19:00:00| 1| 0| 1| 1|10.66|11.365| 56| 22.0028|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+
only showing top 20 rows
print("Train dataset:", "The number of rows:", train.count(), "\n", " The number of columns:", len(train.columns))
print("Test dataset :", "The number of rows:", test.count(), "\n", " The number of columns:", len(test.columns))
Train dataset: The number of rows: 10886
The number of columns: 12
Test dataset : The number of rows: 6493
The number of columns: 9
train.describe().show()
+-------+-------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+
|summary| datetime| season| holiday| workingday| weather| temp| atemp| humidity| windspeed| casual| registered| count|
+-------+-------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+
| count| 10886| 10886| 10886| 10886| 10886| 10886| 10886| 10886| 10886| 10886| 10886| 10886|
| mean| null|2.5066139996325556|0.02856880396839978|0.6808745177291935| 1.418427337865148|20.230859819952173|23.65508405291192| 61.88645967297446|12.799395406945093|36.02195480433584| 155.5521771082124|191.57413191254824|
| stddev| null|1.1161743093443237|0.16659885062470944|0.4661591687997361|0.6338385858190968| 7.791589843987573| 8.47460062648494|19.245033277394704| 8.16453732683871|49.96047657264955|151.03903308192452|181.14445383028493|
| min|2011-01-01 00:00:00| 1| 0| 0| 1| 0.82| 0.76| 0| 0.0| 0| 0| 1|
| max|2012-12-19 23:00:00| 4| 1| 1| 4| 41.0| 45.455| 100| 56.9969| 367| 886| 977|
+-------+-------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+
test.describe().show()
+-------+-------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+
|summary| datetime| season| holiday| workingday| weather| temp| atemp| humidity| windspeed|
+-------+-------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+
| count| 6493| 6493| 6493| 6493| 6493| 6493| 6493| 6493| 6493|
| mean| null| 2.49330047743724|0.029108270445094717| 0.6858154936085015|1.4367780686893579|20.620606807330972|24.012864623440585| 64.1252117665178|12.63115720006173|
| stddev| null|1.0912579418644106| 0.16812296760854603|0.46422601479880476|0.6483898010717418| 8.059583026412682| 8.782741298669094|19.29339098607345|8.250151174075594|
| min|2011-01-20 00:00:00| 1| 0| 0| 1| 0.82| 0.0| 16| 0.0|
| max|2012-12-31 23:00:00| 4| 1| 1| 4| 40.18| 50.0| 100| 55.9986|
+-------+-------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+
The dataset “test.csv” does not have columns for casual
, registered
, and count
, meaning that we can predict these three properties, but in this project we only choose to predict count
.
In exploratory data analysis, we did the following
import seaborn as sn
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore') # Cancel the warning
import numpy as np
import pandas as pd
from datetime import datetime
%matplotlib inline
train.select('count').describe().show()
+-------+------------------+
|summary| count|
+-------+------------------+
| count| 10886|
| mean|191.57413191254824|
| stddev|181.14445383028493|
| min| 1|
| max| 977|
+-------+------------------+
train.dropna().show()
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
| datetime|season|holiday|workingday|weather| temp| atemp|humidity|windspeed|casual|registered|count|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
|2011-01-01 00:00:00| 1| 0| 0| 1| 9.84|14.395| 81| 0.0| 3| 13| 16|
|2011-01-01 01:00:00| 1| 0| 0| 1| 9.02|13.635| 80| 0.0| 8| 32| 40|
|2011-01-01 02:00:00| 1| 0| 0| 1| 9.02|13.635| 80| 0.0| 5| 27| 32|
|2011-01-01 03:00:00| 1| 0| 0| 1| 9.84|14.395| 75| 0.0| 3| 10| 13|
|2011-01-01 04:00:00| 1| 0| 0| 1| 9.84|14.395| 75| 0.0| 0| 1| 1|
|2011-01-01 05:00:00| 1| 0| 0| 2| 9.84| 12.88| 75| 6.0032| 0| 1| 1|
|2011-01-01 06:00:00| 1| 0| 0| 1| 9.02|13.635| 80| 0.0| 2| 0| 2|
|2011-01-01 07:00:00| 1| 0| 0| 1| 8.2| 12.88| 86| 0.0| 1| 2| 3|
|2011-01-01 08:00:00| 1| 0| 0| 1| 9.84|14.395| 75| 0.0| 1| 7| 8|
|2011-01-01 09:00:00| 1| 0| 0| 1|13.12|17.425| 76| 0.0| 8| 6| 14|
|2011-01-01 10:00:00| 1| 0| 0| 1|15.58|19.695| 76| 16.9979| 12| 24| 36|
|2011-01-01 11:00:00| 1| 0| 0| 1|14.76|16.665| 81| 19.0012| 26| 30| 56|
|2011-01-01 12:00:00| 1| 0| 0| 1|17.22| 21.21| 77| 19.0012| 29| 55| 84|
|2011-01-01 13:00:00| 1| 0| 0| 2|18.86|22.725| 72| 19.9995| 47| 47| 94|
|2011-01-01 14:00:00| 1| 0| 0| 2|18.86|22.725| 72| 19.0012| 35| 71| 106|
|2011-01-01 15:00:00| 1| 0| 0| 2|18.04| 21.97| 77| 19.9995| 40| 70| 110|
|2011-01-01 16:00:00| 1| 0| 0| 2|17.22| 21.21| 82| 19.9995| 41| 52| 93|
|2011-01-01 17:00:00| 1| 0| 0| 2|18.04| 21.97| 82| 19.0012| 15| 52| 67|
|2011-01-01 18:00:00| 1| 0| 0| 3|17.22| 21.21| 88| 16.9979| 9| 26| 35|
|2011-01-01 19:00:00| 1| 0| 0| 3|17.22| 21.21| 88| 16.9979| 6| 31| 37|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
only showing top 20 rows
print("There are", train.count() - train.select("count").dropna().count(),"rows were dropped at the previous step.")
There are 0 rows were dropped at the previous step.
train1 = train.toPandas()
Now check again for missing data
train1.info()
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10886 non-null object
1 season 10886 non-null int32
2 holiday 10886 non-null int32
3 workingday 10886 non-null int32
4 weather 10886 non-null int32
5 temp 10886 non-null float64
6 atemp 10886 non-null float64
7 humidity 10886 non-null int32
8 windspeed 10886 non-null float64
9 casual 10886 non-null int32
10 registered 10886 non-null int32
11 count 10886 non-null int32
dtypes: float64(3), int32(8), object(1)
memory usage: 680.5+ KB
There is no missing value in train dataset.
Check the probability density distribution of the train dataset
fig = plt.figure()
ax = fig.add_subplot(1, 2, 1)
fig.set_size_inches(12,5)
sn.distplot(train1['count'])
ax.set(xlabel='count',title='Distribution of count')
[Text(0.5, 0, 'count'), Text(0.5, 1.0, 'Distribution of count')]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CAU1qrXH-1621309884387)(output_48_1.png)]
As can be seen from the probability density distribution diagram, the data fluctuate greatly, and the modeling on this basis is easy to produce the situation of overfitting.
So we transform the data to make it relatively stable.
We chose the logarithm transformation.
yLabels=train1['count']
yLabels_log=np.log(yLabels)
sn.distplot(yLabels_log)
plt.title("The probability density distribution after logarithmic transformation")
Text(0.5, 1.0, 'The probability density distribution after logarithmic transformation')
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YIwDLkbo-1621309884392)(output_50_1.png)]
After logarithmic transformation, the data distribution is more uniform and the size difference is reduced, so the use of such labels is more helpful for training the model.
The original train dataset:
train1['date'] = train1.datetime.apply(lambda a: a.split()[0])
train1['year'] = train1.datetime.apply(lambda a: a.split()[0].split('-')[0]).astype('int')
train1['month'] = train1.datetime.apply(lambda a: a.split()[0].split('-')[1]).astype('int')
train1['day'] = train1.datetime.apply(lambda a: a.split()[0].split('-')[2]).astype('int')
train1['weekday'] = train1.date.apply(lambda a: datetime.strptime(a , '%Y-%m-%d').isoweekday())
train1['hour'] = train1.datetime.apply(lambda a: a.split()[1].split(':')[0]).astype('int')
# delete datetime
train1.drop('datetime', axis = 1, inplace = True)
train1
season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | date | year | month | day | weekday | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0000 | 3 | 13 | 16 | 2011-01-01 | 2011 | 1 | 1 | 6 | 0 |
1 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0000 | 8 | 32 | 40 | 2011-01-01 | 2011 | 1 | 1 | 6 | 1 |
2 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0000 | 5 | 27 | 32 | 2011-01-01 | 2011 | 1 | 1 | 6 | 2 |
3 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0000 | 3 | 10 | 13 | 2011-01-01 | 2011 | 1 | 1 | 6 | 3 |
4 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0000 | 0 | 1 | 1 | 2011-01-01 | 2011 | 1 | 1 | 6 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10881 | 4 | 0 | 1 | 1 | 15.58 | 19.695 | 50 | 26.0027 | 7 | 329 | 336 | 2012-12-19 | 2012 | 12 | 19 | 3 | 19 |
10882 | 4 | 0 | 1 | 1 | 14.76 | 17.425 | 57 | 15.0013 | 10 | 231 | 241 | 2012-12-19 | 2012 | 12 | 19 | 3 | 20 |
10883 | 4 | 0 | 1 | 1 | 13.94 | 15.910 | 61 | 15.0013 | 4 | 164 | 168 | 2012-12-19 | 2012 | 12 | 19 | 3 | 21 |
10884 | 4 | 0 | 1 | 1 | 13.94 | 17.425 | 61 | 6.0032 | 12 | 117 | 129 | 2012-12-19 | 2012 | 12 | 19 | 3 | 22 |
10885 | 4 | 0 | 1 | 1 | 13.12 | 16.665 | 66 | 8.9981 | 4 | 84 | 88 | 2012-12-19 | 2012 | 12 | 19 | 3 | 23 |
10886 rows × 17 columns
(We saved the processed dataframe as train1.csv for import)
After data normalization, the correlation coefficient is calculated, and the correlation coefficient between each feature and count is checked
train2 = spark.read.csv('file:///home/ljm/project/train1.csv', header = True, inferSchema = True)
corrDF = train2.select([corr('count', 'season'), corr('count', 'holiday'), corr('count', 'workingday'), corr('count', 'weather'),
corr('count', 'temp'), corr('count', 'atemp'), corr('count', 'humidity'), corr('count', 'windspeed'),
corr('count', 'casual'), corr('count', 'registered'), corr('count', 'count'), corr('count', 'year'),
corr('count', 'month'), corr('count', 'day'), corr('count', 'weekday'), corr('count', 'hour')])
corrDF.show()
+-------------------+--------------------+-----------------------+--------------------+------------------+-------------------+---------------------+----------------------+-------------------+-----------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+
|corr(count, season)|corr(count, holiday)|corr(count, workingday)|corr(count, weather)| corr(count, temp)| corr(count, atemp)|corr(count, humidity)|corr(count, windspeed)|corr(count, casual)|corr(count, registered)|corr(count, count)| corr(count, year)| corr(count, month)| corr(count, day)|corr(count, weekday)| corr(count, hour)|
+-------------------+--------------------+-----------------------+--------------------+------------------+-------------------+---------------------+----------------------+-------------------+-----------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+
| 0.1634390165763605|-0.00539298447777...| 0.011593866091574893|-0.12865520103850572|0.3944536449672519|0.38978443662697554| -0.31737147887659584| 0.10136947021033213| 0.6904135653286751| 0.9709481058098266| 1.0|0.26040329737852264|0.16686223209772807|0.019825777342373795|-0.00228340038070...|0.40060119414684714|
+-------------------+--------------------+-----------------------+--------------------+------------------+-------------------+---------------------+----------------------+-------------------+-----------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+
In order to show the influence among all features more intuitively, we choose to do the correlation coefficient heat map
corrDf = train1.corr()
mask = np.array(corrDf)
mask[np.tril_indices_from(mask)] = False
fig = plt.figure(figsize=(16, 16))
sn.heatmap(corrDf, mask=mask, annot=True, square=True)
plt.title("Heat map of the correlation coefficient between the features")
Text(0.5, 1.0, 'Heat map of the correlation coefficient between the features')
As can be seen from the correlation coefficient, hour, temp and atemp have obvious influence on count, and the correlation coefficient of temp and atemp on count is very close, so we can only choose temp for analysis.Year, month, season, windspeed, weather and humidity also have significant effects on count, while the correlation coefficients between day,workingday, weekend, holiday and count are extremely small.
Next, the numerical data should be processed. Since the data are included in both Train and Test data sets, the two data sets should be combined first for the convenience of data processing.
test1 = test.toPandas()
Perform the same split operation on the test1 dataset:
test1['date'] = test1.datetime.apply(lambda a: a.split()[0])
test1['year'] = test1.datetime.apply(lambda a: a.split()[0].split('-')[0]).astype('int')
test1['month'] = test1.datetime.apply(lambda a: a.split()[0].split('-')[1]).astype('int')
test1['day'] = test1.datetime.apply(lambda a: a.split()[0].split('-')[2]).astype('int')
test1['weekday'] = test1.date.apply(lambda a: datetime.strptime(a , '%Y-%m-%d').isoweekday())
test1['hour'] = test1.datetime.apply(lambda a: a.split()[1].split(':')[0]).astype('int')
# delete datetime
test1.drop('datetime', axis = 1, inplace = True)
full = train1.append(test1, ignore_index = True )
print('The merged dataset:', full.shape)
The merged dataset: (17379, 17)
full.head()
season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | date | year | month | day | weekday | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3.0 | 13.0 | 16.0 | 2011-01-01 | 2011 | 1 | 1 | 6 | 0 |
1 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8.0 | 32.0 | 40.0 | 2011-01-01 | 2011 | 1 | 1 | 6 | 1 |
2 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5.0 | 27.0 | 32.0 | 2011-01-01 | 2011 | 1 | 1 | 6 | 2 |
3 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3.0 | 10.0 | 13.0 | 2011-01-01 | 2011 | 1 | 1 | 6 | 3 |
4 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0.0 | 1.0 | 1.0 | 2011-01-01 | 2011 | 1 | 1 | 6 | 4 |
1.1 year
sn.boxplot(full['year'], full['count'])
plt.title('The influence of year')
plt.show()
1.2 month
sn.pointplot(full['month'], full['count'])
plt.title('The influence of month')
plt.show()
1.3 season
sn.boxplot(full['season'], full['count'])
plt.title('The influence of season')
plt.show()
1.4 hour
sn.pointplot(full['hour'], full['count'], color = 'orange')
plt.title('The influence of hour')
plt.show()
sn.barplot(full['weather'] , full['count'])
plt.title('The influence of weather')
plt.show()
We first observe the relationship between temp and atemp:
cols = ['temp' , 'atemp']
sn.pairplot(full[cols])
plt.show()
Make a correlation graph between multiple continuous variables, you can compare the relationship between any two continuous variables.
Let’s start with the temperature trend:
sn.pointplot(full['month'], full['temp'], color = 'salmon')
plt.title("Temperature changes with month")
plt.show()
sn.regplot(full['temp'] , full['count'], marker="+", color = 'y')
plt.title('The influence of temp')
plt.show()
temp_rentals = full.groupby(['temp'], as_index=True).agg({ 'count':'mean'})
temp_rentals.plot(title = 'The average number of rentals initiated per hour changes with the temperature')
As the temperature rises, the number of car rentals generally shows an upward trend, but begins to decline when the temperature exceeds 35 degrees and reaches its lowest point when the temperature is 4 degrees.
Let’s start with the humidity trend:
sn.pointplot(full['month'], full['humidity'], color = 'plum')
plt.title("Humidity changes with month")
plt.show()
Above are the monthly changes of humidity.
Obtain the changing trend of the number of renters with humidity, and take the average value of the number of renters according to humidity.
sn.regplot(full['humidity'] , full['count'], marker="+", color = 'orange')
plt.title('The influence of humidity')
plt.show()
humidity_rentals = full.groupby(['humidity'], as_index=True).agg({'count':'mean'})
humidity_rentals.plot (title = 'Average number of rentals initiated per hour in different humidity', color = 'g')
Let’s first observe the distribution of windspeed:
sn.distplot(full['windspeed'])
plt.xlabel=('humidity')
plt.title('Distribution of humidity')
plt.show()
Through the distribution of wind speed, we can find out the problem: why there are a lot of data of 0 wind speed, while the observation and statistical description find that the vacancy value is between 1 and 6. It can be inferred from here that the data itself may have missing value, but it is filled with 0, and these data of 0 wind speed will interfere with the prediction.
Therefore, we use random forest to fill the missing value of wind speed according to the same year, month, season, temperature, humidity and other characteristics.
from sklearn.ensemble import RandomForestRegressor as RF
full["windspeed_rfr"] = full["windspeed"]
dataWind0 = full[full["windspeed_rfr"]==0]
dataWindNot0 = full[full["windspeed_rfr"]!=0]
# select model
rfModel_wind = RF(n_estimators=1000, random_state=42)
# Select eigenvalues
windColumns = ["season", "weather", "humidity", "month", "temp", "year", "atemp"]
# Take the data with wind speed not equal to 0 as the training set and fit it into RandomForestRegressor
rfModel_wind.fit(dataWindNot0[windColumns], dataWindNot0["windspeed_rfr"])
# Prediction of wind speed using a trained model
wind0Values = rfModel_wind.predict(X = dataWind0[windColumns])
# Piles the predicted wind speed into the data with zero wind speed
dataWind0.loc[:,"windspeed_rfr"] = wind0Values
# join two pieces of data
full = dataWindNot0.append(dataWind0)
full.reset_index(inplace=True)
full.drop('index',inplace=True,axis=1)
Re-observe the density distribution of windspeed:
sn.distplot(full['windspeed_rfr'])
plt.title('Distribution of windspeed')
plt.show()
We looked at the change in windspeed over the month:
sn.pointplot(full['month'], full['windspeed_rfr'], color = 'lightseagreen')
plt.title("Windspeed changes with month")
plt.show()
Considering that the wind speed is very high rarely, if the average value is taken, there will be abnormalities. Therefore, when observing the relationship between the wind speed and the number of leases, the maximum number of leases should be taken according to the wind speed.
windspeed_rentals = full.groupby(['windspeed'], as_index=True).agg({'count':'max'})
windspeed_rentals.plot(title = 'Max number of rentals initiated per hour in different windspeed', color = 'orange')
fig, axes = plt.subplots(2,1,figsize = (16, 10))
ax1 = plt.subplot(2,1,1)
sn.pointplot(full['hour'] , full['count'] , hue = full['weekday'] , ax = ax1)
ax1.set_title('The influence of hour(weekday)')
ax2 = plt.subplot(2,2,3)
sn.pointplot(full['hour'] , full['count'] , hue = full['workingday'] , ax = ax2)
ax2.set_title('The influence of hour(workingday)')
ax3 = plt.subplot(2,2,4)
sn.pointplot(full['hour'] , full['count'] , hue = full['holiday'] , ax = ax3)
ax3.set_title('The influence of hour(holiday)')
Text(0.5, 1.0, 'The influence of hour(holiday)')
In this module, we score the prediction accuracy of the selected model, and finally select the best prediction model to build an analysis model for the data, and finally get the prediction results of the test set
According to the previous observation, the 11 items of hour, temp, humidity, year, month, season, weather, windspeed_rfr, weekday, workingday and holiday are decided as characteristic values.
Let’s compare 3 algorithms: Logistic Regression, Random forest and K-nearest-neighbors to see their acuracy.
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.linear_model import LinearRegression as LR
from sklearn.model_selection import train_test_split
features = ['hour', 'temp', 'humidity', 'year', 'month', 'season', 'weather', 'windspeed', 'weekday', 'workingday', 'holiday']
def accuracy(y_true,y_pred):
return np.mean(y_true == y_pred)
def run_exp_on_feature(x_train,y_train,x_test,y_test):
models= [['Linear Regression ', LR()],
['K Nearest Neighbor ', KNN()],
['Random Forest Classifier ', RF()]]
models_score = []
for name,model in models:
model = model
model.fit(x_train,y_train)
model_pred = model.predict(x_test)
models_score.append(model.score(x_train,y_train))
print(name)
print('Accuracy:', model.score(x_train,y_train))
print('---------------------------------------')
return models_score
x_train,x_test,y_train,y_test = train_test_split(train1[features], train1['count'], test_size=0.2, random_state=23)
models_score = run_exp_on_feature(x_train,y_train,x_test,y_test)
name = ['Linear Regression ','K Nearest Neighbor','Random Forest Classifier']
fig, ax = plt.subplots(figsize = (12, 7))
ax.bar(name, models_score, color = 'lightsalmon')
ax.set_facecolor('white')
ax.set_title("The accuracy of each model")
for x, y in zip(name, models_score):
ax.text(x, y,'%.3f' % y)
Linear Regression
Accuracy: 0.389327001694181
---------------------------------------
K Nearest Neighbor
Accuracy: 0.19189251263206247
---------------------------------------
Random Forest Classifier
Accuracy: 0.9925706830689325
---------------------------------------
It can be clearly seen from the image that the accuracy of random forest is the highest, so the random forest model is selected for modeling.
features = ['hour', 'temp', 'humidity', 'year', 'month', 'season', 'weather', 'windspeed', 'weekday', 'workingday', 'holiday']
yLabels=train1['count']
# Take a logarithmic transformation
yLabels_log = np.log(yLabels)
rfModel = RF(n_estimators=1000 , random_state = 42)
# The data after logarithmic transformation was used as y_train input model for training
rfModel.fit(train1[features], yLabels_log)
preds1 = rfModel.predict(X = train1[features])
preds = rfModel.predict(X = test1[features])
# The predicted results are exponentially transformed and output
test1.loc[:,"count"] = np.exp(preds)
test1
season | holiday | workingday | weather | temp | atemp | humidity | windspeed | date | year | month | day | weekday | hour | count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 1 | 10.66 | 11.365 | 56 | 26.0027 | 2011-01-20 | 2011 | 1 | 20 | 4 | 0 | 10.245482 |
1 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 | 2011-01-20 | 2011 | 1 | 20 | 4 | 1 | 4.479768 |
2 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 | 2011-01-20 | 2011 | 1 | 20 | 4 | 2 | 2.858717 |
3 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 | 2011-01-20 | 2011 | 1 | 20 | 4 | 3 | 2.813818 |
4 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 | 2011-01-20 | 2011 | 1 | 20 | 4 | 4 | 2.444608 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6488 | 1 | 0 | 1 | 2 | 10.66 | 12.880 | 60 | 11.0014 | 2012-12-31 | 2012 | 12 | 31 | 1 | 19 | 285.500426 |
6489 | 1 | 0 | 1 | 2 | 10.66 | 12.880 | 60 | 11.0014 | 2012-12-31 | 2012 | 12 | 31 | 1 | 20 | 194.329977 |
6490 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 60 | 11.0014 | 2012-12-31 | 2012 | 12 | 31 | 1 | 21 | 149.449278 |
6491 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 8.9981 | 2012-12-31 | 2012 | 12 | 31 | 1 | 22 | 106.499478 |
6492 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 65 | 8.9981 | 2012-12-31 | 2012 | 12 | 31 | 1 | 23 | 57.849388 |
6493 rows × 15 columns
And we save the result as “test_pred.csv”
In this project, based on the platform of Apache Spark, we completed the data analysis and modeling of the problem by using the package such as Python Pandas.
Through the analysis of these characteristic data, we find that the factors affecting the number of car rentals are not a single variable, but these characteristics together determine the number of car rentals.
At the same time, many features are correlated (such as temperature
, humidity
, windspeed
and atemp
), which will affect each other. In the modeling process, we only need to select one of the variables that will affect each other for modeling.
Of course, there are also some variables that are interference terms that are not directly related to the number of car rentals (the correlation is very low).
Through this project, we found that exploratory data analysis is very necessary, which directly determines the selection of characteristic variables needed in the modeling.We have also found that data visualization is a useful tool for clearly extracting useful information.
In the process of data processing, we learned the complexity of data processing: there is no simple solution to the problem by simply deleting missing values and outliers from the data set.For example, in the process of processing the Windspeed
data, we found that the original data of Windspeed
had many zeros, which did not conform to the common sense. Therefore, we speculated that the zero was the missing data when the official collected the data, and filled it with zeros.Therefore, we used the random forest to re-predict and fill in the data with Windspeed
of 0, and finally the accuracy of the model prediction also increased.
We also learned that the expression of different data would also affect the accuracy of the model: in the probability density distribution diagram of the variable COUNT
, we found that its data fluctuated greatly, and the feature of COUNT
was the variable we needed to predict, which was particularly important. If the variable itself fluctuated greatly, it would cause the over-fitting of the model.
Therefore, we took the logarithmic transformation method to process the COUNT
feature, and the final data distribution was more uniform and the size difference was reduced, which was conducive to the training of our model.
In this project, weencountered a number of problems: for example, Spark’s DataFrame could not perform method operations on individual columns and rows, which made me have trouble in the initial data analysis and visualization.
I also had a lot of trouble learning Spark’s various operations, drawing functions, and parameters in the model.
At the same time, there are also many deficiencies in our analysis and modeling process.
After modeling, we found that the influence of “weather” on car rental is not suitable to be calculated by means of average value, instead, the cumulative value should be used for statistics;
The abnormal points of the influence of “Windspeed” on the rental volume should be deleted before drawing;
It is not only the original data set of “Windspeed” that has problems. There are dozens of points of the same value in ATEMP;Atemp causes blank cracks in scatter map;Humidity has some values of zero;When the humidity value is close to 100, the scatter arrangement is even.In addition to the 0 problem with Windspeed, there are some outliers and so on.
These are the following analysis and modeling needs to pay attention to and improve.
[1].Tova Milo, Amit Somech. Deep Reinforcement-Learning Framework for Exploratory Data Analysis. ACM, 2018.
[2].Tova Milo, Amit Somech. Automating Exploratory Data Analysis via Machine Learning: An Overview. ACM, 2020.
[3].Jonathan D. Becher, Pavel Berkhin, Edmund Freeman. Automating exploratory data analysis for efficient data mining. ACM, 2000.
[4].Fabian Gieseke, Christian Igel. Training Big Random Forests with Little Resources. ACM, 2018.
[5].Random forest algorithm and its implementation. https://blog.csdn.net/yangyin007/article/details/82385967
[6].Python seaborn drawing. https://blog.csdn.net/suzyu12345/article/details/69029106
[7].How to deal with the correlation between features before training the model. https://blog.csdn.net/weixin_42835182/article/details/84104323
[8].Model evaluation in Sklearn - Build evaluation functions. https://blog.csdn.net/weixin_34184561/article/details/85900792?utm_medium=distribute.pc_relevant_t0.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-1.control&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-1.control
[9].Multiple linear regression (implemented by Sklearn) https://blog.csdn.net/weixin_39739342/article/details/93379653
[10].Sklearn in the KNN algorithm basic explanation. https://blog.csdn.net/sinat_23338865/article/details/80291159
[11].Sklearn train_test_split() each function parameter meaning interpretation. https://www.cnblogs.com/Yanjy-OnlyOne/p/11288098.html