Limerencebb

基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测

Abstract

This project is a modeling and prediction project based on cloud computing

We used the historical bike-sharing usage data set from Kaggle in the Washington area, analyzed and modeled the data on the cloud computing platform of Apache Spark and the Python processing platform, and finally predicted the bike-sharing rental demand in the Washington area.

Data Set Overview: The selected data set consists of a training set and a test set:

The training set consists of data from the first 19 days of each month, and the test set consists of data from the 20th day of each month to the end of the month.
The training set contains 12 attributes, including datetime, season, holiday and other attributes.The test set is missing the casual, count, and registered properties.

Model evaluation and application: In this modeling, three models are used, including Multiple Linear Regression, K Nearest Neighbor and Random forest. Through the model.score function of each model, the prediction accuracy score of each model for the data set is 0.3893, 0.1919 and 0.9926, respectively.The random forest has the highest accuracy, so we finally choose the random forest to predict the test set. Finally save the prediction output as test_pred.csv.

Through this project, we have solved the problems proposed at the beginning, gained a new understanding of data processing and model analysis, and also found new problems and challenges in the process of project execution:

The factors affecting the number of car rentals are not a single variable, but these characteristics jointly determine the number of car rentals.At the same time, there are many characteristics that are related to each other (such as temperature, humidity, windspeed and atemp), which will affect each other.Of course, there are also some variables that are interference terms that are not directly related to the number of car rentals (the correlation is very low).Therefore, in the process of modeling, exploratory data analysis is very necessary.
The data preprocessing step is very critical, which is directly related to the overall analysis and even the prediction of the model.Simple check missing value to weight and remove outliers often does not significantly improve model fitting effect, rather than through insight into the intrinsic characteristics of data visualization results, and according to the characteristics of the data model, processing (such as log, data conversion, etc.), so as to make the model prediction results more accurate.
In this project, we also encountered a number of problems,for example: Spark’s DataFrame could not perform method operations on individual columns and rows, which made me have trouble in the initial data analysis and visualization. Secondly, our analysis and modeling process also has many deficiencies, for example:
- Repeated values, missing values and so on in the original data set need to be completely processed before modeling;
- After modeling, it is found that the impact of some variables on car rental is not suitable to be calculated by means of the average value, instead, the cumulative value should be used for statistics;
- …

Packages imported：pyspark.sql.functions pyspark.context（SparkContext) pyspark.sql.session（SparkSession） seaborn matplotlib.pyplot warnings numpy pandas datetime（datetime) sklearn.ensemble（RandomForestRegressor） sklearn.neighbors（KNeighborsClassifier） sklearn.linear_model（LinearRegression） sklearn.model_selection（train_test_split）

1. Introduction

1.1 Project background

The bike-sharing system is a way of renting bicycles. Registration, renting and returning the bikes are all done through the self-service terminal network of the whole city, which can automatically obtain the bike rental and return data.
Through this system, people can rent bikes in one place and return them to different places.

1.2 Project requirements

The data generated by the system records the car’s ride time, departure point, arrival point and usage time.
In this project, we analyzed the impact of the number of shared bike rentals on natural and human factors such as weather and time, based on historical usage data, in order to predict the demand for shared bike rentals in the Washington area.

1.3 Methods and techniques

In this project, we mainly used the cloud computing platform of Apache Spark and the Python processing platform. We use the Park dataframe basic operations and the Pandas dataframe basic operations. Using exploratory data analysis: data cleaning, data description, view the distribution of data, compare the relationship between data, data summary; At the same time, using data visualization technology, using charts to present the results of exploratory data analysis, more intuitive understanding of the real distribution of data, see the hidden rules in the data, so as to get inspiration, in order to find a model suitable for the data.

In the process of modeling and analysis, we used Multiple Linear Regression, Random Forest and KNN:

Multiple Linear Regression: When there are multiple factors affecting the dependent variable, the problem that multiple independent variables affect one dependent variable can be solved by multiple regression analysis.Multivariate regression analysis refers to a statistical analysis method that takes one variable as dependent variable and one or more variables as independent variables, establishes the quantitative relationship of linear or nonlinear mathematical models among the variables, and uses sample data for analysis.

Random Forest: Random forest is a classifier containing multiple decision trees, and its basic unit is decision tree.The category of random forest output is determined by the mode of the category of individual tree output.Random forest can effectively run on large data sets, and can process input samples with high-dimensional features without dimensionality reduction, which has excellent accuracy

KNN: KNN is the k nearest neighbor classification algorithm, which means that each sample can be represented by its nearest k neighborhood values: if most of the k most similar samples in the feature space of a sample belong to a certain category, then the sample also belongs to this category.KNN algorithm is more suitable for automatic classification of class domains with large sample sizes, while it is easy to generate errors in class domains with small sample sizes.

2. Problem Definition

In this project, we need to explore the question—"What factors affect the use of shared bikes?"

We need to predict the total rental number of shared bikes through characteristic values such as weather in the test set.

3. Data

Before analyzing the data, we need to have a certain understanding of the data in the data set, which will help us to choose the appropriate model later.

In this module, we present the following aspects:

Data source and attribute interpretation;
Data type;
The main structure of the dataset;
Data set size (rows and columns);
Descriptive statistics for the data set;

3.1 Data source

The source of the data set is https://www.kaggle.com/c/bike-sharing-demand/data. The data set consists of the training set and the test set. The training set consists of data from the first 19 days of the month, and the test set consists of data from the 20th day of the month to the end of the month. The training set contains 12 attributes, and the test set lacks the casual, count and registered attributes.

The following table shows the names of the 12 attributes and their explanations

Attributes	Explanation
datetime	Time-Year/Month/Day/Hours
season	1.spring; 2.summer;3.autumn;4.winter
holiday	Is it a holiday? 0:no; 1:yes
workingday	Is it a workday? 0:no; 1:yes
weather	1:Sunny day; 2:cloudy days; 3:light rain or light snow; 4:bad weather (heavy rain, hail or blizzard)
temp	The actual temperature – Celsius
atemp	Sensory temperature - Celsius
humidity	Humidity
windspeed	Wind speed
casual	Number of rentals by unregistered users
registered	Number of rentals by registered users
count	Total rental quantity

3.2 Data set format

Import data packet and show

from pyspark.sql.functions import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext('local')
sqlContext = SparkSession(sc)
spark = SparkSession.builder.appName('Final_project').getOrCreate()
train = spark.read.csv('file:///home/ljm/project/train.csv', header = True, inferSchema = True)
test = spark.read.csv('file:///home/ljm/project/test.csv', header = True, inferSchema = True)

Data type display in training set:

train.dtypes

[('datetime', 'string'),
 ('season', 'int'),
 ('holiday', 'int'),
 ('workingday', 'int'),
 ('weather', 'int'),
 ('temp', 'double'),
 ('atemp', 'double'),
 ('humidity', 'int'),
 ('windspeed', 'double'),
 ('casual', 'int'),
 ('registered', 'int'),
 ('count', 'int')]

Partial training data set display:

train.show()

+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
|           datetime|season|holiday|workingday|weather| temp| atemp|humidity|windspeed|casual|registered|count|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
|2011-01-01 00:00:00|     1|      0|         0|      1| 9.84|14.395|      81|      0.0|     3|        13|   16|
|2011-01-01 01:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     8|        32|   40|
|2011-01-01 02:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     5|        27|   32|
|2011-01-01 03:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     3|        10|   13|
|2011-01-01 04:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     0|         1|    1|
|2011-01-01 05:00:00|     1|      0|         0|      2| 9.84| 12.88|      75|   6.0032|     0|         1|    1|
|2011-01-01 06:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     2|         0|    2|
|2011-01-01 07:00:00|     1|      0|         0|      1|  8.2| 12.88|      86|      0.0|     1|         2|    3|
|2011-01-01 08:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     1|         7|    8|
|2011-01-01 09:00:00|     1|      0|         0|      1|13.12|17.425|      76|      0.0|     8|         6|   14|
|2011-01-01 10:00:00|     1|      0|         0|      1|15.58|19.695|      76|  16.9979|    12|        24|   36|
|2011-01-01 11:00:00|     1|      0|         0|      1|14.76|16.665|      81|  19.0012|    26|        30|   56|
|2011-01-01 12:00:00|     1|      0|         0|      1|17.22| 21.21|      77|  19.0012|    29|        55|   84|
|2011-01-01 13:00:00|     1|      0|         0|      2|18.86|22.725|      72|  19.9995|    47|        47|   94|
|2011-01-01 14:00:00|     1|      0|         0|      2|18.86|22.725|      72|  19.0012|    35|        71|  106|
|2011-01-01 15:00:00|     1|      0|         0|      2|18.04| 21.97|      77|  19.9995|    40|        70|  110|
|2011-01-01 16:00:00|     1|      0|         0|      2|17.22| 21.21|      82|  19.9995|    41|        52|   93|
|2011-01-01 17:00:00|     1|      0|         0|      2|18.04| 21.97|      82|  19.0012|    15|        52|   67|
|2011-01-01 18:00:00|     1|      0|         0|      3|17.22| 21.21|      88|  16.9979|     9|        26|   35|
|2011-01-01 19:00:00|     1|      0|         0|      3|17.22| 21.21|      88|  16.9979|     6|        31|   37|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
only showing top 20 rows

Data type display in testing set:

test.dtypes

[('datetime', 'string'),
 ('season', 'int'),
 ('holiday', 'int'),
 ('workingday', 'int'),
 ('weather', 'int'),
 ('temp', 'double'),
 ('atemp', 'double'),
 ('humidity', 'int'),
 ('windspeed', 'double')]

Partial testing data set display:

test.show()

+-------------------+------+-------+----------+-------+-----+------+--------+---------+
|           datetime|season|holiday|workingday|weather| temp| atemp|humidity|windspeed|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+
|2011-01-20 00:00:00|     1|      0|         1|      1|10.66|11.365|      56|  26.0027|
|2011-01-20 01:00:00|     1|      0|         1|      1|10.66|13.635|      56|      0.0|
|2011-01-20 02:00:00|     1|      0|         1|      1|10.66|13.635|      56|      0.0|
|2011-01-20 03:00:00|     1|      0|         1|      1|10.66| 12.88|      56|  11.0014|
|2011-01-20 04:00:00|     1|      0|         1|      1|10.66| 12.88|      56|  11.0014|
|2011-01-20 05:00:00|     1|      0|         1|      1| 9.84|11.365|      60|  15.0013|
|2011-01-20 06:00:00|     1|      0|         1|      1| 9.02|10.605|      60|  15.0013|
|2011-01-20 07:00:00|     1|      0|         1|      1| 9.02|10.605|      55|  15.0013|
|2011-01-20 08:00:00|     1|      0|         1|      1| 9.02|10.605|      55|  19.0012|
|2011-01-20 09:00:00|     1|      0|         1|      2| 9.84|11.365|      52|  15.0013|
|2011-01-20 10:00:00|     1|      0|         1|      1|10.66|11.365|      48|  19.9995|
|2011-01-20 11:00:00|     1|      0|         1|      2|11.48|13.635|      45|  11.0014|
|2011-01-20 12:00:00|     1|      0|         1|      2| 12.3|16.665|      42|      0.0|
|2011-01-20 13:00:00|     1|      0|         1|      2|11.48|14.395|      45|   7.0015|
|2011-01-20 14:00:00|     1|      0|         1|      2| 12.3| 15.15|      45|   8.9981|
|2011-01-20 15:00:00|     1|      0|         1|      2|13.12| 15.91|      45|   12.998|
|2011-01-20 16:00:00|     1|      0|         1|      2| 12.3| 15.15|      49|   8.9981|
|2011-01-20 17:00:00|     1|      0|         1|      2| 12.3| 15.91|      49|   7.0015|
|2011-01-20 18:00:00|     1|      0|         1|      2|10.66| 12.88|      56|   12.998|
|2011-01-20 19:00:00|     1|      0|         1|      1|10.66|11.365|      56|  22.0028|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+
only showing top 20 rows

3.2 Data set size

print("Train dataset:", "The number of rows:", train.count(), "\n", "              The number of columns:", len(train.columns))
print("Test dataset :", "The number of rows:", test.count(), "\n", "              The number of columns:", len(test.columns))

Train dataset: The number of rows: 10886 
               The number of columns: 12
Test dataset : The number of rows: 6493 
               The number of columns: 9

Get descriptive statistics for a data type column (training set)

train.describe().show()

+-------+-------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+
|summary|           datetime|            season|            holiday|        workingday|           weather|              temp|            atemp|          humidity|         windspeed|           casual|        registered|             count|
+-------+-------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+
|  count|              10886|             10886|              10886|             10886|             10886|             10886|            10886|             10886|             10886|            10886|             10886|             10886|
|   mean|               null|2.5066139996325556|0.02856880396839978|0.6808745177291935| 1.418427337865148|20.230859819952173|23.65508405291192| 61.88645967297446|12.799395406945093|36.02195480433584| 155.5521771082124|191.57413191254824|
| stddev|               null|1.1161743093443237|0.16659885062470944|0.4661591687997361|0.6338385858190968| 7.791589843987573| 8.47460062648494|19.245033277394704|  8.16453732683871|49.96047657264955|151.03903308192452|181.14445383028493|
|    min|2011-01-01 00:00:00|                 1|                  0|                 0|                 1|              0.82|             0.76|                 0|               0.0|                0|                 0|                 1|
|    max|2012-12-19 23:00:00|                 4|                  1|                 1|                 4|              41.0|           45.455|               100|           56.9969|              367|               886|               977|
+-------+-------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+

Get descriptive statistics for a data type column (test set)

test.describe().show()

+-------+-------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+
|summary|           datetime|            season|             holiday|         workingday|           weather|              temp|             atemp|         humidity|        windspeed|
+-------+-------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+
|  count|               6493|              6493|                6493|               6493|              6493|              6493|              6493|             6493|             6493|
|   mean|               null|  2.49330047743724|0.029108270445094717| 0.6858154936085015|1.4367780686893579|20.620606807330972|24.012864623440585| 64.1252117665178|12.63115720006173|
| stddev|               null|1.0912579418644106| 0.16812296760854603|0.46422601479880476|0.6483898010717418| 8.059583026412682| 8.782741298669094|19.29339098607345|8.250151174075594|
|    min|2011-01-20 00:00:00|                 1|                   0|                  0|                 1|              0.82|               0.0|               16|              0.0|
|    max|2012-12-31 23:00:00|                 4|                   1|                  1|                 4|             40.18|              50.0|              100|          55.9986|
+-------+-------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+

The dataset “test.csv” does not have columns for casual, registered, and count, meaning that we can predict these three properties, but in this project we only choose to predict count.

4. Exploratory Data Analysis

In exploratory data analysis, we did the following

Check the data: Are there missing values?
Are there outliers?
Are there duplicate values?
Does the variable need to be converted?
If so, data cleaning, conversion, etc.;
Use descriptive statistics and charts to describe the data:
Examine the relationship between variables:

import seaborn as sn
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')  # Cancel the warning
import numpy as np
import pandas as pd
from datetime import datetime
%matplotlib inline

4.1 Data cleaning

Check for missing values and delete them

train.select('count').describe().show()

+-------+------------------+
|summary|             count|
+-------+------------------+
|  count|             10886|
|   mean|191.57413191254824|
| stddev|181.14445383028493|
|    min|                 1|
|    max|               977|
+-------+------------------+

train.dropna().show()

+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
|           datetime|season|holiday|workingday|weather| temp| atemp|humidity|windspeed|casual|registered|count|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
|2011-01-01 00:00:00|     1|      0|         0|      1| 9.84|14.395|      81|      0.0|     3|        13|   16|
|2011-01-01 01:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     8|        32|   40|
|2011-01-01 02:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     5|        27|   32|
|2011-01-01 03:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     3|        10|   13|
|2011-01-01 04:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     0|         1|    1|
|2011-01-01 05:00:00|     1|      0|         0|      2| 9.84| 12.88|      75|   6.0032|     0|         1|    1|
|2011-01-01 06:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     2|         0|    2|
|2011-01-01 07:00:00|     1|      0|         0|      1|  8.2| 12.88|      86|      0.0|     1|         2|    3|
|2011-01-01 08:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     1|         7|    8|
|2011-01-01 09:00:00|     1|      0|         0|      1|13.12|17.425|      76|      0.0|     8|         6|   14|
|2011-01-01 10:00:00|     1|      0|         0|      1|15.58|19.695|      76|  16.9979|    12|        24|   36|
|2011-01-01 11:00:00|     1|      0|         0|      1|14.76|16.665|      81|  19.0012|    26|        30|   56|
|2011-01-01 12:00:00|     1|      0|         0|      1|17.22| 21.21|      77|  19.0012|    29|        55|   84|
|2011-01-01 13:00:00|     1|      0|         0|      2|18.86|22.725|      72|  19.9995|    47|        47|   94|
|2011-01-01 14:00:00|     1|      0|         0|      2|18.86|22.725|      72|  19.0012|    35|        71|  106|
|2011-01-01 15:00:00|     1|      0|         0|      2|18.04| 21.97|      77|  19.9995|    40|        70|  110|
|2011-01-01 16:00:00|     1|      0|         0|      2|17.22| 21.21|      82|  19.9995|    41|        52|   93|
|2011-01-01 17:00:00|     1|      0|         0|      2|18.04| 21.97|      82|  19.0012|    15|        52|   67|
|2011-01-01 18:00:00|     1|      0|         0|      3|17.22| 21.21|      88|  16.9979|     9|        26|   35|
|2011-01-01 19:00:00|     1|      0|         0|      3|17.22| 21.21|      88|  16.9979|     6|        31|   37|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
only showing top 20 rows

print("There are", train.count() - train.select("count").dropna().count(),"rows were dropped at the previous step.")

There are 0 rows were dropped at the previous step.

To facilitate manipulation and drawing, we convert Spark’s DataFrame to Pandas’s DataFrame

train1 = train.toPandas()

Now check again for missing data

train1.info()


RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int32  
 2   holiday     10886 non-null  int32  
 3   workingday  10886 non-null  int32  
 4   weather     10886 non-null  int32  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int32  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int32  
 10  registered  10886 non-null  int32  
 11  count       10886 non-null  int32  
dtypes: float64(3), int32(8), object(1)
memory usage: 680.5+ KB

There is no missing value in train dataset.

4.2 Data transform

Check the probability density distribution of the train dataset

fig = plt.figure()
ax = fig.add_subplot(1, 2, 1)
fig.set_size_inches(12,5)

sn.distplot(train1['count'])
ax.set(xlabel='count',title='Distribution of count')

[Text(0.5, 0, 'count'), Text(0.5, 1.0, 'Distribution of count')]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CAU1qrXH-1621309884387)(output_48_1.png)]

As can be seen from the probability density distribution diagram, the data fluctuate greatly, and the modeling on this basis is easy to produce the situation of overfitting.
So we transform the data to make it relatively stable.
We chose the logarithm transformation.

yLabels=train1['count']
yLabels_log=np.log(yLabels)
sn.distplot(yLabels_log)
plt.title("The probability density distribution after logarithmic transformation")

Text(0.5, 1.0, 'The probability density distribution after logarithmic transformation')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YIwDLkbo-1621309884392)(output_50_1.png)]

After logarithmic transformation, the data distribution is more uniform and the size difference is reduced, so the use of such labels is more helpful for training the model.

4.3 Explicit data type

The original train dataset:

Numeric type :(direct use)
- TEMP
- Atemp
- Humidity
- Windspeed
- Casual
- Registered
- Count
Time series:
- Datetime: Change to a separate year, month, day, hour, and week
Classified data: (Replace categories with numerical values)
- Season: 1. Spring. 2. Summer; 3. Autumn; 4: Winter
- Holiday: Is it a holiday? 0: no; 1: Yes
- WorkingDay: Is it a workday? 0: no; 1: Yes
- Weather: 1. Sunny days. 2. It’s cloudy. 3: light rain or light snow; 4: Bad weather

Since Spark’s DataFrame does not allow method manipulation on columns, the data processing is done in Pandas’s DataFrame.

train1['date'] = train1.datetime.apply(lambda a: a.split()[0]) 
train1['year'] = train1.datetime.apply(lambda a: a.split()[0].split('-')[0]).astype('int')
train1['month'] = train1.datetime.apply(lambda a: a.split()[0].split('-')[1]).astype('int')
train1['day'] = train1.datetime.apply(lambda a: a.split()[0].split('-')[2]).astype('int')
train1['weekday'] = train1.date.apply(lambda a: datetime.strptime(a , '%Y-%m-%d').isoweekday())
train1['hour'] = train1.datetime.apply(lambda a: a.split()[1].split(':')[0]).astype('int')
# delete datetime
train1.drop('datetime', axis = 1, inplace = True)

train1

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count	date	year	month	day	weekday	hour
0	1	0	0	1	9.84	14.395	81	0.0000	3	13	16	2011-01-01	2011	1	1	6	0
1	1	0	0	1	9.02	13.635	80	0.0000	8	32	40	2011-01-01	2011	1	1	6	1
2	1	0	0	1	9.02	13.635	80	0.0000	5	27	32	2011-01-01	2011	1	1	6	2
3	1	0	0	1	9.84	14.395	75	0.0000	3	10	13	2011-01-01	2011	1	1	6	3
4	1	0	0	1	9.84	14.395	75	0.0000	0	1	1	2011-01-01	2011	1	1	6	4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10881	4	0	1	1	15.58	19.695	50	26.0027	7	329	336	2012-12-19	2012	12	19	3	19
10882	4	0	1	1	14.76	17.425	57	15.0013	10	231	241	2012-12-19	2012	12	19	3	20
10883	4	0	1	1	13.94	15.910	61	15.0013	4	164	168	2012-12-19	2012	12	19	3	21
10884	4	0	1	1	13.94	17.425	61	6.0032	12	117	129	2012-12-19	2012	12	19	3	22
10885	4	0	1	1	13.12	16.665	66	8.9981	4	84	88	2012-12-19	2012	12	19	3	23

10886 rows × 17 columns

(We saved the processed dataframe as train1.csv for import)

After data normalization, the correlation coefficient is calculated, and the correlation coefficient between each feature and count is checked

train2 = spark.read.csv('file:///home/ljm/project/train1.csv', header = True, inferSchema = True)

corrDF = train2.select([corr('count', 'season'), corr('count', 'holiday'), corr('count', 'workingday'), corr('count', 'weather'),
                       corr('count', 'temp'), corr('count', 'atemp'), corr('count', 'humidity'), corr('count', 'windspeed'),
                       corr('count', 'casual'), corr('count', 'registered'), corr('count', 'count'), corr('count', 'year'), 
                       corr('count', 'month'), corr('count', 'day'), corr('count', 'weekday'), corr('count', 'hour')])
corrDF.show()

+-------------------+--------------------+-----------------------+--------------------+------------------+-------------------+---------------------+----------------------+-------------------+-----------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+
|corr(count, season)|corr(count, holiday)|corr(count, workingday)|corr(count, weather)| corr(count, temp)| corr(count, atemp)|corr(count, humidity)|corr(count, windspeed)|corr(count, casual)|corr(count, registered)|corr(count, count)|  corr(count, year)| corr(count, month)|    corr(count, day)|corr(count, weekday)|  corr(count, hour)|
+-------------------+--------------------+-----------------------+--------------------+------------------+-------------------+---------------------+----------------------+-------------------+-----------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+
| 0.1634390165763605|-0.00539298447777...|   0.011593866091574893|-0.12865520103850572|0.3944536449672519|0.38978443662697554| -0.31737147887659584|   0.10136947021033213| 0.6904135653286751|     0.9709481058098266|               1.0|0.26040329737852264|0.16686223209772807|0.019825777342373795|-0.00228340038070...|0.40060119414684714|
+-------------------+--------------------+-----------------------+--------------------+------------------+-------------------+---------------------+----------------------+-------------------+-----------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+

In order to show the influence among all features more intuitively, we choose to do the correlation coefficient heat map

corrDf = train1.corr()
mask = np.array(corrDf)
mask[np.tril_indices_from(mask)] = False
fig = plt.figure(figsize=(16, 16))
sn.heatmap(corrDf, mask=mask, annot=True, square=True)
plt.title("Heat map of the correlation coefficient between the features")

Text(0.5, 1.0, 'Heat map of the correlation coefficient between the features')

As can be seen from the correlation coefficient, hour, temp and atemp have obvious influence on count, and the correlation coefficient of temp and atemp on count is very close, so we can only choose temp for analysis.Year, month, season, windspeed, weather and humidity also have significant effects on count, while the correlation coefficients between day,workingday, weekend, holiday and count are extremely small.

4.4 Deeply analyze the influence rule of each feature on COUNT , and visualize each feature.

Next, the numerical data should be processed. Since the data are included in both Train and Test data sets, the two data sets should be combined first for the convenience of data processing.

test1 = test.toPandas()

Perform the same split operation on the test1 dataset:

test1['date'] = test1.datetime.apply(lambda a: a.split()[0]) 
test1['year'] = test1.datetime.apply(lambda a: a.split()[0].split('-')[0]).astype('int')
test1['month'] = test1.datetime.apply(lambda a: a.split()[0].split('-')[1]).astype('int')
test1['day'] = test1.datetime.apply(lambda a: a.split()[0].split('-')[2]).astype('int')
test1['weekday'] = test1.date.apply(lambda a: datetime.strptime(a , '%Y-%m-%d').isoweekday())
test1['hour'] = test1.datetime.apply(lambda a: a.split()[1].split(':')[0]).astype('int')
# delete datetime
test1.drop('datetime', axis = 1, inplace = True)

full = train1.append(test1, ignore_index = True )
print('The merged dataset:', full.shape)

The merged dataset: (17379, 17)

full.head()

	season	weather	temp	atemp	humidity	casual	registered	count	date	year	month	day	weekday	hour
0	1	1	9.84	14.395	81	3.0	13.0	16.0	2011-01-01	2011	1	1	6	0
1	1	1	9.02	13.635	80	8.0	32.0	40.0	2011-01-01	2011	1	1	6	1
2	1	1	9.02	13.635	80	5.0	27.0	32.0	2011-01-01	2011	1	1	6	2
3	1	1	9.84	14.395	75	3.0	10.0	13.0	2011-01-01	2011	1	1	6	3
4	1	1	9.84	14.395	75	0.0	1.0	1.0	2011-01-01	2011	1	1	6	4

① Time dimension

1.1 year

sn.boxplot(full['year'], full['count'])
plt.title('The influence of year')
plt.show()

The number of rentals in 2012 is higher than that in 2011, indicating that as time goes by, more and more people are familiar with and accept shared bikes, and the number of users is gradually increasing.

1.2 month

sn.pointplot(full['month'], full['count'])
plt.title('The influence of month')
plt.show()

It can be seen that months have a significant impact on the number of shared bike rentals, increasing month by month from January to June, maintaining near the maximum value from June to October, and decreasing month by month from October to December, showing strong seasonality.

1.3 season

sn.boxplot(full['season'], full['count'])
plt.title('The influence of season')
plt.show()

It can be seen that the relationship between the number of users and seasonal changes is: Autumn > Summer > Winter > Spring. As for whether it is affected by temperature, humidity, wind speed, weather and other factors, we need to combine temperature, humidity, customs,weather and other characteristics for further analysis.

1.4 hour

sn.pointplot(full['hour'], full['count'], color = 'orange')
plt.title('The influence of hour')
plt.show()

As can be seen from the above figure, there are two peaks of rental quantity around 8:00 and 17:00, obviously there is rush hour. However, we should know that a week is divided into working days and rest days, so we need to combine the two characteristics of working days and rest days for in-depth comparative analysis.

② weather

sn.barplot(full['weather'] , full['count'])
plt.title('The influence of weather')
plt.show()

As can be seen from the figure, the number of rentals for sunny days is the largest, followed by mist and cloudy days, and the third and fourth are bad weather and light snow or rain. Not a few people were observed to travel during Category 4 (bad weather).

③ temperature

We first observe the relationship between temp and atemp:

cols = ['temp' , 'atemp']
sn.pairplot(full[cols])
plt.show()

Make a correlation graph between multiple continuous variables, you can compare the relationship between any two continuous variables.

It can be clearly seen from the figure that TEMP and ATEMP are roughly linear, but there is also a group of data significantly deviating from the trend of linear correlation, which may be related to humidity and wind speed.The correlation between temp and count is also higher than that of atemp.Therefore, the ATEMP feature can be deleted in the subsequent modeling process.

Let’s start with the temperature trend:

sn.pointplot(full['month'], full['temp'], color = 'salmon')
plt.title("Temperature changes with month")
plt.show()

It can be seen that the temperature is the lowest in January and the highest in July, showing an overall trend of rise and decline.

sn.regplot(full['temp'] , full['count'], marker="+", color = 'y')
plt.title('The influence of temp')
plt.show()

temp_rentals = full.groupby(['temp'], as_index=True).agg({ 'count':'mean'})
temp_rentals.plot(title = 'The average number of rentals initiated per hour changes with the temperature')

It can be seen that there is a positive correlation between temperature and rental quantity.

As the temperature rises, the number of car rentals generally shows an upward trend, but begins to decline when the temperature exceeds 35 degrees and reaches its lowest point when the temperature is 4 degrees.

④ humidity

Let’s start with the humidity trend:

sn.pointplot(full['month'], full['humidity'], color = 'plum')
plt.title("Humidity changes with month")
plt.show()

Above are the monthly changes of humidity.

Obtain the changing trend of the number of renters with humidity, and take the average value of the number of renters according to humidity.

sn.regplot(full['humidity'] , full['count'], marker="+", color = 'orange')
plt.title('The influence of humidity')
plt.show()

humidity_rentals = full.groupby(['humidity'], as_index=True).agg({'count':'mean'})
humidity_rentals.plot (title = 'Average number of rentals initiated  per hour in different humidity', color = 'g')

It can be observed that humidity is negatively correlated with the number of rentals.At humidity levels around 20, the number of rentals quickly peaks and then slowly declines.

⑤ windspeed

Let’s first observe the distribution of windspeed:

sn.distplot(full['windspeed'])
plt.xlabel=('humidity')
plt.title('Distribution of humidity')
plt.show()

Through the distribution of wind speed, we can find out the problem: why there are a lot of data of 0 wind speed, while the observation and statistical description find that the vacancy value is between 1 and 6. It can be inferred from here that the data itself may have missing value, but it is filled with 0, and these data of 0 wind speed will interfere with the prediction.
Therefore, we use random forest to fill the missing value of wind speed according to the same year, month, season, temperature, humidity and other characteristics.

from sklearn.ensemble import RandomForestRegressor as RF
full["windspeed_rfr"] = full["windspeed"]
dataWind0 = full[full["windspeed_rfr"]==0]
dataWindNot0 = full[full["windspeed_rfr"]!=0]
# select model
rfModel_wind = RF(n_estimators=1000, random_state=42)
# Select eigenvalues
windColumns = ["season", "weather", "humidity", "month", "temp", "year", "atemp"]
# Take the data with wind speed not equal to 0 as the training set and fit it into RandomForestRegressor
rfModel_wind.fit(dataWindNot0[windColumns], dataWindNot0["windspeed_rfr"])
# Prediction of wind speed using a trained model
wind0Values = rfModel_wind.predict(X = dataWind0[windColumns])
# Piles the predicted wind speed into the data with zero wind speed
dataWind0.loc[:,"windspeed_rfr"] = wind0Values
# join two pieces of data
full = dataWindNot0.append(dataWind0)
full.reset_index(inplace=True)
full.drop('index',inplace=True,axis=1)

Re-observe the density distribution of windspeed:

sn.distplot(full['windspeed_rfr'])
plt.title('Distribution of windspeed')
plt.show()

We looked at the change in windspeed over the month:

sn.pointplot(full['month'], full['windspeed_rfr'], color = 'lightseagreen')
plt.title("Windspeed changes with month")
plt.show()

Considering that the wind speed is very high rarely, if the average value is taken, there will be abnormalities. Therefore, when observing the relationship between the wind speed and the number of leases, the maximum number of leases should be taken according to the wind speed.

windspeed_rentals = full.groupby(['windspeed'], as_index=True).agg({'count':'max'})
windspeed_rentals.plot(title = 'Max number of rentals initiated per hour in different windspeed', color = 'orange')

It can be seen that the number of leases decreases as the wind speed increases, and it decreases obviously when the wind speed exceeds 30, but there is a rebound when the wind speed is around 40.

⑥ Date, holiday and working day

fig, axes = plt.subplots(2,1,figsize = (16, 10))
ax1 = plt.subplot(2,1,1)
sn.pointplot(full['hour'] , full['count'] , hue = full['weekday'] , ax = ax1)
ax1.set_title('The influence of hour(weekday)')

ax2 = plt.subplot(2,2,3)
sn.pointplot(full['hour'] , full['count'] , hue = full['workingday'] , ax = ax2)
ax2.set_title('The influence of hour(workingday)')

ax3 = plt.subplot(2,2,4)
sn.pointplot(full['hour'] , full['count'] , hue = full['holiday'] , ax = ax3)
ax3.set_title('The influence of hour(holiday)')

Text(0.5, 1.0, 'The influence of hour(holiday)')

As can be seen from the chart, the number of rentals is high during the morning and evening rush hours on weekdays, while the number of rentals is low at other times.Holiday noon and afternoon rental number is higher.

5. Modeling and Evaluation

In this module, we score the prediction accuracy of the selected model, and finally select the best prediction model to build an analysis model for the data, and finally get the prediction results of the test set

5.1 Select eigenvalue

According to the previous observation, the 11 items of hour, temp, humidity, year, month, season, weather, windspeed_rfr, weekday, workingday and holiday are decided as characteristic values.

5.2 Accuracy comparison

Let’s compare 3 algorithms: Logistic Regression, Random forest and K-nearest-neighbors to see their acuracy.

from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.linear_model import LinearRegression as LR
from sklearn.model_selection import train_test_split

features = ['hour', 'temp', 'humidity', 'year', 'month', 'season', 'weather', 'windspeed', 'weekday', 'workingday', 'holiday']

def accuracy(y_true,y_pred):
    return np.mean(y_true == y_pred)

def run_exp_on_feature(x_train,y_train,x_test,y_test):
    models= [['Linear Regression ', LR()],
            ['K Nearest Neighbor ', KNN()],
            ['Random Forest Classifier ', RF()]]

    models_score = []
    for name,model in models:

        model = model
        model.fit(x_train,y_train)
        model_pred = model.predict(x_test)
        models_score.append(model.score(x_train,y_train))

        print(name)
        print('Accuracy:', model.score(x_train,y_train))
        print('---------------------------------------')
    
    return models_score
                
x_train,x_test,y_train,y_test = train_test_split(train1[features], train1['count'], test_size=0.2, random_state=23)
models_score = run_exp_on_feature(x_train,y_train,x_test,y_test)


name = ['Linear Regression ','K Nearest Neighbor','Random Forest Classifier']
fig, ax = plt.subplots(figsize = (12, 7))
ax.bar(name, models_score, color = 'lightsalmon')
ax.set_facecolor('white')
ax.set_title("The accuracy of each model")
for x, y in zip(name, models_score):
    ax.text(x, y,'%.3f' % y)

Linear Regression 
Accuracy: 0.389327001694181
---------------------------------------
K Nearest Neighbor 
Accuracy: 0.19189251263206247
---------------------------------------
Random Forest Classifier 
Accuracy: 0.9925706830689325
---------------------------------------

It can be clearly seen from the image that the accuracy of random forest is the highest, so the random forest model is selected for modeling.

5.3 Training model

features = ['hour', 'temp', 'humidity', 'year', 'month', 'season', 'weather', 'windspeed', 'weekday', 'workingday', 'holiday']

yLabels=train1['count']
# Take a logarithmic transformation
yLabels_log = np.log(yLabels)

rfModel = RF(n_estimators=1000 , random_state = 42)
# The data after logarithmic transformation was used as y_train input model for training
rfModel.fit(train1[features], yLabels_log)
preds1 = rfModel.predict(X = train1[features])

5.4 Forecast test set data

preds = rfModel.predict(X = test1[features])
# The predicted results are exponentially transformed and output
test1.loc[:,"count"] = np.exp(preds)
test1

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	date	year	month	day	weekday	hour	count
0	1	0	1	1	10.66	11.365	56	26.0027	2011-01-20	2011	1	20	4	0	10.245482
1	1	0	1	1	10.66	13.635	56	0.0000	2011-01-20	2011	1	20	4	1	4.479768
2	1	0	1	1	10.66	13.635	56	0.0000	2011-01-20	2011	1	20	4	2	2.858717
3	1	0	1	1	10.66	12.880	56	11.0014	2011-01-20	2011	1	20	4	3	2.813818
4	1	0	1	1	10.66	12.880	56	11.0014	2011-01-20	2011	1	20	4	4	2.444608
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6488	1	0	1	2	10.66	12.880	60	11.0014	2012-12-31	2012	12	31	1	19	285.500426
6489	1	0	1	2	10.66	12.880	60	11.0014	2012-12-31	2012	12	31	1	20	194.329977
6490	1	0	1	1	10.66	12.880	60	11.0014	2012-12-31	2012	12	31	1	21	149.449278
6491	1	0	1	1	10.66	13.635	56	8.9981	2012-12-31	2012	12	31	1	22	106.499478
6492	1	0	1	1	10.66	13.635	65	8.9981	2012-12-31	2012	12	31	1	23	57.849388

6493 rows × 15 columns

And we save the result as “test_pred.csv”

6. Discussion and Conclusion

In this project, based on the platform of Apache Spark, we completed the data analysis and modeling of the problem by using the package such as Python Pandas.

Through the analysis of these characteristic data, we find that the factors affecting the number of car rentals are not a single variable, but these characteristics together determine the number of car rentals.
At the same time, many features are correlated (such as temperature, humidity, windspeed and atemp), which will affect each other. In the modeling process, we only need to select one of the variables that will affect each other for modeling.
Of course, there are also some variables that are interference terms that are not directly related to the number of car rentals (the correlation is very low).

Through this project, we found that exploratory data analysis is very necessary, which directly determines the selection of characteristic variables needed in the modeling.We have also found that data visualization is a useful tool for clearly extracting useful information.

In the process of data processing, we learned the complexity of data processing: there is no simple solution to the problem by simply deleting missing values and outliers from the data set.For example, in the process of processing the Windspeed data, we found that the original data of Windspeed had many zeros, which did not conform to the common sense. Therefore, we speculated that the zero was the missing data when the official collected the data, and filled it with zeros.Therefore, we used the random forest to re-predict and fill in the data with Windspeed of 0, and finally the accuracy of the model prediction also increased.

We also learned that the expression of different data would also affect the accuracy of the model: in the probability density distribution diagram of the variable COUNT, we found that its data fluctuated greatly, and the feature of COUNT was the variable we needed to predict, which was particularly important. If the variable itself fluctuated greatly, it would cause the over-fitting of the model.
Therefore, we took the logarithmic transformation method to process the COUNT feature, and the final data distribution was more uniform and the size difference was reduced, which was conducive to the training of our model.

In this project, weencountered a number of problems: for example, Spark’s DataFrame could not perform method operations on individual columns and rows, which made me have trouble in the initial data analysis and visualization.
I also had a lot of trouble learning Spark’s various operations, drawing functions, and parameters in the model.
At the same time, there are also many deficiencies in our analysis and modeling process.

After modeling, we found that the influence of “weather” on car rental is not suitable to be calculated by means of average value, instead, the cumulative value should be used for statistics；
The abnormal points of the influence of “Windspeed” on the rental volume should be deleted before drawing;
It is not only the original data set of “Windspeed” that has problems. There are dozens of points of the same value in ATEMP;Atemp causes blank cracks in scatter map;Humidity has some values of zero;When the humidity value is close to 100, the scatter arrangement is even.In addition to the 0 problem with Windspeed, there are some outliers and so on.

These are the following analysis and modeling needs to pay attention to and improve.

7. References

[1].Tova Milo, Amit Somech. Deep Reinforcement-Learning Framework for Exploratory Data Analysis. ACM, 2018.

[2].Tova Milo, Amit Somech. Automating Exploratory Data Analysis via Machine Learning: An Overview. ACM, 2020.

[3].Jonathan D. Becher, Pavel Berkhin, Edmund Freeman. Automating exploratory data analysis for efficient data mining. ACM, 2000.

[4].Fabian Gieseke, Christian Igel. Training Big Random Forests with Little Resources. ACM, 2018.

[5].Random forest algorithm and its implementation. https://blog.csdn.net/yangyin007/article/details/82385967

[6].Python seaborn drawing. https://blog.csdn.net/suzyu12345/article/details/69029106

[7].How to deal with the correlation between features before training the model. https://blog.csdn.net/weixin_42835182/article/details/84104323

[8].Model evaluation in Sklearn - Build evaluation functions. https://blog.csdn.net/weixin_34184561/article/details/85900792?utm_medium=distribute.pc_relevant_t0.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-1.control&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-1.control

[9].Multiple linear regression (implemented by Sklearn) https://blog.csdn.net/weixin_39739342/article/details/93379653

[10].Sklearn in the KNN algorithm basic explanation. https://blog.csdn.net/sinat_23338865/article/details/80291159

[11].Sklearn train_test_split() each function parameter meaning interpretation. https://www.cnblogs.com/Yanjy-OnlyOne/p/11288098.html

你可能感兴趣的:(spark,机器学习,数据挖掘,python,linux)

系统学习Python——并发模型和异步编程：进程、线程和GIL
分类目录：《系统学习Python》总目录在文章《并发模型和异步编程：基础知识》我们简单介绍了Python中的进程、线程和协程。本文就着重介绍Python中的进程、线程和GIL的关系。Python解释器的每个实例都是一个进程。使用multiprocessing或concurrent.futures库可以启动额外的Python进程。Python的subprocess库用于启动运行外部程序（不管使用何种
Flask框架入门：快速搭建轻量级Python网页应用「已注销」 python-AI python基础网站网络 python flask 后端
转载：Flask框架入门：快速搭建轻量级Python网页应用1.Flask基础Flask是一个使用Python编写的轻量级Web应用框架。它的设计目标是让Web开发变得快速简单，同时保持应用的灵活性。Flask依赖于两个外部库：Werkzeug和Jinja2，Werkzeug作为WSGI工具包处理Web服务的底层细节，Jinja2作为模板引擎渲染模板。安装Flask非常简单，可以使用pip安装命令
Python Flask 框架入门：快速搭建 Web 应用的秘诀 Python编程之道 Python人工智能与大数据 Python编程之道 python flask 前端 ai
PythonFlask框架入门：快速搭建Web应用的秘诀关键词Flask、微框架、路由系统、Jinja2模板、请求处理、WSGI、Web开发摘要想快速用Python搭建一个灵活的Web应用？Flask作为“微框架”代表，凭借轻量、可扩展的特性，成为初学者和小型项目的首选。本文将从Flask的核心概念出发，结合生活化比喻、代码示例和实战案例，带你一步步掌握：如何用Flask搭建第一个Web应用？路由
上位机知识篇---SD卡&U盘镜像
常用的镜像烧录软件balenaEtcherbalenaEtcher是一个开源的、跨平台的工具，用于将操作系统镜像文件（如ISO和IMG文件）烧录到SD卡和USB驱动器中。以下是其使用方法、使用场景和使用注意事项的介绍：使用方法下载安装：根据自己的操作系统，从官方网站下载对应的安装包。Windows系统下载.exe文件后双击安装；Linux系统若下载的是.deb文件，可在终端执行“sudodpkg-
python_虚拟环境阿_焦 python
第一、配置虚拟环境：virtualenv（1）pipvirtualenv>安装虚拟环境包（2）pipinstallvirtualenvwrapper-win>安装虚拟环境依赖包（3）c盘创建虚拟目录>C:\virtualenv>配置环境变量【了解一下】：（1）如何使用virtualenv创建虚拟环境a、cd到C:\virtualenv目录下：b、mkvirtualenvname>创建虚拟环境nam
Python爱心光波
系列文章序号直达链接Tkinter1Python李峋同款可写字版跳动的爱心2Python跳动的双爱心3Python蓝色跳动的爱心4Python动漫烟花5Python粒子烟花Turtle1Python满屏飘字2Python蓝色流星雨3Python金色流星雨4Python漂浮爱心5Python爱心光波①6Python爱心光波②7Python满天繁星8Python五彩气球9Python白色飘雪10Pyt
Python流星雨 Want595 python 开发语言
文章目录系列文章写在前面技术需求完整代码代码分析1.模块导入2.画布设置3.画笔设置4.颜色列表5.流星类(Star)6.流星对象创建7.主循环8.流星运动逻辑9.视觉效果10.总结写在后面系列文章序号直达链接表白系列1Python制作一个无法拒绝的表白界面2Python满屏飘字表白代码3Python无限弹窗满屏表白代码4Python李峋同款可写字版跳动的爱心5Python流星雨代码6Python
Python之七彩花朵代码实现 PlutoZuo Python python 开发语言
Python之七彩花朵代码实现文章目录Python之七彩花朵代码实现下面是一个简单的使用Python的七彩花朵。这个示例只是一个简单的版本，没有很多高级功能，但它可以作为一个起点，你可以在此基础上添加更多功能。importturtleastuimportrandomasraimportmathtu.setup(1.0,1.0)t=tu.Pen()t.ht()colors=['red','skybl
Python 脚本最佳实践2025版
前文可以直接把这篇文章喂给AI,可以放到AI角色设定里,也可以直接作为提示词.这样,你只管提需求,写脚本就让AI来.概述追求简洁和清晰：脚本应简单明了。使用函数(functions)、常量(constants)和适当的导入(import)实践来有逻辑地组织你的Python脚本。使用枚举(enumerations)和数据类(dataclasses)等数据结构高效管理脚本状态。通过命令行参数增强交互性
（Python基础篇）了解和使用分支结构 EternityArt 基础篇 python
目录一、引言二、Python分支结构的类型与语法（一）if语句（单分支）（二）if-else语句（双分支）（三）if-elif-else语句（多分支）三、分支结构的应用场景（一）提示用户输入用户名，然后再提示输入密码，如果用户名是“admin”并且密码是“88888”则提示正确，否则，如果用户名不是admin还提示用户用户名不存在,（二）提示用户输入用户名，然后再提示输入密码，如果用户名是“adm
（Python基础篇）循环结构 EternityArt 基础篇 python
一、什么是Python循环结构？循环结构是编程中重复执行代码块的机制。在Python中，循环允许你：1.迭代处理数据：遍历列表、字典、文件内容等。2.自动化重复任务：如批量处理数据、生成序列等。3.控制执行流程：根据条件决定是否继续或终止循环。二、为什么需要循环结构？假设你需要打印1到100的所有偶数：没有循环：需手动编写100行print()语句。print(0)print(2)print(4)
（Python基础篇）字典的操作 EternityArt 基础篇 python 开发语言
一、引言在Python编程中，字典（Dictionary）是一种极具灵活性的数据结构，它通过“键-值对”（key-valuepair）的形式存储数据，如同现实生活中的字典——通过“词语（键）”快速查找“释义（值）”。相较于列表和元组的有序索引访问，字典的优势在于基于键的快速查找，这使得它在处理需要频繁通过唯一标识获取数据的场景中极为高效。掌握字典的操作，能让我们更高效地组织和管理复杂数据，是Pyt
Python七彩花朵 Want595 python 开发语言
系列文章序号直达链接Tkinter1Python李峋同款可写字版跳动的爱心2Python跳动的双爱心3Python蓝色跳动的爱心4Python动漫烟花5Python粒子烟花Turtle1Python满屏飘字2Python蓝色流星雨3Python金色流星雨4Python漂浮爱心5Python爱心光波①6Python爱心光波②7Python满天繁星8Python五彩气球9Python白色飘雪10Pyt
用OpenCV标定相机内参应用示例（C++和Python）
下面是一个完整的使用OpenCV进行相机内参标定（CameraCalibration）的示例，包括C++和Python两个版本，基于棋盘格图案标定。一、目标：相机标定通过拍摄多张带有棋盘格图案的图像，估计相机的内参：相机矩阵（内参）K畸变系数distCoeffs可选外参（R,T）标定精度指标（如重投影误差）二、棋盘格参数设置（根据自己的棋盘格设置）：棋盘格角点数：9x6（内角点，9列×6行）；每个
Anaconda 详细下载与安装教程
Anaconda详细下载与安装教程1.简介Anaconda是一个用于科学计算的开源发行版，包含了Python和R的众多常用库。它还包括了conda包管理器，可以方便地安装、更新和管理各种软件包。2.下载Anaconda2.1访问官方网站首先，打开浏览器，访问Anaconda官方网站。2.2选择适合的版本在页面中，你会看到两个主要的下载选项：AnacondaIndividualEdition：适用于
Linux/Centos7离线安装并配置MySQL 5.7 有事开摆无事百杜同学 LInux/CentOS7 linux mysql 运维
Linux/Centos7离线安装并配置MySQL5.7超详细教程一、环境准备1.下载MySQL5.7离线包2.使用rpm工具卸载MariaDB（避免冲突）3.创建系统级别的MySQL专用用户二、安装与配置1.解压并重命名MySQL目录2.创建数据目录和配置文件3.设置目录权限4.初始化MySQL5.配置启动脚本6.配置环境变量三、启动与验证1.启动MySQL服务2.获取初始密码3.登录并修改密码
python中 @注解及内置注解的使用方法总结以及完整示例慧一居士 Python python
在Python中，装饰器（Decorator）使用@符号实现，是一种修改函数/类行为的语法糖。它本质上是一个高阶函数，接受目标函数作为参数并返回包装后的函数。Python也提供了多个内置装饰器，如@property、@staticmethod、@classmethod等。一、核心概念装饰器本质：@decorator等价于func=decorator(func)执行时机：在函数/类定义时立即执行装饰
Python中的静态方法和类方法详解
在Python中，`@staticmethod`和`@classmethod`是两种装饰器，它们用于定义类中的方法，但是它们的行为和用途有所不同。###@staticmethod`@staticmethod`装饰器用于定义一个静态方法。静态方法不接收类或实例的引用作为第一个参数，因此它不能访问类的状态或实例的状态。静态方法可以看作是与类关联的普通函数，但它们可以通过类名直接调用。classMath
Linux操作系统磁盘管理 CZZDg linux 运维服务器
目录一.硬盘介绍1.硬盘的物理结构2.CHS编号3.磁盘存储划分4.开机流程5.要点6.磁盘存储数据的形式二.Linux文件系统1.根文件系统2.虚拟文件系统3.真文件系统4.伪文件系统三.磁盘分区与挂载1.磁盘分区方式2.分区命令3.查看与识别命令4.格式化命令5.挂载命令四.LVM逻辑卷1.概述2.管理命令五.磁盘配额1.概述usrquota:支持对用户的磁盘配额grpquota：支持对组的磁
Python中类静态方法：@classmethod/@staticmethod详解和实战示例
在Python中，类方法(@classmethod)和静态方法(@staticmethod)是类作用域下的两种特殊方法。它们使用装饰器定义，并且与实例方法(deffunc(self))的行为有所不同。1.三种方法的对比概览方法类型是否访问实例(self)是否访问类(cls)典型用途实例方法✅是❌否访问对象属性类方法@classmethod❌否✅是创建类的替代构造器，访问类变量等静态方法@stati
Python多版本管理与pip升级全攻略：解决冲突与高效实践码界奇点 Python python pip 开发语言 python3.11 源代码管理虚拟现实依赖倒置原则
引言Python作为最流行的编程语言之一，其版本迭代速度与生态碎片化给开发者带来了巨大挑战。据统计，超过60%的Python开发者需要同时维护基于Python3.6+和Python2.7的项目。本文将系统解决以下核心痛点：如何安全地在同一台机器上管理多个Python版本pip依赖冲突的根治方案符合PEP标准的生产环境最佳实践第一部分：Python多版本管理核心方案1.1系统级多版本共存方案Wind
基于Python的健身数据分析工具的搭建流程day1 weixin_45677320 python 开发语言数据挖掘爬虫
基于Python的健身数据分析工具的搭建流程分数据挖掘、数据存储和数据分析三个步骤。本文主要介绍利用Python实现健身数据分析工具的数据挖掘部分。第一步：加载库加载本文需要的库，如下代码所示。若库未安装，请按照python如何安装各种库（保姆级教程）_python安装库-CSDN博客https://blog.csdn.net/aobulaien001/article/details/133298
tcpdump交叉编译 weixin_45673259 tcpdump 测试工具网络
1.下载路径官网：https://www.tcpdump.org/2.编译解压：tar-xflibpcap-1.10.4.tar.xztar-xftcpdump-4.99.4.tar.xz编译libpcap./configure--host=mips-v720s229-linux--target=mips-v720s229-linuxCC=/opt/A1/mips-gcc720-uclibc229
【Linux内核模块】Linux内核模块程序结构 byte轻骑兵 #嵌入式Linux驱动开发实战 linux 运维服务器
如果你已经写过第一个"HelloWorld"内核模块，可能会好奇：为什么那个几行代码的程序能被内核识别？那些module_init、MODULE_LICENSE到底是什么意思？今天咱们就来扒一扒内核模块的程序结构，搞清楚一个合格的内核模块到底由哪些部分组成，每个部分又承担着什么角色。目录一、内核模块的"骨架"：最简化结构解析二、头文件：内核模块的"说明书"2.1最常用的三个头文件2.2按需添加的其
LVM逻辑卷扩容
目录1.逻辑卷的简介2.逻辑卷的概念3.相关命令4.建立逻辑卷1.逻辑卷的简介1.LVM是逻辑卷管理(LogicalVolumeManager)的简称,它是Linux环境下对磁盘分区进行管理的一种机制,LVM是建立在硬盘和分区之上的一个逻辑层,来提高磁盘分区管理的灵活性。2.LVM最大的特点就是可以对磁盘进行动态管理。使用了LVM管理分区,动态的调整分区的大小,标准分区是做不到的。2.逻辑卷的概念
Rocky Linux 8.5/CentOS 8 安装Wine chen_teacher linux 运维服务器
RockyLinux8.5/CentOS8安装Wine首先配置EPEL镜像配置方法安装Wine首先配置EPEL镜像EPEL(ExtraPackagesforEnterpriseLinux),是由FedoraSpecialInterestGroup维护的EnterpriseLinux（RHEL、CentOS）中经常用到的包。下载地址：https://mirrors.aliyun.com/epel/相
seaborn又一个扩展heatmapz qq_21478261 #Python可视化 matplotlib
推荐阅读：Pythonmatplotlib保姆级教程嫌Matplotlib繁琐？试试Seaborn！
NGS测序基础梳理01-文库构建（Library Preparation） qq_21478261 #生物信息生物学
本文介绍Illumina测序平台文库构建（LibraryPreparation）步骤，文库结构。写作时间：2020.05。推荐阅读：10W字《Python可视化教程1.0》来了！一份由公众号「pythonic生物人」精心制作的PythonMatplotlib可视化系统教程，105页PDFhttps://mp.weixin.qq.com/s/QaSmucuVsS_DR-klfpE3-Q10W字《Rg
系统迁移从CentOS7.9到Rocky8.9
我有两台阿里云上的服务器是CentOS7.9，由于CentOS7已经停止支持，后续使用的话会有安全漏洞，所以需要尽快迁移，个人使用的话目前兼容性好的还是RockyLinux8，很多脚本改改就能用了。一、盘点系统和迁移应用查看当前系统发行版版本cat/etc/os-release盘点迁移清单服务器应用部署方式docker镜像来源v1wordpressdockerdockerhubv1zdirdock
【Linux内核模块】Linux内核模块简介 byte轻骑兵 #嵌入式Linux驱动开发实战 linux arm开发运维
你是否好奇过，为什么Linux系统可以在不重启的情况下支持新硬件？为什么修改一个驱动程序不需要重新编译整个内核？这一切都离不开Linux的"模块化魔法"——内核模块（KernelModule）。作为Linux内核最灵活的特性之一，内核模块让开发者可以动态扩展内核功能，今天就来揭开这个神秘组件的面纱。目录一、什么是内核模块？1.1先打个比方：给内核装"插件"1.2技术定义：动态加载的内核代码段1.3
rust的指针作为函数返回值是直接传递，还是先销毁后创建？ wudixiaotie 返回值
这是我自己想到的问题，结果去知呼提问，还没等别人回答，我自己就想到方法实验了。。 fn main() { let mut a = 34; println!("a's addr:{:p}", &a); let p = &mut a; println!("p's addr:{:p}", &a
java编程思想 -- 数据的初始化百合不是茶 java 数据的初始化
1.使用构造器确保数据初始化 /* *在ReckInitDemo类中创建Reck的对象 */ public class ReckInitDemo { public static void main(String[] args) { //创建Reck对象 new Reck(); } }
[航天与宇宙]为什么发射和回收航天器有档期 comsci
地球的大气层中有一个时空屏蔽层,这个层次会不定时的出现,如果该时空屏蔽层出现,那么将导致外层空间进入的任何物体被摧毁,而从地面发射到太空的飞船也将被摧毁... 所以,航天发射和飞船回收都需要等待这个时空屏蔽层消失之后,再进行 &
linux下批量替换文件内容商人shang linux 替换
1、网络上现成的资料　　格式: sed -i "s/查找字段/替换字段/g" `grep 查找字段 -rl 路径` 　　linux sed 批量替换多个文件中的字符串　　sed -i "s/oldstring/newstring/g" `grep oldstring -rl yourdir` 　　例如：替换/home下所有文件中的www.admi
网页在线天气预报 oloz 天气预报
网页在线调用天气预报 <%@ page language="java" contentType="text/html; charset=utf-8" pageEncoding="utf-8"%> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transit
SpringMVC和Struts2比较杨白白 springMVC
1. 入口 spring mvc的入口是servlet，而struts2是filter（这里要指出，filter和servlet是不同的。以前认为filter是servlet的一种特殊），这样就导致了二者的机制不同，这里就牵涉到servlet和filter的区别了。参见：http://blog.csdn.net/zs15932616453/article/details/8832343 2
refuse copy, lazy girl! 小桔子 copy
妹妹坐船头啊啊啊啊！都打算一点点琢磨呢。文字编辑也写了基本功能了。。今天查资料，结果查到了人家写得完完整整的。我清楚的认识到： 1.那是我自己觉得写不出的高度 2.如果直接拿来用，很快就能解决问题 3.然后就是抄咩~~ 4.肿么可以这样子，都不想写了今儿个，留着作参考吧！拒绝大抄特抄，慢慢一点点写！
apache与php整合 aichenglong php apache web
一 apache web服务器 1 apeche web服务器的安装 1)下载Apache web服务器 2)配置域名(如果需要使用要在DNS上注册) 3)测试安装访问http://localhost/验证是否安装成功 2 apache管理 1)service.msc进行图形化管理 2)命令管理，配
Maven常用内置变量 AILIKES maven
Built-in properties ${basedir} represents the directory containing pom.xml ${version} equivalent to ${project.version} (deprecated: ${pom.version}) Pom/Project properties Al
java的类和对象百合不是茶 JAVA面向对象类对象
java中的类： java是面向对象的语言，解决问题的核心就是将问题看成是一个类，使用类来解决 java使用 class 类名来创建类，在Java中类名要求和构造方法，Java的文件名是一样的创建一个A类： class A{ } java中的类：将某两个事物有联系的属性包装在一个类中，再通
JS控制页面输入框为只读 bijian1013 JavaScript
在WEB应用开发当中，增、删除、改、查功能必不可少，为了减少以后维护的工作量，我们一般都只做一份页面，通过传入的参数控制其是新增、修改或者查看。而修改时需将待修改的信息从后台取到并显示出来，实际上就是查看的过程，唯一的区别是修改时，页面上所有的信息能修改，而查看页面上的信息不能修改。因此完全可以将其合并，但通过前端JS将查看页面的所有信息控制为只读，在信息量非常大时，就比较麻烦。
AngularJS与服务器交互 bijian1013 JavaScript AngularJS $http
对于AJAX应用（使用XMLHttpRequests）来说，向服务器发起请求的传统方式是：获取一个XMLHttpRequest对象的引用、发起请求、读取响应、检查状态码，最后处理服务端的响应。整个过程示例如下： var xmlhttp = new XMLHttpRequest(); xmlhttp.onreadystatechange
[Maven学习笔记八]Maven常用插件应用 bit1129 maven
常用插件及其用法位于：http://maven.apache.org/plugins/ 1. Jetty server plugin 2. Dependency copy plugin 3. Surefire Test plugin 4. Uber jar plugin 1. Jetty Pl
【Hive六】Hive用户自定义函数(UDF) bit1129 自定义函数
1. 什么是Hive UDF Hive是基于Hadoop中的MapReduce，提供HQL查询的数据仓库。Hive是一个很开放的系统，很多内容都支持用户定制，包括：文件格式：Text File，Sequence File 内存中的数据格式： Java Integer/String, Hadoop IntWritable/Text 用户提供的 map/reduce 脚本：不管什么
杀掉nginx进程后丢失nginx.pid，如何重新启动nginx ronin47 nginx 重启 pid丢失
nginx进程被意外关闭，使用nginx -s reload重启时报如下错误：nginx: [error] open() “/var/run/nginx.pid” failed (2: No such file or directory)这是因为nginx进程被杀死后pid丢失了，下一次再开启nginx -s reload时无法启动解决办法：nginx -s reload 只是用来告诉运行中的ng
UI设计中我们为什么需要设计动效 brotherlamp UI ui教程 ui视频 ui资料 ui自学
随着国际大品牌苹果和谷歌的引领，最近越来越多的国内公司开始关注动效设计了，越来越多的团队已经意识到动效在产品用户体验中的重要性了，更多的UI设计师们也开始投身动效设计领域。但是说到底，我们到底为什么需要动效设计？或者说我们到底需要什么样的动效？做动效设计也有段时间了，于是尝试用一些案例，从产品本身出发来说说我所思考的动效设计。一、加强体验舒适度嗯，就是让用户更加爽更加爽的用你的产品。
Spring中JdbcDaoSupport的DataSource注入问题 bylijinnan java spring
参考以下两篇文章： http://www.mkyong.com/spring/spring-jdbctemplate-jdbcdaosupport-examples/ http://stackoverflow.com/questions/4762229/spring-ldap-invoking-setter-methods-in-beans-configuration Sprin
数据库连接池的工作原理 chicony 数据库连接池
随着信息技术的高速发展与广泛应用，数据库技术在信息技术领域中的位置越来越重要，尤其是网络应用和电子商务的迅速发展，都需要数据库技术支持动态Web站点的运行，而传统的开发模式是：首先在主程序（如Servlet、Beans）中建立数据库连接；然后进行SQL操作，对数据库中的对象进行查询、修改和删除等操作；最后断开数据库连接。使用这种开发模式，对
java 关键字 CrazyMizzz java
关键字是事先定义的，有特别意义的标识符，有时又叫保留字。对于保留字，用户只能按照系统规定的方式使用，不能自行定义。 Java中的关键字按功能主要可以分为以下几类：（1）访问修饰符 public,private,protected p
Hive中的排序语法 daizj 排序 hive order by DISTRIBUTE BY sort by
Hive中的排序语法 2014.06.22 ORDER BY hive中的ORDER BY语句和关系数据库中的sql语法相似。他会对查询结果做全局排序，这意味着所有的数据会传送到一个Reduce任务上，这样会导致在大数量的情况下，花费大量时间。与数据库中 ORDER BY 的区别在于在hive.mapred.mode = strict模式下，必须指定 limit 否则执行会报错。
单态设计模式 dcj3sjt126com 设计模式
单例模式（Singleton）用于为一个类生成一个唯一的对象。最常用的地方是数据库连接。使用单例模式生成一个对象后，该对象可以被其它众多对象所使用。 <?phpclass Example{ // 保存类实例在此属性中 private static&
svn locked dcj3sjt126com Lock
post-commit hook failed (exit code 1) with output: svn: E155004: Working copy 'D:\xx\xxx' locked svn: E200031: sqlite: attempt to write a readonly database svn: E200031: sqlite: attempt to write a
ARM寄存器学习 e200702084 数据结构 C++c C#F#
无论是学习哪一种处理器，首先需要明确的就是这种处理器的寄存器以及工作模式。 ARM有37个寄存器，其中31个通用寄存器，6个状态寄存器。 1、不分组寄存器（R0-R7）不分组也就是说说，在所有的处理器模式下指的都时同一物理寄存器。在异常中断造成处理器模式切换时，由于不同的处理器模式使用一个名字相同的物理寄存器，就是
常用编码资料 gengzg 编码
List<UserInfo> list=GetUserS.GetUserList(11); String json=JSON.toJSONString(list); HashMap<Object,Object> hs=new HashMap<Object, Object>(); for(int i=0;i<10;i++) {
进程 vs. 线程 hongtoushizi 线程 linux 进程
我们介绍了多进程和多线程，这是实现多任务最常用的两种方式。现在，我们来讨论一下这两种方式的优缺点。首先，要实现多任务，通常我们会设计Master-Worker模式，Master负责分配任务，Worker负责执行任务，因此，多任务环境下，通常是一个Master，多个Worker。如果用多进程实现Master-Worker，主进程就是Master，其他进程就是Worker。如果用多线程实现
Linux定时Job：crontab -e 与 /etc/crontab 的区别 Josh_Persistence linux crontab
一、linux中的crotab中的指定的时间只有5个部分：* * * * * 分别表示：分钟，小时，日，月，星期，具体说来：第一段代表分钟 0—59 第二段代表小时 0—23 第三段代表日期 1—31 第四段代表月份 1—12 第五段代表星期几，0代表星期日 0—6 如： */1 * * * * 每分钟执行一次。 *
KMP算法详解 hm4123660 数据结构 C++算法字符串 KMP
字符串模式匹配我们相信大家都有遇过，然而我们也习惯用简单匹配法（即Brute-Force算法)，其基本思路就是一个个逐一对比下去，这也是我们大家熟知的方法，然而这种算法的效率并不高，但利于理解。假设主串s="ababcabcacbab",模式串为t="
枚举类型的单例模式 zhb8015 单例模式
E.编写一个包含单个元素的枚举类型[极推荐]。代码如下： public enum MaYun {himself; //定义一个枚举的元素，就代表MaYun的一个实例private String anotherField;MaYun() {//MaYun诞生要做的事情//这个方法也可以去掉。将构造时候需要做的事情放在instance赋值的时候：/** himself = MaYun() {*
Kafka+Storm+HDFS ssydxa219 storm
cd /myhome/usr/stormbin/storm nimbus &bin/storm supervisor &bin/storm ui &Kafka+Storm+HDFS整合实践kafka_2.9.2-0.8.1.1.tgzapache-storm-0.9.2-incubating.tar.gzKafka安装配置我们使用3台机器搭建Kafk
Java获取本地服务器的IP 中华好儿孙 java Web 获取服务器ip地址
System.out.println("getRequestURL:"+request.getRequestURL()); System.out.println("getLocalAddr:"+request.getLocalAddr()); System.out.println("getLocalPort:&quo