介绍 (Introduction)

The purpose of this project is to analyse pharmaceutical sales data. Analysing sales data and predicting future sales based on historical data is a very common data science task. This is a great way to start working with data science.

该项目的目的是分析药品销售数据。分析销售数据并根据历史数据预测未来的销售是一项非常常见的数据科学任务。这是开始使用数据科学的好方法。

你会学什么？ (What will you learn?)

In this project, you will learn loading data sets from text files to Pandas, the most popular data manipulation and data analysis Python library and finding specific information in different sales data sets like when a specific drug was sold most often. In addition to this, we will predict future sales based on the existing data using Linear Regression, Polynomial Regression and Simple Vector Regression. We will do some data preprocessing and standardisation. To get better results, we will also learn an important and useful data science technique - ensemble learning.

在这个项目中，您将学习将数据集从文本文件加载到Pandas，最流行的数据处理和数据分析Python库，以及在不同的销售数据集中查找特定信息，例如最常销售某种药物的时间。除此之外，我们还将使用线性回归，多项式回归和简单向量回归，根据现有数据预测未来的销售量。我们将进行一些数据预处理和标准化。为了获得更好的结果，我们还将学习一种重要而有用的数据科学技术-集成学习。

You will also learn how to test your model and plot results using Matplotlib.

您还将学习如何使用Matplotlib测试模型和绘制结果。

Let’s get started.

让我们开始吧。

问题定义 (Problem definition)

Here are the specific questions we will be answering in this exercise:

以下是我们在本练习中将回答的特定问题：

On which day of the week is the second drug (M01AE) most often sold?
第二种药物(M01AE)在一周的哪一天最常销售？
Which three drugs have the highest sales in January 2015, July 2016, September 2017.
哪三种药物在2015年1月，2016年7月和2017年9月的销售额最高。
Which drug has sold most often on Mondays in 2017?
哪种药物在2017年的星期一销售最频繁？
What medicine sales may be in January 2020? (Our data set only contains information about sales from January 2014 to October 2019)
2020年1月可能会售出什么药？ (我们的数据集仅包含2014年1月至2019年10月的销售信息)

逐步解决方案 (Step by step solution)

创建一个项目文件夹 (Create a project folder)

Create a folder for a project on your computer called “Analysing-pharmaceutical-sales-data”

在计算机上为项目创建一个名为“ Analysing-pharmaceutical-sales-data”的文件夹

从此Kaggle项目下载数据集： (Download data sets from this Kaggle project:)

https://www.kaggle.com/milanzdravkovic/pharma-sales-data

Place these data sets in a folder called “data” in your project folder.

将这些数据集放置在项目文件夹中名为“ data”的文件夹中。

If you’ve never used Python or Jupyter Notebook on your computer read my article How to set up your computer for Data Science to check if you have everything you need to run the below analysis in your computer.

如果您从未在计算机上使用过Python或Jupyter Notebook，请阅读我的文章如何为数据科学设置计算机以检查是否具备在计算机上运行以下分析所需的一切。

启动新笔记本 (Start a new notebook)

Start Jupyter Notebook by typing a command in the Terminal/Command Prompt:

通过在终端/命令提示符中键入命令来启动Jupyter Notebook：

$ jupyter notebook

Click new in the top right corner and select Python 3.

单击右上角的“新建”，然后选择“ Python 3”。

This will open a new Jupyter Notebook in your browser. Rename the Untitled project name to your project name and you are ready to start.

这将在浏览器中打开一个新的Jupyter Notebook。将Untitled项目名称重命名为您的项目名称，您就可以开始了。

If you have Anaconda installed on your computer you will already have all libraries needed for this project installed on your computer.

如果您的计算机上安装了Anaconda，则已经在计算机上安装了此项目所需的所有库。

If you are using Google Colab, open a new notebook.

如果您使用的是Google Colab，请打开一个新笔记本。

加载库和设置 (Loading libraries and setup)

The first thing we usually do in a new notebook is adding different libraries we will need to use when working on the project.

我们通常在新笔记本中要做的第一件事是添加在项目上需要使用的不同库。

# Pandas - Data manipulation and analysis library
import pandas as pd
# NumPy - mathematical functions on multi-dimensional arrays and matrices
import numpy as np
# Matplotlib - plotting library to create graphs and charts
import matplotlib.pyplot as plt
# Re - regular expression module for Python
import re
# Calendar - Python functions related to the calendar
import calendar


# Manipulating dates and times for Python
from datetime import datetime


# Scikit-learn algorithms and functions
from sklearn import linear_model
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.ensemble import VotingRegressor


# Settings for Matplotlib graphs and charts
from pylab import rcParams
rcParams['figure.figsize'] = 12, 8


# Display Matplotlib output inline
%matplotlib inline


# Additional configuration
np.set_printoptions(precision=2)

Now we are ready to solve our first question.

现在我们准备解决我们的第一个问题。

第二种药物(M01AE)在一周的哪一天最常销售？ (On which day of the week is the second drug (M01AE) most often sold?)

Once we have all libraries loaded usually what we do next is to load our dataset. Because we need to find out on which day of the week is the second drug most often sold, we need to load daily data. We are loading this dataset for further exploration to Pandas Data Frame.

加载完所有库后，通常下一步是加载数据集。因为我们需要找出一周中的哪一天是第二种最常销售的药物，所以我们需要加载每日数据。我们正在加载此数据集，以进一步探索“熊猫数据框”。

# Loading our sales daily data set from csv file using Pandas.
df = pd.read_csv("data/salesdaily.csv")


# Let's look at the data.
df.head()

This is how our data looks like.

这就是我们的数据的样子。

This will display a few first rows from the data set so we can see the structure and what is in the data.

这将显示数据集中的前几行，因此我们可以看到结构和数据中的内容。

To find out on which day of the week is the second drug (M01AE) most often sold, we need to sum up all results

为了找出第二天最常售出第二种药物(M01AE)的时间是星期几，我们需要总结所有结果

Because Python Pandas is very powerful, we can do most of these things using one line of code.

由于Python Pandas非常强大，因此我们可以使用一行代码来完成大多数这些事情。

# Grouping the second drug sales by weekday name.
df = df[['M01AE', 'Weekday Name']]
result = df.groupby(['Weekday Name'], as_index=False).sum().sort_values('M01AE', ascending=False)

The above command will sum up all sales values by Weekday Name and sort then so the day with most sold items of the drug will be in the first row.

上面的命令将按“工作日名称”汇总所有销售值，然后进行排序，因此该药品销售量最高的日期将位于第一行。

Now we just need to take the Weekday Name and value from the first row.

现在，我们只需要从第一行获取“工作日名称”和值。

# Taking the weekday name with most sales and the volume of sales from the result
resultDay = result.iloc[0,0]
resultValue = round(result.iloc[0,1], 2)

All we need to do now is to print the results.

我们现在要做的就是打印结果。

# Printing the result
print('The second drug, M01AE, was most often sold on ' + str(resultDay))
print('with the volume of ' + str(resultValue))

If you have completed the task correctly, the result on your screen should look like this.

如果您已正确完成任务，则屏幕上的结果应如下所示。

The second drug, M01AE, was most often sold on Sunday with the volume of 1384.94

第二种药物M01AE，最常在周日出售，交易量为1384.94

We encourage you to also find on which day of the week other drugs are most often sold.

我们鼓励您还查找一周中哪一天最常销售其他药物。

Now let’s look at the second question.

现在让我们看第二个问题。

2015年1月，2016年7月和2017年9月哪三种药物的销售额最高 (Which three drugs have the highest sales in January 2015, July 2016, September 2017)

For this task, we need to load monthly sales data into Pandas Data Frame and let’s look at how our data looks like.

对于此任务，我们需要将每月销售数据加载到Pandas Data Frame中，让我们看一下数据的外观。

# Loading monthly sales data set from csv file using Pandas.
df = pd.read_csv("data/salesmonthly.csv")
df.head()

Because we will be repeating calculations for different years and months, it is a good idea to define a function that will take month and year as parameters.

因为我们将在不同的年份和月份重复计算，所以最好定义一个将月份和年份作为参数的函数。

The function will look like this.

该函数将如下所示。

def top3byMonth(month, year):
    """
    given a month and a year
    find top 3 drugs sold
    """
    month = str(month) if (month > 9) else '0'+str(month)
    year = str(year)
    # filter by date
    sales = df.loc[df['datum'].str.contains('^'+year+'\-'+month+'', flags=re.I, regex=True)]
    # reset index
    sales = sales.reset_index()
    # filter relevant columns
    topSales = sales[['M01AB', 'M01AE', 'N02BA', 'N02BE', 'N05B', 'N05C', 'R03', 'R06']]
    # sort values horizontally
    topSales = topSales.sort_values(by=0, ascending=False, axis=1)
    # print results
    print('Top 3 drugs by sale in '+calendar.month_name[int(month)]+' '+year)
    for field in topSales.columns.values[0:3]:
        print(' - Product: ' + str(field) + ', Volume sold: ' + str(round(topSales[field].iloc[0], 2)))
    print("\n")

Once the function is defined we can run the function for different integer values of a month and a year.

定义函数后，我们可以针对一个月和一年的不同整数值运行该函数。

# top3 drugs by sale in January 2015
top3byMonth(1, 2015)


# top3 drugs by sale in July 2016
top3byMonth(7, 2016)


# top3 drugs by sale in September 2017
top3byMonth(9, 2017)

If you have completed the task correctly you should receive the following results.

如果您正确完成了任务，则应该收到以下结果。

Top 3 drugs by sale in January 2015

2015年1月销售量排名前三的药物

Product: N02BE, Volume sold: 1044.24
产品：N02BE，销量：1044.24
Product: N05B, Volume sold: 463.0
产品：N05B，销量：463.0
Product: R03, Volume sold: 177.25
产品：R03，销售量：177.25

Top 3 drugs by sale in July 2016

2016年7月销售量排名前三的药物

Product: N02BE, Volume sold: 652.36
产品：N02BE，销量：652.36
Product: N05B, Volume sold: 240.0
产品：N05B，销量：240.0
Product: M01AB, Volume sold: 203.97
产品：M01AB，销量：203.97

Top 3 drugs by sale in September 2017

2017年9月销售量排名前三的药物

Product: N02BE, Volume sold: 863.75
产品：N02BE，销量：863.75
Product: N05B, Volume sold: 223.0
产品：N05B，销量：223.0
Product: R03, Volume sold: 139.0
产品：R03，销量：139.0

That is our task number two complete. Let’s look at task number three.

那就是我们的第二项任务。让我们看一下第三项任务。

哪种药物在2017年的星期一销售最频繁？ (Which drug has sold most often on Mondays in 2017?)

To answer this question we need to load a daily sales data set.

为了回答这个问题，我们需要加载每日销售数据集。

# Loading our sales daily data set from csv file using Pandas.
df = pd.read_csv("data/salesdaily.csv")


# Let's look at the data.
df.head()

This is how our Pandas Data Frame looks like.

这就是我们的熊猫数据框的样子。

Now we need to filter sales so we get only sales on Mondays in 2017.

现在我们需要过滤销售量，以便仅在2017年的星期一获得销售量。

# Filtering out from the data everything else apart from yar 2017 and Monday
df = df.loc[df['datum'].str.contains('2017', flags=re.I, regex=True) & (df['Weekday Name'] == 'Monday')]

We now need to group results by Weekday Name and sum values.

现在，我们需要按“工作日名称”和“总和”值对结果进行分组。

# Groupping by weekday name and summarising
df = df.groupby(['Weekday Name'], as_index=False).sum()

Now we need to sort the values horizontally to get the biggest value on the left. Sorting values horizontally is similar to more commonly used vertical sorting but instead of sorting by columns, horizontal scrolling is sorting by rows.

现在我们需要水平排序这些值以在左侧获得最大的值。水平排序值类似于更常用的垂直排序，但是水平滚动不是按列排序，而是按行排序。

# Filtering only relevant columns and sorting values of most sold drugs horizontally to achieve the most often sold drug on the left
df = df[['M01AB', 'M01AE', 'N02BA', 'N02BE', 'N05B', 'N05C', 'R03', 'R06']]
result = df.sort_values(by=0, ascending=False, axis=1)

Now, we will get the drug that sold most often on Mondays in 2017 in the first column from the left.

现在，我们将在左侧第一栏中获得2017年星期一最常销售的药物。

The only thing we need to do is to take this value and print it as a result.

我们唯一需要做的就是获取该值并将其打印出来。

# Displaying results
for field in result.columns.values[0:1]:
    print('The drug most often sold on Mondays in 2017 is ' + str(field))
    print('with the volume of ' + str(round(result[field].iloc[0], 2)))

The drug most often sold on Mondays in 2017 is N02BE
with the volume of 1160.56

In the above exercises, we were loading data sets to Pandas Data Frames and we were looking for specific information using functions like grouping, sorting and summarising. These are great exercises to practice these types of data manipulations and we will be using them often in the next exercises.

在上面的练习中，我们正在将数据集加载到Pandas Data Frames中，并且正在使用诸如分组，排序和汇总之类的功能来查找特定信息。这些是练习这些类型的数据操作的出色练习，我们将在下一个练习中经常使用它们。

2020年1月可能会售出什么药？ (What medicine sales may be in January 2020?)

We will now look at regression which is a very common data science task. The idea of regression is to predict the value of a dependent variable based on the values of one or more independent variables. Using different methods of regression and using past data we can try to predict future values.

现在，我们将讨论回归，这是非常常见的数据科学任务。回归的思想是基于一个或多个自变量的值来预测因变量的值。使用不同的回归方法和过去的数据，我们可以尝试预测未来的价值。

In this exercise, we will try to predict sales volume in the future months for the data recorded between 2014 and 2019.

在本练习中，我们将尝试使用2014年至2019年之间记录的数据预测未来几个月的销量。

前处理 (Preprocessing)

Looking at the data sets we can see that the data is good quality but there are some records where sales value is 0 for at least one group of drugs. This is usually something that we need to take care of before we run any machine learning model. In this case, we have a couple of options; we can remove rows where the recorded sales value is 0 for at least one group of drugs or we can also replace 0 values with the mean or median value for the group. For simplicity, we will remove all records where the recorded sales value is 0 for at least one group of drugs but we recommend you to repeat this exercise again and replace 0 values with the mean or median value for the group to see if you will get better results.

查看数据集，我们可以看到数据质量不错，但有一些记录显示至少一组药物的销售价值为0。在运行任何机器学习模型之前，通常这是我们需要注意的事情。在这种情况下，我们有两个选择。我们可以删除至少一组药物的记录销售值为0的行，也可以将0值替换为该组的平均值或中值。为简单起见，我们将删除至少一组药物的记录销售值为0的所有记录，但我们建议您再次重复此练习，并将0值替换为该组的平均值或中位数，以查看是否会更好的结果。

Another important feature of data is that it contains incomplete sales data for 2019. The last recorded day is 8th of October which means sales data for October is incomplete. Because for the regression methods in this exercise we only use monthly sales data, we have excluded data from October 2019 from the analysis.

数据的另一个重要特征是它包含2019年不完整的销售数据。最后记录的日期是10月8日，这意味着10月的销售数据不完整。因为对于本练习中的回归方法，我们仅使用每月销售数据，所以从分析中排除了2019年10月的数据。

模型与技术 (Models and Techniques)

We will be using Pandas for reading CSV data files and data preprocessing and Scikit-learn Python library for the regression models.

我们将使用Pandas读取CSV数据文件和数据预处理，并使用Scikit-learn Python库创建回归模型。

For data visualisation, we will use Matplotlib Python library.

为了实现数据可视化，我们将使用Matplotlib Python库。

We will use the following regression models:

我们将使用以下回归模型：

Linear Regression
线性回归
Polynomial Regression
多项式回归
Simple Vector Regression (SVR)
简单向量回归(SVR)

Scikit-learn library includes implementation of all the above models.

Scikit学习库包括上述所有模型的实现。

We have already loaded all the required Scikit-learn libraries at the beginning of our notebook.

我们已经在笔记本的开头加载了所有必需的Scikit学习库。

We will split the data and we will use 70% data for training the models and 30% of data for testing.

我们将拆分数据，并将使用70％的数据训练模型，并使用30％的数据进行测试。

We will use Voting Regressor to combine different machine learning regressors and return the average predicted values. We do this to balance out individual regressors weaknesses.

我们将使用投票回归器来组合不同的机器学习回归器并返回平均预测值。我们这样做是为了平衡各个回归变量的弱点。

计算和图解 (Calculations and Plots)

We will display individual results for all regressions and Voting Regressor and we will plot all regressions on a chart to visually assess how the data is scattered and how regressions are plotted among dataset values.

我们将显示所有回归和投票回归的单独结果，并将所有回归绘制在图表上，以直观地评估数据的分散方式以及如何在数据集值之间绘制回归。

Because we are working with a relatively big project here it is a good idea to organise the code using functions.

由于我们正在处理一个相对较大的项目，因此最好使用函数来组织代码。

Let’s start from the function that will scatter our train and test data on the chart.

让我们从在图表上分散训练和测试数据的功能开始。

def scatterData(X_train, y_train, X_test, y_test, title):
    plt.title('Prediction using ' + title)
    plt.xlabel('Month sequence', fontsize=20)
    plt.ylabel('Sales', fontsize=20)


    # Use Matplotlib Scatter Plot
    plt.scatter(X_train, y_train, color='blue', label='Training observation points')
    plt.scatter(X_test, y_test, color='cyan', label='Testing observation points')

Now we will create Linear Regression, Polynomial Regression and SVR prediction functions. These functions take train and test values as parameters, calculate regression and display the predicted value for January 2020 along with the calculated accuracy and error values so we can see the effectiveness of the model.

现在，我们将创建线性回归，多项式回归和SVR预测函数。这些函数将训练值和测试值作为参数，计算回归并显示2020年1月的预测值以及计算出的准确性和误差值，以便我们可以看到模型的有效性。

Linear Regression prediction function.

线性回归预测功能。

def predictLinearRegression(X_train, y_train, X_test, y_test):


    y_train = y_train.reshape(-1, 1)
    y_test = y_test.reshape(-1, 1)


    scatterData(X_train, y_train, X_test, y_test, 'Linear Regression')


    reg = linear_model.LinearRegression()
    reg.fit(X_train, y_train)
    plt.plot(X_train, reg.predict(X_train), color='red', label='Linear regressor')
    plt.legend()
    plt.show()


    # LINEAR REGRESSION - Predict/Test model
    y_predict_linear = reg.predict(X_test)


    # LINEAR REGRESSION - Predict for January 2020
    linear_predict = reg.predict([[predictFor]])
    # linear_predict = reg.predict([[predictFor]])[0]


    # LINEAR REGRESSION - Accuracy
    accuracy = reg.score(X_train, y_train)


    # LINEAR REGRESSION - Error
    # error = round(np.mean((y_predict_linear-y_test)**2), 2)
    
    # Results
    print('Linear Regression: ' + str(linear_predict) + ' (Accuracy: ' + str(round(accuracy*100)) + '%)')


    return {'regressor':reg, 'values':linear_predict}

Polynomial Regression prediction function.

多项式回归预测函数。

def predictPolynomialRegression(X_train, y_train, X_test, y_test):


    y_train = y_train.reshape(-1, 1)
    y_test = y_test.reshape(-1, 1)


    scatterData(X_train, y_train, X_test, y_test, 'Polynomial Regression')
    
    poly_reg = PolynomialFeatures(degree = 2)
    X_poly = poly_reg.fit_transform(X_train)
    poly_reg_model = linear_model.LinearRegression()
    poly_reg_model.fit(X_poly, y_train)
    plt.plot(X_train, poly_reg_model.predict(poly_reg.fit_transform(X_train)), color='green', label='Polynomial regressor')
    plt.legend()
    plt.show()


    # Polynomial Regression - Predict/Test model
    y_predict_polynomial = poly_reg_model.predict(X_poly)


    # Polynomial Regression - Predict for January 2020
    polynomial_predict = poly_reg_model.predict(poly_reg.fit_transform([[predictFor]]))


    # Polynomial Regression - Accuracy
    # X_poly_test = poly_reg.fit_transform(X_test)
    accuracy = poly_reg_model.score(X_poly, y_train)


    # Polynomial Regression - Error
    # error = round(np.mean((y_predict_polynomial-y_train)**2), 2)


    # Result
    print('Polynomial Regression: ' + str(polynomial_predict) + ' (Accuracy: ' + str(round(accuracy*100)) + '%)')
    return {'regressor':poly_reg_model, 'values':polynomial_predict}

Simple Vector Regression (SVR) prediction function.

简单向量回归(SVR)预测功能。

def predictSVR(X_train, y_train, X_test, y_test):


    y_train = y_train.reshape(-1, 1)
    y_test = y_test.reshape(-1, 1)


    scatterData(X_train, y_train, X_test, y_test, 'Simple Vector Regression (SVR)')


    svr_regressor = SVR(kernel='rbf', gamma='auto')
    svr_regressor.fit(X_train, y_train.ravel())


    # plt.scatter(X_train, y_train, color='red', label='Actual observation points')
    plt.plot(X_train, svr_regressor.predict(X_train), label='SVR regressor')
    plt.legend()
    plt.show()


    # Simple Vector Regression (SVR) - Predict/Test model
    y_predict_svr = svr_regressor.predict(X_test)


    # Simple Vector Regression (SVR) - Predict for January 2020
    svr_predict = svr_regressor.predict([[predictFor]])


    # Simple Vector Regression (SVR) - Accuracy
    accuracy = svr_regressor.score(X_train, y_train)


    # Simple Vector Regression (SVR) - Error
    # error = round(np.mean((y_predict_svr-y_train)**2), 2)
    
    # Result
    print('Simple Vector Regression (SVR): ' + str(svr_predict) + ' (Accuracy: ' + str(round(accuracy*100)) + '%)')
    return {'regressor':svr_regressor, 'values':svr_predict}

In the next Jupyter Notebook cell we can write the code that will use the above functions and display the predicted value and visualise our training and test data and different regression models using Matplotlib.

在下一个Jupyter Notebook单元中，我们可以编写将使用上述功能并显示预测值的代码，并使用Matplotlib可视化我们的训练和测试数据以及不同的回归模型。

For calculations, we will use the second product again (M01AE) but we encourage you to do similar calculations for other products as well.

为了进行计算，我们将再次使用第二种产品(M01AE)，但我们建议您也对其他产品进行类似的计算。

We need to define our product variable with a product name that we will be calculating the regression for. Next, we will define Pandas Data Frame where we will be storing results of our regression. And then we will define predictFor variable which is a number of a month in the sequence of values to predict the dependent sales value for. Because we have data until October 2019, not December 2019 we’re predicting for 3 months ahead.

我们需要使用要为其计算回归的产品名称定义产品变量。接下来，我们将定义Pandas Data Frame，以存储回归结果。然后，我们将定义predictFor变量，该变量是一系列值中的一个月，以预测其相关的销售值。因为我们拥有的数据截止到2019年10月，而不是2019年12月，所以我们预计将提前3个月。

product = 'N02BA'


# For storing all regression results
regResults = pd.DataFrame(columns=('Linear', 'Polynomial', 'SVR', 'Voting Regressor'), index=[product])


# To display a larger graph than a default with specify some additional parameters for Matplotlib library.
rcParams['figure.figsize'] = 12, 8


# We will be using monthly data for our predictions
df = pd.read_csv("data/salesmonthly.csv")


# We will use monthly sales data from 2017, 2018, 2019. We could also use just 2019 for that.
df = df.loc[df['datum'].str.contains("2014") | df['datum'].str.contains("2015") | df['datum'].str.contains("2016") | df['datum'].str.contains("2017") | df['datum'].str.contains("2018") | df['datum'].str.contains("2019")]
df = df.reset_index()


# It is always a good practice to look at the data often
df

# We are adding a sequence number for each month as an independent variable
df['datumNumber'] = 1
for index, row in df.iterrows():
    df.loc[index, 'datumNumber'] = index+1


# Removing the first and the last incompleted record from Pandas Data Frame
# the first and the last available month is quite low which may indicate that it might be incomplete
# and skewing results so we're dropping it
df.drop(df.head(1).index,inplace=True)
df.drop(df.tail(1).index,inplace=True)


# Cleaning up any rows with the product value = 0.
df = df[df[product] != 0]


# Let's look at the data again.
df.head()


# What value we predict for? January 2020. Because we have data until August 2019 we're predicting for 5 months ahead
predictFor = len(df)+5
print('Predictions for the product ' + str(product) + ' sales in January 2020')

Predictions for the product N02BA sales in January 2020

# For storing regression results.
regValues = {}


# Preparing training and testing data by using train_test_split function. 70% for training and 30% for testing.
dfSplit = df[['datumNumber', product]]


# We are going to keep 30% of the dataset in test dataset
train, test = train_test_split(dfSplit, test_size=3/10, random_state=0)


trainSorted = train.sort_values('datumNumber', ascending=True)
testSorted = test.sort_values('datumNumber', ascending=True)


X_train = trainSorted[['datumNumber']].values
y_train = trainSorted[product].values
X_test = testSorted[['datumNumber']].values
y_test = testSorted[product].values

Performing and saving results for Linear Regression.

执行并保存线性回归的结果。

# LINEAR REGRESSION
linearResult = predictLinearRegression(X_train, y_train, X_test, y_test)
reg = linearResult['regressor']
regValues['Linear'] = round(linearResult['values'][0][0])

Performing and saving results for Polynomial Regression.

执行并保存多项式回归的结果。

# POLYNOMIAL REGRESSION
polynomialResult = predictPolynomialRegression(X_train, y_train, X_test, y_test)
polynomial_regressor = polynomialResult['regressor']
regValues['Polynomial'] = round(polynomialResult['values'][0][0])

Performing and saving results for Simple Vector Regression (SVR).

执行和保存简单矢量回归(SVR)的结果。

# SIMPLE VECTOR REGRESSION (SVR)
svrResult = predictSVR(X_train, y_train, X_test, y_test)
svr_regressor = svrResult['regressor']
regValues['SVR'] = round(svrResult['values'][0])

For better results, we will use voting regressor which is an ensemble technique that uses several models and then averages individual predictions and returns a final prediction.

为了获得更好的结果，我们将使用表决回归器，这是一种集成技术，它使用多个模型，然后平均各个预测并返回最终预测。

vRegressor = VotingRegressor(estimators=[('reg', reg), ('polynomial_regressor', polynomial_regressor), ('svr_regressor', svr_regressor)])


vRegressorRes = vRegressor.fit(X_train, y_train.ravel())


# VotingRegressor - Predict for January 2020
vRegressor_predict = vRegressor.predict([[predictFor]])
regValues['Voting Regressor'] = round(vRegressor_predict[0])
print('Voting Regressor January 2020 predicted value: ' + str(round(vRegressor_predict[0])))
regResults.loc[product] = regValues

Voting Regressor January 2020 predicted value: 98.0

Displaying all results

显示所有结果

regResults

摘要 (Summary)

In this exercise, we have learned to use Python Pandas, the most popular data manipulation and analysis library in data science. Loading data sets using Pandas and performing a statistical analysis is one of the most important elements for the beginner data scientist.

在本练习中，我们学会了使用Python Pandas，这是数据科学中最流行的数据处理和分析库。对于初学者来说，使用Pandas加载数据集并进行统计分析是最重要的元素之一。

Next, we have looked at regression which is another very important element of data science. Linear regression, Polynomial regression and SVR are basic models to start with in terms of regression.

接下来，我们研究了回归，这是数据科学的另一个非常重要的元素。线性回归，多项式回归和SVR是从回归开始的基本模型。

To consolidate your knowledge consider completing the task again from the beginning without looking at the code examples from the book and see what results you will get. This is an excellent thing to do to solidify your knowledge.

为了巩固您的知识，可以考虑从头开始再次完成任务，而无需查看本书中的代码示例并查看将获得什么结果。巩固您的知识是一件很棒的事情。

Full Python code in Jupyter Notebook is available on GitHub:https://github.com/pjonline/Basic-Data-Science-Projects/tree/master/1-Analysing-Pharmaceutical-Sales-Data

GitHub上提供了Jupyter Notebook中的完整Python代码： https : //github.com/pjonline/Basic-Data-Science-Projects/tree/master/1-Analysing-Pharmaceutical-Sales-Data

Happy coding!

编码愉快！

翻译自: https://medium.com/@pjarz/analysing-pharmaceutical-sales-data-in-python-6ce74da818ab

在python中分析药品销售数据