cumi6497

python 线性回归模型_如何在Python中建立和训练线性和逻辑回归ML模型

python 线性回归模型

Linear regression and logistic regression are two of the most popular machine learning models today.

线性回归和逻辑回归是当今最受欢迎的两种机器学习模型。

In the last article, you learned about the history and theory behind a linear regression machine learning algorithm.

在上一篇文章中，您了解了线性回归机器学习算法背后的历史和理论。

This tutorial will teach you how to create, train, and test your first linear regression machine learning model in Python using the scikit-learn library.

本教程将教您如何使用scikit-learn库在Python中创建，训练和测试您的第一个线性回归机器学习模型。

第1节：线性回归 (Section 1: Linear Regression)

我们将在本教程中使用的数据集 (The Data Set We Will Use in This Tutorial)

Since we're just starting to learn about linear regression in machine learning, we will work with artificially-created datasets in this tutorial. This will allow you to focus on learning the machine learning concepts and avoid spending unnecessary time on cleaning or manipulating data.

由于我们才刚刚开始学习机器学习中的线性回归，因此在本教程中我们将使用人工创建的数据集。这将使您可以专注于学习机器学习的概念，并避免在清理或处理数据上花费不必要的时间。

More specifically, we will be working with a data set of housing data and attempting to predict housing prices. Before we build the model, we’ll first need to import the required libraries.

更具体地说，我们将使用住房数据的数据集并尝试预测住房价格。在构建模型之前，我们首先需要导入所需的库。

我们将在本教程中使用的图书馆 (The Libraries We Will Use in This Tutorial)

The first library that we need to import is pandas, which is a portmanteau of “panel data” and is the most popular Python library for working with tabular data.

我们需要导入的第一个库是pandas ，它是“面板数据”的portmanteau，是使用表格数据的最受欢迎的Python库。

It is convention to import pandas under the alias pd. You can import pandas with the following statement:

按照惯例，以别名pd导入pandas 。您可以使用以下语句导入pandas ：

import pandas as pd

Next, we’ll need to import NumPy, which is a popular library for numerical computing. Numpy is known for its NumPy array data structure as well as its useful methods reshape, arange, and append.

接下来，我们需要导入NumPy ，这是一个流行的数值计算库。 Numpy以其NumPy数组数据结构以及有用的方法reshape ， arange和append闻名。

It is convention to import NumPy under the alias np. You can import numpy with the following statement:

按照惯例，以别名np导入NumPy。您可以使用以下语句导入numpy ：

import numpy as np

Next, we need to import matplotlib, which is Python’s most popular library for data visualization.

接下来，我们需要导入matplotlib ，这是Python最受欢迎的数据可视化库。

matplotlib is typically imported under the alias plt. You can import matplotlib with the following statement:

matplotlib通常以别名plt导入。您可以使用以下语句导入matplotlib ：

import matplotlib.pyplot as plt

%matplotlib inline

The %matplotlib inline statement will cause of of our matplotlib visualizations to embed themselves directly in our Jupyter Notebook, which makes them easier to access and interpret.

%matplotlib inline语句将使我们的matplotlib可视化效果直接嵌入到我们的Jupyter Notebook中，这使它们更易于访问和解释。

Lastly, you will want to import seaborn, which is another Python data visualization library that makes it easier to create beautiful visualizations using matplotlib.

最后，您将要导入seaborn ，这是另一个Python数据可视化库，可以更轻松地使用matplotlib创建漂亮的可视化。

You can import seaborn with the following statement:

您可以使用以下语句导入seaborn ：

import seaborn as sns

To summarize, here are all of the imports required in this tutorial:

总结一下，这是本教程中所有必需的导入：

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as sns

In future articles, I will specify which imports are necessary but I will not explain each import in detail like I did here.

在以后的文章中，我将指定哪些导入是必需的，但不会像在这里一样详细解释每个导入。

导入数据集 (Importing the Data Set)

As mentioned, we will be using a data set of housing information. We will use

如前所述，我们将使用住房信息数据集。我们将使用

The data set has been uploaded to my website as a .csv file at the following URL:

数据集已作为.csv文件通过以下URL上传到我的网站：

https://nickmccullum.com/files/Housing_Data.csv

To import the data set into your Jupyter Notebook, the first thing you should do is download the file by copying and pasting this URL into your browser. Then, move the file into the same directory as your Jupyter Notebook.

要将数据集导入到Jupyter Notebook中，您应该做的第一件事是通过将该URL复制并粘贴到浏览器中来下载文件。然后，将文件移到与Jupyter Notebook相同的目录中。

Once this is done, the following Python statement will import the housing data set into your Jupyter Notebook:

完成此操作后，以下Python语句将外壳数据集导入到Jupyter Notebook中：

raw_data = pd.read_csv('Housing_Data.csv')

This data set has a number of features, including:

该数据集具有许多功能，包括：

The average income in the area of the house
房屋面积的平均收入
The average number of total rooms in the area
该地区平均客房总数
The price that the house sold for
房子卖出的价格
The address of the house
房子的地址

This data is randomly generated, so you will see a few nuances that might not normally make sense (such as a large number of decimal places after a number that should be an integer).

此数据是随机生成的，因此您会看到一些通常可能没有意义的细微差别(例如，在应该为整数的数字之后的大量小数位)。

了解数据集 (Understanding the Data Set)

Now that the data set has been imported under the raw_data variable, you can use the info method to get some high-level information about the data set. Specifically, running raw_data.info() gives:

现在已经在raw_data变量下导入了数据集，您可以使用info方法来获取有关数据集的一些高级信息。具体来说，运行raw_data.info()得到：



RangeIndex: 5000 entries, 0 to 4999

Data columns (total 7 columns):

Avg. Area Income                5000 non-null float64

Avg. Area House Age             5000 non-null float64

Avg. Area Number of Rooms       5000 non-null float64

Avg. Area Number of Bedrooms    5000 non-null float64

Area Population                 5000 non-null float64

Price                           5000 non-null float64

Address                         5000 non-null object

dtypes: float64(6), object(1)

memory usage: 273.6+ KB

Another useful way that you can learn about this data set is by generating a pairplot. You can use the seaborn method pairplot for this, and pass in the entire DataFrame as a parameter. Here is the entire statement for this:

您可以了解此数据集的另一种有用方法是生成对图。您可以seaborn使用seaborn方法pairplot ，并将整个DataFrame作为参数传递。这是整个说明：

sns.pairplot(raw_data)

The output of this statement is below:

该语句的输出如下：

Next, let’s begin building our linear regression model.

接下来，让我们开始构建线性回归模型。

建立机器学习线性回归模型 (Building a Machine Learning Linear Regression Model)

The first thing we need to do is split our data into an x-array (which contains the data that we will use to make predictions) and a y-array (which contains the data that we are trying to predict.

我们需要做的第一件事是将我们的数据分成一个x-array (包含我们将用来进行预测的数据)和一个y-array (其中包含我们正在尝试预测的数据)。

First, we should decide which columns to include. You can generate a list of the DataFrame’s columns using raw_data.columns, which outputs:

首先，我们应该决定要包括哪些列。您可以使用raw_data.columns生成DataFrame列的列表，该列表输出：

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',

       'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],

      dtype='object')

We will be using all of these variables in the x-array except for Price (since that’s the variable we’re trying to predict) and Address (since it is only contains text).

我们将在x-array使用所有这些变量，但Price (因为这是我们要预测的变量)和Address (因为它仅包含文本)除外。

Let’s create our x-array and assign it to a variable called x.

让我们创建x-array并将其分配给名为x的变量。

x = raw_data[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',

       'Avg. Area Number of Bedrooms', 'Area Population']]

Next, let’s create our y-array and assign it to a variable called y.

接下来，让我们创建y-array并将其分配给名为y的变量。

y = raw_data['Price']

We have successfully divided our data set into an x-array (which are the input values of our model) and a y-array (which are the output values of our model). We’lll learn how to split our data set further into training data and test data in the next section.

我们已经成功地将数据集划分为x-array (这是我们模型的输入值)和y-array (这是我们模型的输出值)。在下一部分中，我们将学习如何将我们的数据集进一步分为训练数据和测试数据。

将我们的数据集分为训练数据和测试数据 (Splitting our Data Set into Training Data and Test Data)

scikit-learn makes it very easy to divide our data set into training data and test data. To do this, we’ll need to import the function train_test_split from the model_selection module of scikit-learn.

scikit-learn可以很容易地将我们的数据集分为训练数据和测试数据。为此，我们需要从scikit-learn的model_selection模块中导入函数train_test_split 。

Here is the full code to do this:

这是执行此操作的完整代码：

from sklearn.model_selection import train_test_split

The train_test_split data accepts three arguments:

train_test_split数据接受三个参数：

Our x-array
我们的x-array
Our y-array
我们的y-array
The desired size of our test data
我们测试数据的期望大小

With these parameters, the train_test_split function will split our data for us! Here’s the code to do this if we want our test data to be 30% of the entire data set:

使用这些参数， train_test_split函数将为我们分割数据！如果我们希望测试数据占整个数据集的30％，请执行以下代码：

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

Let’s unpack what is happening here.

让我们解压缩这里发生的事情。

The train_test_split function returns a Python list of length 4, where each item in the list is x_train, x_test, y_train, and y_test, respectively. We then use list unpacking to assign the proper values to the correct variable names.

train_test_split函数返回一个长度为4的Python列表，其中列表中的每个项目分别为x_train ， x_test ， y_train和y_test 。然后，我们使用列表解压缩将正确的值分配给正确的变量名称。

Now that we have properly divided our data set, it is time to build and train our linear regression machine learning model.

现在，我们已经正确划分了数据集，是时候构建和训练我们的线性回归机器学习模型了。

建立和训练模型 (Building and Training the Model)

The first thing we need to do is import the LinearRegression estimator from scikit-learn. Here is the Python statement for this:

我们需要做的第一件事是从scikit-learn导入LinearRegression估计器。这是为此的Python语句：

from sklearn.linear_model import LinearRegression

Next, we need to create an instance of the Linear Regression Python object. We will assign this to a variable called model. Here is the code for this:

接下来，我们需要创建Linear Regression Python对象的实例。我们将其分配给一个名为model的变量。这是此代码：

model = LinearRegression()

We can use scikit-learn’s fit method to train this model on our training data.

我们可以使用scikit-learn的fit方法在训练数据上训练该模型。

model.fit(x_train, y_train)

Our model has now been trained. You can examine each of the model’s coefficients using the following statement:

我们的模型现已训练完毕。您可以使用以下语句检查模型的每个系数：

print(model.coef_)

This prints:

打印：

[2.16176350e+01 1.65221120e+05 1.21405377e+05 1.31871878e+03

 1.52251955e+01]

Similarly, here is how you can see the intercept of the regression equation:

同样，这是查看回归方程的截距的方法：

print(model.intercept_)

This prints:

打印：

-2641372.6673013503

A nicer way to view the coefficients is by placing them in a DataFrame. This can be done with the following statement:

查看系数的一种更好的方法是将它们放在DataFrame中。可以使用以下语句完成此操作：

pd.DataFrame(model.coef_, x.columns, columns = ['Coeff'])

The output in this case is much easier to interpret:

在这种情况下，输出更容易解释：

Let’s take a moment to understand what these coefficients mean. Let’s look at the Area Population variable specifically, which has a coefficient of approximately 15.

让我们花一点时间来理解这些系数的含义。让我们具体看看Area Population变量，其系数约为15 。

What this means is that if you hold all other variables constant, then a one-unit increase in Area Population will result in a 15-unit increase in the predicted variable - in this case, Price.

这意味着如果您将所有其他变量保持不变，则“ Area Population增加1个单位将导致预测变量增加15单位-在这种情况下为Price 。

Said differently, large coefficients on a specific variable mean that that variable has a large impact on the value of the variable you’re trying to predict. Similarly, small values have small impact.

换句话说，特定变量的大系数意味着该变量对您要预测的变量的值有很大的影响。同样，小的值影响也很小。

Now that we’ve generated our first machine learning linear regression model, it’s time to use the model to make predictions from our test data set.

现在，我们已经生成了第一个机器学习线性回归模型，是时候使用该模型从测试数据集中进行预测了。

根据我们的模型做出预测 (Making Predictions From Our Model)

scikit-learn makes it very easy to make predictions from a machine learning model. You simply need to call the predict method on the model variable that we created earlier.

scikit-learn使从机器学习模型进行预测变得非常容易。您只需要在我们之前创建的model变量上调用predict方法。

Since the predict variable is designed to make predictions, it only accepts an x-array parameter. It will generate the y values for you!

由于predict变量旨在进行预测，因此它仅接受x-array参数。它将为您生成y值！

Here is the code you’ll need to generate predictions from our model using the predict method:

这是您需要使用predict方法从我们的模型生成预测的代码：

predictions = model.predict(x_test)

The predictions variable holds the predicted values of the features stored in x_test. Since we used the train_test_split method to store the real values in y_test, what we want to do next is compare the values of the predictions array with the values of y_test.

predictions变量保存x_test存储的x_test的预测值。由于我们使用train_test_split方法将实际值存储在y_test ，因此下一步要做的是将predictions数组的值与y_test的值进行y_test 。

An easy way to do this is plot the two arrays using a scatterplot. It’s easy to build matplotlib scatterplots using the plt.scatter method. Here’s the code for this:

一种简单的方法是使用散点图绘制两个数组。使用plt.scatter方法很容易构建matplotlib散点图。这是此代码：

plt.scatter(y_test, predictions)

Here’s the scatterplot that this code generates:

这是此代码生成的散点图：

As you can see, our predicted values are very close to the actual values for the observations in the data set. A perfectly straight diagonal line in this scatterplot would indicate that our model perfectly predicted the y-array values.

如您所见，我们的预测值非常接近数据集中观测值的实际值。在该散点图中，一条完美的对角线将表明我们的模型完美地预测了y-array值。

Another way to visually assess the performance of our model is to plot its residuals, which are the difference between the actual y-array values and the predicted y-array values.

直观评估模型性能的另一种方法是绘制其residuals ，即实际y-array值与预测y-array值之间的差。

An easy way to do this is with the following statement:

下面的语句是实现此目的的简单方法：

plt.hist(y_test - predictions)

Here is the visualization that this code generates:

这是此代码生成的可视化效果：

This is a histogram of the residuals from our machine learning model.

这是我们的机器学习模型残差的直方图。

You may notice that the residuals from our machine learning model appear to be normally distributed. This is a very good sign!

您可能会注意到，我们的机器学习模型中的残差似乎呈正态分布。这是一个非常好的信号！

It indicates that we have selected an appropriate model type (in this case, linear regression) to make predictions from our data set. We will learn more about how to make sure you’re using the right model later in this course.

这表明我们已经选择了适当的模型类型(在这种情况下为线性回归)来根据我们的数据集进行预测。在本课程的后面，我们将详细了解如何确保使用正确的模型。

测试模型的性能 (Testing the Performance of our Model)

We learned near the beginning of this course that there are three main performance metrics used for regression machine learning models:

在本课程开始时，我们了解到有三种主要的性能指标用于回归机器学习模型：

Mean absolute error
平均绝对误差
Mean squared error
均方误差
Root mean squared error
均方根误差

We will now see how to calculate each of these metrics for the model we’ve built in this tutorial. Before proceeding, run the following import statement within your Jupyter Notebook:

现在，我们将了解如何为我们在本教程中构建的模型计算每个指标。在继续之前，请在Jupyter Notebook中运行以下import语句：

from sklearn import metrics

Mean Absolute Error (MAE)

平均绝对误差(MAE)

You can calculate mean absolute error in Python with the following statement:

您可以使用以下语句在Python中计算平均绝对错误：

metrics.mean_absolute_error(y_test, predictions)

均方误差(MSE) (Mean Squared Error (MSE))

Similarly, you can calculate mean squared error in Python with the following statement:

同样，您可以使用以下语句在Python中计算均方误差：

metrics.mean_squared_error(y_test, predictions)

均方根误差(RMSE) (Root Mean Squared Error (RMSE))

Unlike mean absolute error and mean squared error, scikit-learn does not actually have a built-in method for calculating root mean squared error.

与均值绝对误差和均方误差不同， scikit-learn实际上没有内置的方法来计算均方根误差。

Fortunately, it really doesn’t need to. Since root mean squared error is just the square root of mean squared error, you can use NumPy’s sqrt method to easily calculate it:

幸运的是，它确实不需要。由于均方根误差只是均方根误差的sqrt ，因此您可以使用NumPy的sqrt方法轻松地进行计算：

np.sqrt(metrics.mean_squared_error(y_test, predictions))

本教程的完整代码 (The Complete Code For This Tutorial)

Here is the entire code for this Python linear regression machine learning tutorial. You can also view it in this GitHub repository.

这是此Python线性回归机器学习教程的全部代码。您也可以在此GitHub存储库中查看它。

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

raw_data = pd.read_csv('Housing_Data.csv')

x = raw_data[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',

       'Avg. Area Number of Bedrooms', 'Area Population']]

y = raw_data['Price']

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(x_train, y_train)

print(model.coef_)

print(model.intercept_)

pd.DataFrame(model.coef_, x.columns, columns = ['Coeff'])

predictions = model.predict(x_test)

# plt.scatter(y_test, predictions)

plt.hist(y_test - predictions)

from sklearn import metrics

metrics.mean_absolute_error(y_test, predictions)

metrics.mean_squared_error(y_test, predictions)

np.sqrt(metrics.mean_squared_error(y_test, predictions))

第2节：逻辑回归
(Section 2: Logistic Regression)

Note - if you have been coding along with this tutorial so far and built your linear regression model already, you'll want to open a new Jupyter Notebook (with no code in it) before proceeding.

注意-如果您到目前为止一直在与本教程一起编码并且已经建立了线性回归模型，则在继续之前，您需要打开一个新的Jupyter Notebook(其中没有代码)。

我们将在本教程中使用的数据集 (The Data Set We Will Be Using in This Tutorial)

The Titanic data set is a very famous data set that contains characteristics about the passengers on the Titanic. It is often used as an introductory data set for logistic regression problems.

泰坦尼克号数据集是非常著名的数据集，其中包含有关泰坦尼克号上乘客的特征。它通常用作逻辑回归问题的入门数据集。

In this tutorial, we will be using the Titanic data set combined with a Python logistic regression model to predict whether or not a passenger survived the Titanic crash.

在本教程中，我们将结合泰坦尼克号数据集和Python logistic回归模型来预测乘客是否在泰坦尼克号坠机事故中幸免。

The original Titanic data set is publicly available on Kaggle.com, which is a website that hosts data sets and data science competitions.

原始的泰坦尼克号数据集可在Kaggle.com上公开获得，该网站托管数据集和数据科学竞赛。

To make things easier for you as a student in this course, we will be using a semi-cleaned version of the Titanic data set, which will save you time on data cleaning and manipulation.

为了使您本课程的学生更轻松，我们将使用Titanic数据集的半清洁版本，这将节省您在数据清洁和处理上的时间。

The cleaned Titanic data set has actually already been made available for you. You can download the data file by clicking the links below:

实际上，已清理的Titanic数据集已可供您使用。您可以通过单击以下链接下载数据文件：

Titanic data
泰坦尼克号数据

Once this file has been downloaded, open a Jupyter Notebook in the same working directory and we can begin building our logistic regression model.

下载此文件后，在同一工作目录中打开Jupyter Notebook ，我们可以开始构建逻辑回归模型。

我们将在本教程中使用的导入 (The Imports We Will Be Using in This Tutorial)

As before, we will be using multiple open-source software libraries in this tutorial. Here are the imports you will need to run to follow along as I code through our Python logistic regression model:

和以前一样，本教程中将使用多个开源软件库。这是我通过Python Logistic回归模型进行编码时需要遵循的导入：

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as sns

Next, we will need to import the Titanic data set into our Python script.

接下来，我们需要将Titanic数据集导入到我们的Python脚本中。

将数据集导入我们的Python脚本 (Importing the Data Set into our Python Script)

We will be using pandas’ read_csv method to import our csv files into pandas DataFrames called titanic_data.

我们将使用pandas的read_csv方法将csv文件导入名为titanic_data pandas titanic_data 。

Here is the code to do this:

这是执行此操作的代码：

titanic_data = pd.read_csv('titanic_train.csv')

Next, let’s investigate what data is actually included in the Titanic data set. There are two main methods to do this (using the titanic_data DataFrame specifically):

接下来，让我们研究一下Titanic数据集中实际包含的数据。有两种主要方法可以做到这一点(专门使用titanic_data DataFrame)：

The titanic_data.head(5) method will print the first 5 rows of the DataFrame. You can substitute 5 with whichever number you’d like.
titanic_data.head(5)方法将打印DataFrame的前5行。您可以用任意一个数字代替5 。
You can also print titanic_data.columns, which will show you the column named.
您还可以打印titanic_data.columns ，这将向您显示名为的列。

Running the second command (titanic_data.columns) generates the following output:

运行第二个命令( titanic_data.columns )会生成以下输出：

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',

       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],

      dtype='object'

These are the names of the columns in the DataFrame. Here are brief explanations of each data point:

这些是DataFrame中列的名称。以下是每个数据点的简要说明：

PassengerId: a numerical identifier for every passenger on the Titanic.
PassengerId ：泰坦尼克号上每个乘客的数字标识符。
Survived: a binary identifier that indicates whether or not the passenger survived the Titanic crash. This variable will hold a value of 1 if they survived and 0 if they did not.
Survived ：二进制标识符，指示乘客是否在泰坦尼克号坠机事故中幸存。如果生存，则此变量的值将为1 ，否则生存为0 。
Pclass: the passenger class of the passenger in question. This can hold a value of 1, 2, or 3, depending on where the passenger was located in the ship.
Pclass ：有关乘客的乘客等级。这可以保持的值1 ， 2 ，或3 ，这取决于其中乘客位于在船上。
Name: the passenger’s name.`
Name ：乘客的名字。
Sex: male or female.
Sex ：男性或女性。
Age: the age (in years) of the passenger.
Age ：乘客的年龄(以年为单位)。
SibSp: the number of siblings and spouses aboard the ship.
SibSp ：船上兄弟姐妹和配偶的数量。
Parch: the number of parents and children aboard the ship.
Parch ：船上父母和子女的数量。
Ticket: the passenger’s ticket number.
Ticket ：乘客的票号。
Fare: how much the passenger paid for their ticket on the Titanic.
Fare ：乘客在《泰坦尼克号》上花了多少钱。
Cabin: the passenger’s cabin number.
Cabin ：乘客的客舱编号。
Embarked: the port where the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)
Embarked ：旅客Embarked的港口(C =瑟堡，Q =皇后镇，S =南安普敦)

Next up, we will learn more about our data set by using some basic exploratory data analysis techniques.

接下来，我们将通过使用一些基本的探索性数据分析技术来了解有关数据集的更多信息。

通过探索性数据分析了解我们的数据集 (Learning About Our Data Set With Exploratory Data Analysis)

每个分类类别的流行 (The Prevalence of Each Classification Category)

When using machine learning techniques to model classification problems, it is always a good idea to have a sense of the ratio between categories. For this specific problem, it’s useful to see how many survivors vs. non-survivors exist in our training data.

当使用机器学习技术对分类问题建模时，了解类别之间的比率始终是一个好主意。对于此特定问题，查看我们的训练数据中有多少幸存者与非幸存者是有用的。

An easy way to visualize this is using the seaborn plot countplot. In this example, you could create the appropriate seasborn plot with the following Python code:

一种简单的可视化方法是使用seaborn plot countplot 。在此示例中，您可以使用以下Python代码创建适当的seasborn图：

sns.countplot(x='Survived', data=titanic_data)

This generates the following plot:

这将生成以下图：

As you can see, we have many more incidences of non-survivors than we do of survivors.

如您所见，与幸存者相比，非幸存者的发病率要高得多。

性别之间的成活率 (Survival Rates Between Genders)

It is also useful to compare survival rates relative to some other data feature. For example, we can compare survival rates between the Male and Female values for Sex using the following Python code:

比较相对于某些其他数据特征的生存率也很有用。例如，我们可以使用以下Python代码来比较“ Sex的“ Male和“ Female值的生存率：

sns.countplot(x='Survived', hue='Sex', data=titanic_data)

This generates the following plot:

这将生成以下图：

As you can see, passengers with a Sex of Male were much more likely to be non-survivors than passengers with a Sex of Female.

如您所见， Sex为Male乘客比Sex为Female乘客更有可能是非幸存者。

旅客舱位之间的生存率 (Survival Rates Between Passenger Classes)

We can perform a similar analysis using the Pclass variable to see which passenger class was the most (and least) likely to have passengers that were survivors.

我们可以使用Pclass变量执行类似的分析，以查看哪个旅客类别最有(和最少)可能有幸存者。

Here is the code to do this:

这是执行此操作的代码：

sns.countplot(x='Survived', hue='Pclass', data=titanic_data)

This generates the following plot:

这将生成以下图：

The most noticeable observation from this plot is that passengers with a Pclass value of 3 - which indicates the third class, which was the cheapest and least luxurious - were much more likely to die when the Titanic crashed.

从该图中最明显的观察结果是，当泰坦尼克号坠毁时， Pclass值为3乘客(表示最便宜，最不豪华的第三等舱)更有可能死亡。

泰坦尼克号乘客的年龄分布 (The Age Distribution of Titanic Passengers)

One other useful analysis we could perform is investigating the age distribution of Titanic passengers. A histogram is an excellent tool for this.

我们可以执行的另一项有用的分析是调查泰坦尼克号乘客的年龄分布。直方图是一个很好的工具。

You can generate a histogram of the Age variable with the following code:

您可以使用以下代码生成Age变量的直方图：

plt.hist(titanic_data['Age'].dropna())

Note that the dropna() method is necessary since the data set contains several nulls values.

注意，因为数据集包含多个null值，所以dropna()方法是必需的。

Here is the histogram that this code generates:

这是此代码生成的直方图：

As you can see, there is a concentration of Titanic passengers with an Age value between 20 and 40.

如您所见， Age在20到40之间的泰坦尼克号乘客集中。

泰坦尼克号乘客的票价分布 (The Ticket Price Distribution of Titanic Passengers)

The last exploratory data analysis technique that we will use is investigating the distribution of fare prices within the Titanic data set.

我们将使用的最后一种探索性数据分析技术是调查泰坦尼克号数据集中的票价分布。

You can do this with the following code:

您可以使用以下代码执行此操作：

plt.hist(titanic_data['Fare'])

This generates the following plot:

这将生成以下图：

As you can see, there are three distinct groups of Fare prices within the Titanic data set. This makes sense because there are also three unique values for the Pclass variable. The difference Fare groups correspond to the different Pclass categories.

如您所见，在泰坦尼克号数据集中有三组不同的Fare价格。这是有道理的，因为Pclass变量还有三个唯一值。差异Fare组对应于不同的Pclass类别。

Since the Titanic data set is a real-world data set, it contains some missing data. We will learn how to deal with missing data in the next section.

由于Titanic数据集是真实世界的数据集，因此它包含一些缺失的数据。在下一节中，我们将学习如何处理丢失的数据。

从我们的数据集中删除空数据 (Removing Null Data From Our Data Set)

To start, let’s examine where our data set contains missing data. To do this, run the following command:

首先，让我们检查数据集中包含缺失数据的位置。为此，请运行以下命令：

titanic_data.isnull()

This will generate a DataFrame of boolean values where the cell contains True if it is a null value and False otherwise. Here is an image of what this looks like:

这将生成一个布尔值的DataFrame，如果该单元格为空值，则该单元格包含True ，否则为False 。这是它的样子的图像：

A far more useful method for assessing missing data in this data set is by creating a quick visualization. To do this, we can use the seaborn visualization library. Here is quick command that you can use to create a heatmap using the seaborn library:

评估此数据集中缺失数据的一种更有用的方法是创建快速可视化。为此，我们可以使用seaborn可视化库。这是快速命令，可用于使用seaborn库创建heatmap ：

sns.heatmap(titanic_data.isnull(), cbar=False)

Here is the visualization that this generates:

这是生成的可视化效果：

In this visualization, the white lines indicate missing values in the dataset. You can see that the Age and Cabin columns contain the majority of the missing data in the Titanic data set.

在此可视化中，白线表示数据集中缺少的值。您可以看到“ Age和“ Cabin列包含“泰坦尼克号”数据集中大部分丢失的数据。

The Age column in particular contains a small enough amount of missing that that we can fill in the missing data using some form of mathematics. On the other hand, the Cabin data is missing enough data that we could probably remove it from our model entirely.

特别是“ Age列包含的缺失量很小，我们可以使用某种形式的数学来填充缺失数据。另一方面， Cabin数据缺少足够的数据，因此我们有可能将其完全从模型中删除。

The process of filling in missing data with average data from the rest of the data set is called imputation. We will now use imputation to fill in the missing data from the Age column.

用其余数据集中的平均数据填充缺失数据的过程称为imputation 。现在，我们将使用imputation来填充“ Age列中的缺失数据。

The most basic form of imputation would be to fill in the missing Age data with the average Age value across the entire data set. However, there are better methods.

imputation最基本形式是用整个数据集中的平均Age值来填充缺失的Age数据。但是，有更好的方法。

We will fill in the missing Age values with the average Age value for the specific Pclass passenger class that the passenger belongs to. To understand why this is useful, consider the following boxplot:

我们将使用该乘客所属的特定Pclass乘客舱的平均Age值来填充缺少的Age值。要了解为什么这样做有用，请考虑以下箱线图：

sns.boxplot(titanic_data['Pclass'], titanic_data['Age'])

As you can see, the passengers with a Pclass value of 1 (the most expensive passenger class) tend to be the oldest while the passengers with a Pclass value of 3 (the cheapest) tend to be the youngest. This is very logical, so we will use the average Age value within different Pclass data to imputate the missing data in our Age column.

如您所见， Pclass值为1 (最昂贵的乘客舱)的乘客往往是最老的，而Pclass值为3 (最便宜的乘客)的乘客往往是最年轻的。这是非常符合逻辑的，所以我们将使用的平均Age不同范围内的值Pclass数据imputate我们丢失的数据Age列。

The easiest way to perform imputation on a data set like the Titanic data set is by building a custom function. To start, we will need to determine the mean Age value for each Pclass value.

对像泰坦尼克号数据集这样的数据集执行imputation的最简单方法是构建自定义函数。首先，我们需要确定每个Pclass值的平均Age值。

#Pclass value 1

titanic_data[titanic_data['Pclass'] == 1]['Age'].mean()

#Pclass value 2

titanic_data[titanic_data['Pclass'] == 2]['Age'].mean()

#Pclass 3

titanic_data[titanic_data['Pclass'] == 2]['Age'].mean()

Here is the final function that we will use to imputate our missing Age variables:

这是我们将用来imputate缺少的Age变量的最终函数：

def impute_missing_age(columns):

    age = columns[0]

    passenger_class = columns[1]

    

    if pd.isnull(age):

        if(passenger_class == 1):

            return titanic_data[titanic_data['Pclass'] == 1]['Age'].mean()

        elif(passenger_class == 2):

            return titanic_data[titanic_data['Pclass'] == 2]['Age'].mean()

        elif(passenger_class == 3):

            return titanic_data[titanic_data['Pclass'] == 3]['Age'].mean()

        

    else:

        return age

Now that this imputation function is complete, we need to apply it to every row in the titanic_data DataFrame. Python’s apply method is an excellent tool for this:

现在，该插补功能已完成，我们需要将其应用于titanic_data DataFrame中的每一行。 Python的apply方法是一个出色的工具：

titanic_data['Age'] = titanic_data[['Age', 'Pclass']].apply(impute_missing_age, axis = 1)

Now that we have performed imputation on every row to deal with our missing Age data, let’s investigate our original boxplot:

既然我们已经对每一行进行了imputation以处理丢失的Age数据，那么让我们研究一下原始箱形图：

sns.heatmap(titanic_data.isnull(), cbar=False)

You wil notice there is no longer any missing data in the Age column of our pandas DataFrame!

您会发现我们的熊猫DataFrame的Age列中不再缺少任何数据！

You might be wondering why we spent so much time dealing with missing data in the Age column specifically. It is because given the impact of Age on survival for most disasters and diseases, it is a variable that is likely to have high predictive value within our data set.

您可能想知道为什么我们要花费大量时间专门处理“ Age列中的缺失数据。这是因为考虑到Age对大多数灾难和疾病生存的影响，在我们的数据集中，该变量可能具有较高的预测价值。

Now that we have an understanding of the structure of this data set and have removed its missing data, let’s begin building our logistic regression machine learning model.

现在我们已经了解了该数据集的结构并删除了缺失的数据，让我们开始构建逻辑回归机器学习模型。

建立逻辑回归模型 (Building a Logistic Regression Model)

It is now time to remove our logistic regression model.

现在是时候删除我们的逻辑回归模型了。

删除缺少太多数据的列 (Removing Columns With Too Much Missing Data)

First, let’s remove the Cabin column. As we mentioned, the high prevalence of missing data in this column means that it is unwise to impute the missing data, so we will remove it entirely with the following code:

首先，让我们删除“ Cabin列。正如我们提到的，此列中丢失数据的普遍性意味着不正确地impute丢失数据，因此我们将使用以下代码将其完全删除：

titanic_data.drop('Cabin', axis=1, inplace = True)

Next, let’s remove any additional columns that contain missing data with the pandas dropna() method:

接下来，让我们使用pandas dropna()方法删除包含丢失数据的所有其他列：

titanic_data.dropna(inplace = True)

使用虚拟变量处理分类数据 (Handling Categorical Data With Dummy Variables)

The next task we need to handle is dealing with categorical features. Namely, we need to find a way to numerically work with observations that are not naturally numerical.

我们需要处理的下一个任务是处理分类特征。即，我们需要找到一种方法来对非自然数值的观测值进行数值处理。

A great example of this is the Sex column, which has two values: Male and Female. Similarly, the Embarked column contains a single letter which indicates which city the passenger departed from.

一个很好的例子是“ Sex列，该列具有两个值： Male和Female 。同样，“ Embarked栏包含一个字母，指示该乘客离开的城市。

To solve this problem, we will create dummy variables. These assign a numerical value to each category of a non-numerical feature.

为了解决这个问题，我们将创建dummy variables 。这些将数字值分配给非数字特征的每个类别。

Fortunately, pandas has a built-in method called get_dummies() that makes it easy to create dummy variables. The get_dummies method does have one issue - it will create a new column for each value in the DataFrame column.

幸运的是， pandas有一个名为get_dummies()的内置方法，可轻松创建虚拟变量。 get_dummies方法确实存在一个问题-它会为DataFrame列中的每个值创建一个新列。

Let’s consider an example to help understand this better. If we call the get_dummies() method on the Age column, we get the following output:

让我们考虑一个示例，以帮助您更好地理解这一点。如果我们在Age列上调用get_dummies()方法， get_dummies()得到以下输出：

pd.get_dummies(titanic_data['Sex'])

As you can see, this creates two new columns: female and male. These columns will both be perfect predictors of each other, since a value of 0 in the female column indicates a value of 1 in the male column, and vice versa.

如您所见，这将创建两个新列： female和male 。这些列都将是彼此的完美预测器，因为female列中的值为0表示male列中的值为1 ，反之亦然。

This is called multicollinearity and it significantly reduces the predictive power of your algorithm. To remove this, we can add the argument drop_first = True to the get_dummies method like this:

这称为multicollinearity ，它会大大降低算法的预测能力。要删除它，我们可以将参数drop_first = True添加到get_dummies方法中，如下所示：

pd.get_dummies(titanic_data['Sex'], drop_first = True)

Now, let’s create dummy variable columns for our Sex and Embarked columns, and assign them to variables called sex and embarked.

现在，让我们为我们的虚拟变量列Sex和Embarked列，并将它们分配给变量称为sex和embarked 。

sex_data = pd.get_dummies(titanic_data['Sex'], drop_first = True)

embarked_data = pd.get_dummies(titanic_data['Embarked'], drop_first = True)

There is one important thing to note about the embarked variable defined below. It has two columns: Q and S, but since we’ve already removed one other column (the C column), neither of the remaining two columns are perfect predictors of each other, so multicollinearity does not exist in the new, modified data set.

还有就是要注意的一个重要的事情有关embarked下面定义的变量。它有两列： Q和S ，但是由于我们已经删除了另一列( C列)，因此其余两列都不是彼此的完美预测变量，因此在修改后的新数据集中不存在multicollinearity 。

将虚拟变量添加到`pandas` DataFrame (Adding Dummy Variables to the `pandas` DataFrame)

Next we need to add our sex and embarked columns to the DataFrame.

接下来，我们需要增加我们的sex ，并embarked列数据框。

You can concatenate these data columns into the existing pandas DataFrame with the following code:

您可以使用以下代码将这些数据列连接到现有的pandas DataFrame中：

titanic_data = pd.concat([titanic_data, sex_data, embarked_data], axis = 1)

Now if you run the command print(titanic_data.columns), your Jupyter Notebook will generate the following output:

现在，如果您运行命令print(titanic_data.columns) ，Jupyter Notebook将生成以下输出：

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',

       'Parch', 'Ticket', 'Fare', 'Embarked', 'male', 'Q', 'S'],

      dtype='object')

The existence of the male, Q, and S columns shows that our data was concatenated successfully.

male ， Q和S列的存在表明我们的数据已成功连接。

从数据集中删除不必要的列 (Removing Unnecessary Columns From The Data Set)

This means that we can now drop the original Sex and Embarked columns from the DataFrame. There are also other columns (like Name , PassengerId, Ticket) that are not predictive of Titanic crash survival rates, so we will remove those as well. The following code handles this for us:

这意味着我们现在可以从DataFrame中删除原始的Sex和Embarked列。还有其他一些列(如Name ， PassengerId ， Ticket )无法预测泰坦尼克号的撞车幸存率，因此我们也将其删除。以下代码为我们处理了此问题：

titanic_data.drop(['Name', 'Ticket', 'Sex', 'Embarked'], axis = 1, inplace = True)

If you print titanic_data.columns now, your Jupyter Notebook will generate the following output:

如果您现在打印titanic_data.columns ，则Jupyter Notebook将生成以下输出：

Index(['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare',

       'male', 'Q', 'S'],

      dtype='object'

The DataFrame now has the following appearance:

DataFrame现在具有以下外观：

As you can see, every field in this data set is now numeric, which makes it an excellent candidate for a logistic regression machine learning algorithm.

如您所见，该数据集中的每个字段现在都是数字，这使其成为逻辑回归机器学习算法的理想选择。

创建培训数据和测试数据 (Creating Training Data and Test Data)

Next, it’s time to split our titanic_data into training data and test data. As before, we will use built-in functionality from scikit-learn to do this.

接下来，是时候将我们的titanic_data分为训练数据和测试数据了。和以前一样，我们将使用scikit-learn内置功能来执行此操作。

First, we need to divide our data into x values (the data we will be using to make predictions) and y values (the data we are attempting to predict). The following code handles this:

首先，我们需要将数据分为x值(我们将用于进行预测的数据)和y值(我们正在尝试预测的数据)。以下代码处理此问题：

y_data = titanic_data['Survived']

x_data = titanic_data.drop('Survived', axis = 1)

Next, we need to import the train_test_split function from scikit-learn. The following code executes this import:

接下来，我们需要从scikit-learn导入train_test_split函数。以下代码执行此导入：

from sklearn.model_selection import train_test_split

Lastly, we can use the train_test_split function combined with list unpacking to generate our training data and test data:

最后，我们可以结合使用train_test_split函数和列表解train_test_split来生成我们的训练数据和测试数据：

x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x_data, y_data, test_size = 0.3)

Note that in this case, the test data is 30% of the original data set as specified with the parameter test_size = 0.3.

请注意，在这种情况下，测试数据是参数test_size = 0.3指定的原始数据集的30％。

We have now created our training data and test data for our logistic regression model. We will train our model in the next section of this tutorial.

现在，我们为逻辑回归模型创建了训练数据和测试数据。我们将在本教程的下一部分中训练模型。

训练逻辑回归模型 (Training the Logistic Regression Model)

To train our model, we will first need to import the appropriate model from scikit-learn with the following command:

要训练我们的模型，我们首先需要使用以下命令从scikit-learn导入适当的模型：

from sklearn.linear_model import LogisticRegression

Next, we need to create our model by instantiating an instance of the LogisticRegression object:

接下来，我们需要通过实例化LogisticRegression对象的实例来创建模型：

model = LogisticRegression()

To train the model, we need to call the fit method on the LogisticRegression object we just created and pass in our x_training_data and y_training_data variables, like this:

要训练模型，我们需要在刚刚创建的LogisticRegression对象上调用fit方法，并传入x_training_data和y_training_data变量，如下所示：

model.fit(x_training_data, y_training_data)

Our model has now been trained. We will begin making predictions using this model in the next section of this tutorial.

我们的模型现已训练完毕。我们将在本教程的下一部分中开始使用此模型进行预测。

使用我们的Logistic回归模型进行预测 (Making Predictions With Our Logistic Regression Model)

Let’s make a set of predictions on our test data using the model logistic regression model we just created. We will store these predictions in a variable called predictions:

让我们使用刚刚创建的model逻辑回归模型对测试数据进行一组预测。我们将这些预测存储在一个名为predictions的变量中：

predictions = model.predict(x_test_data)

Our predictions have been made. Let’s examine the accuracy of our model next.

我们已经做出了预测。接下来让我们检查模型的准确性。

测量Logistic回归机器学习模型的性能 (Measuring the Performance of a Logistic Regression Machine Learning Model)

scikit-learn has an excellent built-in module called classification_report that makes it easy to measure the performance of a classification machine learning model. We will use this module to measure the performance of the model that we just created.

scikit-learn具有一个出色的内置模块，称为classification_report _报告，可轻松测量分类机器学习模型的性能。我们将使用此模块来评估我们刚刚创建的模型的性能。

First, let’s import the module:

首先，让我们导入模块：

from sklearn.metrics import classification_report

Next, let’s use the module to calculate the performance metrics for our logistic regression machine learning module:

接下来，让我们使用该模块为我们的逻辑回归机器学习模块计算性能指标：

classification_report(y_test_data, predictions)

Here is the output of this command:

这是此命令的输出：

precision    recall  f1-score   support

           0       0.83      0.87      0.85       169

           1       0.75      0.68      0.72        98

    accuracy                           0.80       267

   macro avg       0.79      0.78      0.78       267

weighted avg       0.80      0.80      0.80       267

If you’re interested in seeing the raw confusion matrix and calculating the performance metrics manually, you can do this with the following code:

如果您有兴趣查看原始的混淆矩阵并手动计算性能指标，则可以使用以下代码进行操作：

from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test_data, predictions))

This generates the following output:

这将产生以下输出：

[[145  22]

 [ 30  70]]

本教程的完整代码 (The Full Code for This Tutorial)

You can view the full code for this tutorial in this GitHub repository. It is also pasted below for your reference:

您可以在GitHub存储库中查看本教程的完整代码。还将其粘贴在下面以供您参考：

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as sns

#Import the data set

titanic_data = pd.read_csv('titanic_train.csv')

#Exploratory data analysis

sns.heatmap(titanic_data.isnull(), cbar=False)

sns.countplot(x='Survived', data=titanic_data)

sns.countplot(x='Survived', hue='Sex', data=titanic_data)

sns.countplot(x='Survived', hue='Pclass', data=titanic_data)

plt.hist(titanic_data['Age'].dropna())

plt.hist(titanic_data['Fare'])

sns.boxplot(titanic_data['Pclass'], titanic_data['Age'])

#Imputation function

def impute_missing_age(columns):

    age = columns[0]

    passenger_class = columns[1]

    

    if pd.isnull(age):

        if(passenger_class == 1):

            return titanic_data[titanic_data['Pclass'] == 1]['Age'].mean()

        elif(passenger_class == 2):

            return titanic_data[titanic_data['Pclass'] == 2]['Age'].mean()

        elif(passenger_class == 3):

            return titanic_data[titanic_data['Pclass'] == 3]['Age'].mean()

        

    else:

        return age

#Impute the missing Age data

titanic_data['Age'] = titanic_data[['Age', 'Pclass']].apply(impute_missing_age, axis = 1)

#Reinvestigate missing data

sns.heatmap(titanic_data.isnull(), cbar=False)

#Drop null data

titanic_data.drop('Cabin', axis=1, inplace = True)

titanic_data.dropna(inplace = True)

#Create dummy variables for Sex and Embarked columns

sex_data = pd.get_dummies(titanic_data['Sex'], drop_first = True)

embarked_data = pd.get_dummies(titanic_data['Embarked'], drop_first = True)

#Add dummy variables to the DataFrame and drop non-numeric data

titanic_data = pd.concat([titanic_data, sex_data, embarked_data], axis = 1)

titanic_data.drop(['Name', 'PassengerId', 'Ticket', 'Sex', 'Embarked'], axis = 1, inplace = True)

#Print the finalized data set

titanic_data.head()

#Split the data set into x and y data

y_data = titanic_data['Survived']

x_data = titanic_data.drop('Survived', axis = 1)

#Split the data set into training data and test data

from sklearn.model_selection import train_test_split

x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x_data, y_data, test_size = 0.3)

#Create the model

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

#Train the model and create predictions

model.fit(x_training_data, y_training_data)

predictions = model.predict(x_test_data)

#Calculate performance metrics

from sklearn.metrics import classification_report

print(classification_report(y_test_data, predictions))

#Generate a confusion matrix

from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test_data, predictions))

最后的想法 (Final Thoughts)

In this tutorial, you learned how to build linear regression and logistic regression machine learning models in Python.

在本教程中，您学习了如何在Python中构建线性回归和逻辑回归机器学习模型。

If you're interested in learning more about building, training, and deploying cutting-edge machine learning model, my eBook Pragmatic Machine Learning will teach you how to build 9 different machine learning models using real-world projects.

如果您想了解有关构建，训练和部署前沿机器学习模型的更多信息，我的电子书实用机器学习电子书将教您如何使用实际项目构建9种不同的机器学习模型。

You can deploy the code from the eBook to your GitHub or personal portfolio to show to prospective employers. The book launches on August 3rd – preorder it for 50% off now!

您可以将代码从电子书部署到GitHub或个人投资组合，以向潜在雇主展示。该书将于8月3日发行，现在可以50％的价格预订！

Here is a brief summary of what you learned in this article:

这是您从本文中学到的简短摘要：

How to import the libraries required to build a linear regression machine learning algorithm
如何导入构建线性回归机器学习算法所需的库
How to split a data set into training data and test data using scikit-learn
如何使用scikit-learn将数据集分为训练数据和测试数据
How to use scikit-learn to train a linear regression model and make predictions using that model
如何使用scikit-learn训练线性回归模型并使用该模型进行预测
How to calculate linear regression performance metrics using scikit-learn
如何使用scikit-learn计算线性回归性能指标
Why the Titanic data set is often used for learning machine learning classification techniques
为什么Titanic数据集经常用于学习机器学习分类技术
How to perform exploratory data analysis when working with a data set for classification machine learning problems
处理分类机器学习问题的数据集时如何执行探索性数据分析
How to handle missing data in a pandas DataFrame
如何处理Pandas DataFrame中的缺失数据
What imputation means and how you can use it to fill in missing data
imputation含义以及如何使用它来填写缺失的数据
How to create dummy variables for categorical data in machine learning data sets
如何为机器学习数据集中的分类数据创建虚拟变量
How to train a logistic regression machine learning model in Python
如何在Python中训练Logistic回归机器学习模型
How to make predictions using a logistic regression model in Python
如何在Python中使用逻辑回归模型进行预测
How to the scikit-learn’s classification_report to quickly calculate performance metrics for machine learning classification problems
scikit-learn的classification_report如何快速计算机器学习分类问题的性能指标

翻译自: https://www.freecodecamp.org/news/how-to-build-and-train-linear-and-logistic-regression-ml-models-in-python/

python 线性回归模型

ELK Stack 安装教程 - 构建日志存储告警系统运维
介绍“ELK”是三个开源项目的首字母缩写，这三个项目分别是：Elasticsearch、Logstash和Kibana。Elasticsearch是一个搜索和分析引擎。Logstash是服务器端数据处理管道，能够同时从多个来源采集数据，转换数据，然后将数据发送到诸如Elasticsearch等“存储库”中。Kibana则可以让用户在Elasticsearch中使用图形和图表对数据进行可视化。目前最
盲签名算法的原理与C语言实现 c密码学信息安全加密解密
0x01概述盲签名(BlindSignature)是由Chaum,David提出的一种数字签名方式，其中消息的内容在签名之前对签名者是不可见的（盲化）。经过盲签名得到的签名值可以使用原始的非盲消息使用常规数字签名验证的方式进行公开验证。盲签名可以有效的保护隐私，其中签名者和消息作者不同，在电子投票系统和数字现金系统中会被使用。盲签名常常被类比成下面的场景：Alice想让Bob在自己的文件上签名，但
Zookeeper与Kafka学习笔记上海研博数据 zookeeper kafka 学习
一、Zookeeper核心要点1.核心特性分布式协调服务，用于维护配置/命名/同步等元数据采用层次化数据模型（Znode树结构），每个节点可存储<1MB数据典型应用场景：HadoopNameNode高可用HBase元数据管理Kafka集群选举与状态管理2.设计限制内存型存储，不适合大数据量场景数据变更通过版本号（Version）控制，实现乐观锁机制采用ZAB协议保证数据一致性二、Kafka核心架构
《 YOLOv5、YOLOv8、YOLO11训练的关键文件：data.yaml文件编写全解》空云风语人工智能 YOLO 机器视觉目标跟踪人工智能计算机视觉 YOLO
走进YOLOv5、YOLOv8、YOLO11的data.yaml在计算机视觉领域的广袤星空中，目标检测无疑是一颗璀璨的明星，它广泛应用于自动驾驶、智能安防、工业检测、医疗影像分析等众多关键领域，发挥着不可或缺的作用。而YOLO系列算法，更是以其独特的“一次看全（YouOnlyLookOnce）”理念和卓越的性能，在目标检测领域中独树一帜，成为了众多研究者和开发者的首选工具。从最初的YOLOv1横空
【LLM】从零开始实现 LLaMA3 FOUR_A LLM 人工智能机器学习大模型 llama 算法
分词器在这里，我们不会实现一个BPE分词器（但AndrejKarpathy有一个非常简洁的实现）。BPE（BytePairEncoding，字节对编码）是一种数据压缩算法，也被用于自然语言处理中的分词方法。它通过逐步将常见的字符或子词组合成更长的词元（tokens），从而有效地表示文本中的词汇。在自然语言处理中的BPE分词器的工作原理如下：初始化：首先，将所有词汇表中的单词分解为单个字符或符号。例
机器学习之线性代数珠峰日记 AI理论与实践机器学习线性代数人工智能
文章目录一、引言：线性代数为何是AI的基石二、向量：AI世界的基本构建块（一）向量的定义（二）向量基础操作（三）重要概念三、矩阵：AI数据的强大容器（一）矩阵的定义（二）矩阵运算（三）矩阵特性（四）矩阵分解（五）Python示例（使用NumPy库）四、线性代数在AI中的应用（一）数据表示（二）降维：PCA（三）线性回归（四）计算机视觉（五）自然语言处理一、引言：线性代数为何是AI的基石在人工智能领
.NET 6 WebApi使用JWT wenqi.xu .net .netcore
JWT（JsonWebToken）jwt是一种用于身份验证的开放标准，他可以在网络之间传递信息，jwt由三部分组成：头部，载荷，签名。头部包含了令牌的类型和加密算法，载荷包含了用户的信息，签名则是对头部和载荷的加密结果。jwt鉴权验证是指在用户登录成功后，服务器生成一个jwt令牌并返回给客户端，客户端在后续的请求中携带该令牌，服务通过令牌的签名来确定用户的身份和权限。这种方式可以避免在每个请求中都
使用Yarn创建Grafana模板的完整指南云服务器linux运维yarn
在本篇文章中，我将带你逐步完成如何使用Yarn生成Grafana模板的过程。Grafana是一款开源的数据可视化工具，我们可以使用它来创建各种仪表板，以便更好地监控和展示数据。请跟随我一起来完成这一过程。整体流程概览在开始之前，我们先来看看整个操作的流程。以下是步骤的概述，以表格形式展示：步骤描述1安装Node.js和Yarn2创建新的Yarn项目3安装Grafana的API客户端库4编写Graf
有趣的学习Python-第十篇：Python的“魔法宝库”：标准库之旅王盼达有趣的学习Python 学习 python 开发语言
Python不仅是一门强大的编程语言，更像是一座充满宝藏的“魔法宝库”，里面装满了各种各样的“魔法工具”（标准库）。这些“魔法工具”可以帮助你轻松地完成各种任务，从文件操作到网络编程，从数据处理到性能优化。接下来，让我们一起探索Python的“魔法宝库”，看看这些“魔法工具”到底有多神奇！10.1操作系统接口：与“魔法世界”互动os模块就像是一个“魔法接口”，可以帮助你与操作系统进行互动。你可以用
有趣的学习Python-第八篇：Python的“魔法盾牌”：错误与异常处理王盼达有趣的学习Python 学习 python 开发语言
在Python的魔法世界里，即使是经验丰富的魔法师也可能遇到一些“魔法失误”。这些失误分为两种：语法错误和异常。别担心，Python为你准备了一面强大的“魔法盾牌”，帮助你应对这些挑战。8.1语法错误：魔法咒语写错了语法错误就像是你在念魔法咒语时，不小心说错了单词。这是学习Python过程中最常见的问题。比如，你可能忘记在while循环后面加上冒号：whileTrueprint('Hellowor
Python字符串操作 weixin_30871905 python
转自http://blog.chinaunix.net/u/19742/showart_382176.html#Python字符串操作'''1.复制字符串'''#strcpy(sStr1,sStr2)sStr1='strcpy'sStr2=sStr1sStr1='strcpy2'printsStr2'''2.连接字符串'''#strcat(sStr1,sStr2)sStr1='strcat'sSt
如果，你想找 AI大模型相关的工作，这三个建议你一定要看！我爱学大模型人工智能 chatgpt AI大模型 AI 大模型入门转行程序员
01各种大厂小厂创业团队和AI擦边的面试难度，由难到简单，依次是：大模型算法（⭐⭐⭐⭐⭐）模型部署加速（⭐⭐⭐⭐）RAG等相关技术（⭐⭐⭐）纯应用（⭐⭐）Prompt工程师等其他自媒体（⭐）会简单应用就行02这结果方向，B站找几个视频看看，这里推荐用Qwen7B，开源的模型，一个3060都能跑。例如这个，如何微调Qwen开源模型。https://www.bilibili.com/video/BV1
零基础必看！CCF-GESP Python一级考点全解析：运算符这样学就对了奕澄羽邦 python 开发语言
第一章编程世界的基础工具：运算符三剑客在Python编程语言中，运算符如同魔法咒语般神奇。对于CCF-GESPPython一级考生而言，正确掌握比较运算符、算术运算符和逻辑运算符这三大基础工具，就相当于打开了数字世界的大门。这三个运算符家族共同构成了程序逻辑的核心骨架，其灵活组合能实现从简单计算到复杂判断的多样功能。1.1运算符分类图谱算术运算符：负责数字间的数学运算（+-*/%）比较运算符：用于
机器学习(Machine Learning) 七指琴魔御清绝大数据学习
原文链接：http://blog.csdn.net/zhoubl668/article/details/42921187希望转载的朋友，你可以不用联系我．但是一定要保留原文链接，因为这个项目还在继续也在不定期更新．希望看到文章的朋友能够学到更多．《BriefHistoryofMachineLearning》介绍:这是一篇介绍机器学习历史的文章，介绍很全面，从感知机、神经网络、决策树、SVM、Ada
Python 字符串操作 iteye_13776 Python Python C C++C#
Python截取字符串使用变量[头下标:尾下标]，就可以截取相应的字符串，其中下标是从0开始算起，可以是正数或负数，下标可以为空表示取到头或尾。#例1：字符串截取str='12345678'printstr[0:1]>>1#输出str位置0开始到位置1以前的字符printstr[1:6]>>23456#输出str位置1开始到位置6以前的字符num=18str='0000'+str(num)#合并字
关联规则算法：揭秘数据中的隐藏关系，从理论到实战秋声studio 机器学习算法详解关联规则算法数据挖掘 Apriori算法 FP-Growth算法大数据优化数据预处理增量式更新
引言在当今数据驱动的时代，如何从海量数据中挖掘出有价值的信息成为了各行各业的核心挑战。关联规则算法作为数据挖掘领域的重要工具，能够帮助我们发现数据中隐藏的关联关系，从而为决策提供支持。无论是电商平台的商品推荐，还是医疗领域的疾病诊断，关联规则算法都展现出了强大的应用潜力。本文将从基础概念出发，逐步深入探讨关联规则算法的核心原理、经典算法及其优化策略。无论你是数据挖掘的初学者，还是希望进一步了解关联
一文理清：阿里系数据中台-数据治理工具集(傻傻也能分清楚） Debug_Snail Hadoop Big Data 技术工具人工智能 hadoop 数据仓库
阿里云提供的大数据与数据分析产品种类较多，各产品的定位和核心功能有所不同。以下是对DataWorks、MaxCompute、Dataphin、AnalyticDBforMySQL（ADB）、QuickBI、EMR的详细梳理。一、核心产品定位与功能DataWorks定位：一站式大数据开发治理平台，提供数据集成、开发、调度、治理、服务等全链路能力。核心功能：数据集成：支持异构数据源（如数据库、OSS、
大语言模型(LLM)入门学习路线图_llm教程，从零基础到精通，理论与实践结合的最佳路径！ AGI学习社语言模型学习人工智能 LLM 大模型大数据自然语言处理
Github项目上有一个大语言模型学习路线笔记，它全面涵盖了大语言模型的所需的基础知识学习，LLM前沿算法和架构，以及如何将大语言模型进行工程化实践。这份资料是初学者或有一定基础的开发/算法人员入门活深入大型语言模型学习的优秀参考。这份资料重点介绍了我们应该掌握哪些核心知识，并推荐了一系列优质的学习视频和博客，旨在帮助大家系统性地掌握大型语言模型的相关技术。大语言模型（LargeLanguageM
机器学习实战——音乐流派分类（主页有源码）喵了个AI 机器学习实战机器学习分类人工智能
✨个人主页欢迎您的访问✨期待您的三连✨✨个人主页欢迎您的访问✨期待您的三连✨✨个人主页欢迎您的访问✨期待您的三连✨1.简介音乐流派分类是音乐信息检索（MusicInformationRetrieval,MIR）中的一个重要任务，旨在通过分析音频信号的特征，将音乐自动分类到不同的流派（如古典、摇滚、爵士、流行等）。随着数字音乐平台的普及，音乐流派分类技术被广泛应用于音乐推荐、自动标签生成和音乐库管理
【Python 第五篇章】数据类型蜗牛 | ICU Python 专栏 python windows 开发语言
一、列表详解list.append(x)在列表末尾添加一个元素。list.extend(iterable)用可迭代对象的元素扩展列表。list.insert(i,x)在指定位置插入元素，第一个参数是插入元素的索引，第二个是值。list.remove(x)从列表中删除第一个值为x的元素。list.pop([i])移除列表中给定位置的条目，并返回该条目。如果未指定索引号，则a.pop()将移除并返回列
python catia catalog文件_Python封装的获取文件目录的函数卢新生 python catia catalog文件
获取指定文件夹中文件的函数，网上学习时东拼西凑的结果。注意，其中文件名如1.txt，文件路径如D:\文件夹\1.txt；direct为第一层子级importos#filePath输入文件夹全路径#mode#1递归获取所有文件名;#2递归获取所有文件路径;#3获取direct文件名;#4获取direct文件路径;#5获取direct文件名和direct子文件夹名;#6获取direct文件路径和dir
Python：每日一题之错误票据努力的敲码工蓝桥杯每日一题 python 蓝桥杯
题目描述某涉密单位下发了某种票据，并要在年终全部收回。每张票据有唯一的ID号。全年所有票据的ID号是连续的，但ID的开始数码是随机选定的。因为工作人员疏忽，在录入ID号的时候发生了一处错误，造成了某个ID断号，另外一个ID重号。你的任务是通过编程，找出断号的ID和重号的ID。假设断号不可能发生在最大和最小号。输入描述输入描述要求程序首先输入一个整数N(N<100)表示后面数据行数。接着读入N行数据
C 语言中的数组详解 812503533 c语言 java 开发语言
在C语言中，数组是一种非常基础且常用的数据结构。数组是存储一组相同类型元素的集合，允许我们以统一的方式访问和操作这些元素。C语言中的数组不仅在编程中使用广泛，而且它的灵活性和效率使得它成为了许多算法实现的基础。本篇文章将深入分析C语言中的一维数组，包括定义、存储方式、操作方式、常见问题等等，所有的数据结构都可以从这几个方面来学习。1.数组的定义与存储方式1.1一维数组的定义数组的定义方式包括数组大
Python控制批量插入Catia文件并修改文件定义及PN 一盘红烧肉 python
改了两天，总算初步摸清楚了Catia中的文件结构，实现了使用Python控制批量修改文件名及定义使用Pycatia在Product中插入Part并改名及定义
C++随机数宁玉AC c学习 c++开发语言
目录一、名著参考二、详解1.rand()函数2.time(0)3.srand(time(0))4.获取指定范围内的随机数（含指定位数）一、名著参考可以使用cstdlib头文件中的rand()函数来获得随机整数；这个函数返回0~RAND_MAX之间的随机整数；rand()函数生成的是伪随机数。即每次在同一个系统上执行这个函数的时候，rand()函数生成同一序列的数。rand()函数的算法使用一个叫种
PySide2是 Qt 库的 Python 绑定之一 WwwwwH_PLUS #Qt qt python 开发语言
PySide2是Qt库的Python绑定之一，它为Python程序员提供了创建跨平台桌面应用程序的工具和功能。PySide2是Qt5.x系列的Python绑定，而Qt本身是一个跨平台的图形用户界面（GUI）框架，广泛用于开发各种类型的桌面应用程序，包括多种平台（Windows、Linux、macOS）的应用。主要特点跨平台支持：PySide2可以在Windows、Linux和macOS上运行，允许
使用LangChain访问个人数据第一章-简介明志刘明大模型学习手册 langchain
需要学习提示词工程的同学请看面向开发者的提示词工程需要学习ChatGPT的同学请查看搭建基于ChatGPT的问答系统需要学习LangChian开发的同学请查看基于LangChain开发应用程序正文在大数据时代，数据价值逐渐凸显，打造定制化、个性化服务，个人数据尤为重要。要开发一个具备较强服务能力、能够充分展现个性化智能的应用程序，大模型与个人数据的对齐是一个重要步骤。作为针对大模型开发应运而生的框
Python学习第十一天 Leo来编程 Python学习 python
疑惑：有很多人不知道是不是也分不清什么是单核？什么是多核？什么是时间片？进程？线程？那么在讲进程和线程前我先举个例子更好理解这些概念。单核例子：比如你是一个厨师（计算机）在一个厨房（CPU）里需要同时做3个菜（进程）、每个菜需要准备不同的调料以及协作（线程），那么这个厨师需要不断地切换时间（时间片）来达到同时在一个时间将三个菜做完。多核的话其实对应的例子就是多个厨师，这样的例子太多了因为万物皆对象
python学习第三天 Leo来编程 Python学习 python 开发语言
条件判断条件判断使用if、elif和else关键字。它们用于根据条件执行不同的代码块。#条件判断age=18ifage0:#也可以写if(s>0)但是没必要因为python给个提示建议去掉保证代码的按照缩进来进行更加规范print("这个数字是大于0的数字!")#这行代码属于if语句的代码块elifs==0:print("这个数字是等于0的数字!")#这行代码属于elif语句的代码块else:pr
三种优化算法旅者时光算法算法 python 开发语言
本文将总结遗传算法、粒子群算法、模拟退火三种优化算法的核心思路，并使用python完整实现。实际上，越来越多的优秀算法已经被封装为一个易用的接口。很多时候，一行代码就能实现我们的需求。但了解这些算法的基本逻辑，能够使用最基本的代码实现它。无论对于提升我们的编程能力还是解决问题的能力，都会大有裨益。甚至，改变我们思考问题的方式。1、遗传算法遗传算法，顾名思义，就是借鉴了生物通过遗传变异来逐渐适应环境
html页面js获取参数值 0624chenhong html
1.js获取参数值js function GetQueryString(name) { var reg = new RegExp("(^|&)"+ name +"=([^&]*)(&|$)"); var r = windo
MongoDB 在多线程高并发下的问题 BigCat2013 mongodb DB 高并发重复数据
最近项目用到 MongoDB , 主要是一些读取数据及改状态位的操作. 因为是结合了最近流行的 Storm进行大数据的分析处理，并将分析结果插入Vertica数据库，所以在多线程高并发的情境下, 会发现 Vertica 数据库中有部分重复的数据. 这到底是什么原因导致的呢？笔者开始也是一筹莫展，重复去看 MongoDB 的 API , 终于有了新发现： com.mongodb.DB 这个类有
c++ 用类模版实现链表(c++语言程序设计第四版示例代码) CrazyMizzz 数据结构 C++
#include<iostream> #include<cassert> using namespace std; template<class T> class Node { private: Node<T> * next; public: T data;
最近情况麦田的设计者感慨考试生活
在五月黄梅天的岁月里，一年两次的软考又要开始了。到目前为止，我已经考了多达三次的软考，最后的结果就是通过了初级考试（程序员）。人啊，就是不满足，考了初级就希望考中级，于是，这学期我就报考了中级，明天就要考试。感觉机会不大，期待奇迹发生吧。这个学期忙于练车，写项目，反正最后是一团糟。后天还要考试科目二。这个星期真的是很艰难的一周，希望能快点度过。
linux系统中用pkill踢出在线登录用户被触发 linux
由于linux服务器允许多用户登录，公司很多人知道密码，工作造成一定的障碍所以需要有时踢出指定的用户 1/#who 查出当前有那些终端登录（用 w 命令更详细） # who root pts/0 2010-10-28 09:36 (192
仿QQ聊天第二版肆无忌惮_ qq
在第一版之上的改进内容: 第一版链接: http://479001499.iteye.com/admin/blogs/2100893 用map存起来号码对应的聊天窗口对象,解决私聊的时候所有消息发到一个窗口的问题. 增加ViewInfo类,这个是信息预览的窗口,如果是自己的信息,则可以进行编辑. 信息修改后上传至服务器再告诉所有用户,自己的窗口
java读取配置文件知了ing
1，java读取.properties配置文件 InputStream in; try { in = test.class.getClassLoader().getResourceAsStream("config/ipnetOracle.properties");//配置文件的路径 Properties p = new Properties()
__attribute__ 你知多少？矮蛋蛋 C++gcc
原文地址: http://www.cnblogs.com/astwish/p/3460618.html GNU C 的一大特色就是__attribute__ 机制。__attribute__ 可以设置函数属性（Function Attribute ）、变量属性（Variable Attribute ）和类型属性（Type Attribute ）。 __attribute__ 书写特征是：
jsoup使用笔记 alleni123 java 爬虫 JSoup
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.7.3</version> </dependency> 2014/08/28 今天遇到这种形式，
JAVA中的集合 Collectio 和Map的简单使用及方法百合不是茶 list map set
List ,set ,map的使用方法和区别 java容器类类库的用途是保存对象，并将其分为两个概念： Collection集合：一个独立的序列，这些序列都服从一条或多条规则;List必须按顺序保存元素，set不能重复元素；Queue按照排队规则来确定对象产生的顺序（通常与他们被插入的
杀LINUX的JOB进程 bijian1013 linux unix
今天发现数据库一个JOB一直在执行，都执行了好几个小时还在执行，所以想办法给删除掉系统环境： ORACLE 10G Linux操作系统操作步骤如下：第一步.查询出来那个job在运行，找个对应的SID字段 select * from dba_jobs_running--找到job对应的sid &n
Spring AOP详解 bijian1013 java spring AOP
最近项目中遇到了以下几点需求，仔细思考之后，觉得采用AOP来解决。一方面是为了以更加灵活的方式来解决问题，另一方面是借此机会深入学习Spring AOP相关的内容。例如，以下需求不用AOP肯定也能解决，至于是否牵强附会，仁者见仁智者见智。 1.对部分函数的调用进行日志记录，用于观察特定问题在运行过程中的函数调用
[Gson六]Gson类型适配器(TypeAdapter) bit1129 Adapter
TypeAdapter的使用动机 Gson在序列化和反序列化时，默认情况下，是按照POJO类的字段属性名和JSON串键进行一一映射匹配，然后把JSON串的键对应的值转换成POJO相同字段对应的值，反之亦然，在这个过程中有一个JSON串Key对应的Value和对象之间如何转换(序列化/反序列化)的问题。以Date为例，在序列化和反序列化时，Gson默认使用java.
【spark八十七】给定Driver Program，如何判断哪些代码在Driver运行，哪些代码在Worker上执行 bit1129 driver
Driver Program是用户编写的提交给Spark集群执行的application，它包含两部分作为驱动： Driver与Master、Worker协作完成application进程的启动、DAG划分、计算任务封装、计算任务分发到各个计算节点(Worker)、计算资源的分配等。计算逻辑本身，当计算任务在Worker执行时，执行计算逻辑完成application的计算任务
nginx 经验总结 ronin47 nginx 总结
　　　深感nginx的强大，只学了皮毛，把学下的记录。　　　获取Header 信息，一般是以$http_XX（ＸＸ是小写）获取body,通过接口，再展开，根据Ｋ取Ｖ　　　获取uri,以$arg_XX &n
轩辕互动-1.求三个整数中第二大的数2.整型数组的平衡点 bylijinnan 数组
import java.util.ArrayList; import java.util.Arrays; import java.util.List; public class ExoWeb { public static void main(String[] args) { ExoWeb ew=new ExoWeb(); System.out.pri
Netty源码学习-Java-NIO-Reactor bylijinnan java 多线程 netty
Netty里面采用了NIO-based Reactor Pattern 了解这个模式对学习Netty非常有帮助参考以下两篇文章： http://jeewanthad.blogspot.com/2013/02/reactor-pattern-explained-part-1.html http://gee.cs.oswego.edu/dl/cpjslides/nio.pdf
AOP通俗理解 cngolon spring AOP
1.我所知道的aop 初看aop,上来就是一大堆术语，而且还有个拉风的名字，面向切面编程，都说是OOP的一种有益补充等等。一下子让你不知所措，心想着：怪不得很多人都和我说aop多难多难。当我看进去以后，我才发现：它就是一些java基础上的朴实无华的应用，包括ioc，包括许许多多这样的名词，都是万变不离其宗而已。 2.为什么用aop&nb
cursor variable 实例 ctrain variable
create or replace procedure proc_test01 as type emp_row is record( empno emp.empno%type, ename emp.ename%type, job emp.job%type, mgr emp.mgr%type, hiberdate emp.hiredate%type, sal emp.sal%t
shell报bash: service: command not found解决方法 daizj linux shell service jps
今天在执行一个脚本时，本来是想在脚本中启动hdfs和hive等程序，可以在执行到service hive-server start等启动服务的命令时会报错，最终解决方法记录一下：脚本报错如下： ./olap_quick_intall.sh: line 57: service: command not found ./olap_quick_intall.sh: line 59
40个迹象表明你还是PHP菜鸟 dcj3sjt126com 设计模式 PHP 正则表达式 oop
你是PHP菜鸟，如果你：1. 不会利用如phpDoc 这样的工具来恰当地注释你的代码2. 对优秀的集成开发环境如Zend Studio 或Eclipse PDT 视而不见3. 从未用过任何形式的版本控制系统，如Subclipse4. 不采用某种编码与命名标准，以及通用约定，不能在项目开发周期里贯彻落实5. 不使用统一开发方式6. 不转换（或）也不验证某些输入或SQL查询串（译注：参考PHP相关函
Android逐帧动画的实现 dcj3sjt126com android
一、代码实现： private ImageView iv; private AnimationDrawable ad; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout
java远程调用linux的命令或者脚本 eksliang linux ganymed-ssh2
转载请出自出处： http://eksliang.iteye.com/blog/2105862 Java通过SSH2协议执行远程Shell脚本(ganymed-ssh2-build210.jar) 使用步骤如下： 1.导包官网下载: http://www.ganymed.ethz.ch/ssh2/ ma
adb端口被占用问题 gqdy365 adb
最近重新安装的电脑，配置了新环境，老是出现： adb server is out of date. killing... ADB server didn't ACK * failed to start daemon * 百度了一下，说是端口被占用，我开个eclipse，然后打开cmd，就提示这个，很烦人。一个比较彻底的解决办法就是修改
ASP.NET使用FileUpload上传文件 hvt .net C#hovertree asp.net webform
前台代码： <asp:FileUpload ID="fuKeleyi" runat="server" /> <asp:Button ID="BtnUp" runat="server" onclick="BtnUp_Click" Text="上传" />
代码之谜（四）- 浮点数（从惊讶到思考） justjavac 浮点数精度代码之谜 IEEE
在『代码之谜』系列的前几篇文章中，很多次出现了浮点数。浮点数在很多编程语言中被称为简单数据类型，其实，浮点数比起那些复杂数据类型（比如字符串）来说，一点都不简单。单单是说明 IEEE浮点数就可以写一本书了，我将用几篇博文来简单的说说我所理解的浮点数，算是抛砖引玉吧。一次面试记得多年前我招聘 Java 程序员时的一次关于浮点数、二分法、编码的面试，多年以后，他已经称为了一名很出色的
数据结构随记_1 lx.asymmetric 数据结构笔记
第一章 1.数据结构包括数据的逻辑结构、数据的物理/存储结构和数据的逻辑关系这三个方面的内容。 2.数据的存储结构可用四种基本的存储方法表示，它们分别是顺序存储、链式存储、索引存储和散列存储。 3.数据运算最常用的有五种，分别是查找/检索、排序、插入、删除、修改。 4.算法主要有以下五个特性：输入、输出、可行性、确定性和有穷性。 5.算法分析的
linux的会话和进程组网络接口 linux
会话：一个或多个进程组。起于用户登录，终止于用户退出。此期间所有进程都属于这个会话期。会话首进程：调用setsid创建会话的进程1.规定组长进程不能调用setsid，因为调用setsid后，调用进程会成为新的进程组的组长进程.如何保证？先调用fork，然后终止父进程，此时由于子进程的进程组ID为父进程的进程组ID，而子进程的ID是重新分配的，所以保证子进程不会是进程组长，从而子进程可以调用se
二维数组元素的连续求解 1140566087 二维数组 ACM
import java.util.HashMap; public class Title { public static void main(String[] args){ f(); } // 二位数组的应用 //12、二维数组中，哪一行或哪一列的连续存放的0的个数最多，是几个0。注意，是“连续”。 public static void f(){
也谈什么时候Java比C++快 windshome java C++
刚打开iteye就看到这个标题“Java什么时候比C++快”，觉得很好笑。你要比，就比同等水平的基础上的相比，笨蛋写得C代码和C++代码，去和高手写的Java代码比效率，有什么意义呢？我是写密码算法的，深刻知道算法C和C++实现和Java实现之间的效率差，甚至也比对过C代码和汇编代码的效率差，计算机是个死的东西，再怎么优化，Java也就是和C

python 线性回归模型_如何在Python中建立和训练线性和逻辑回归ML模型

第1节：线性回归 (Section 1: Linear Regression)

我们将在本教程中使用的数据集 (The Data Set We Will Use in This Tutorial)

我们将在本教程中使用的图书馆 (The Libraries We Will Use in This Tutorial)

导入数据集 (Importing the Data Set)

了解数据集 (Understanding the Data Set)

建立机器学习线性回归模型 (Building a Machine Learning Linear Regression Model)

将我们的数据集分为训练数据和测试数据 (Splitting our Data Set into Training Data and Test Data)

建立和训练模型 (Building and Training the Model)

根据我们的模型做出预测 (Making Predictions From Our Model)

测试模型的性能 (Testing the Performance of our Model)

均方误差(MSE) (Mean Squared Error (MSE))

均方根误差(RMSE) (Root Mean Squared Error (RMSE))

本教程的完整代码 (The Complete Code For This Tutorial)

第2节：逻辑回归 (Section 2: Logistic Regression)

我们将在本教程中使用的数据集 (The Data Set We Will Be Using in This Tutorial)

我们将在本教程中使用的导入 (The Imports We Will Be Using in This Tutorial)

将数据集导入我们的Python脚本 (Importing the Data Set into our Python Script)

通过探索性数据分析了解我们的数据集 (Learning About Our Data Set With Exploratory Data Analysis)

每个分类类别的流行 (The Prevalence of Each Classification Category)

性别之间的成活率 (Survival Rates Between Genders)

旅客舱位之间的生存率 (Survival Rates Between Passenger Classes)

泰坦尼克号乘客的年龄分布 (The Age Distribution of Titanic Passengers)

泰坦尼克号乘客的票价分布 (The Ticket Price Distribution of Titanic Passengers)

从我们的数据集中删除空数据 (Removing Null Data From Our Data Set)

建立逻辑回归模型 (Building a Logistic Regression Model)

删除缺少太多数据的列 (Removing Columns With Too Much Missing Data)

使用虚拟变量处理分类数据 (Handling Categorical Data With Dummy Variables)

将虚拟变量添加到pandas DataFrame (Adding Dummy Variables to the pandas DataFrame)

从数据集中删除不必要的列 (Removing Unnecessary Columns From The Data Set)

创建培训数据和测试数据 (Creating Training Data and Test Data)

训练逻辑回归模型 (Training the Logistic Regression Model)

使用我们的Logistic回归模型进行预测 (Making Predictions With Our Logistic Regression Model)

测量Logistic回归机器学习模型的性能 (Measuring the Performance of a Logistic Regression Machine Learning Model)

本教程的完整代码 (The Full Code for This Tutorial)

最后的想法 (Final Thoughts)

你可能感兴趣的:(算法,可视化,大数据,python,机器学习)

第2节：逻辑回归
(Section 2: Logistic Regression)

将虚拟变量添加到`pandas` DataFrame (Adding Dummy Variables to the `pandas` DataFrame)