cumei1658

kaggle 入门_Kaggle入门：房价竞争

kaggle 入门

Founded in 2010, Kaggle is a Data Science platform where users can share, collaborate, and compete. One key feature of Kaggle is “Competitions”, which offers users the ability to practice on real world data and to test their skills with, and against, an international community.

Kaggle成立于2010年，是一个数据科学平台，用户可以在其中共享，协作和竞争。 Kaggle的一项主要功能是“竞争”，它为用户提供了在现实世界数据上进行练习以及与国际社会进行对抗的能力，以测试其技能。

This guide will teach you how to approach and enter a Kaggle competition, including exploring the data, creating and engineering features, building models, and submitting predictions. We’ll use Python 3 and Jupyter Notebook.

本指南将教您如何进入和参加Kaggle竞赛，包括探索数据，创建和工程特征，构建模型以及提交预测。我们将使用Python 3和Jupyter Notebook 。

竞赛 (The Competition)

We’ll work through the House Prices: Advanced Regression Techniques competition.

我们将完成“ 房价：高级回归技术”竞赛。

We’ll follow these steps to a successful Kaggle Competition submission:

我们将按照以下步骤成功提交Kaggle竞赛：

Acquire the data
Explore the data
Engineer and transform the features and the target variable
Build a model
Make and submit predictions

采集数据
探索数据
设计并转换特征和目标变量
建立模型
做出并提交预测

步骤1：获取数据并创建环境 (Step 1: Acquire the data and create our environment)

We need to acquire the data for the competition. The descriptions of the features and some other helpful information are contained in a file with an obvious name, data_description.txt.

我们需要获取比赛数据。这些功能的描述和其他一些有用的信息包含在一个文件中，文件的名称很明显data_description.txt 。

Download the data and save it into a folder where you’ll keep everything you need for the competition.

下载数据并将其保存到文件夹中，以保存比赛所需的一切。

We will first look at the train.csv data. After we’ve trained a model, we’ll make predictions using the test.csv data.

我们将首先查看train.csv数据。训练模型后，我们将使用test.csv数据进行预测。

First, import Pandas, a fantastic library for working with data in Python. Next we’ll import Numpy.

首先，导入Pandas ，这是一个使用Python处理数据的绝佳库。接下来，我们将导入Numpy 。

import import pandas pandas as as pd
pd
import import numpy numpy as as np
np

We can use Pandas to read in csv files. The pd.read_csv() method creates a DataFrame from a csv file.

我们可以使用Pandas读取csv文件。 pd.read_csv()方法从csv文件创建一个DataFrame。

Let’s check out the size of the data.

让我们检查一下数据的大小。

print print (( "Train data shape:""Train data shape:" , , traintrain .. shapeshape )
)
print print (( "Test data shape:""Test data shape:" , , testtest .. shapeshape )
)


Train data shape: (1460, 81)
Test data shape: (1459, 80)

We see that test has only 80 columns, while train has 81. This is due to, of course, the fact that the test data do not include the final sale price information!

我们看到test只有80列，而train只有81列。这当然是由于测试数据不包含最终售价信息！

Next, we’ll look at a few rows using the DataFrame.head() method.

接下来，我们将使用DataFrame.head()方法查看几行。

		Id	ID	MSSubClass	MSSubClass	MSZoning	分区	LotFrontage	LotFrontage	LotArea	LotArea	Street	街	Alley	胡同	LotShape	LotShape	LandContour	陆地轮廓	Utilities	实用工具	LotConfig	LotConfig	LandSlope	土地坡度	Neighborhood	邻里	Condition1	条件1	Condition2	条件2	BldgType	BldgType	HouseStyle	家庭风格	OverallQual	综合素质	OverallCond	综合条件	YearBuilt	建立年份	YearRemodAdd	YearRemodAdd	RoofStyle	屋顶风格	RoofMatl	屋顶材料	Exterior1st	外观1	Exterior2nd	外观第二	MasVnrType	MasVnrType	MasVnrArea	MasVnrArea	ExterQual	扩展质量	ExterCond	ExterCond	Foundation	基础	BsmtQual	质量标准	BsmtCond	BsmtCond	BsmtExposure	BsmtExposure	BsmtFinType1	BsmtFinType1	BsmtFinSF1	BsmtFinSF1	BsmtFinType2	BsmtFinType2	BsmtUnfSF	BsmtUnfSF	TotalBsmtSF	TotalBsmtSF	Heating	加热	HeatingQC	加热QC	CentralAir	中央航空	Electrical	电的	1stFlrSF	1stFlrSF	2ndFlrSF	2ndFlrSF	GrLivArea	GrLivArea	BsmtFullBath	BsmtFullBath	BsmtHalfBath	Bsmt半浴	FullBath	全浴	HalfBath	半浴	BedroomAbvGr	BedroomAbvGr	KitchenAbvGr	KitchenAbvGr	KitchenQual	厨房质量	TotRmsAbvGrd	TotRmsAbvGrd	Functional	功能性	Fireplaces	壁炉	FireplaceQu	壁炉Qu	GarageType	车库类型	GarageYrBlt	车库	GarageFinish	车库完成	GarageCars	车库车	GarageArea	车库面积	GarageQual	车库质量	GarageCond	车库康德	PavedDrive	铺路	WoodDeckSF	木甲板SF	OpenPorchSF	OpenPorchSF	EnclosedPorch	封闭式门廊	PoolQC	池质量控制	Fence	围栏	MiscFeature	其他功能	MoSold	MoSold	YrSold	已售	SaleType	销售类型	SaleCondition	销售条件	SalePrice	销售价格
0	0	1	1个	60	60	RL	RL	65.0	65.0	8450	8450	Pave	铺平	NaN	N	Reg	注册	Lvl	等级	AllPub	AllPub	Inside	内	Gtl	Gtl	CollgCr	科勒格	Norm	规范	Norm	规范	1Fam	1名	2Story	2层	7	7	5	5	2003	2003年	2003	2003年	Gable	山墙	CompShg	补全	VinylSd	乙烯基	VinylSd	乙烯基	BrkFace	BrkFace	196.0	196.0	Gd	d	TA	TA	PConc	PConc	Gd	d	TA	TA	No	没有	GLQ	格力格	706	706	Unf	Unf	150	150	856	856	GasA	煤气	Ex	防爆	Y	ÿ	SBrkr	SBrkr	856	856	854	854	1710	1710	1	1个	0	0	2	2	1	1个	3	3	1	1个	Gd	d	8	8	Typ	典型值	0	0	NaN	N	Attchd	Attchd	2003.0	2003.0	RFn	射频	2	2	548	548	TA	TA	TA	TA	Y	ÿ	0	0	61	61	0	0	NaN	N	NaN	N	NaN	N	2	2	2008	2008年	WD	西数	Normal	正常	208500	208500
1	1个	2	2	20	20	RL	RL	80.0	80.0	9600	9600	Pave	铺平	NaN	N	Reg	注册	Lvl	等级	AllPub	AllPub	FR2	FR2	Gtl	Gtl	Veenker	Veenker	Feedr	送纸器	Norm	规范	1Fam	1名	1Story	1层	6	6	8	8	1976	1976年	1976	1976年	Gable	山墙	CompShg	补全	MetalSd	金属锡	MetalSd	金属锡	None	没有	0.0	0.0	TA	TA	TA	TA	CBlock	块	Gd	d	TA	TA	Gd	d	ALQ	ALQ	978	978	Unf	Unf	284	284	1262	1262	GasA	煤气	Ex	防爆	Y	ÿ	SBrkr	SBrkr	1262	1262	0	0	1262	1262	0	0	1	1个	2	2	0	0	3	3	1	1个	TA	TA	6	6	Typ	典型值	1	1个	TA	TA	Attchd	Attchd	1976.0	1976.0	RFn	射频	2	2	460	460	TA	TA	TA	TA	Y	ÿ	298	298	0	0	0	0	NaN	N	NaN	N	NaN	N	5	5	2007	2007年	WD	西数	Normal	正常	181500	181500
2	2	3	3	60	60	RL	RL	68.0	68.0	11250	11250	Pave	铺平	NaN	N	IR1	红外线1	Lvl	等级	AllPub	AllPub	Inside	内	Gtl	Gtl	CollgCr	科勒格	Norm	规范	Norm	规范	1Fam	1名	2Story	2层	7	7	5	5	2001	2001	2002	2002年	Gable	山墙	CompShg	补全	VinylSd	乙烯基	VinylSd	乙烯基	BrkFace	BrkFace	162.0	162.0	Gd	d	TA	TA	PConc	PConc	Gd	d	TA	TA	Mn	锰	GLQ	格力格	486	486	Unf	Unf	434	434	920	920	GasA	煤气	Ex	防爆	Y	ÿ	SBrkr	SBrkr	920	920	866	866	1786	1786年	1	1个	0	0	2	2	1	1个	3	3	1	1个	Gd	d	6	6	Typ	典型值	1	1个	TA	TA	Attchd	Attchd	2001.0	2001.0	RFn	射频	2	2	608	608	TA	TA	TA	TA	Y	ÿ	0	0	42	42	0	0	NaN	N	NaN	N	NaN	N	9	9	2008	2008年	WD	西数	Normal	正常	223500	223500
3	3	4	4	70	70	RL	RL	60.0	60.0	9550	9550	Pave	铺平	NaN	N	IR1	红外线1	Lvl	等级	AllPub	AllPub	Corner	角	Gtl	Gtl	Crawfor	克劳佛	Norm	规范	Norm	规范	1Fam	1名	2Story	2层	7	7	5	5	1915	1915年	1970	1970年	Gable	山墙	CompShg	补全	Wd Sdng	Wd Sdng	Wd Shng	Sh	None	没有	0.0	0.0	TA	TA	TA	TA	BrkTil	BrkTil	TA	TA	Gd	d	No	没有	ALQ	ALQ	216	216	Unf	Unf	540	540	756	756	GasA	煤气	Gd	d	Y	ÿ	SBrkr	SBrkr	961	961	756	756	1717	1717	1	1个	0	0	1	1个	0	0	3	3	1	1个	Gd	d	7	7	Typ	典型值	1	1个	Gd	d	Detchd	拉丝	1998.0	1998.0	Unf	Unf	3	3	642	642	TA	TA	TA	TA	Y	ÿ	0	0	35	35	272	272	NaN	N	NaN	N	NaN	N	2	2	2006	2006年	WD	西数	Abnorml	Abnorml	140000	140000
4	4	5	5	60	60	RL	RL	84.0	84.0	14260	14260	Pave	铺平	NaN	N	IR1	红外线1	Lvl	等级	AllPub	AllPub	FR2	FR2	Gtl	Gtl	NoRidge	诺里奇	Norm	规范	Norm	规范	1Fam	1名	2Story	2层	8	8	5	5	2000	2000	2000	2000	Gable	山墙	CompShg	补全	VinylSd	乙烯基	VinylSd	乙烯基	BrkFace	BrkFace	350.0	350.0	Gd	d	TA	TA	PConc	PConc	Gd	d	TA	TA	Av	平均值	GLQ	格力格	655	655	Unf	Unf	490	490	1145	1145	GasA	煤气	Ex	防爆	Y	ÿ	SBrkr	SBrkr	1145	1145	1053	1053	2198	2198	1	1个	0	0	2	2	1	1个	4	4	1	1个	Gd	d	9	9	Typ	典型值	1	1个	TA	TA	Attchd	Attchd	2000.0	2000.0	RFn	射频	3	3	836	836	TA	TA	TA	TA	Y	ÿ	192	192	84	84	0	0	NaN	N	NaN	N	NaN	N	12	12	2008	2008年	WD	西数	Normal	正常	250000	250000

We should have the data dictionary available in our folder for the competition. You can also find it here.

我们应该在比赛文件夹中找到可用的data dictionary 。您也可以在这里找到它。

Here’s a brief version of what you’ll find in the data description file:

这是在数据描述文件中可以找到的简短版本：

SalePrice – the property’s sale price in dollars. This is the target variable that you’re trying to predict.
MSSubClass – The building class
MSZoning – The general zoning classification
LotFrontage – Linear feet of street connected to property
LotArea – Lot size in square feet
Street – Type of road access
Alley – Type of alley access
LotShape – General shape of property
LandContour – Flatness of the property
Utilities – Type of utilities available
LotConfig – Lot configuration

SalePrice –物业的销售价格（美元）。这是您要预测的目标变量。
MSSubClass –建筑类
MSZoning –常规分区分类
LotFrontage –连接到物业的街道的线性英尺
LotArea –平方英尺面积
Street –道路通行类型
Alley –胡同通道的类型
LotShape –属性的一般形状
LandContour – LandContour平整度
Utilities –可用的实用Utilities类型
LotConfig –批次配置

And so on.

等等。

The competition challenges you to predict the final price of each home. At this point, we should start to think about what we know about housing prices, Ames, Iowa, and what we might expect to see in this dataset.

竞争使您难以预测每个房屋的最终价格。在这一点上，我们应该开始考虑对房价，艾姆斯，爱荷华州的了解，以及我们可能希望在此数据集中看到的内容。

Looking at the data, we see features we expected, like YrSold (the year the home was last sold) and SalePrice. Others we might not have anticipated, such as LandSlope (the slope of the land the home is built upon) and RoofMatl (the materials used to construct the roof). Later, we’ll have to make decisions about how we’ll approach these and other features.

查看数据，我们可以看到预期的功能，例如YrSold （房屋的最后出售年）和SalePrice 。其他我们可能未曾预料到的，例如LandSlope （房屋所建土地的坡度）和RoofMatl （用于建造屋顶的材料）。稍后，我们将不得不决定如何使用这些功能和其他功能。

We want to do some plotting during the exploration stage of our project, and we’ll need to import that functionality into our environment as well. Plotting allows us to visualize the distribution of the data, check for outliers, and see other patterns that we might miss otherwise. We’ll use Matplotlib, a popular visualization library.

我们希望在项目的探索阶段进行一些绘制，并且还需要将该功能导入到我们的环境中。绘图使我们能够可视化数据的分布，检查异常值，并查看否则可能会错过的其他模式。我们将使用流行的可视化库Matplotlib 。

import import matplotlib.pyplot matplotlib.pyplot as as plt
plt
pltplt .. stylestyle .. useuse (( stylestyle == 'ggplot''ggplot' )
)
pltplt .. rcParamsrcParams [[ 'figure.figsize''figure.figsize' ] ] = = (( 1010 , , 66 )
)

第2步：探索数据并设计功能 (Step 2: Explore the data and engineer Features)

The challenge is to predict the final sale price of the homes. This information is stored in the SalePrice column. The value we are trying to predict is often called the target variable.

挑战在于预测房屋的最终售价。此信息存储在SalePrice列中。我们试图预测的值通常称为目标变量 。

We can use Series.describe() to get more information.

我们可以使用Series.describe()获得更多信息。


count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

Series.describe() gives you more information about any series. count displays the total number of rows in the series. For numerical data, Series.describe() also gives the mean, std, min and max values as well.

Series.describe()为您提供有关任何系列的更多信息。 count显示系列中的总行数。对于数值数据， Series.describe()还提供mean ， std ， min和max 。

The average sale price of a house in our dataset is close to $180,000, with most of the values falling within the $130,000 to $215,000 range.

在我们的数据集中，房屋的平均售价接近$180,000 ，其中大部分价格在$130,000至$215,000之间。

Next, we’ll check for skewness, which is a measure of the shape of the distribution of values.

接下来，我们将检查偏斜度，这是对值分布形状的度量。

When performing regression, sometimes it makes sense to log-transform the target variable when it is skewed. One reason for this is to improve the linearity of the data. Although the justification is beyond the scope of this tutorial, more information can be found here.

在执行回归时，有时在倾斜目标变量时对数转换是有意义的。其原因之一是为了改善数据的线性度。尽管合理性超出了本教程的范围，但可以在此处找到更多信息。

Importantly, the predictions generated by the final model will also be log-transformed, so we’ll need to convert these predictions back to their original form later.

重要的是，最终模型生成的预测也将被对数转换，因此我们稍后需要将这些预测转换回其原始形式。

np.log() will transform the variable, and np.exp() will reverse the transformation.

np.log()将转换变量，而np.exp()将反转转换。

We use plt.hist() to plot a histogram of SalePrice. Notice that the distribution has a longer tail on the right. The distribution is positively skewed.

我们使用plt.hist()绘制SalePrice的直方图。请注意，该分布在右侧具有较长的尾巴。分布正偏斜。

print print (( "Skew is:""Skew is:" , , traintrain .. SalePriceSalePrice .. skewskew ())
())
pltplt .. histhist (( traintrain .. SalePriceSalePrice , , colorcolor == 'blue''blue' )
)
pltplt .. showshow ()
()


Skew is: 1.88287575977

Now we use np.log() to transform train.SalePric and calculate the skewness a second time, as well as re-plot the data. A value closer to 0 means that we have improved the skewness of the data. We can see visually that the data will more resembles a normal distribution.

现在，我们使用np.log()转换train.SalePric并train.SalePric计算偏斜度，然后重新绘制数据。接近0的值表示我们已经改善了数据的偏度。我们可以从视觉上看到数据将更类似于正态分布。


Skew is: 0.121335062205

Now that we’ve transformed the target variable, let’s consider our features. First, we’ll check out the numerical features and make some plots. The .select_dtypes() method will return a subset of columns matching the specified data types.

现在我们已经转换了目标变量，让我们考虑一下我们的功能。首先，我们将检查数字特征并绘制一些图。 .select_dtypes()方法将返回与指定数据类型匹配的列的子集。

使用数字功能 (Working with Numeric Features)

numeric_features numeric_features = = traintrain .. select_dtypesselect_dtypes (( includeinclude == [[ npnp .. numbernumber ])
])
numeric_featuresnumeric_features .. dtypes
dtypes


Id                 int64
MSSubClass         int64
LotFrontage      float64
LotArea            int64
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
MasVnrArea       float64
BsmtFinSF1         int64
BsmtFinSF2         int64
BsmtUnfSF          int64
TotalBsmtSF        int64
1stFlrSF           int64
2ndFlrSF           int64
LowQualFinSF       int64
GrLivArea          int64
BsmtFullBath       int64
BsmtHalfBath       int64
FullBath           int64
HalfBath           int64
BedroomAbvGr       int64
KitchenAbvGr       int64
TotRmsAbvGrd       int64
Fireplaces         int64
GarageYrBlt      float64
GarageCars         int64
GarageArea         int64
WoodDeckSF         int64
OpenPorchSF        int64
EnclosedPorch      int64
3SsnPorch          int64
ScreenPorch        int64
PoolArea           int64
MiscVal            int64
MoSold             int64
YrSold             int64
SalePrice          int64
dtype: object

The DataFrame.corr() method displays the correlation (or relationship) between the columns. We’ll examine the correlations between the features and the target.

DataFrame.corr()方法显示列之间的相关性（或关系）。我们将研究特征与目标之间的相关性。


SalePrice      1.000000
OverallQual    0.790982
GrLivArea      0.708624
GarageCars     0.640409
GarageArea     0.623431
Name: SalePrice, dtype: float64 

YrSold          -0.028923
OverallCond     -0.077856
MSSubClass      -0.084284
EnclosedPorch   -0.128578
KitchenAbvGr    -0.135907
Name: SalePrice, dtype: float64

The first five features are the most positively correlated with SalePrice, while the next five are the most negatively correlated.

前五个功能与SalePrice 正相关性 SalePrice ，而后五个功能与负价格相关性最高。

Let’s dig deeper on OverallQual. We can use the .unique() method to get the unique values.

让我们深入了解OverallQual 。我们可以使用.unique()方法获取唯一值。

traintrain .. OverallQualOverallQual .. uniqueunique ()
()


array([ 7,  6,  8,  5,  9,  4, 10,  3,  1,  2])

The OverallQual data are integer values in the interval 1 to 10 inclusive.

OverallQual数据是OverallQual 1到10之间的整数值。

We can create a pivot table to further investigate the relationship between OverallQual and SalePrice. The Pandas docs demonstrate how to accomplish this task. We set index='OverallQual' and values='SalePrice' . We chose to look at the median here.

我们可以创建一个数据透视表来进一步调查OverallQual和SalePrice之间的关系。熊猫文档演示了如何完成此任务。我们设置index='OverallQual'和values='SalePrice' 。我们选择在这里查看中median 。

quality_pivot
quality_pivot


OverallQual
1      50150
2      60000
3      86250
4     108000
5     133000
6     160000
7     200141
8     269750
9     345000
10    432390
Name: SalePrice, dtype: int64

To help us visualize this pivot table more easily, we can create a bar plot using the Series.plot() method.

为了帮助我们更轻松地可视化此数据透视表，我们可以使用Series.plot()方法创建条形图。

Notice that the median sales price strictly increases as Overall Quality increases.

请注意，中位数销售价格随整体质量的提高而严格提高。

Next, let’s use plt.scatter() to generate some scatter plots and visualize the relationship between the Ground Living Area GrLivArea and SalePrice.

接下来，让我们使用plt.scatter()生成一些散点图，并可视化地面居住区GrLivArea和SalePrice之间的关系。

pltplt .. scatterscatter (( xx == traintrain [[ 'GrLivArea''GrLivArea' ], ], yy == targettarget )
)
pltplt .. ylabelylabel (( 'Sale Price''Sale Price' )
)
pltplt .. xlabelxlabel (( 'Above grade (ground) living area square feet''Above grade (ground) living area square feet' )
)
pltplt .. showshow ()
()

At first glance, we see that increases in living area correspond to increases in price. We will do the same for GarageArea.

乍一看，我们看到居住面积的增加与价格的增加相对应。我们将对GarageArea做同样的GarageArea 。

Notice that there are many homes with 0 for Garage Area, indicating that they don’t have a garage. We’ll transform other features later to reflect this assumption. There are a few outliers as well. Outliers can affect a regression model by pulling our estimated regression line further away from the true population regression line. So, we’ll remove those observations from our data. Removing outliers is an art and a science. There are many techniques for dealing with outliers.

请注意，对于Garage Area ，有许多房屋的0表示没有车库。稍后我们将转换其他功能以反映此假设。也有一些异常值。离群值可以通过将我们的估计回归线拉离真实总体回归线进一步影响回归模型。因此，我们将从数据中删除这些观察值。消除异常值是一门艺术和一门科学。有许多处理异常值的技术。

We will create a new dataframe with some outliers removed.

我们将创建一个删除了一些异常值的新数据框。

train train = = traintrain [[ traintrain [[ 'GarageArea''GarageArea' ] ] < < 12001200 ]
]

Let’s take another look.

让我们再看一看。

处理空值 (Handling Null Values)

Next, we’ll examine the null or missing values.

接下来，我们将检查空值或缺失值。

We will create a DataFrame to view the top null columns. Chaining together the train.isnull().sum() methods, we return a Series of the counts of the null values in each column.

我们将创建一个DataFrame来查看顶部的空列。将train.isnull().sum()方法链接在一起，我们在每一列中返回一系列的空值计数。

nulls nulls = = pdpd .. DataFrameDataFrame (( traintrain .. isnullisnull ()() .. sumsum ()() .. sort_valuessort_values (( ascendingascending == FalseFalse )[:)[: 2525 ])
])
nullsnulls .. columns columns = = [[ 'Null Count''Null Count' ]
]
nullsnulls .. indexindex .. name name = = 'Feature'
'Feature'
nulls
nulls

		Null Count	空计数
Feature	特征
PoolQC	池质量控制	1449	1449
MiscFeature	其他功能	1402	1402
Alley	胡同	1364	1364
Fence	围栏	1174	1174
FireplaceQu	壁炉Qu	689	689
LotFrontage	LotFrontage	258	258
GarageCond	车库康德	81	81
GarageType	车库类型	81	81
GarageYrBlt	车库	81	81
GarageFinish	车库完成	81	81
GarageQual	车库质量	81	81
BsmtExposure	BsmtExposure	38	38
BsmtFinType2	BsmtFinType2	38	38
BsmtFinType1	BsmtFinType1	37	37
BsmtCond	BsmtCond	37	37
BsmtQual	质量标准	37	37
MasVnrArea	MasVnrArea	8	8
MasVnrType	MasVnrType	8	8
Electrical	电的	1	1个
Utilities	实用工具	0	0
YearRemodAdd	YearRemodAdd	0	0
MSSubClass	MSSubClass	0	0
Foundation	基础	0	0
ExterCond	ExterCond	0	0
ExterQual	扩展质量	0	0

The documentation can help us understand the missing values. In the case of PoolQC, the column refers to Pool Quality. Pool quality is NaN when PoolArea is 0, or there is no pool. We can find a similar relationship between many of the Garage-related columns.

该文档可以帮助我们了解缺失的值。对于PoolQC ，此列指的是“池质量”。当PoolArea为0或没有池时，池质量为NaN 。我们可以在许多与车库相关的列之间找到相似的关系。

Let’s take a look at one of the other columns, MiscFeature. We’ll use the Series.unique() method to return a list of the unique values.

让我们看一下其他列之一MiscFeature 。我们将使用Series.unique()方法返回唯一值的列表。


Unique values are: [nan 'Shed' 'Gar2' 'Othr' 'TenC']

We can use the documentation to find out what these values indicate:

我们可以使用文档来找出这些值表示什么：

MiscFeature: Miscellaneous feature not covered in other categories

   Elev Elevator
   Gar2 2nd Garage (if not described in garage section)
   Othr Other
   Shed Shed (over 100 SF)
   TenC Tennis Court
   NA   None

MiscFeature: Miscellaneous feature not covered in other categories

   Elev Elevator
   Gar2 2nd Garage (if not described in garage section)
   Othr Other
   Shed Shed (over 100 SF)
   TenC Tennis Court
   NA   None

These values describe whether or not the house has a shed over 100 sqft, a second garage, and so on. We might want to use this information later. It’s important to gather domain knowledge in order to make the best decisions when dealing with missing data.

这些值描述房屋是否有超过100平方英尺的棚屋，第二个车库等。我们稍后可能希望使用此信息。重要的是收集领域知识，以便在处理丢失的数据时做出最佳决策。

争夺非数字特征 (Wrangling the non-numeric Features)

Let’s now consider the non-numeric features.

现在让我们考虑非数字功能。

		MSZoning	分区	Street	街	Alley	胡同	LotShape	LotShape	LandContour	陆地轮廓	Utilities	实用工具	LotConfig	LotConfig	LandSlope	土地坡度	Neighborhood	邻里	Condition1	条件1	Condition2	条件2	BldgType	BldgType	HouseStyle	家庭风格	RoofStyle	屋顶风格	RoofMatl	屋顶材料	Exterior1st	外观1	Exterior2nd	外观第二	MasVnrType	MasVnrType	ExterQual	扩展质量	ExterCond	ExterCond	Foundation	基础	BsmtQual	质量标准	BsmtCond	BsmtCond	BsmtExposure	BsmtExposure	BsmtFinType1	BsmtFinType1	BsmtFinType2	BsmtFinType2	Heating	加热	HeatingQC	加热QC	CentralAir	中央航空	Electrical	电的	KitchenQual	厨房质量	Functional	功能性	FireplaceQu	壁炉Qu	GarageType	车库类型	GarageFinish	车库完成	GarageQual	车库质量	GarageCond	车库康德	PavedDrive	铺路	PoolQC	池质量控制	Fence	围栏	MiscFeature	其他功能	SaleType	销售类型	SaleCondition	销售条件
count	计数	1455	1455	1455	1455	91	91	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1455	1447	1447	1455	1455	1455	1455	1455	1455	1418	1418	1418	1418	1417	1417	1418	1418	1417	1417	1455	1455	1455	1455	1455	1455	1454	1454	1455	1455	1455	1455	766	766	1374	1374	1374	1374	1374	1374	1374	1374	1455	1455	6	6	281	281	53	53	1455	1455	1455	1455
unique	独特	5	5	2	2	2	2	4	4	4	4	2	2	5	5	3	3	25	25	9	9	8	8	5	5	8	8	6	6	7	7	15	15	16	16	4	4	4	4	5	5	6	6	4	4	4	4	4	4	6	6	6	6	6	6	5	5	2	2	5	5	4	4	7	7	5	5	6	6	3	3	5	5	5	5	3	3	3	3	4	4	4	4	9	9	6	6
top	最佳	RL	RL	Pave	铺平	Grvl	v	Reg	注册	Lvl	等级	AllPub	AllPub	Inside	内	Gtl	Gtl	NAmes	奈姆斯	Norm	规范	Norm	规范	1Fam	1名	1Story	1层	Gable	山墙	CompShg	补全	VinylSd	乙烯基	VinylSd	乙烯基	None	没有	TA	TA	TA	TA	PConc	PConc	TA	TA	TA	TA	No	没有	Unf	Unf	Unf	Unf	GasA	煤气	Ex	防爆	Y	ÿ	SBrkr	SBrkr	TA	TA	Typ	典型值	Gd	d	Attchd	Attchd	Unf	Unf	TA	TA	TA	TA	Y	ÿ	Ex	防爆	MnPrv	锰	Shed	棚	WD	西数	Normal	正常
freq	频率	1147	1147	1450	1450	50	50	921	921	1309	1309	1454	1454	1048	1048	1378	1378	225	225	1257	1257	1441	1441	1216	1216	722	722	1139	1139	1430	1430	514	514	503	503	863	863	905	905	1278	1278	644	644	647	647	1306	1306	951	951	428	428	1251	1251	1423	1423	737	737	1360	1360	1329	1329	733	733	1355	1355	377	377	867	867	605	605	1306	1306	1321	1321	1335	1335	2	2	157	157	48	48	1266	1266	1196	1196

The count column indicates the count of non-null observations, while unique counts the number of unique values. top is the most commonly occurring value, with the frequency of the top value shown by freq.

count列指示非空观测值的计数，而unique计数则计数唯一值的数量。 top是最常见的值，其最高freq由freq 。

For many of these features, we might want to use one-hot encoding to make use of the information for modeling. One-hot encoding is a technique which will transform categorical data into numbers so the model can understand whether or not a particular observation falls into one category or another.

对于这些功能中的许多功能，我们可能希望使用一键编码来利用信息进行建模。一次热编码是一种将分类数据转换为数字的技术，以便模型可以了解特定观察是否属于一个类别或另一个类别。

转换和工程特征 (Transforming and engineering features)

When transforming features, it’s important to remember that any transformations that you’ve applied to the training data before fitting the model must be applied to the test data.

转换要素时，请记住，在拟合模型之前对训练数据进行的任何转换都必须应用于测试数据，这一点很重要。

Our model expects that the shape of the features from the train set match those from the test set. This means that any feature engineering that occurred while working on the train data should be applied again on the test set.

我们的模型期望train集中的特征形状与test集中的特征形状匹配。这意味着在处理train数据时发生的任何特征工程都应再次应用于test集。

To demonstrate how this works, consider the Street data, which indicates whether there is Gravel or Paved road access to the property.

为了演示其工作原理，请考虑“ Street数据，该数据指示是否有通往该物业的Gravel路或Paved道路。

print print (( "Original: "Original:  nn "" ) 
) 
print print (( traintrain .. StreetStreet .. value_countsvalue_counts (), (), "" nn "" )
)


Original: 

Pave    1450
Grvl       5
Name: Street, dtype: int64

In the Street column, the unique values are Pave and Grvl, which describe the type of road access to the property. In the training set, only 5 homes have gravel access. Our model needs numerical data, so we will use one-hot encoding to transform the data into a Boolean column.

在“ Street列中，唯一值是Pave和Grvl ，它们描述了通往该物业的道路类型。在训练集中，只有5座房屋可以使用砾石。我们的模型需要数值数据，因此我们将使用一键编码将数据转换为布尔列。

We create a new column called enc_street. The pd.get_dummies() method will handle this for us.

我们创建一个名为enc_street的新列。 pd.get_dummies()方法将为我们处理此问题。

As mentioned earlier, we need to do this on both the train and test data.

如前所述，我们需要在train和test数据上都执行此操作。

print print (( 'Encoded: 'Encoded:  nn '' ) 
) 
print print (( traintrain .. enc_streetenc_street .. value_countsvalue_counts ())
())


Encoded: 

1    1450
0       5
Name: enc_street, dtype: int64

The values agree. We’ve engineered our first feature! Feature Engineering is the process of making features of the data suitable for use in machine learning and modelling. When we encoded the Street feature into a column of Boolean values, we engineered a feature.

值一致。我们设计了第一个功能！特征工程是使数据的特征适合用于机器学习和建模的过程。将Street要素编码为一列布尔值时，我们设计了一个要素。

Let’s try engineering another feature. We’ll look at SaleCondition by constructing and plotting a pivot table, as we did above for OverallQual.

让我们尝试设计另一个功能。我们来看看SaleCondition通过构建和绘制透视表，象我们上面那样的OverallQual 。

Notice that Partial has a significantly higher Median Sale Price than the others. We will encode this as a new feature. We select all of the houses where SaleCondition is equal to Patrial and assign the value 1, otherwise assign 0.

请注意， Partial中位数销售价格明显高于其他产品。我们将其编码为一项新功能。我们选择SaleCondition等于Patrial所有房屋，并分配值1 ，否则分配0 。

Follow a similar method that we used for Street above.

遵循我们上面用于Street的类似方法。

def def encodeencode (( xx ): ): return return 1 1 if if x x == == 'Partial' 'Partial' else else 0
0
traintrain [[ 'enc_condition''enc_condition' ] ] = = traintrain .. SaleConditionSaleCondition .. applyapply (( encodeencode )
)
testtest [[ 'enc_condition''enc_condition' ] ] = = testtest .. SaleConditionSaleCondition .. applyapply (( encodeencode )
)

Let’s explore this new feature as a plot.

让我们以图表的形式探索这个新功能。

This looks great. You can continue to work with more features to improve the ultimate performance of your model.

这看起来很棒。您可以继续使用更多功能来改善模型的最终性能。

Before we prepare the data for modeling, we need to deal with the missing data. We’ll the missing values with an average value and then assign the results to data. This is a method of interpolation. The DataFrame.interpolate() method makes this simple.

在准备用于建模的数据之前，我们需要处理丢失的数据。我们将平均值作为缺失值，然后将结果分配给data 。这是一种插值方法。 DataFrame.interpolate()方法使这一过程变得简单。

This is a quick and simple method of dealing with missing values, and might not lead to the best performance of the model on new data. Handling missing values is an important part of the modeling process, where creativity and insight can make a big difference. This is another area where you can extend on this tutorial.

这是一种处理缺失值的快速简便的方法，并且可能不会导致模型在新数据上的最佳性能。处理缺失值是建模过程的重要组成部分，在此过程中，创造力和洞察力可以发挥很大作用。这是您可以扩展本教程的另一个领域。

data data = = traintrain .. select_dtypesselect_dtypes (( includeinclude == [[ npnp .. numbernumber ])]) .. interpolateinterpolate ()() .. dropnadropna () 
()

Check if the all of the columns have 0 null values.

检查所有列是否都有0个空值。

步骤3：建立线性模型 (Step 3 : Build a linear model)

Let’s perform the final steps to prepare our data for modeling. We’ll separate the features and the target variable for modeling. We will assign the features to X and the target variable to y. We use np.log() as explained above to transform the y variable for the model. data.drop([features], axis=1) tells pandas which columns we want to exclude. We won’t include SalePrice for obvious reasons, and Id is just an index with no relationship to SalePrice.

让我们执行最后的步骤，以准备用于建模的数据。我们将分离特征和目标变量以进行建模。我们将特征分配给X ，目标变量分配给y 。如上所述，我们使用np.log()转换模型的y变量。 data.drop([features], axis=1)告诉熊猫我们要排除的列。出于显而易见的原因，我们将不包含SalePrice ，并且Id只是与SalePrice的索引。

y y = = npnp .. loglog (( traintrain .. SalePriceSalePrice )
)
X X = = datadata .. dropdrop ([([ 'SalePrice''SalePrice' , , 'Id''Id' ], ], axisaxis == 11 )
)

Let’s partition the data and start modeling. We will use the train_test_split() function from scikit-learn to create a training set and a hold-out set. Partitioning the data in this way allows us to evaluate how our model might perform on data that it has never seen before. If we train the model on all of the test data, it will be difficult to tell if overfitting has taken place.

让我们对数据进行分区并开始建模。我们将使用scikit-learn中的train_test_split()函数来创建训练集和保持集。通过这种方式对数据进行分区，使我们能够评估模型如何处理从未见过的数据。如果我们在所有测试数据上训练模型，将很难判断是否发生了过度拟合。

train_test_split() returns four objects:

train_test_split()返回四个对象：

X_train is the subset of our features used for training.
X_test is the subset which will be our ‘hold-out’ set – what we’ll use to test the model.
y_train is the target variable SalePrice which corresponds to X_train.
y_test is the target variable SalePrice which corresponds to X_test.

X_train是我们用于训练的功能的子集。
X_test是将成为我们的“保留”集的子集-我们将使用它来测试模型。
y_train是目标变量SalePrice ，它对应于X_train 。
y_test是目标变量SalePrice ，它对应于X_test 。

The first parameter value X denotes the set of predictor data, and y is the target variable. Next, we set random_state=42. This provides for reproducible results, since sci-kit learn’s train_test_split will randomly partition the data. The test_size parameter tells the function what proportion of the data should be in the test partition. In this example, about 33% of the data is devoted to the hold-out set.

第一参数值X表示预测变量数据集，而y是目标变量。接下来，我们设置random_state=42 。由于sci-kit Learn的train_test_split将随机划分数据，因此可提供可重复的结果。 test_size参数告诉函数test分区中应有多少比例的数据。在此示例中，大约33％的数据专用于保留集。

开始建模 (Begin modelling)

We will first create a Linear Regression model. First, we instantiate the model.

我们将首先创建一个线性回归模型。首先，我们实例化模型。

from from sklearn sklearn import import linear_model
linear_model
lr lr = = linear_modellinear_model .. LinearRegressionLinearRegression ()
()

Next, we need to fit the model. First instantiate the model and next fit the model. Model fitting is a procedure that varies for different types of models. Put simply, we are estimating the relationship between our predictors and the target variable so we can make accurate predictions on new data.

接下来，我们需要拟合模型。首先实例化模型，然后拟合模型。模型拟合是针对不同类型的模型而变化的过程。简而言之，我们正在估计预测变量与目标变量之间的关系，以便可以对新数据进行准确的预测。

We fit the model using X_train and y_train, and we’ll score with X_test and y_test. The lr.fit() method will fit the linear regression on the features and target variable that we pass.

我们适合使用模型X_train和y_train ，我们会用进球X_test和y_test 。 lr.fit()方法将使线性回归适合我们传递的特征和目标变量。

评估效果并可视化结果 (Evaluate the performance and visualize results)

Now, we want to evaluate the performance of the model. Each competition might evaluate the submissions differently. In this competition, Kaggle will evaluate our submission using root-mean-squared-error (RMSE). We’ll also look at The r-squared value. The r-squared value is a measure of how close the data are to the fitted regression line. It takes a value between 0 and 1, 1 meaning that all of the variance in the target is explained by the data. In general, a higher r-squared value means a better fit.

现在，我们要评估模型的性能。每次比赛对提交的内容可能会有不同的评价。在这项比赛中，Kaggle将使用均方根误差（RMSE）评估我们的提交。我们还将查看r平方值。 r平方值是数据与拟合回归线的接近程度的度量。它取0到1之间的值，即1表示目标中的所有方差都由数据解释。通常，较高的r平方值表示较好的拟合度。

The model.score() method returns the r-squared value by default.

默认情况下， model.score()方法返回r平方值。

print print (( "R^2 is: "R^2 is:  nn "" , , modelmodel .. scorescore (( X_testX_test , , y_testy_test ))
))


R^2 is: 
 0.888247770926

This means that our features explain approximately 89% of the variance in our target variable. Follow the link above to learn more.

这意味着我们的功能可以解释目标变量中大约89％的方差。请点击上面的链接以了解更多信息。

Next, we’ll consider rmse. To do so, use the model we have built to make predictions on the test data set.

接下来，我们将考虑rmse 。为此，请使用我们建立的模型对测试数据集进行预测。

The model.predict() method will return a list of predictions given a set of predictors. Use model.predict() after fitting the model.

在给定一组预测变量的情况下， model.predict()方法将返回一系列预测。拟合模型后使用model.predict() 。

The mean_squared_error function takes two arrays and calculates the rmse.

mean_squared_error函数采用两个数组并计算rmse 。

from from sklearn.metrics sklearn.metrics import import mean_squared_error
mean_squared_error
print print (( 'RMSE is: 'RMSE is:  nn '' , , mean_squared_errormean_squared_error (( y_testy_test , , predictionspredictions ))
))


RMSE is: 
 0.0178417945196

Interpreting this value is somewhat more intuitive that the r-squared value. The RMSE measures the distance between our predicted values and actual values.

解释该值比r平方值更直观。 RMSE衡量我们的预测值和实际值之间的距离。

We can view this relationship graphically with a scatter plot.

我们可以使用散点图以图形方式查看此关系。

If our predicted values were identical to the actual values, this graph would be the straight line y=x because each predicted value x would be equal to each actual value y.

如果我们的预测值与实际值相同，则此图将为直线y=x因为每个预测值x都将等于每个实际值y 。

尝试改善模型 (Try to improve the model)

We’ll next try using Ridge Regularization to decrease the influence of less important features. Ridge Regularization is a process which shrinks the regression coefficients of less important features.

接下来，我们将尝试使用Ridge正则化来减少次要特征的影响。 Ridge正则化是缩小次要特征的回归系数的过程。

We’ll once again instantiate the model. The Ridge Regularization model takes a parameter, alpha , which controls the strength of the regularization.

我们将再次实例化该模型。 Ridge正则化模型采用参数alpha ，该参数控制正则化的强度。

We’ll experiment by looping through a few different values of alpha, and see how this changes our results.

我们将通过循环遍历几个不同的alpha值进行实验，并观察其如何改变结果。

for for i i in in range range (( -- 22 , , 33 ):
    ):
    alpha alpha = = 1010 **** i
    i
    rm rm = = linear_modellinear_model .. RidgeRidge (( alphaalpha == alphaalpha )
    )
    ridge_model ridge_model = = rmrm .. fitfit (( X_trainX_train , , y_trainy_train )
    )
    preds_ridge preds_ridge = = ridge_modelridge_model .. predictpredict (( X_testX_test )
    
    )
    
    pltplt .. scatterscatter (( preds_ridgepreds_ridge , , actual_valuesactual_values , , alphaalpha =.=. 7575 , , colorcolor == 'b''b' )
    )
    pltplt .. xlabelxlabel (( 'Predicted Price''Predicted Price' )
    )
    pltplt .. ylabelylabel (( 'Actual Price''Actual Price' )
    )
    pltplt .. titletitle (( 'Ridge Regularization with alpha = 'Ridge Regularization with alpha =  {}{} '' .. formatformat (( alphaalpha ))
    ))
    overlay overlay = = 'R^2 is: 'R^2 is:  {}{} nn RMSE is: RMSE is:  {}{} '' .. formatformat (
                    (
                    ridge_modelridge_model .. scorescore (( X_testX_test , , y_testy_test ),
                    ),
                    mean_squared_errormean_squared_error (( y_testy_test , , preds_ridgepreds_ridge ))
    ))
    pltplt .. annotateannotate (( ss == overlayoverlay ,, xyxy == (( 12.112.1 ,, 10.610.6 ),), sizesize == 'x-large''x-large' )
    )
    pltplt .. showshow ()
()

These models perform almost identically to the first model. In our case, adjusting the alpha did not substantially improve our model. As you add more features, regularization can be helpful. Repeat this step after you’ve added more features.

这些模型的性能几乎与第一个模型相同。在我们的案例中，调整Alpha并不能大大改善我们的模型。随着您添加更多功能，正则化可能会有所帮助。添加更多功能后，请重复此步骤。

步骤4：提交 (Step 4: Make a submission)

We’ll need to create a csv that contains the predicted SalePrice for each observation in the test.csv dataset.

我们需要创建一个csv ，其中包含test.csv数据集中每个观察值的预测SalePrice 。

We’ll log in to our Kaggle account and go to the submission page to make a submission. We will use the DataFrame.to_csv() to create a csv to submit. The first column must the contain the ID from the test data.

我们将登录到我们的Kaggle帐户，然后转到提交页面进行提交。我们将使用DataFrame.to_csv()创建要提交的csv。第一列必须包含测试数据中的ID。

Now, select the features from the test data for the model as we did above.

现在，像上面一样，从模型的测试数据中选择特征。

feats feats = = testtest .. select_dtypesselect_dtypes (
        (
        includeinclude == [[ npnp .. numbernumber ])]) .. dropdrop ([([ 'Id''Id' ], ], axisaxis == 11 )) .. interpolateinterpolate ()
()

Next, we generate our predictions.

接下来，我们生成我们的预测。

Now we’ll transform the predictions to the correct form. Remember that to reverse log() we do exp(). So we will apply np.exp() to our predictions becasuse we have taken the logarithm previously.

现在，我们将预测转换为正确的形式。请记住，要反转log()我们要做exp() 。因此我们将np.exp()应用于我们的预测，因为我们之前已经采用了对数。

final_predictions final_predictions = = npnp .. expexp (( predictionspredictions )
)

Look at the difference.

看区别。


Original predictions are: 
 [ 11.76725362  11.71929504  12.07656074  12.20632678  12.11217655] 

Final predictions are: 
 [ 128959.49172586  122920.74024358  175704.82598102  200050.83263756
  182075.46986405]

Lets assign these predictions and check that everything looks good.

让我们分配这些预测并检查一切看起来是否良好。

submissionsubmission [[ 'SalePrice''SalePrice' ] ] = = final_predictions
final_predictions
submissionsubmission .. headhead ()
()

		Id	ID	SalePrice	销售价格
0	0	1461	1461	128959.491726	128959.491726
1	1个	1462	1462	122920.740244	122920.740244
2	2	1463	1463	175704.825981	175704.825981
3	3	1464	1464	200050.832638	200050.832638
4	4	1465	1465	182075.469864	182075.469864

One we’re confident that we’ve got the data arranged in the proper format, we can export to a .csv file as Kaggle expects. We pass index=False because Pandas otherwise would create a new index for us.

我们有信心以正确的格式安排数据，我们可以按照Kaggle的期望将其导出到.csv file 。我们传递index=False因为熊猫会为我们创建一个新索引。

提交结果 (Submit our results)

We’ve created a file called submission1.csv in our working directory that conforms to the correct format. Go to the submission page to make a submission.

我们已经创建了一个名为submission1.csv在我们的工作目录符合正确的格式。转到提交页面进行提交。

Our Submission!

我们的意见书！

We placed 1602 out of about 2400 competitors. Almost middle of the pack, not bad! Notice that our score here is .15097, which is better than the score we observed on the test data. That’s a good result, but will not always be the case.

我们在大约2400个竞争对手中排名1602。几乎是背包的中间，还不错！请注意，这里的分数是.15097 ，比我们在测试数据上观察到的分数要好。这是一个很好的结果，但并非总是如此。

下一步 (Next steps)

You can extend this tutorial and improve your results by:

您可以通过以下方法扩展本教程并改善结果：

Working with and transforming other features in the training set
Experimenting with different modeling techniques, such as Random Forest Regressors or Gradient Boosting
Using ensembling models

使用并转换训练集中的其他功能
实验不同的建模技术，例如随机森林回归器或梯度提升
使用集成模型

We created a set of categorical features called categoricals that were not all included in the final model. Go back and try to include these features. There are other methods that might help with categorical data, notably the pd.get_dummies() method. After working on these features, repeat the transformations for the test data and make another submission.

我们创建了一套名为类别特征categoricals了未全部包含在最终的模型。返回并尝试包含这些功能。还有其他可能有助于分类数据的方法，尤其是pd.get_dummies()方法。使用完这些功能后，重复测试数据的转换并再次提交。

Working on models and participating in Kaggle competitions can be an iterative process – it’s important to experiment with new ideas, learn about the data, and test newer models and techniques.

制作模型和参加Kaggle比赛可能是一个反复的过程-尝试新想法，了解数据并测试更新的模型和技术非常重要。

翻译自: https://www.pybloggers.com/2017/05/getting-started-with-kaggle-house-prices-competition/

kaggle 入门

你可能感兴趣的:(可视化,大数据,python,机器学习,人工智能)

机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
理解Gunicorn：Python WSGI服务器的基石范范0825 ipython linux 运维
理解Gunicorn：PythonWSGI服务器的基石介绍Gunicorn，全称GreenUnicorn，是一个为PythonWSGI（WebServerGatewayInterface）应用设计的高效、轻量级HTTP服务器。作为PythonWeb应用部署的常用工具，Gunicorn以其高性能和易用性著称。本文将介绍Gunicorn的基本概念、安装和配置，帮助初学者快速上手。1.什么是Gunico
Python数据分析与可视化实战指南 William数据分析 python python 数据
在数据驱动的时代，Python因其简洁的语法、强大的库生态系统以及活跃的社区，成为了数据分析与可视化的首选语言。本文将通过一个详细的案例，带领大家学习如何使用Python进行数据分析，并通过可视化来直观呈现分析结果。一、环境准备1.1安装必要库在开始数据分析和可视化之前，我们需要安装一些常用的库。主要包括pandas、numpy、matplotlib和seaborn等。这些库分别用于数据处理、数学
python os.environ 江湖偌大 python 深度学习
os.environ['TF_CPP_MIN_LOG_LEVEL']='0'#默认值，输出所有信息os.environ['TF_CPP_MIN_LOG_LEVEL']='1'#屏蔽通知信息（INFO）os.environ['TF_CPP_MIN_LOG_LEVEL']='2'#屏蔽通知信息和警告信息（INFO\WARNING）os.environ['TF_CPP_MIN_LOG_LEVEL']='
Python中os.environ基本介绍及使用方法鹤冲天Pro #Python python 服务器开发语言
文章目录python中os.environos.environ简介os.environ进行环境变量的增删改查python中os.environ的使用详解1.简介2.key字段详解2.1常见key字段3.os.environ.get()用法4.环境变量的增删改查和判断是否存在4.1新增环境变量4.2更新环境变量4.3获取环境变量4.4删除环境变量4.5判断环境变量是否存在python中os.envi
Pyecharts数据可视化大屏：打造沉浸式数据分析体验我的运维人生信息可视化数据分析数据挖掘运维开发技术共享
Pyecharts数据可视化大屏：打造沉浸式数据分析体验在当今这个数据驱动的时代，如何将海量数据以直观、生动的方式展现出来，成为了数据分析师和企业决策者关注的焦点。Pyecharts，作为一款基于Python的开源数据可视化库，凭借其丰富的图表类型、灵活的配置选项以及高度的定制化能力，成为了构建数据可视化大屏的理想选择。本文将深入探讨如何利用Pyecharts打造数据可视化大屏，并通过实际代码案例
Python教程：一文了解使用Python处理XPath 旦莫 Python进阶 python 开发语言
目录1.环境准备1.1安装lxml1.2验证安装2.XPath基础2.1什么是XPath？2.2XPath语法2.3示例XML文档3.使用lxml解析XML3.1解析XML文档3.2查看解析结果4.XPath查询4.1基本路径查询4.2使用属性查询4.3查询多个节点5.XPath的高级用法5.1使用逻辑运算符5.2使用函数6.实战案例6.1从网页抓取数据6.1.1安装Requests库6.1.2代
Google earth studio 简介陟彼高冈yu 旅游
GoogleEarthStudio是一个基于Web的动画工具，专为创作使用GoogleEarth数据的动画和视频而设计。它利用了GoogleEarth强大的三维地图和卫星影像数据库，使用户能够轻松地创建逼真的地球动画、航拍视频和动态地图可视化。网址为https://www.google.com/earth/studio/。GoogleEarthStudio是一个基于Web的动画工具，专为创作使用G
python os.environ_python os.environ 读取和设置环境变量 weixin_39605414 python os.environ
>>>importos>>>os.environ.keys()['LC_NUMERIC','GOPATH','GOROOT','GOBIN','LESSOPEN','SSH_CLIENT','LOGNAME','USER','HOME','LC_PAPER','PATH','DISPLAY','LANG','TERM','SHELL','J2REDIR','LC_MONETARY','QT_QPA
探索OpenAI和LangChain的适配器集成：轻松切换模型提供商 nseejrukjhad langchain easyui 前端 python
#探索OpenAI和LangChain的适配器集成：轻松切换模型提供商##引言在人工智能和自然语言处理的世界中，OpenAI的模型提供了强大的能力。然而，随着技术的发展，许多人开始探索其他模型以满足特定需求。LangChain作为一个强大的工具，集成了多种模型提供商，通过提供适配器，简化了不同模型之间的转换。本篇文章将介绍如何使用LangChain的适配器与OpenAI集成，以便轻松切换模型提供商
使用Faiss进行高效相似度搜索 llzwxh888 faiss python
在现代AI应用中，快速和高效的相似度搜索是至关重要的。Faiss（FacebookAISimilaritySearch）是一个专门用于快速相似度搜索和聚类的库，特别适用于高维向量。本文将介绍如何使用Faiss来进行相似度搜索，并结合Python代码演示其基本用法。什么是Faiss？Faiss是一个由FacebookAIResearch团队开发的开源库，主要用于高维向量的相似性搜索和聚类。Faiss
python是什么意思中文-在python中%是什么意思编程大乐趣
Python中%有两种：1、数值运算：%代表取模，返回除法的余数。如：>>>7%212、%操作符（字符串格式化，stringformatting），说明如下：%[(name)][flags][width].[precision]typecode(name)为命名flags可以有+，-，''或0。+表示右对齐。-表示左对齐。''为一个空格，表示在正数的左侧填充一个空格，从而与负数对齐。0表示使用0填
深入理解 MultiQueryRetriever：提升向量数据库检索效果的强大工具 nseejrukjhad 数据库 python
深入理解MultiQueryRetriever：提升向量数据库检索效果的强大工具引言在人工智能和自然语言处理领域，高效准确的信息检索一直是一个关键挑战。传统的基于距离的向量数据库检索方法虽然广泛应用，但仍存在一些局限性。本文将介绍一种创新的解决方案：MultiQueryRetriever，它通过自动生成多个查询视角来增强检索效果，提高结果的相关性和多样性。MultiQueryRetriever的工
Day1笔记-Python简介&标识符和关键字&输入输出 ~在杰难逃~ Python python 开发语言大数据数据分析数据挖掘
大家好，从今天开始呢，杰哥开展一个新的专栏，当然，数据分析部分也会不定时更新的，这个新的专栏主要是讲解一些Python的基础语法和知识，帮助0基础的小伙伴入门和学习Python，感兴趣的小伙伴可以开始认真学习啦！一、Python简介【了解】1.计算机工作原理编程语言就是用来定义计算机程序的形式语言。我们通过编程语言来编写程序代码，再通过语言处理程序执行向计算机发送指令，让计算机完成对应的工作，编程
python八股文面试题分享及解析(1) Shawn________ python
#1.'''a=1b=2不用中间变量交换a和b'''#1.a=1b=2a,b=b,aprint(a)print(b)结果：21#2.ll=[]foriinrange(3):ll.append({'num':i})print(11)结果:#[{'num':0},{'num':1},{'num':2}]#3.kk=[]a={'num':0}foriinrange(3):#0,12#可变类型，不仅仅改变
人工智能时代，程序员如何保持核心竞争力？ jmoych 人工智能
随着AIGC（如chatgpt、midjourney、claude等）大语言模型接二连三的涌现，AI辅助编程工具日益普及，程序员的工作方式正在发生深刻变革。有人担心AI可能取代部分编程工作，也有人认为AI是提高效率的得力助手。面对这一趋势,程序员应该如何应对?是专注于某个领域深耕细作，还是广泛学习以适应快速变化的技术环境?又或者，我们是否应该将重点转向AI无法轻易替代的软技能？让我们一起探讨程序员
每日算法&面试题，大厂特训二十八天——第二十天（树）肥学 ⚡算法题⚡面试题每日精进 java 算法数据结构
目录标题导读算法特训二十八天面试题点击直接资料领取导读肥友们为了更好的去帮助新同学适应算法和面试题，最近我们开始进行专项突击一步一步来。上一期我们完成了动态规划二十一天现在我们进行下一项对各类算法进行二十八天的一个小总结。还在等什么快来一起肥学进行二十八天挑战吧！！特别介绍小白练手专栏，适合刚入手的新人欢迎订阅编程小白进阶python有趣练手项目里面包括了像《机器人尬聊》《恶搞程序》这样的有趣文章
Python快速入门 —— 第三节：类与对象孤华暗香 Python快速入门 python 开发语言
第三节：类与对象目标：了解面向对象编程的基础概念，并学会如何定义类和创建对象。内容：类与对象：定义类：class关键字。类的构造函数：__init__()。类的属性和方法。对象的创建与使用。示例：classStudent:def__init__(self,name,age,major):self.name&#
pyecharts——绘制柱形图折线图 2224070247 信息可视化 python java 数据可视化
一、pyecharts概述自2013年6月百度EFE(ExcellentFrontEnd）数据可视化团队研发的ECharts1.0发布到GitHub网站以来，ECharts一直备受业界权威的关注并获得广泛好评，成为目前成熟且流行的数据可视化图表工具，被应用到诸多数据可视化的开发领域。Python作为数据分析领域最受欢迎的语言，也加入ECharts的使用行列，并研发出方便Python开发者使用的数据
Python 实现图片裁剪（附代码） | Python工具剑客阿良_ALiang
前言本文提供将图片按照自定义尺寸进行裁剪的工具方法，一如既往的实用主义。环境依赖ffmpeg环境安装，可以参考我的另一篇文章：windowsffmpeg安装部署_阿良的博客-CSDN博客本文主要使用到的不是ffmpeg，而是ffprobe也在上面这篇文章中的zip包中。ffmpy安装：pipinstallffmpy-ihttps://pypi.douban.com/simple代码不废话了，上代码
【华为OD技术面试真题 - 技术面】- python八股文真题题库（4) 算法大师华为od 面试 python
华为OD面试真题精选专栏：华为OD面试真题精选目录:2024华为OD面试手撕代码真题目录以及八股文真题目录文章目录华为OD面试真题精选**1.Python中的`with`**用途和功能自动资源管理示例：文件操作上下文管理协议示例代码工作流程解析优点2.\_\_new\_\_和**\_\_init\_\_**区别__new____init__区别总结3.**切片（Slicing）操作**基本切片语法
python os 环境变量 CV矿工 python 开发语言 numpy
环境变量：环境变量是程序和操作系统之间的通信方式。有些字符不宜明文写进代码里，比如数据库密码，个人账户密码，如果写进自己本机的环境变量里，程序用的时候通过os.environ.get（）取出来就行了。os.environ是一个环境变量的字典。环境变量的相关操作importos"""设置/修改环境变量：os.environ[‘环境变量名称’]=‘环境变量值’#其中key和value均为string类
高级 ECharts 技巧：自定义图表主题与样式 SnowMan1993 echarts 信息可视化数据分析
ECharts是一个强大的数据可视化库，提供了多种内置主题和样式，但你也可以根据项目的设计需求，自定义图表的主题与样式。本文将介绍如何使用ECharts自定义图表主题，以提升数据可视化的吸引力和一致性。1.什么是ECharts主题？ECharts的主题是指定义图表样式的配置项，包括颜色、字体、线条样式等。通过预设主题，你可以快速更改图表的整体风格，而自定义主题则允许你在此基础上进行个性化设置。2.
Python爬虫解析工具之xpath使用详解 eqa11 python 爬虫开发语言
文章目录Python爬虫解析工具之xpath使用详解一、引言二、环境准备1、插件安装2、依赖库安装三、xpath语法详解1、路径表达式2、通配符3、谓语4、常用函数四、xpath在Python代码中的使用1、文档树的创建2、使用xpath表达式3、获取元素内容和属性五、总结Python爬虫解析工具之xpath使用详解一、引言在Python爬虫开发中，数据提取是一个至关重要的环节。xpath作为一门
【华为OD技术面试真题 - 技术面】- python八股文真题题库（1）算法大师华为od 面试 python
华为OD面试真题精选专栏：华为OD面试真题精选目录:2024华为OD面试手撕代码真题目录以及八股文真题目录文章目录华为OD面试真题精选1.数据预处理流程数据预处理的主要步骤工具和库2.介绍线性回归、逻辑回归模型线性回归（LinearRegression）模型形式：关键点：逻辑回归（LogisticRegression）模型形式：关键点：参数估计与评估：3.python浅拷贝及深拷贝浅拷贝（Shal
数字里的世界17期：2021年全球10大顶级数据中心，中国移动榜首张三叨
你知道吗？2016年，全球的数据中心共计用电4160亿千瓦时，比整个英国的发电量还多40％！前言每天，我们都会创造超过250万TB的数据。并且随着物联网（IOT）的不断普及，这一数据将持续增长。如此庞大的数据被存储在被称为“数据中心”的专用设施中。虽然最早的数据中心建于20世纪40年代，但直到1997-2000年的互联网泡沫期间才逐渐成为主流。当前人类的技术，比如人工智能和机器学习，已经将我们推向
nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
《Python数据分析实战终极指南》 xjt921122 python 数据分析开发语言
对于分析师来说，大家在学习Python数据分析的路上，多多少少都遇到过很多大坑**，有关于技能和思维的**：Excel已经没办法处理现有的数据量了，应该学Python吗？找了一大堆Python和Pandas的资料来学习，为什么自己动手就懵了？跟着比赛类公开数据分析案例练了很久，为什么当自己面对数据需求还是只会数据处理而没有分析思路？学了对比、细分、聚类分析，也会用PEST、波特五力这类分析法，为啥
Python中深拷贝与浅拷贝的区别 yuxiaoyu.
转自：http://blog.csdn.net/u014745194/article/details/70271868定义：在Python中对象的赋值其实就是对象的引用。当创建一个对象，把它赋值给另一个变量的时候，python并没有拷贝这个对象，只是拷贝了这个对象的引用而已。浅拷贝：拷贝了最外围的对象本身，内部的元素都只是拷贝了一个引用而已。也就是，把对象复制一遍，但是该对象中引用的其他对象我不复
Python开发常用的三方模块如下：换个网名有点难 python 开发语言
Python是一门功能强大的编程语言，拥有丰富的第三方库，这些库为开发者提供了极大的便利。以下是100个常用的Python库，涵盖了多个领域：1、NumPy，用于科学计算的基础库。2、Pandas，提供数据结构和数据分析工具。3、Matplotlib，一个绘图库。4、Scikit-learn，机器学习库。5、SciPy，用于数学、科学和工程的库。6、TensorFlow，由Google开发的开源机
apache 安装linux windows 墙头上一根草 apache inux windows
linux安装Apache 有两种方式一种是手动安装通过二进制的文件进行安装，另外一种就是通过yum 安装，此中安装方式，需要物理机联网。以下分别介绍两种的安装方式通过二进制文件安装Apache需要的软件有apr,apr-util,pcre 1，安装 apr 下载地址：htt
fill_parent、wrap_content和match_parent的区别 Cb123456 match_parent fill_parent
fill_parent、wrap_content和match_parent的区别: 1）fill_parent 设置一个构件的布局为fill_parent将强制性地使构件扩展，以填充布局单元内尽可能多的空间。这跟Windows控件的dockstyle属性大体一致。设置一个顶部布局或控件为fill_parent将强制性让它布满整个屏幕。 2） wrap_conte
网页自适应设计天子之骄 html css 响应式设计页面自适应
网页自适应设计网页对浏览器窗口的自适应支持变得越来越重要了。自适应响应设计更是异常火爆。再加上移动端的崛起，更是如日中天。以前为了适应不同屏幕分布率和浏览器窗口的扩大和缩小，需要设计几套css样式，用js脚本判断窗口大小，选择加载。结构臃肿，加载负担较大。现笔者经过一定时间的学习，有所心得，故分享于此，加强交流，共同进步。同时希望对大家有所
[sql server] 分组取最大最小常用sql 一炮送你回车库 SQL Server
--分组取最大最小常用sql--测试环境if OBJECT_ID('tb') is not null drop table tb;gocreate table tb( col1 int, col2 int, Fcount int)insert into tbselect 11,20,1 union allselect 11,22,1 union allselect 1
ImageIO写图片输出到硬盘 3213213333332132 java image
package awt; import java.awt.Color; import java.awt.Font; import java.awt.Graphics; import java.awt.image.BufferedImage; import java.io.File; import java.io.IOException; import javax.imagei
自己的String动态数组宝剑锋梅花香 java 动态数组数组
数组还是好说，学过一两门编程语言的就知道，需要注意的是数组声明时需要把大小给它定下来，比如声明一个字符串类型的数组：String str[]=new String[10]; 但是问题就来了，每次都是大小确定的数组，我需要数组大小不固定随时变化怎么办呢？动态数组就这样应运而生，龙哥给我们讲的是自己用代码写动态数组，并非用的ArrayList 看看字符
pinyin4j工具类 darkranger .net
pinyin4j工具类Java工具类 2010-04-24 00:47:00 阅读69 评论0 字号：大中小引入pinyin4j-2.5.0.jar包: pinyin4j是一个功能强悍的汉语拼音工具包，主要是从汉语获取各种格式和需求的拼音，功能强悍，下面看看如何使用pinyin4j。本人以前用AscII编码提取工具，效果不理想，现在用pinyin4j简单实现了一个。功能还不是很完美，
StarUML学习笔记----基本概念 aijuans UML建模
介绍StarUML的基本概念，这些都是有效运用StarUML?所需要的。包括对模型、视图、图、项目、单元、方法、框架、模型块及其差异以及UML轮廓。模型、视与图（Model, View and Diagram） &
Activiti最终总结 avords Activiti id 工作流
1、流程定义ID：ProcessDefinitionId，当定义一个流程就会产生。 2、流程实例ID：ProcessInstanceId，当开始一个具体的流程时就会产生，也就是不同的流程实例ID可能有相同的流程定义ID。 3、TaskId，每一个userTask都会有一个Id这个是存在于流程实例上的。 4、TaskDefinitionKey和（ActivityImpl activityId
从省市区多重级联想到的，react和jquery的差别 bee1314 jquery UI react
在我们的前端项目里经常会用到级联的select，比如省市区这样。通常这种级联大多是动态的。比如先加载了省，点击省加载市，点击市加载区。然后数据通常ajax返回。如果没有数据则说明到了叶子节点。针对这种场景，如果我们使用jquery来实现，要考虑很多的问题，数据部分，以及大量的dom操作。比如这个页面上显示了某个区，这时候我切换省，要把市重新初始化数据，然后区域的部分要从页面
Eclipse快捷键大全 bijian1013 java eclipse 快捷键
Ctrl+1 快速修复(最经典的快捷键,就不用多说了)Ctrl+D: 删除当前行 Ctrl+Alt+↓ 复制当前行到下一行(复制增加)Ctrl+Alt+↑ 复制当前行到上一行(复制增加)Alt+↓ 当前行和下面一行交互位置(特别实用,可以省去先剪切,再粘贴了)Alt+↑ 当前行和上面一行交互位置(同上)Alt+← 前一个编辑的页面Alt+→ 下一个编辑的页面(当然是针对上面那条来说了)Alt+En
js 笔记函数征客丶 JavaScript
一、函数的使用 1.1、定义函数变量 var vName = funcation(params){ } 1.2、函数的调用函数变量的调用： vName(params); 函数定义时自发调用：(function(params){})(params); 1.3、函数中变量赋值 var a = 'a'; var ff
【Scala四】分析Spark源代码总结的Scala语法二 bit1129 scala
1. Some操作在下面的代码中，使用了Some操作：if (self.partitioner == Some(partitioner))，那么Some(partitioner)表示什么含义？首先partitioner是方法combineByKey传入的变量， Some的文档说明： /** Class `Some[A]` represents existin
java 匿名内部类 BlueSkator java匿名内部类
组合优先于继承 Java的匿名类，就是提供了一个快捷方便的手段，令继承关系可以方便地变成组合关系继承只有一个时候才能用，当你要求子类的实例可以替代父类实例的位置时才可以用继承。在Java中内部类主要分为成员内部类、局部内部类、匿名内部类、静态内部类。内部类不是很好理解，但说白了其实也就是一个类中还包含着另外一个类如同一个人是由大脑、肢体、器官等身体结果组成，而内部类相
盗版win装在MAC有害发热，苹果的东西不值得买，win应该不用 ljy325 游戏 apple windows XP OS
Mac mini 型号: MC270CH-A RMB:5,688 Apple 对windows的产品支持不好,有以下问题: 1.装完了xp,发现机身很热虽然没有运行任何程序！貌似显卡跑游戏发热一样，按照那样的发热量,那部机子损耗很大,使用寿命受到严重的影响! 2.反观安装了Mac os的展示机，发热量很小，运行了1天温度也没有那么高 &nbs
读《研磨设计模式》-代码笔记-生成器模式-Builder bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ /** * 生成器模式的意图在于将一个复杂的构建与其表示相分离，使得同样的构建过程可以创建不同的表示（GoF） * 个人理解： * 构建一个复杂的对象，对于创建者（Builder）来说，一是要有数据来源(rawData)，二是要返回构
JIRA与SVN插件安装 chenyu19891124 SVN jira
JIRA安装好后提交代码并要显示在JIRA上，这得需要用SVN的插件才能看见开发人员提交的代码。 1.下载svn与jira插件安装包，解压后在安装包(atlassian-jira-subversion-plugin-0.10.1) 2.解压出来的包里下的lib文件夹下的jar拷贝到(C:\Program Files\Atlassian\JIRA 4.3.4\atlassian-jira\WEB
常用数学思想方法 comsci 工作
对于搞工程和技术的朋友来讲，在工作中常常遇到一些实际问题，而采用常规的思维方式无法很好的解决这些问题，那么这个时候我们就需要用数学语言和数学工具，而使用数学工具的前提却是用数学思想的方法来描述问题。。下面转帖几种常用的数学思想方法，仅供学习和参考函数思想　　把某一数学问题用函数表示出来，并且利用函数探究这个问题的一般规律。这是最基本、最常用的数学方法
pl/sql集合类型 daizj oracle 集合 type pl/sql
--集合类型 /* 单行单列的数据，使用标量变量单行多列数据，使用记录单列多行数据，使用集合（。。。） *集合：类似于数组也就是。pl/sql集合类型包括索引表（pl/sql table）、嵌套表（Nested Table）、变长数组（VARRAY）等 */ /* --集合方法 &n
[Ofbiz]ofbiz初用 dinguangx 电商 ofbiz
从github下载最新的ofbiz（截止2015-7-13），从源码进行ofbiz的试用 1. 加载测试库 ofbiz内置derby，通过下面的命令初始化测试库 ./ant load-demo (与load-seed有一些区别) 2. 启动内置tomcat ./ant start 或 ./startofbiz.sh 或 java -jar ofbiz.jar &
结构体中最后一个元素是长度为0的数组 dcj3sjt126com c gcc
在Linux源代码中，有很多的结构体最后都定义了一个元素个数为0个的数组，如/usr/include/linux/if_pppox.h中有这样一个结构体： struct pppoe_tag { __u16 tag_type; __u16 tag_len; &n
Linux cp 实现强行覆盖 dcj3sjt126com linux
发现在Fedora 10 /ubutun 里面用cp -fr src dest，即使加了-f也是不能强行覆盖的，这时怎么回事的呢？一两个文件还好说，就输几个yes吧，但是要是n多文件怎么办，那还不输死人呢？下面提供三种解决办法。方法一我们输入alias命令，看看系统给cp起了一个什么别名。 [root@localhost ~]# aliasalias cp=’cp -i’a
Memcached(一)、HelloWorld frank1234 memcached
一、简介高性能的架构离不开缓存，分布式缓存中的佼佼者当属memcached，它通过客户端将不同的key hash到不同的memcached服务器中，而获取的时候也到相同的服务器中获取，由于不需要做集群同步，也就省去了集群间同步的开销和延迟，所以它相对于ehcache等缓存来说能更好的支持分布式应用，具有更强的横向伸缩能力。二、客户端选择一个memcached客户端，我这里用的是memc
Search in Rotated Sorted Array II hcx2013 search
Follow up for "Search in Rotated Sorted Array":What if duplicates are allowed? Would this affect the run-time complexity? How and why? Write a function to determine if a given ta
Spring4新特性——更好的Java泛型操作API jinnianshilongnian spring4 generic type
Spring4新特性——泛型限定式依赖注入 Spring4新特性——核心容器的其他改进 Spring4新特性——Web开发的增强 Spring4新特性——集成Bean Validation 1.1(JSR-349)到SpringMVC Spring4新特性——Groovy Bean定义DSL Spring4新特性——更好的Java泛型操作API Spring4新
CentOS安装JDK liuxingguome centos
1、行卸载原来的： [root@localhost opt]# rpm -qa | grep java tzdata-java-2014g-1.el6.noarch java-1.7.0-openjdk-1.7.0.65-2.5.1.2.el6_5.x86_64 java-1.6.0-openjdk-1.6.0.0-11.1.13.4.el6.x86_64 [root@localhost
二分搜索专题2-在有序二维数组中搜索一个元素 OpenMind 二维数组算法二分搜索
1,设二维数组p的每行每列都按照下标递增的顺序递增。用数学语言描述如下：p满足 (1),对任意的x1，x2，y，如果x1<x2,则p(x1,y)<p(x2,y); (2),对任意的x，y1,y2, 如果y1<y2,则p(x,y1)<p(x,y2); 2,问题：给定满足1的数组p和一个整数k，求是否存在x0,y0使得p(x0,y0)=k? 3,算法分析： (
java 随机数 Math与Random SaraWon java Math Random
今天需要在程序中产生随机数，知道有两种方法可以使用，但是使用Math和Random的区别还不是特别清楚，看到一篇文章是关于的，觉得写的还挺不错的，原文地址是 http://www.oschina.net/question/157182_45274?sort=default&p=1#answers 产生1到10之间的随机数的两种实现方式： //Math Math.roun
oracle创建表空间 tugn oracle
create temporary tablespace TXSJ_TEMP tempfile 'E:\Oracle\oradata\TXSJ_TEMP.dbf' size 32m autoextend on next 32m maxsize 2048m extent m
使用Java8实现自己的个性化搜索引擎 yangshangchuan java superword 搜索引擎 java8 全文检索
需要对249本软件著作实现句子级别全文检索，这些著作均为PDF文件，不使用现有的框架如lucene，自己实现的方法如下： 1、从PDF文件中提取文本，这里的重点是如何最大可能地还原文本。提取之后的文本，一个句子一行保存为文本文件。 2、将所有文本文件合并为一个单一的文本文件，这样，每一个句子就有一个唯一行号。 3、对每一行文本进行分词，建立倒排表，倒排表的格式为：词=包含该词的总行数N=行号