JinnyR

Datacamp 笔记&代码 Supervised Learning with scikit-learn 第二章 Regression

更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python

Datacamp track: Data Scientist with Python - Course 21 (2)

Exercise

Importing data for supervised learning

In this chapter, you will work with Gapminder data that we have consolidated into one CSV file available in the workspace as 'gapminder.csv'. Specifically, your goal will be to use this data to predict the life expectancy in a given country based on features such as the country’s GDP, fertility rate, and population. As in Chapter 1, the dataset has been preprocessed.

Since the target variable here is quantitative, this is a regression problem. To begin, you will fit a linear regression with just one feature: 'fertility', which is the average number of children a woman in a given country gives birth to. In later exercises, you will use all the features to build regression models.

Before that, however, you need to import the data and get it into the form needed by scikit-learn. This involves creating feature and target variable arrays. Furthermore, since you are going to use only one feature to begin with, you need to do some reshaping using NumPy’s .reshape() method. Don’t worry too much about this reshaping right now, but it is something you will have to do occasionally when working with scikit-learn so it is useful to practice.

Instruction

Import numpy and pandas as their standard aliases.
Read the file 'gapminder.csv' into a DataFrame dfusing the read_csv() function.
Create array X for the 'fertility' feature and array y for the 'life' target variable.
Reshape the arrays by using the .reshape() method and passing in -1 and 1.

# Modified/Added by Jinny

fn = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_2433/datasets/gapminder-clean.csv'
from urllib.request import urlretrieve
urlretrieve(fn, 'gapminder.csv')

('gapminder.csv', )

# Import numpy and pandas
import numpy as np
import pandas as pd

# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create arrays for features and target variable
y = df['life'].values
X = df['fertility'].values

# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))

# Reshape X and y
y = y.reshape(-1, 1)
X = X.reshape(-1, 1)

# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))

Dimensions of y before reshaping: (139,)
Dimensions of X before reshaping: (139,)
Dimensions of y after reshaping: (139, 1)
Dimensions of X after reshaping: (139, 1)

Exercise

Exploring the Gapminder data

As always, it is important to explore your data before building models. On the right, we have constructed a heatmap showing the correlation between the different features of the Gapminder dataset, which has been pre-loaded into a DataFrame as df and is available for exploration in the IPython Shell. Cells that are in green show positive correlation, while cells that are in red show negative correlation. Take a moment to explore this: Which features are positively correlated with life, and which ones are negatively correlated? Does this match your intuition?

Then, in the IPython Shell, explore the DataFrame using pandas methods such as .info(), .describe(), .head().

In case you are curious, the heatmap was generated using Seaborn’s heatmap function and the following line of code, where df.corr()computes the pairwise correlation between columns:

sns.heatmap(df.corr(), square=True, cmap='RdYlGn')

Once you have a feel for the data, consider the statements below and select the one that is not true. After this, Hugo will explain the mechanics of linear regression in the next video and you will be on your way building regression models!

Instruction

Possible Answers
- The DataFrame has 139 samples (or rows) and 9 columns.
- life and fertility are negatively correlated.
- The mean of life is 69.602878
- fertility is of type int64.
- GDP and life are positively correlated.

# Modified/Added by Jinny

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from urllib.request import urlretrieve

fn = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_2433/datasets/gapminder-clean.csv'
urlretrieve(fn, 'gapminder.csv')
df = pd.read_csv('gapminder.csv')
sns.heatmap(df.corr(), square=True, cmap='RdYlGn')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.show()

df.info()


RangeIndex: 139 entries, 0 to 138
Data columns (total 9 columns):
population         139 non-null float64
fertility          139 non-null float64
HIV                139 non-null float64
CO2                139 non-null float64
BMI_male           139 non-null float64
GDP                139 non-null float64
BMI_female         139 non-null float64
life               139 non-null float64
child_mortality    139 non-null float64
dtypes: float64(9)
memory usage: 9.9 KB

df.describe()

	population	fertility	HIV	CO2	BMI_male	GDP	BMI_female	life	child_mortality
count	1.390000e+02	139.000000	139.000000	139.000000	139.000000	139.000000	139.000000	139.000000	139.000000
mean	3.549977e+07	3.005108	1.915612	4.459874	24.623054	16638.784173	126.701914	69.602878	45.097122
std	1.095121e+08	1.615354	4.408974	6.268349	2.209368	19207.299083	4.471997	9.122189	45.724667
min	2.773150e+05	1.280000	0.060000	0.008618	20.397420	588.000000	117.375500	45.200000	2.700000
25%	3.752776e+06	1.810000	0.100000	0.496190	22.448135	2899.000000	123.232200	62.200000	8.100000
50%	9.705130e+06	2.410000	0.400000	2.223796	25.156990	9938.000000	126.519600	72.000000	24.000000
75%	2.791973e+07	4.095000	1.300000	6.589156	26.497575	23278.500000	130.275900	76.850000	74.200000
max	1.197070e+09	7.590000	25.900000	48.702062	28.456980	126076.000000	135.492000	82.600000	192.000000

df.head()

	population	fertility	HIV	CO2	BMI_male	GDP	BMI_female	life	child_mortality
0	34811059.0	2.73	0.1	3.328945	24.59620	12314.0	129.9049	75.3	29.5
1	19842251.0	6.43	2.0	1.474353	22.25083	7103.0	130.1247	58.3	192.0
2	40381860.0	2.24	0.5	4.785170	27.50170	14646.0	118.8915	75.5	15.4
3	2975029.0	1.40	0.1	1.804106	25.35542	7383.0	132.8108	72.5	20.0
4	21370348.0	1.96	0.1	18.016313	27.56373	41312.0	117.3755	81.5	5.2

Exercise

Fit & predict for regression

Now, you will fit a linear regression and predict life expectancy using just one feature. You saw Andy do this earlier using the 'RM' feature of the Boston housing dataset. In this exercise, you will use the 'fertility' feature of the Gapminder dataset. Since the goal is to predict life expectancy, the target variable here is 'life'. The array for the target variable has been pre-loaded as y and the array for 'fertility' has been pre-loaded as X_fertility.

A scatter plot with 'fertility' on the x-axis and 'life'on the y-axis has been generated. As you can see, there is a strongly negative correlation, so a linear regression should be able to capture this trend. Your job is to fit a linear regression and then predict the life expectancy, overxlaying these predicted values on the plot to generate a regression line. You will also compute and print the R2R2 score using sckit-learn’s .score()method.

Instruction

Import LinearRegression from sklearn.linear_model.
Create a LinearRegression regressor called reg.
Set up the prediction space to range from the minimum to the maximum of X_fertility. This has been done for you.
Fit the regressor to the data (X_fertility and y) and compute its predictions using the .predict() method and the prediction_space array.
Compute and print the R2R2 score using the .score()method.
Overlay the plot with your linear regression line. This has been done for you, so hit ‘Submit Answer’ to see the result!

# modified/added by Jinny
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')

df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_2433/datasets/gapminder-clean.csv')

y = df['life'].values
X = df.drop('life', axis=1)

# Reshape to 1-D
y = y.reshape(-1, 1)
X_fertility = X['fertility'].values.reshape(-1, 1) 

_ = plt.scatter(X['fertility'], y, color='blue')
_ = plt.ylabel('Life Expectancy')
_ = plt.xlabel('Fertility')

# -----------------------
# Import LinearRegression
from sklearn.linear_model import LinearRegression

# Create the regressor: reg
reg = LinearRegression()

# Create the prediction space
prediction_space = np.linspace(min(X_fertility), max(X_fertility)).reshape(-1,1)

# Fit the model to the data
reg.fit(X_fertility, y)

# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)

# Print R^2 
print(reg.score(X_fertility, y))

# Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()

0.6192442167740035

Exercise

Train/test split for regression

As you learned in Chapter 1, train and test sets are vital to ensure that your supervised learning model is able to generalize well to new data. This was true for classification models, and is equally true for linear regression models.

In this exercise, you will split the Gapminder dataset into training and testing sets, and then fit and predict a linear regression over all features. In addition to computing the R2R2score, you will also compute the Root Mean Squared Error (RMSE), which is another commonly used metric to evaluate regression models. The feature array X and target variable array y have been pre-loaded for you from the DataFrame df.

Instruction

Import LinearRegression from sklearn.linear_model, mean_squared_error from sklearn.metrics, and train_test_split from sklearn.model_selection.
Using X and y, create training and test sets such that 30% is used for testing and 70% for training. Use a random state of 42.
Create a linear regression regressor called reg_all, fit it to the training set, and evaluate it on the test set.
Compute and print the R2R2 score using the .score()method on the test set.
Compute and print the RMSE. To do this, first compute the Mean Squared Error using the mean_squared_error()function with the arguments y_test and y_pred, and then take its square root using np.sqrt().

# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

# Create the regressor: reg_all
reg_all = LinearRegression()

# Fit the regressor to the training data
reg_all.fit(X_train, y_train)

# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(reg_all.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))

R^2: 0.8380468731430133
Root Mean Squared Error: 3.2476010800369477

Exercise

5-fold cross-validation

Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.

In this exercise, you will practice 5-fold cross validation on the Gapminder data. By default, scikit-learn’s cross_val_score() function uses R2R2 as the metric of choice for regression. Since you are performing 5-fold cross-validation, the function will return 5 scores. Your job is to compute these 5 scores and then take their average.

The DataFrame has been loaded as df and split into the feature/target variable arrays X and y. The modules pandas and numpy have been imported as pd and np, respectively.

Instruction

Import LinearRegression from sklearn.linear_modeland cross_val_score from sklearn.model_selection.
Create a linear regression regressor called reg.
Use the cross_val_score() function to perform 5-fold cross-validation on X and y.
Compute and print the average cross-validation score. You can use NumPy’s mean() function to compute the average.

# Import the necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv=5)

# Print the 5-fold cross-validation scores
print(cv_scores)

print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))

[0.81720569 0.82917058 0.90214134 0.80633989 0.94495637]
Average 5-Fold CV Score: 0.859962772279345

Exercise

K-Fold CV comparison

Cross validation is essential but do not forget that the more folds you use, the more computationally expensive cross-validation becomes. In this exercise, you will explore this for yourself. Your job is to perform 3-fold cross-validation and then 10-fold cross-validation on the Gapminder dataset.

In the IPython Shell, you can use %timeit to see how long each 3-fold CV takes compared to 10-fold CV by executing the following cv=3 and cv=10:

%timeit cross_val_score(reg, X, y, cv = ____)

pandas and numpy are available in the workspace as pdand np. The DataFrame has been loaded as df and the feature/target variable arrays X and y have been created.

Instruction

Import LinearRegression from sklearn.linear_modeland cross_val_score from sklearn.model_selection.
Create a linear regression regressor called reg.
Perform 3-fold CV and then 10-fold CV. Compare the resulting mean scores.

# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Perform 3-fold CV
cvscores_3 = cross_val_score(reg, X, y, cv=3)
print(np.mean(cvscores_3))

# Perform 10-fold CV
cvscores_10 = cross_val_score(reg, X, y, cv=10)
print(np.mean(cvscores_10))

0.8718712782622262
0.8436128620131267

Exercise

Regularization I: Lasso

In the video, you saw how Lasso selected out the 'RM'feature as being the most important for predicting Boston house prices, while shrinking the coefficients of certain other features to 0. Its ability to perform feature selection in this way becomes even more useful when you are dealing with data involving thousands of features.

In this exercise, you will fit a lasso regression to the Gapminder data you have been working with and plot the coefficients. Just as with the Boston data, you will find that the coefficients of some features are shrunk to 0, with only the most important ones remaining.

The feature and target variable arrays have been pre-loaded as X and y.

Instruction

Import Lasso from sklearn.linear_model.
Instantiate a Lasso regressor with an alpha of 0.4 and specify normalize=True.
Fit the regressor to the data and compute the coefficients using the coef_ attribute.
Plot the coefficients on the y-axis and column names on the x-axis. This has been done for you, so hit ‘Submit Answer’ to view the plot!

# modified/added by Jinny
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_2433/datasets/gapminder-clean.csv')

y = df['life'].values
X = df.drop('life', axis=1).values

df_columns = df.drop('life', axis=1).columns

# -----------------------------------
# Import Lasso
from sklearn.linear_model import Lasso

# Instantiate a lasso regressor: lasso
lasso = Lasso(alpha=0.4, normalize=True)

# Fit the regressor to the data
lasso.fit(X, y)

# Compute and print the coefficients
lasso_coef = lasso.coef_
print(lasso_coef)

# Plot the coefficients
plt.plot(range(len(df_columns)), lasso_coef)
plt.xticks(range(len(df_columns)), df_columns.values, rotation=60)
plt.margins(0.02)
plt.show()

[-0.         -0.         -0.          0.          0.          0.
 -0.         -0.07087587]

Exercise

Regularization II: Ridge

Lasso is great for feature selection, but when building regression models, Ridge regression should be your first choice.

Recall that lasso performs regularization by adding to the loss function a penalty term of the absolute value of each coefficient multiplied by some alpha. This is also known as L1L1regularization because the regularization term is the L1L1 norm of the coefficients. This is not the only way to regularize, however.

If instead you took the sum of the squared values of the coefficients multiplied by some alpha - like in Ridge regression - you would be computing the L2L2 norm. In this exercise, you will practice fitting ridge regression models over a range of different alphas, and plot cross-validated R2R2 scores for each, using this function that we have defined for you, which plots the R2R2 score as well as standard error for each alpha:

def display_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.set_xlabel('Alpha')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])
    ax.set_xscale('log')
    plt.show()

Don’t worry about the specifics of the above function works. The motivation behind this exercise is for you to see how the R2R2score varies with different alphas, and to understand the importance of selecting the right value for alpha. You’ll learn how to tune alpha in the next chapter.

Instruction

Instantiate a Ridge regressor and specify normalize=True.
Inside the for loop:
- Specify the alpha value for the regressor to use.
- Perform 10-fold cross-validation on the regressor with the specified alpha. The data is available in the arrays X and y.
- Append the average and the standard deviation of the computed cross-validated scores. NumPy has been pre-imported for you as np.
Use the display_plot() function to visualize the scores and standard deviations.

def display_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.set_xlabel('Alpha')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])
    ax.set_xscale('log')
    plt.show()

# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []

# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)

# Compute scores over range of alphas
for alpha in alpha_space:

    # Specify the alpha value to use: ridge.alpha
    ridge.alpha = alpha
    
    # Perform 10-fold CV: ridge_cv_scores
    ridge_cv_scores = cross_val_score(ridge, X, y, cv=10)
    
    # Append the mean of ridge_cv_scores to ridge_scores
    ridge_scores.append(np.mean(ridge_cv_scores))
    
    # Append the std of ridge_cv_scores to ridge_scores_std
    ridge_scores_std.append(np.std(ridge_cv_scores))

# Display the plot
display_plot(ridge_scores, ridge_scores_std)

你可能感兴趣的:(datacamp)

python练习：Case Study - Sunlight in Austin 鲸鱼酱375
内容来自datacamp课程：pandasfoundation数据以及代码在github数据：数据1weather_data_austin_2010：2010年的Austin天气情况head为了后续更好使用，把date作为indexdf.Date=pd.to_datetime(df.Date)df.index=df.Datedf=df.drop(['Date'],axis=1)df.head()h
R学习记录 - Cleaning Data 侯悍超
做过数据分析的都知道，可能只有25%左右的时间是花在分析数据上边的，剩下的时间都在清洗数据（cleaningdata）。所以清理数据是数据分析中超级费时间，但是超级重要的事情。以前我清理数据基本上就重点看3个事情：重复值极端值缺失值不过这只是自己总结的东西，今天算是比较系统地在datacamp上学了清理数据的过程，感觉以后清理数据更有信心了。通过这次学习我理解的主要过程稍微升级了一点：观察数据：看
Markdown printQiao
数据帮（DataCamp）专注于引进国外优质数据分析课程。Datacamp.jpgimportpandas
SQL基础操作 dataengineer
SQL，全称StructuredQueryLanguage，是一种关系数据库查询和编程语言。本文是DataCamp《IntrotoSQLforDataScience》课程的学习笔记。1.选择1列#从表格films中选择名为title的1列SELECTtitleFROMfilms;2.选择多列#从表格films中选择名为title、release_year和country的3列SELECTtitle
DataCamp在线学习平台源代码杀手人工智能跨学科编程学习
DataCamp（https://www.datacamp.com/blog）是一个在线学习平台，专注于数据科学和分析领域的教育。该平台提供丰富的课程，涵盖了从数据处理到机器学习和深度学习的各个方面。以下是DataCamp的主要特点：互动学习：DataCamp采用互动式学习方法，通过实时的编程环境，学习者可以直接在浏览器中执行代码，并获得即时反馈。多语言支持：平台主要支持Python和R语言，为学
Python学习资源精华：数据分析篇(数据处理) 可口可乐没有乐 python 学习数据分析
重点推荐：数据科学牛站DataScienceTutorialshttps://www.datacamp.com/tutorial数据分析：四剑客NumPy：数值计算Pandas：数据处理Seaborn：数据可视化Matplotlib：数据可视化，高度定制你想要今天重点分享：NumPy和PandasNumPy简介NumPy（NumericalPython）是Python的一种开源的数值计算扩展。这种
DataCamp课程：Joining Data in SQL Daisy Lee DataCamp课程 sql
JoiningDatainSQL1.IntroductiontojoinsInnerjoinBeginbyselectingallcolumnsfromthecitiestable.SELECT*FROMcities;namecountry_codecity_proper_popmetroarea_popurbanarea_popAbidjanCIV4765000null4765000AbuDha
SQL-DataCamp-Joining Data in SQL radar_sun DataCamp-SQL sql datacamp
1.IntroductiontoJoins1.1IntroductiontoINNERJOIN(video)1.2INNERJOINPostgreSQLwasmentionedintheslidesbutyou’llfindthatthesejoinsandthematerialhereappliestodifferentformsofSQLaswell.Recallfromthevideothe
在我强烈要求下，DataCamp送了我两节课！侯悍超
今天终于学完了16节dataanalyst的课，本来想着可以高高兴兴地把证书分享到linkedin，然后就可以等着猎头来帮我升值加薪了！~结果前几天发生了想不到的事情：我的学习进度从80%多掉到了70%多！刚看到的时候，我是不相信的。是我打开的姿势不对么？于是我退出登陆重进了一下，还是70%多？怎么回事？冷静下来仔细一看，课程的完成数目突然少了2节！原来是DataCamp怕我们跟不上“日新月异的技
DataCamp在线学习python 曼达随笔
一直没有系统学习过python，虽然用这个写过简单的程序，但每次都是要用到的时候祭出google大法，所以基础非常的不扎实，python语言的精髓也没有领会到。很多时候看到别人的程序，特别是复杂一些的程序，还是会一脸懵逼，似懂非懂。所以一直想要系统的学习一下。最开始用的是《Learnpythonthehardway》这本书。这本书就是以练习为主，有很多程序实例，从简单到复杂，学习的过程就是跟着敲程
Datacamp 笔记&代码 Supervised Learning with scikit-learn 第一章 Classification JinnyR datacamp datacamp sklearn data science python machine learning
更多原始数据文档和JupyterNotebookGithub:https://github.com/JinnyR/Datacamp_DataScienceTrack_PythonDatacamptrack:DataScientistwithPython-Course21(1)Exercisek-NearestNeighbors:FitHavingexploredtheCongressionalvo
难以取舍的Python和R，到底学哪个？丨程序之道丨
对于想从事数据行业的人和数据工作者来说，是学习R还是python，哪个工具更实用一直被大家争论。MartijnTheuwissen，DataCamp的教育专家详细比较了这两个工具。ython和R是统计学中两种最流行的的编程语言，R的功能性主要是统计学家在开发时考虑的（R具有强大的可视化功能），而Python因为易于理解的语法被大家所接受。在这篇文章中，我们将重点介绍R和Python以及它们在数据科
R_DATACAMP Working with Dates and Times in R 一条很闲的咸鱼
DatesandTimesinRR中的日期与时间as.Date("2014-04-10")转化为日期格式library(anytime)anytime包sep_10_2009<-c("September102009","2009-09-10","10Sep2009","09-10-2009")anytime(sep_10_2009)统一转化为2009-09-10UTC的格式ggplot(relea
送你一个在线机器学习网站，真香！生信宝典决策树算法机器学习深度学习人工智能
https://campus.datacamp.com机器学习系列教程从随机森林开始，一步步理解决策树、随机森林、ROC/AUC、数据集、交叉验证的概念和实践。文字能说清的用文字、图片能展示的用、描述不清的用公式、公式还不清楚的写个简单代码，一步步理清各个环节和概念。再到成熟代码应用、模型调参、模型比较、模型评估，学习整个机器学习需要用到的知识和技能。机器学习算法-随机森林之决策树初探（1）机器学
R_DATACAMP7 Exploratory Data Analysis 一条很闲的咸鱼
ExploringCategoricalData探索分类数据droplevels()舍弃相关不常用的levelstheme(axis.text.x=element_text(angle=90))有关主题全面了解一下ggplot中，x=as.factor()与x=中的区别在哪里，转化为factor格式后有什么不同？xlim(c(90,550))ggplot中限制图的大小xlim(c())ggtitl
第二章数据可视化——6.21 小憨豆
在完成数据的清理和重构，为了使数据能够更加易于理解，需要将数据进行可视化处理，这里主要用到的是Python数据可视化库Matplotlib。导入Matplotlib库的时候，有的时候需要加上%matplotlibinline，这句话的功能是：可以内嵌绘图，并且可以省略掉plt.show()这一步。最主要的matplotlib库的操作说明可以参考下图（作者是Datacamp，下载于Python程序员
R_DATACAMP10 Cluster Analysis in R分类分析一条很闲的咸鱼
Calculatingdistancebetweenobservations计算两点间距离lims(x=c(-30,30),y=c(-20,20))应用于ggplot中，可以设置图标坐标轴的范围dist(two_players)dist(data.frame)会计算出数据结构中各个点相互之间的举例scale(data.frame)后再dist，可以消除因为同组数之间相差太大引起的影响，比如一个是千
python练习：67 Years of Lego Sets and their Features 鲸鱼酱375
前提：project来自datacamphttps://www.datacamp.com/projects/101.介绍EveryonelovesLego(unlessyoueversteppedonone).Didyouknowbythewaythat"Lego"wasderivedfromtheDanishphraseleggodt,whichmeans"playwell"?Unlessyou
R_DATACAMP8 Exploratory Data Analysis in R: Case Study 一条很闲的咸鱼
countries%filter(country%in%countries)筛选出对应的城市，这么看更明了关于ggplot2中，facet_wrap（）分类后各个图的y一致的问题，如果想要各个图的y不一致，可以使用facet_wrap(~country,scales="free_y")lm(y~x,数据集)回归预测library(broom)tidy(US_fit)即tidy使用lm回归后的数据集
datacamp Cheat Sheet iOSDevLog
1.PythonForDataScienceCheatSheetImportingData.png2.RForDataScienceCheatSheetTidyverseforBeginners.png3.PythonForDataScienceCheatSheetBokeh.png4.PythonForDataScienceCheatSheetJupyterNotebook.png5.Pytho
Seaborn入门指南 Mr_喵
一、官网官网绝对是最权威的，并且存在大量通俗易懂的例子，非常适合学习seaborn官网二、来自DataCamp的教程不同于以往从图表的特性和类别来进行讲解的教程，本教程更注重绘图细节PythonSeabornTutorialForBeginners
DataCamp-Introduction of R 无机牛奶
thetermfactorreferstostatisticaldatatypeusedtohead()enablesyoutoshowthefirstobservationsofadataframe.Similarly,thefunctiontail()printsoutthelastobservationsinyourdataset.Thefunctionstr()showsyouthestr
【译】MLXTEND之StackingCVRegressor wong小尧
www.DataCamp.com中有很多数据科学家的cheatsheet，可以放在手边参考，大部分情况就够用了，以下是个人整理的一些详细的例子。spark中通常使用rdd，但是这样代码可读性差，目前rdd的很多方法已经不再更新了。dataframe大部分使用SparkSQL操作，速度会比rdd的方法更快，dataset是dataframe的子集，大部分api是互通的，目前主流是在使用SparkSQ
时间序列笔记-三指数模型新云旧雨
笔记说明在datacamp网站上学习“TimeSerieswithR”track“ForecastingUsingR”课程做的对应笔记。学识有限，错误难免，还请不吝赐教。学习的课程为“ForecastingUsingR”，主要用forecast包。课程参考教材Forecasting:PrinciplesandPractice课程中数据可在fpp2包得到本次笔记也参考了其他人的文章：指数平滑方法简介
python的论文_python论文_python 论文_python毕业论文 - 云+社区 - 腾讯云 weixin_39963080 python的论文
广告关闭腾讯云11.11云上盛惠，精选热门产品助力上云，云服务器首年88元起，买的越多返的越多，最高返5000元！rank1.python马尔可夫链初学者教程文章地址：https:www.datacamp.comcommunitytutorialsmarkov-chains-python-tutorial?rank2.jakevanderplas-vega，vega-lite和altair探索性数
解读 | 数据分析领域七大热门职业 CDA·数据分析师 sql sqlserver 数据库
CDA数据分析师出品来源：datacamp编译：Mika根据《韦氏词典》，数据指的是用作推理、讨论或计算基础的事实信息。基于这个定义，我们可以进一步得出：数据可以理解为是收集到的任何信息，可以使用、进一步处理和分析以获得见解。而且通常与计算机联系在一起，因为数据通常是在计算机中生成和存储的，然而数据存在的时间比我们想象的要长得多。数据的历史人类存储和分析数据的最早例子可以追溯到公元前18000年，
再见英文版，Python 速查表中文版来了 Python数据挖掘 python python 数据分析数据挖掘 SQL 大数据
近几年以来，Python的应用场景越来越多，几乎可以应用于自然科学、工程技术、金融、通信和商业等各种领域。究其原因在于Python的简单易学、功能强大。想系统地学点东西，发现很多不错的技术文档都是英文资料，发现英文竟然成为了学习的拦路虎。非常幸运的是，DataCamp推出的Python数据科学速查表，已经翻译成中文啦！高清资料已打包。喜欢点赞支持、欢迎收藏学习。Python基础系列推出的内容包括：
终于盼到了，Python 数据科学速查表中文版来了 Python数据开发学习笔记 python
近几年以来，Python的应用场景越来越多，几乎可以应用于自然科学、工程技术、金融、通信和商业等各种领域。究其原因在于Python的简单易学、功能强大。想系统地学点东西，发现很多不错的技术文档都是英文资料，发现英文竟然成为了学习的拦路虎。非常幸运的是，DataCamp推出的Python数据科学速查表，已经翻译成中文啦！高清资料已打包。喜欢点赞支持、欢迎收藏学习。领取方式：资料已打包，获取方法有两种
超级干货：手把手教你学习R语言（附资源链接）数据分析v
作者：NSS；翻译：杨金鸿；校对：韩海畴，林亦霖；本文约3000字，建议阅读7分钟。本文为带大家了解R语言以及分段式的步骤教程！人们学习R语言时普遍存在缺乏系统学习方法的问题。学习者不知道从哪开始，如何进行，选择什么学习资源。虽然网络上有许多不错的免费学习资源，然而它们多过了头，反而会让人挑花了眼。为了构建R语言学习方法，我们在Vidhya和DataCamp中选一组综合资源，帮您从头学习R语言。这
整理了245道Python面试真题！、烟雨楼 phtyon 语言面试 python 面试开发语言 java 压力测试
小编搜罗了网上的各种面试题，现在做成了PDF版本的《Python面试大全》，更加方便阅读。面试大全中涵盖了Python基础、Python高级部分、Python语言特性、操作系统、数据库、网络、数据结构、编程题等。不管你是现在准备找Python工作的，还是将来准备从事Python相关工作的，这份面试宝典都不可错过！书籍部分内容截图如下：另外，推荐一下DataCamp推出的Python数据科学速查表（
Spring中@Value注解，需要注意的地方无量 spring bean @Value xml
Spring 3以后,支持@Value注解的方式获取properties文件中的配置值，简化了读取配置文件的复杂操作 1、在applicationContext.xml文件(或引用文件中)中配置properties文件 <bean id="appProperty" class="org.springframework.beans.fac
mongoDB 分片开窍的石头 mongodb
mongoDB的分片。要mongos查询数据时候先查询configsvr看数据在那台shard上，configsvr上边放的是metar信息，指的是那条数据在那个片上。由此可以看出mongo在做分片的时候咱们至少要有一个configsvr,和两个以上的shard（片）信息。第一步启动两台以上的mongo服务 &nb
OVER(PARTITION BY)函数用法 0624chenhong oracle
这篇写得很好，引自 http://www.cnblogs.com/lanzi/archive/2010/10/26/1861338.html OVER(PARTITION BY)函数用法 2010年10月26日 OVER(PARTITION BY)函数介绍开窗函数 &nb
Android开发中，ADB server didn't ACK 解决方法一炮送你回车库 Android开发
首先通知：凡是安装360、豌豆荚、腾讯管家的全部卸载，然后再尝试。一直没搞明白这个问题咋出现的，但今天看到一个方法，搞定了！原来是豌豆荚占用了 5037 端口导致。参见原文章：一个豌豆荚引发的血案——关于ADB server didn't ACK的问题简单来讲，首先将Windows任务进程中的豌豆荚干掉，如果还是不行，再继续按下列步骤排查。 &nb
canvas中的像素绘制问题换个号韩国红果果 JavaScript canvas
pixl的绘制，1.如果绘制点正处于相邻像素交叉线，绘制x像素的线宽，则从交叉线分别向前向后绘制x/2个像素，如果x/2是整数，则刚好填满x个像素，如果是小数，则先把整数格填满，再去绘制剩下的小数部分，绘制时，是将小数部分的颜色用来除以一个像素的宽度，颜色会变淡。所以要用整数坐标来画的话（即绘制点正处于相邻像素交叉线时），线宽必须是2的整数倍。否则会出现不饱满的像素。 2.如果绘制点为一个像素的
编码乱码问题灵静志远 java jvm jsp 编码
1、JVM中单个字符占用的字节长度跟编码方式有关，而默认编码方式又跟平台是一一对应的或说平台决定了默认字符编码方式；2、对于单个字符：ISO-8859-1单字节编码，GBK双字节编码，UTF-8三字节编码；因此中文平台(中文平台默认字符集编码GBK)下一个中文字符占2个字节，而英文平台(英文平台默认字符集编码Cp1252(类似于ISO-8859-1))。 3、getBytes()、getByte
java 求几个月后的日期 darkranger calendar getinstance
Date plandate = planDate.toDate(); SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd"); Calendar cal = Calendar.getInstance(); cal.setTime(plandate); // 取得三个月后时间 cal.add(Calendar.M
数据库设计的三大范式（通俗易懂） aijuans 数据库复习
关系数据库中的关系必须满足一定的要求。满足不同程度要求的为不同范式。数据库的设计范式是数据库设计所需要满足的规范。只有理解数据库的设计范式，才能设计出高效率、优雅的数据库，否则可能会设计出错误的数据库. 目前，主要有六种范式：第一范式、第二范式、第三范式、BC范式、第四范式和第五范式。满足最低要求的叫第一范式，简称1NF。在第一范式基础上进一步满足一些要求的为第二范式，简称2NF。其余依此类推。
想学工作流怎么入手 atongyeye jbpm
工作流在工作中变得越来越重要，很多朋友想学工作流却不知如何入手。很多朋友习惯性的这看一点，那了解一点，既不系统，也容易半途而废。好比学武功，最好的办法是有一本武功秘籍。研究明白，则犹如打通任督二脉。系统学习工作流，很重要的一本书《JBPM工作流开发指南》。本人苦苦学习两个月，基本上可以解决大部分流程问题。整理一下学习思路，有兴趣的朋友可以参考下。 1 首先要
Context和SQLiteOpenHelper创建数据库百合不是茶 android Context创建数据库
一直以为安卓数据库的创建就是使用SQLiteOpenHelper创建,但是最近在android的一本书上看到了Context也可以创建数据库,下面我们一起分析这两种方式创建数据库的方式和区别,重点在SQLiteOpenHelper 一:SQLiteOpenHelper创建数据库: 1,SQLi
浅谈group by和distinct bijian1013 oracle 数据库 group by distinct
group by和distinct只了去重意义一样，但是group by应用范围更广泛些，如分组汇总或者从聚合函数里筛选数据等。譬如：统计每id数并且只显示数大于3 select id ,count(id) from ta
vi opertion 征客丶 mac opration vi
进入 command mode （命令行模式）按 esc 键再按 shift + 冒号注：以下命令中带 $ 【在命令行模式下进行】，不带 $ 【在非命令行模式下进行】一、文件操作 1.1、强制退出不保存 $ q! 1.2、保存 $ w 1.3、保存并退出 $ wq 1.4、刷新或重新加载已打开的文件 $ e 二、光标移动 2.1、跳到指定行数字
【Spark十四】深入Spark RDD第三部分RDD基本API bit1129 spark
对于K/V类型的RDD,如下操作是什么含义？ val rdd = sc.parallelize(List(("A",3),("C",6),("A",1),("B",5)) rdd.reduceByKey(_+_).collect reduceByKey在这里的操作，是把
java类加载机制 BlueSkator java 虚拟机
java类加载机制 1.java类加载器的树状结构引导类加载器 ^ | 扩展类加载器 ^ | 系统类加载器 java使用代理模式来完成类加载，java的类加载器也有类似于继承的关系，引导类是最顶层的加载器，它是所有类的根加载器，它负责加载java核心库。当一个类加载器接到装载类到虚拟机的请求时，通常会代理给父类加载器，若已经是根加载器了，就自己完成加载。虚拟机区分一个Cla
动态添加文本框 BreakingBad 文本框
<script> var num=1; function AddInput() { var str=""; str+="<input
读《研磨设计模式》-代码笔记-单例模式 bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ public class Singleton { } /* * 懒汉模式。注意，getInstance如果在多线程环境中调用，需要加上synchronized，否则存在线程不安全问题 */ class LazySingleton
iOS应用打包发布常见问题 chenhbc ios iOS发布 iOS上传 iOS打包
这个月公司安排我一个人做iOS客户端开发，由于急着用，我先发布一个版本，由于第一次发布iOS应用，期间出了不少问题，记录于此。 1、使用Application Loader 发布时报错：Communication error.please use diagnostic mode to check connectivity.you need to have outbound acc
工作流复杂拓扑结构处理新思路 comsci 设计模式工作算法企业应用 OO
我们走的设计路线和国外的产品不太一样，不一样在哪里呢？国外的流程的设计思路是通过事先定义一整套规则(类似XPDL)来约束和控制流程图的复杂度(我对国外的产品了解不够多，仅仅是在有限的了解程度上面提出这样的看法)，从而避免在流程引擎中处理这些复杂的图的问题，而我们却没有通过事先定义这样的复杂的规则来约束和降低用户自定义流程图的灵活性，这样一来，在引擎和流程流转控制这一个层面就会遇到很
oracle 11g新特性Flashback data archive daizj oracle
1. 什么是flashback data archive Flashback data archive是oracle 11g中引入的一个新特性。Flashback archive是一个新的数据库对象，用于存储一个或多表的历史数据。Flashback archive是一个逻辑对象，概念上类似于表空间。实际上flashback archive可以看作是存储一个或多个表的所有事务变化的逻辑空间。
多叉树:2-3-4树 dieslrae 树
平衡树多叉树,每个节点最多有4个子节点和3个数据项,2,3,4的含义是指一个节点可能含有的子节点的个数,效率比红黑树稍差.一般不允许出现重复关键字值.2-3-4树有以下特征: 1、有一个数据项的节点总是有2个子节点(称为2-节点) 2、有两个数据项的节点总是有3个子节点(称为3-节
C语言学习七动态分配 malloc的使用 dcj3sjt126com c language malloc
/* 2013年3月15日15:16:24 malloc 就memory(内存) allocate(分配)的缩写本程序没有实际含义，只是理解使用 */ # include <stdio.h> # include <malloc.h> int main(void) { int i = 5; //分配了4个字节静态分配 int * p
Objective-C编码规范[译] dcj3sjt126com 代码规范
原文链接 : The official raywenderlich.com Objective-C style guide 原文作者 : raywenderlich.com Team 译文出自 : raywenderlich.com Objective-C编码规范译者 : Sam Lau
0.性能优化-目录 frank1234 性能优化
从今天开始笔者陆续发表一些性能测试相关的文章，主要是对自己前段时间学习的总结，由于水平有限，性能测试领域很深，本人理解的也比较浅，欢迎各位大咖批评指正。主要内容包括：一、性能测试指标吞吐量、TPS、响应时间、负载、可扩展性、PV、思考时间 http://frank1234.iteye.com/blog/2180305 二、性能测试策略生产环境相同基准测试预热等 htt
Java父类取得子类传递的泛型参数Class类型 happyqing java 泛型父类子类 Class
import java.lang.reflect.ParameterizedType; import java.lang.reflect.Type; import org.junit.Test; abstract class BaseDao<T> { public void getType() { //Class<E> clazz =
跟我学SpringMVC目录汇总贴、PDF下载、源码下载 jinnianshilongnian springMVC
----广告-------------------------------------------------------------- 网站核心商详页开发掌握Java技术，掌握并发/异步工具使用，熟悉spring、ibatis框架；掌握数据库技术，表设计和索引优化，分库分表/读写分离；了解缓存技术，熟练使用如Redis/Memcached等主流技术；了解Ngin
the HTTP rewrite module requires the PCRE library 流浪鱼 rewrite
./configure: error: the HTTP rewrite module requires the PCRE library. 模块依赖性Nginx需要依赖下面3个包 1. gzip 模块需要 zlib 库 ( 下载: http://www.zlib.net/ ) 2. rewrite 模块需要 pcre 库 ( 下载: http://www.pcre.org/ ) 3. s
第12章 Ajax（中） onestopweb Ajax
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
Optimize query with Query Stripping in Web Intelligence blueoxygen BO
http://wiki.sdn.sap.com/wiki/display/BOBJ/Optimize+query+with+Query+Stripping+in+Web+Intelligence and a very straightfoward video http://www.sdn.sap.com/irj/scn/events?rid=/library/uuid/40ec3a0c-936
Java开发者写SQL时常犯的10个错误 tomcat_oracle java sql
1、不用PreparedStatements 　　有意思的是，在JDBC出现了许多年后的今天，这个错误依然出现在博客、论坛和邮件列表中，即便要记住和理解它是一件很简单的事。开发者不使用PreparedStatements的原因可能有如下几个：　　他们对PreparedStatements不了解　　他们认为使用PreparedStatements太慢了　　他们认为写Prepar
世纪互联与结盟有感阿尔萨斯
10月10日，世纪互联与（Foxcon）签约成立合资公司，有感。全球电子制造业巨头（全球500强企业）与世纪互联共同看好IDC、云计算等业务在中国的增长空间，双方迅速果断出手，在资本层面上达成合作，此举体现了全球电子制造业巨头对世纪互联IDC业务的欣赏与信任，另一方面反映出世纪互联目前良好的运营状况与广阔的发展前景。众所周知，精于电子产品制造（世界第一），对于世纪互联而言，能够与结盟

Datacamp 笔记&代码 Supervised Learning with scikit-learn 第二章 Regression

Importing data for supervised learning

Exploring the Gapminder data

Possible Answers

Fit & predict for regression

Train/test split for regression

5-fold cross-validation

K-Fold CV comparison

Regularization I: Lasso

Regularization II: Ridge

你可能感兴趣的:(datacamp)