JinnyR

Datacamp 笔记&代码 Supervised Learning with scikit-learn 第四章 Preprocessing and pipelines

更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python

Datacamp track: Data Scientist with Python - Course 21 (4)

Exercise

Exploring categorical features

The Gapminder dataset that you worked with in previous chapters also contained a categorical 'Region' feature, which we dropped in previous exercises since you did not have the tools to deal with it. Now however, you do, so we have added it back in!

Your job in this exercise is to explore this feature. Boxplots are particularly useful for visualizing categorical features such as this.

Instruction

Import pandas as pd.
Read the CSV file 'gapminder.csv' into a DataFrame called df.
Use pandas to create a boxplot showing the variation of life expectancy ('life') by region ('Region'). To do so, pass the column names in to df.boxplot() (in that order).

import matplotlib.pyplot as plt
plt.style.use('ggplot')
from urllib.request import urlretrieve
fn = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1939/datasets/gm_2008_region.csv'
urlretrieve(fn, 'gapminder.csv')

('gapminder.csv', )

# Import pandas
import pandas as pd

# Read 'gapminder.csv' into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create a boxplot of life expectancy per region
df.boxplot('life', 'Region', rot=60)

# Show the plot
plt.show()

Exercise

Creating dummy variables

As Andy discussed in the video, scikit-learn does not accept non-numerical features. You saw in the previous exercise that the 'Region' feature contains very useful information that can predict life expectancy. For example, Sub-Saharan Africa has a lower life expectancy compared to Europe and Central Asia. Therefore, if you are trying to predict life expectancy, it would be preferable to retain the 'Region' feature. To do this, you need to binarize it by creating dummy variables, which is what you will do in this exercise.

Instruction

Use the pandas get_dummies() function to create dummy variables from the df DataFrame. Store the result as df_region.
Print the columns of df_region. This has been done for you.
Use the get_dummies() function again, this time specifying drop_first=True to drop the unneeded dummy variable (in this case, 'Region_America').
Hit 'Submit Answer to print the new columns of df_region and take note of how one column was dropped!

# Create dummy variables: df_region
df_region = pd.get_dummies(df)

# Print the columns of df_region
print(df_region.columns)

# Create dummy variables with drop_first=True: df_region
df_region = pd.get_dummies(df, drop_first=True)

# Print the new columns of df_region
print(df_region.columns)

Index(['population', 'fertility', 'HIV', 'CO2', 'BMI_male', 'GDP',
       'BMI_female', 'life', 'child_mortality', 'Region_America',
       'Region_East Asia & Pacific', 'Region_Europe & Central Asia',
       'Region_Middle East & North Africa', 'Region_South Asia',
       'Region_Sub-Saharan Africa'],
      dtype='object')
Index(['population', 'fertility', 'HIV', 'CO2', 'BMI_male', 'GDP',
       'BMI_female', 'life', 'child_mortality', 'Region_East Asia & Pacific',
       'Region_Europe & Central Asia', 'Region_Middle East & North Africa',
       'Region_South Asia', 'Region_Sub-Saharan Africa'],
      dtype='object')

Exercise

Regression with categorical features

Having created the dummy variables from the 'Region'feature, you can build regression models as you did before. Here, you’ll use ridge regression to perform 5-fold cross-validation.

The feature array X and target variable array y have been pre-loaded.

Instruction

Import Ridge from sklearn.linear_model and cross_val_score from sklearn.model_selection.
Instantiate a ridge regressor called ridge with alpha=0.5 and normalize=True.
Perform 5-fold cross-validation on X and y using the cross_val_score() function.
Print the cross-validated scores.

# modified/added by Jinny
import pandas as pd
import numpy as np

from urllib.request import urlretrieve

fn = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1939/datasets/gm_2008_region.csv'
urlretrieve(fn, 'gapminder.csv')

df = pd.read_csv('gapminder.csv')

df_region = pd.get_dummies(df)

df_region = df_region.drop('Region_America', axis=1)

X = df_region.drop('life', axis=1).values

y = df_region['life'].values

# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Instantiate a ridge regressor: ridge
ridge = Ridge(alpha=0.5, normalize=True)

# Perform 5-fold cross-validation: ridge_cv
ridge_cv = cross_val_score(ridge, X, y, cv=5)

# Print the cross-validated scores
print(ridge_cv)

[0.86808336 0.80623545 0.84004203 0.7754344  0.87503712]

Exercise

Dropping missing data

The voting dataset from Chapter 1 contained a bunch of missing values that we dealt with for you behind the scenes. Now, it’s time for you to take care of these yourself!

The unprocessed dataset has been loaded into a DataFrame df. Explore it in the IPython Shell with the .head()method. You will see that there are certain data points labeled with a '?'. These denote missing values. As you saw in the video, different datasets encode missing values in different ways. Sometimes it may be a '9999', other times a 0 - real-world data can be very messy! If you’re lucky, the missing values will already be encoded as NaN. We use NaNbecause it is an efficient and simplified way of internally representing missing data, and it lets us take advantage of pandas methods such as .dropna() and .fillna(), as well as scikit-learn’s Imputation transformer Imputer().

In this exercise, your job is to convert the '?'s to NaNs, and then drop the rows that contain them from the DataFrame.

Instruction

Explore the DataFrame df in the IPython Shell. Notice how the missing value is represented.
Convert all '?' data points to np.nan.
Count the total number of NaNs using the .isnull() and .sum() methods. This has been done for you.
Drop the rows with missing values from df using .dropna().
Hit ‘Submit Answer’ to see how many rows were lost by dropping the missing values.

# modified/added by Jinny
import numpy as np
import pandas as pd
df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_1939/datasets/house-votes-84.csv',
 header=None, names = ['infants', 'water', 'budget', 'physician', 'salvador', 'religious',
 'satellite', 'aid', 'missile', 'immigration', 'synfuels', 'education',
 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa'])

df = df.reset_index()
df.rename(columns = {'index': 'party'}, inplace = True)

df[df == 'y'] = 1
df[df == 'n'] = 0

# Convert '?' to NaN
df[df == '?'] = np.nan

# Print the number of NaNs
print(df.isnull().sum())

# Print shape of original DataFrame
print("Shape of Original DataFrame: {}".format(df.shape))

# Drop missing values and print shape of new DataFrame
df = df.dropna()

# Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df.shape))

party                  0
infants               12
water                 48
budget                11
physician             11
salvador              15
religious             11
satellite             14
aid                   15
missile               22
immigration            7
synfuels              21
education             31
superfund             25
crime                 17
duty_free_exports     28
eaa_rsa              104
dtype: int64
Shape of Original DataFrame: (435, 17)
Shape of DataFrame After Dropping All Rows with Missing Values: (232, 17)

Exercise

Imputing missing data in a ML Pipeline I

As you’ve come to appreciate, there are many steps to building a model, from creating training and test sets, to fitting a classifier or regressor, to tuning its parameters, to evaluating its performance on new data. Imputation can be seen as the first step of this machine learning process, the entirety of which can be viewed within the context of a pipeline. Scikit-learn provides a pipeline constructor that allows you to piece together these steps into one process and thereby simplify your workflow.

You’ll now practice setting up a pipeline with two steps: the imputation step, followed by the instantiation of a classifier. You’ve seen three classifiers in this course so far: k-NN, logistic regression, and the decision tree. You will now be introduced to a fourth one - the Support Vector Machine, or SVM. For now, do not worry about how it works under the hood. It works exactly as you would expect of the scikit-learn estimators that you have worked with previously, in that it has the same .fit() and .predict() methods as before.

Instruction

Import Imputer from sklearn.preprocessing and SVC from sklearn.svm. SVC stands for Support Vector Classification, which is a type of SVM.
Setup the Imputation transformer to impute missing data (represented as 'NaN') with the 'most_frequent'value in the column (axis=0).
Instantiate a SVC classifier. Store the result in clf.
Create the steps of the pipeline by creating a list of tuples:
- The first tuple should consist of the imputation step, using imp.
- The second should consist of the classifier.

# Import the Imputer module
from sklearn.preprocessing import Imputer
from sklearn.svm import SVC

# Setup the Imputation transformer: imp
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)

# Instantiate the SVC classifier: clf
clf = SVC()

# Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
        ('SVM', clf)]

Exercise

Imputing missing data in a ML Pipeline II

Having setup the steps of the pipeline in the previous exercise, you will now use it on the voting dataset to classify a Congressman’s party affiliation. What makes pipelines so incredibly useful is the simple interface that they provide. You can use the .fit() and .predict() methods on pipelines just as you did with your classifiers and regressors!

Practice this for yourself now and generate a classification report of your predictions. The steps of the pipeline have been set up for you, and the feature array X and target variable array y have been pre-loaded. Additionally, train_test_split and classification_report have been imported from sklearn.model_selection and sklearn.metrics respectively.

Instruction

Import the following modules:
- Imputer from sklearn.preprocessing and Pipeline from sklearn.pipeline.
- SVC from sklearn.svm.
Create the pipeline using Pipeline() and steps.
Create training and test sets. Use 30% of the data for testing and a random state of 42.
Fit the pipeline to the training set and predict the labels of the test set.
Compute the classification report.

# modified/added by Jinny
import pandas as pd
import numpy as np
#
df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_1939/datasets/votes-ch1.csv')
# Create arrays for the features and the response variable. As a reminder, the response variable is 'party'
y = df['party']
X = df.drop('party', axis=1)

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Import necessary modules
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
        ('SVM', SVC())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = pipeline.predict(X_test)

# Compute metrics
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    democrat       0.99      0.96      0.98        85
  republican       0.94      0.98      0.96        46

   micro avg       0.97      0.97      0.97       131
   macro avg       0.96      0.97      0.97       131
weighted avg       0.97      0.97      0.97       131

Exercise

Centering and scaling your data

In the video, Hugo demonstrated how significantly the performance of a model can improve if the features are scaled. Note that this is not always the case: In the Congressional voting records dataset, for example, all of the features are binary. In such a situation, scaling will have minimal impact.

You will now explore scaling for yourself on a new dataset - White Wine Quality! Hugo used the Red Wine Quality dataset in the video. We have used the 'quality' feature of the wine to create a binary target variable: If 'quality'is less than 5, the target variable is 1, and otherwise, it is 0.

The DataFrame has been pre-loaded as df, along with the feature and target variable arrays X and y. Explore it in the IPython Shell. Notice how some features seem to have different units of measurement. 'density', for instance, takes values between 0.98 and 1.04, while 'total sulfur dioxide' ranges from 9 to 440. As a result, it may be worth scaling the features here. Your job in this exercise is to scale the features and compute the mean and standard deviation of the unscaled features compared to the scaled features.

Instruction

Import scale from sklearn.preprocessing.
Scale the features X using scale().
Print the mean and standard deviation of the unscaled features X, and then the scaled features X_scaled. Use the numpy functions np.mean() and np.std() to compute the mean and standard deviations.
Compute the classification report.

# modified/added by Jinny
import pandas as pd
import numpy as np
df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_3238/datasets/white-wine.csv')
X = df.drop('quality' , 1).values # drop target variable
y1 = df['quality'].values
y = y1 <= 5

# Import scale
from sklearn.preprocessing import scale

# Scale the features: X_scaled
X_scaled = scale(X)

# Print the mean and standard deviation of the unscaled features
print("Mean of Unscaled Features: {}".format(np.mean(X))) 
print("Standard Deviation of Unscaled Features: {}".format(np.std(X)))

# Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled))) 
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))

Mean of Unscaled Features: 18.432687072460002
Standard Deviation of Unscaled Features: 41.54494764094571
Mean of Scaled Features: 2.7314972981668206e-15
Standard Deviation of Scaled Features: 0.9999999999999999

Exercise

Centering and scaling in a pipeline

With regard to whether or not scaling is effective, the proof is in the pudding! See for yourself whether or not scaling the features of the White Wine Quality dataset has any impact on its performance. You will use a k-NN classifier as part of a pipeline that includes scaling, and for the purposes of comparison, a k-NN classifier trained on the unscaled data has been provided.

The feature array and target variable array have been pre-loaded as X and y. Additionally, KNeighborsClassifier and train_test_split have been imported from sklearn.neighbors and sklearn.model_selection, respectively.

Instruction

Import the following modules:
- StandardScaler from sklearn.preprocessing.
- Pipeline from sklearn.pipeline.
Complete the steps of the pipeline with StandardScaler() for 'scaler' and KNeighborsClassifier() for 'knn'.
Create the pipeline using Pipeline() and steps.
Create training and test sets, with 30% used for testing. Use a random state of 42.
Fit the pipeline to the training set.
Compute the accuracy scores of the scaled and unscaled models by using the .score() method inside the provided print() functions.

# modified/added by Jinny
import pandas as pd
df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_3238/datasets/white-wine.csv')
X = df.drop('quality' , 1).values # drop target variable
y1 = df['quality'].values
y = y1 <= 5

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Import the necessary modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())]
        
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)

# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))

Accuracy with Scaling: 0.7700680272108843
Accuracy without Scaling: 0.6979591836734694

Exercise

Bringing it all together I: Pipeline for classification

It is time now to piece together everything you have learned so far into a pipeline for classification! Your job in this exercise is to build a pipeline that includes scaling and hyperparameter tuning to classify wine quality.

You’ll return to using the SVM classifier you were briefly introduced to earlier in this chapter. The hyperparameters you will tune are $C$ and gammagamma. $C$ controls the regularization strength. It is analogous to the $C$ you tuned for logistic regression in Chapter 3, while gammagamma controls the kernel coefficient: Do not worry about this now as it is beyond the scope of this course.

The following modules have been pre-loaded: Pipeline, svm, train_test_split, GridSearchCV, classification_report, accuracy_score. The feature and target variable arrays X and y have also been pre-loaded.

Instruction

Setup the pipeline with the following steps:
- Scaling, called 'scaler' with StandardScaler().
- Classification, called 'SVM' with SVC().
Specify the hyperparameter space using the following notation: 'step_name__parameter_name'. Here, the step_name is SVM, and the parameter_names are C and gamma.
Create training and test sets, with 20% of the data used for the test set. Use a random state of 21.
Instantiate GridSearchCV with the pipeline and hyperparameter space and fit it to the training set. Use 3-fold cross-validation (This is the default, so you don’t have to specify it).
Predict the labels of the test set and compute the metrics. The metrics have been computed for you.

# modified/added by Jinny
import pandas as pd
df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_3238/datasets/white-wine.csv')
X = df.drop('quality' , 1).values # drop target variable
y1 = df['quality'].values
y = y1 <= 5
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Setup the pipeline
steps = [('scaler', StandardScaler()),
         ('SVM', SVC())]

pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'SVM__C':[1, 10, 100],
              'SVM__gamma':[0.1, 0.01]}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

# Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline, parameters, cv=3)

# Fit to the training set
cv.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)

# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))

Accuracy: 0.7795918367346939
              precision    recall  f1-score   support

       False       0.83      0.85      0.84       662
        True       0.67      0.63      0.65       318

   micro avg       0.78      0.78      0.78       980
   macro avg       0.75      0.74      0.74       980
weighted avg       0.78      0.78      0.78       980

Tuned Model Parameters: {'SVM__C': 10, 'SVM__gamma': 0.1}

Exercise

Bringing it all together II: Pipeline for regression

For this final exercise, you will return to the Gapminder dataset. Guess what? Even this dataset has missing values that we dealt with for you in earlier chapters! Now, you have all the tools to take care of them yourself!

Your job is to build a pipeline that imputes the missing data, scales the features, and fits an ElasticNet to the Gapminder data. You will then tune the l1_ratio of your ElasticNet using GridSearchCV.

All the necessary modules have been imported, and the feature and target variable arrays have been pre-loaded as X and y.

Instruction

Set up a pipeline with the following steps:
- 'imputation', which uses the Imputer()transformer and the 'mean' strategy to impute missing data ('NaN') using the mean of the column.
- 'scaler', which scales the features using StandardScaler().
- 'elasticnet', which instantiates an ElasticNet()regressor.
Specify the hyperparameter space for the l1l1 ratio using the following notation: 'step_name__parameter_name'. Here, the step_name is elasticnet, and the parameter_name is l1_ratio.
Create training and test sets, with 40% of the data used for the test set. Use a random state of 42.
Instantiate GridSearchCV with the pipeline and hyperparameter space. Use 3-fold cross-validation (This is the default, so you don’t have to specify it).
Fit the GridSearchCV object to the training set.
Compute R2R2 and the best parameters. This has been done for you, so hit ‘Submit Answer’ to see the results!

# modified/added by Jinny
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_2433/datasets/gapminder-clean.csv')

y = df['life'].values
X = df.drop('life', axis=1).values

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

numpy.ndarray

# Setup the pipeline steps: steps
# modified by Jinny: Imputer -> SimpleImputer
steps = [('imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
         ('scaler', StandardScaler()),
         ('elasticnet', ElasticNet())]

# Create the pipeline: pipeline 
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'elasticnet__l1_ratio':np.linspace(0,1,30)}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Create the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(pipeline, parameters, cv=3)

# Fit to the training set
gm_cv.fit(X_train, y_train)

# Compute and print the metrics
r2 = gm_cv.score(X_test, y_test)
print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))

Tuned ElasticNet Alpha: {'elasticnet__l1_ratio': 1.0}
Tuned ElasticNet R squared: 0.8862016570888217

你可能感兴趣的:(datacamp)

python练习：Case Study - Sunlight in Austin 鲸鱼酱375
内容来自datacamp课程：pandasfoundation数据以及代码在github数据：数据1weather_data_austin_2010：2010年的Austin天气情况head为了后续更好使用，把date作为indexdf.Date=pd.to_datetime(df.Date)df.index=df.Datedf=df.drop(['Date'],axis=1)df.head()h
R学习记录 - Cleaning Data 侯悍超
做过数据分析的都知道，可能只有25%左右的时间是花在分析数据上边的，剩下的时间都在清洗数据（cleaningdata）。所以清理数据是数据分析中超级费时间，但是超级重要的事情。以前我清理数据基本上就重点看3个事情：重复值极端值缺失值不过这只是自己总结的东西，今天算是比较系统地在datacamp上学了清理数据的过程，感觉以后清理数据更有信心了。通过这次学习我理解的主要过程稍微升级了一点：观察数据：看
Markdown printQiao
数据帮（DataCamp）专注于引进国外优质数据分析课程。Datacamp.jpgimportpandas
SQL基础操作 dataengineer
SQL，全称StructuredQueryLanguage，是一种关系数据库查询和编程语言。本文是DataCamp《IntrotoSQLforDataScience》课程的学习笔记。1.选择1列#从表格films中选择名为title的1列SELECTtitleFROMfilms;2.选择多列#从表格films中选择名为title、release_year和country的3列SELECTtitle
DataCamp在线学习平台源代码杀手人工智能跨学科编程学习
DataCamp（https://www.datacamp.com/blog）是一个在线学习平台，专注于数据科学和分析领域的教育。该平台提供丰富的课程，涵盖了从数据处理到机器学习和深度学习的各个方面。以下是DataCamp的主要特点：互动学习：DataCamp采用互动式学习方法，通过实时的编程环境，学习者可以直接在浏览器中执行代码，并获得即时反馈。多语言支持：平台主要支持Python和R语言，为学
Python学习资源精华：数据分析篇(数据处理) 可口可乐没有乐 python 学习数据分析
重点推荐：数据科学牛站DataScienceTutorialshttps://www.datacamp.com/tutorial数据分析：四剑客NumPy：数值计算Pandas：数据处理Seaborn：数据可视化Matplotlib：数据可视化，高度定制你想要今天重点分享：NumPy和PandasNumPy简介NumPy（NumericalPython）是Python的一种开源的数值计算扩展。这种
DataCamp课程：Joining Data in SQL Daisy Lee DataCamp课程 sql
JoiningDatainSQL1.IntroductiontojoinsInnerjoinBeginbyselectingallcolumnsfromthecitiestable.SELECT*FROMcities;namecountry_codecity_proper_popmetroarea_popurbanarea_popAbidjanCIV4765000null4765000AbuDha
SQL-DataCamp-Joining Data in SQL radar_sun DataCamp-SQL sql datacamp
1.IntroductiontoJoins1.1IntroductiontoINNERJOIN(video)1.2INNERJOINPostgreSQLwasmentionedintheslidesbutyou’llfindthatthesejoinsandthematerialhereappliestodifferentformsofSQLaswell.Recallfromthevideothe
在我强烈要求下，DataCamp送了我两节课！侯悍超
今天终于学完了16节dataanalyst的课，本来想着可以高高兴兴地把证书分享到linkedin，然后就可以等着猎头来帮我升值加薪了！~结果前几天发生了想不到的事情：我的学习进度从80%多掉到了70%多！刚看到的时候，我是不相信的。是我打开的姿势不对么？于是我退出登陆重进了一下，还是70%多？怎么回事？冷静下来仔细一看，课程的完成数目突然少了2节！原来是DataCamp怕我们跟不上“日新月异的技
DataCamp在线学习python 曼达随笔
一直没有系统学习过python，虽然用这个写过简单的程序，但每次都是要用到的时候祭出google大法，所以基础非常的不扎实，python语言的精髓也没有领会到。很多时候看到别人的程序，特别是复杂一些的程序，还是会一脸懵逼，似懂非懂。所以一直想要系统的学习一下。最开始用的是《Learnpythonthehardway》这本书。这本书就是以练习为主，有很多程序实例，从简单到复杂，学习的过程就是跟着敲程
Datacamp 笔记&代码 Supervised Learning with scikit-learn 第一章 Classification JinnyR datacamp datacamp sklearn data science python machine learning
更多原始数据文档和JupyterNotebookGithub:https://github.com/JinnyR/Datacamp_DataScienceTrack_PythonDatacamptrack:DataScientistwithPython-Course21(1)Exercisek-NearestNeighbors:FitHavingexploredtheCongressionalvo
难以取舍的Python和R，到底学哪个？丨程序之道丨
对于想从事数据行业的人和数据工作者来说，是学习R还是python，哪个工具更实用一直被大家争论。MartijnTheuwissen，DataCamp的教育专家详细比较了这两个工具。ython和R是统计学中两种最流行的的编程语言，R的功能性主要是统计学家在开发时考虑的（R具有强大的可视化功能），而Python因为易于理解的语法被大家所接受。在这篇文章中，我们将重点介绍R和Python以及它们在数据科
R_DATACAMP Working with Dates and Times in R 一条很闲的咸鱼
DatesandTimesinRR中的日期与时间as.Date("2014-04-10")转化为日期格式library(anytime)anytime包sep_10_2009<-c("September102009","2009-09-10","10Sep2009","09-10-2009")anytime(sep_10_2009)统一转化为2009-09-10UTC的格式ggplot(relea
送你一个在线机器学习网站，真香！生信宝典决策树算法机器学习深度学习人工智能
https://campus.datacamp.com机器学习系列教程从随机森林开始，一步步理解决策树、随机森林、ROC/AUC、数据集、交叉验证的概念和实践。文字能说清的用文字、图片能展示的用、描述不清的用公式、公式还不清楚的写个简单代码，一步步理清各个环节和概念。再到成熟代码应用、模型调参、模型比较、模型评估，学习整个机器学习需要用到的知识和技能。机器学习算法-随机森林之决策树初探（1）机器学
R_DATACAMP7 Exploratory Data Analysis 一条很闲的咸鱼
ExploringCategoricalData探索分类数据droplevels()舍弃相关不常用的levelstheme(axis.text.x=element_text(angle=90))有关主题全面了解一下ggplot中，x=as.factor()与x=中的区别在哪里，转化为factor格式后有什么不同？xlim(c(90,550))ggplot中限制图的大小xlim(c())ggtitl
第二章数据可视化——6.21 小憨豆
在完成数据的清理和重构，为了使数据能够更加易于理解，需要将数据进行可视化处理，这里主要用到的是Python数据可视化库Matplotlib。导入Matplotlib库的时候，有的时候需要加上%matplotlibinline，这句话的功能是：可以内嵌绘图，并且可以省略掉plt.show()这一步。最主要的matplotlib库的操作说明可以参考下图（作者是Datacamp，下载于Python程序员
R_DATACAMP10 Cluster Analysis in R分类分析一条很闲的咸鱼
Calculatingdistancebetweenobservations计算两点间距离lims(x=c(-30,30),y=c(-20,20))应用于ggplot中，可以设置图标坐标轴的范围dist(two_players)dist(data.frame)会计算出数据结构中各个点相互之间的举例scale(data.frame)后再dist，可以消除因为同组数之间相差太大引起的影响，比如一个是千
python练习：67 Years of Lego Sets and their Features 鲸鱼酱375
前提：project来自datacamphttps://www.datacamp.com/projects/101.介绍EveryonelovesLego(unlessyoueversteppedonone).Didyouknowbythewaythat"Lego"wasderivedfromtheDanishphraseleggodt,whichmeans"playwell"?Unlessyou
R_DATACAMP8 Exploratory Data Analysis in R: Case Study 一条很闲的咸鱼
countries%filter(country%in%countries)筛选出对应的城市，这么看更明了关于ggplot2中，facet_wrap（）分类后各个图的y一致的问题，如果想要各个图的y不一致，可以使用facet_wrap(~country,scales="free_y")lm(y~x,数据集)回归预测library(broom)tidy(US_fit)即tidy使用lm回归后的数据集
datacamp Cheat Sheet iOSDevLog
1.PythonForDataScienceCheatSheetImportingData.png2.RForDataScienceCheatSheetTidyverseforBeginners.png3.PythonForDataScienceCheatSheetBokeh.png4.PythonForDataScienceCheatSheetJupyterNotebook.png5.Pytho
Seaborn入门指南 Mr_喵
一、官网官网绝对是最权威的，并且存在大量通俗易懂的例子，非常适合学习seaborn官网二、来自DataCamp的教程不同于以往从图表的特性和类别来进行讲解的教程，本教程更注重绘图细节PythonSeabornTutorialForBeginners
DataCamp-Introduction of R 无机牛奶
thetermfactorreferstostatisticaldatatypeusedtohead()enablesyoutoshowthefirstobservationsofadataframe.Similarly,thefunctiontail()printsoutthelastobservationsinyourdataset.Thefunctionstr()showsyouthestr
【译】MLXTEND之StackingCVRegressor wong小尧
www.DataCamp.com中有很多数据科学家的cheatsheet，可以放在手边参考，大部分情况就够用了，以下是个人整理的一些详细的例子。spark中通常使用rdd，但是这样代码可读性差，目前rdd的很多方法已经不再更新了。dataframe大部分使用SparkSQL操作，速度会比rdd的方法更快，dataset是dataframe的子集，大部分api是互通的，目前主流是在使用SparkSQ
时间序列笔记-三指数模型新云旧雨
笔记说明在datacamp网站上学习“TimeSerieswithR”track“ForecastingUsingR”课程做的对应笔记。学识有限，错误难免，还请不吝赐教。学习的课程为“ForecastingUsingR”，主要用forecast包。课程参考教材Forecasting:PrinciplesandPractice课程中数据可在fpp2包得到本次笔记也参考了其他人的文章：指数平滑方法简介
python的论文_python论文_python 论文_python毕业论文 - 云+社区 - 腾讯云 weixin_39963080 python的论文
广告关闭腾讯云11.11云上盛惠，精选热门产品助力上云，云服务器首年88元起，买的越多返的越多，最高返5000元！rank1.python马尔可夫链初学者教程文章地址：https:www.datacamp.comcommunitytutorialsmarkov-chains-python-tutorial?rank2.jakevanderplas-vega，vega-lite和altair探索性数
解读 | 数据分析领域七大热门职业 CDA·数据分析师 sql sqlserver 数据库
CDA数据分析师出品来源：datacamp编译：Mika根据《韦氏词典》，数据指的是用作推理、讨论或计算基础的事实信息。基于这个定义，我们可以进一步得出：数据可以理解为是收集到的任何信息，可以使用、进一步处理和分析以获得见解。而且通常与计算机联系在一起，因为数据通常是在计算机中生成和存储的，然而数据存在的时间比我们想象的要长得多。数据的历史人类存储和分析数据的最早例子可以追溯到公元前18000年，
再见英文版，Python 速查表中文版来了 Python数据挖掘 python python 数据分析数据挖掘 SQL 大数据
近几年以来，Python的应用场景越来越多，几乎可以应用于自然科学、工程技术、金融、通信和商业等各种领域。究其原因在于Python的简单易学、功能强大。想系统地学点东西，发现很多不错的技术文档都是英文资料，发现英文竟然成为了学习的拦路虎。非常幸运的是，DataCamp推出的Python数据科学速查表，已经翻译成中文啦！高清资料已打包。喜欢点赞支持、欢迎收藏学习。Python基础系列推出的内容包括：
终于盼到了，Python 数据科学速查表中文版来了 Python数据开发学习笔记 python
近几年以来，Python的应用场景越来越多，几乎可以应用于自然科学、工程技术、金融、通信和商业等各种领域。究其原因在于Python的简单易学、功能强大。想系统地学点东西，发现很多不错的技术文档都是英文资料，发现英文竟然成为了学习的拦路虎。非常幸运的是，DataCamp推出的Python数据科学速查表，已经翻译成中文啦！高清资料已打包。喜欢点赞支持、欢迎收藏学习。领取方式：资料已打包，获取方法有两种
超级干货：手把手教你学习R语言（附资源链接）数据分析v
作者：NSS；翻译：杨金鸿；校对：韩海畴，林亦霖；本文约3000字，建议阅读7分钟。本文为带大家了解R语言以及分段式的步骤教程！人们学习R语言时普遍存在缺乏系统学习方法的问题。学习者不知道从哪开始，如何进行，选择什么学习资源。虽然网络上有许多不错的免费学习资源，然而它们多过了头，反而会让人挑花了眼。为了构建R语言学习方法，我们在Vidhya和DataCamp中选一组综合资源，帮您从头学习R语言。这
整理了245道Python面试真题！、烟雨楼 phtyon 语言面试 python 面试开发语言 java 压力测试
小编搜罗了网上的各种面试题，现在做成了PDF版本的《Python面试大全》，更加方便阅读。面试大全中涵盖了Python基础、Python高级部分、Python语言特性、操作系统、数据库、网络、数据结构、编程题等。不管你是现在准备找Python工作的，还是将来准备从事Python相关工作的，这份面试宝典都不可错过！书籍部分内容截图如下：另外，推荐一下DataCamp推出的Python数据科学速查表（
Js函数返回值 _wy_ js return
一、返回控制与函数结果，语法为：return 表达式;作用: 结束函数执行，返回调用函数，而且把表达式的值作为函数的结果二、返回控制语法为：return;作用: 结束函数执行，返回调用函数，而且把undefined作为函数的结果在大多数情况下,为事件处理函数返回false,可以防止默认的事件行为.例如,默认情况下点击一个<a>元素,页面会跳转到该元素href属性
MySQL 的 char 与 varchar bylijinnan mysql
今天发现，create table 时，MySQL 4.1有时会把 char 自动转换成 varchar 测试举例： CREATE TABLE `varcharLessThan4` ( `lastName` varchar(3) ) ; mysql> desc varcharLessThan4; +----------+---------+------+-
Quartz——TriggerListener和JobListener eksliang TriggerListener JobListener quartz
转载请出自出处：http://eksliang.iteye.com/blog/2208624 一.概述 listener是一个监听器对象，用于监听scheduler中发生的事件，然后执行相应的操作；你可能已经猜到了，TriggerListeners接受与trigger相关的事件，JobListeners接受与jobs相关的事件。二.JobListener监听器 j
oracle层次查询 18289753290 oracle；层次查询；树查询
.oracle层次查询(connect by) oracle的emp表中包含了一列mgr指出谁是雇员的经理，由于经理也是雇员，所以经理的信息也存储在emp表中。这样emp表就是一个自引用表，表中的mgr列是一个自引用列，它指向emp表中的empno列，mgr表示一个员工的管理者， select empno,mgr,ename,sal from e
通过反射把map中的属性赋值到实体类bean对象中酷的飞上天空 javaee 泛型类型转换
使用过struts2后感觉最方便的就是这个框架能自动把表单的参数赋值到action里面的对象中但现在主要使用Spring框架的MVC，虽然也有@ModelAttribute可以使用但是明显感觉不方便。好吧，那就自己再造一个轮子吧。原理都知道，就是利用反射进行字段的赋值，下面贴代码主要类如下： import java.lang.reflect.Field; imp
SAP HANA数据存储：传统硬盘的瓶颈问题蓝儿唯美 HANA
SAPHANA平台有各种各样的应用场景，这也意味着客户的实施方法有许多种选择，关键是如何挑选最适合他们需求的实施方案。在《Implementing SAP HANA》这本书中，介绍了SAP平台在现实场景中的运作原理，并给出了实施建议和成功案例供参考。本系列文章节选自《Implementing SAP HANA》，介绍了行存储和列存储的各自特点，以及SAP HANA的数据存储方式如何提升空间压
Java Socket 多线程实现文件传输随便小屋 java socket
高级操作系统作业，让用Socket实现文件传输，有些代码也是在网上找的，写的不好，如果大家能用就用上。客户端类： package edu.logic.client; import java.io.BufferedInputStream; import java.io.Buffered
java初学者路径 aijuans java
学习Java有没有什么捷径?要想学好Java，首先要知道Java的大致分类。自从Sun推出Java以来，就力图使之无所不包，所以Java发展到现在，按应用来分主要分为三大块：J2SE,J2ME和J2EE,这也就是Sun ONE(Open Net Environment)体系。J2SE就是Java2的标准版，主要用于桌面应用软件的编程；J2ME主要应用于嵌入是系统开发，如手机和PDA的编程；J2EE
APP推广 aoyouzi APP 推广
一，免费篇 1，APP推荐类网站自主推荐最美应用、酷安网、DEMO8、木蚂蚁发现频道等,如果产品独特新颖，还能获取最美应用的评测推荐。PS：推荐简单。只要产品有趣好玩，用户会自主分享传播。例如足迹APP在最美应用推荐一次，几天用户暴增将服务器击垮。 2，各大应用商店首发合作老实盯着排期，多给应用市场官方负责人献殷勤。 3，论坛贴吧推广百度知道，百度贴吧，猫扑论坛，天涯社区，豆瓣（
JSP转发与重定向百合不是茶 jsp servlet Java Web jsp转发
在servlet和jsp中我们经常需要请求,这时就需要用到转发和重定向; 转发包括;forward和include 例子;forwrad转发; 将请求装法给reg.html页面关键代码; req.getRequestDispatcher("reg.html
web.xml之jsp-config bijian1013 java web.xml servlet jsp-config
1.作用：主要用于设定JSP页面的相关配置。 2.常见定义： <jsp-config> <taglib> <taglib-uri>URI(定义TLD文件的URI,JSP页面的tablib命令可以经由此URI获取到TLD文件)</tablib-uri> <taglib-location> TLD文件所在的位置
JSF2.2 ViewScoped Using CDI sunjing CDI JSF 2.2 ViewScoped
JSF 2.0 introduced annotation @ViewScoped; A bean annotated with this scope maintained its state as long as the user stays on the same view(reloads or navigation - no intervening views). One problem w
【分布式数据一致性二】Zookeeper数据读写一致性 bit1129 zookeeper
很多文档说Zookeeper是强一致性保证，事实不然。关于一致性模型请参考http://bit1129.iteye.com/blog/2155336 Zookeeper的数据同步协议 Zookeeper采用称为Quorum Based Protocol的数据同步协议。假如Zookeeper集群有N台Zookeeper服务器(N通常取奇数，3台能够满足数据可靠性同时
Java开发笔记白糖_ java开发
1、Map<key,value>的remove方法只能识别相同类型的key值 Map<Integer,String> map = new HashMap<Integer,String>(); map.put(1,"a"); map.put(2,"b"); map.put(3,"c"
图片黑色阴影 bozch 图片
.event{ padding:0; width:460px; min-width: 460px; border:0px solid #e4e4e4; height: 350px; min-heig
编程之美-饮料供货-动态规划 bylijinnan 动态规划
import java.util.Arrays; import java.util.Random; public class BeverageSupply { /** * 编程之美饮料供货 * 设Opt（V’，i）表示从i到n-1种饮料中，总容量为V’的方案中，满意度之和的最大值。 * 那么递归式就应该是：Opt（V’，i）=max{ k * Hi+Op
ajax大参数（大数据）提交性能分析 chenbowen00 Web Ajax 框架浏览器 prototype
近期在项目中发现如下一个问题项目中有个提交现场事件的功能，该功能主要是在web客户端保存现场数据（主要有截屏，终端日志等信息）然后提交到服务器上方便我们分析定位问题。客户在使用该功能的过程中反应点击提交后反应很慢，大概要等10到20秒的时间浏览器才能操作，期间页面不响应事件。根据客户描述分析了下的代码流程，很简单，主要通过OCX控件截屏，在将前端的日志等文件使用OCX控件打包，在将之转换为
[宇宙与天文]在太空采矿,在太空建造 comsci
我们在太空进行工业活动...但是不太可能把太空工业产品又运回到地面上进行加工,而一般是在哪里开采,就在哪里加工,太空的微重力环境,可能会使我们的工业产品的制造尺度非常巨大.... 地球上制造的最大工业机器是超级油轮和航空母舰,再大些就会遇到困难了,但是在空间船坞中,制造的最大工业机器,可能就没
ORACLE中CONSTRAINT的四对属性 daizj oracle CONSTRAINT
ORACLE中CONSTRAINT的四对属性 summary:在data migrate时,某些表的约束总是困扰着我们,让我们的migratet举步维艰,如何利用约束本身的属性来处理这些问题呢?本文详细介绍了约束的四对属性: Deferrable/not deferrable, Deferred/immediate, enalbe/disable, validate/novalidate,以及如
Gradle入门教程 dengkane gradle
一、寻找gradle的历程一开始的时候，我们只有一个工程，所有要用到的jar包都放到工程目录下面，时间长了，工程越来越大，使用到的jar包也越来越多，难以理解jar之间的依赖关系。再后来我们把旧的工程拆分到不同的工程里，靠ide来管理工程之间的依赖关系，各工程下的jar包依赖是杂乱的。一段时间后，我们发现用ide来管理项程很不方便，比如不方便脱离ide自动构建，于是我们写自己的ant脚本。再后
C语言简单循环示例 dcj3sjt126com c
# include <stdio.h> int main(void) { int i; int count = 0; int sum = 0; float avg; for (i=1; i<=100; i++) { if (i%2==0) { count++; sum += i; } } avg
presentModalViewController 的动画效果 dcj3sjt126com controller
系统自带(四种效果)： presentModalViewController模态的动画效果设置： [cpp] view plain copy UIViewController *detailViewController = [[UIViewController al
java 二分查找 shuizhaosi888 二分查找 java二分查找
需求：在排好顺序的一串数字中，找到数字T 一般解法：从左到右扫描数据，其运行花费线性时间O(N)。然而这个算法并没有用到该表已经排序的事实。 /** * * @param array * 顺序数组 * @param t * 要查找对象 * @return */ public stati
Spring Security（07）——缓存UserDetails 234390216 ehcache 缓存 Spring Security
Spring Security提供了一个实现了可以缓存UserDetails的UserDetailsService实现类，CachingUserDetailsService。该类的构造接收一个用于真正加载UserDetails的UserDetailsService实现类。当需要加载UserDetails时，其首先会从缓存中获取，如果缓存中没
Dozer 深层次复制 jayluns VO maven po
最近在做项目上遇到了一些小问题，因为架构在做设计的时候web前段展示用到了vo层，而在后台进行与数据库层操作的时候用到的是Po层。这样在业务层返回vo到控制层，每一次都需要从po-->转化到vo层，用到BeanUtils.copyProperties(source, target)只能复制简单的属性，因为实体类都配置了hibernate那些关联关系，所以它满足不了现在的需求，但后发现还有个很
CSS规范整理（摘自懒人图库） a409435341 html UI css 浏览器
刚没事闲着在网上瞎逛，找了一篇CSS规范整理，粗略看了一下后还蛮有一定的道理，并自问是否有这样的规范，这也是初入前端开发的人一个很好的规范吧。一、文件规范 1、文件均归档至约定的目录中。具体要求通过豆瓣的CSS规范进行讲解：所有的CSS分为两大类：通用类和业务类。通用的CSS文件，放在如下目录中：基本样式库 /css/core
C++动态链接库创建与使用你不认识的休道人 C++dll
一、创建动态链接库 1.新建工程test中选择”MFC [dll]”dll类型选择第二项"Regular DLL With MFC shared linked"，完成 2.在test.h中添加 extern “C” 返回类型 _declspec(dllexport)函数名(参数列表); 3.在test.cpp中最后写 extern “C” 返回类型 _decls
Android代码混淆之ProGuard rensanning ProGuard
Android应用的Java代码，通过反编译apk文件（dex2jar、apktool）很容易得到源代码，所以在release版本的apk中一定要混淆一下一些关键的Java源码。 ProGuard是一个开源的Java代码混淆器（obfuscation）。ADT r8开始它被默认集成到了Android SDK中。官网： http://proguard.sourceforge.net/
程序员在编程中遇到的奇葩弱智问题 tomcat_oracle jquery 编程 ide
　　现在收集一下：　　排名不分先后，按照发言顺序来的。 1、Jquery插件一个通用函数一直报错，尤其是很明显是存在的函数，很有可能就是你没有引入jquery。。。或者版本不对 2、调试半天没变化：不在同一个文件中调试。这个很可怕，我们很多时候会备份好几个项目，改完发现改错了。有个群友说的好：在汤匙
解决maven-dependency-plugin (goals "copy-dependencies","unpack") is not supported xp9802 dependency
解决办法：在plugins之前添加如下pluginManagement，二者前后顺序如下： [html] view plain copy <build> <pluginManagement