JinnyR

Datacamp 笔记&代码 Machine Learning with the Experts: School Budgets 第三章 Improving your model

更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python

Datacamp track: Data Scientist with Python - Course 22 (3)

Exercise

Instantiate pipeline

In order to make your life easier as you start to work with all of the data in your original DataFrame, df, it’s time to turn to one of scikit-learn’s most useful objects: the Pipeline.

For the next few exercises, you’ll reacquaint yourself with pipelines and train a classifier on some synthetic (sample) data of multiple datatypes before using the same techniques on the main dataset.

The sample data is stored in the DataFrame, sample_df, which has three kinds of feature data: numeric, text, and numeric with missing values. It also has a label column with two classes, a and b.

In this exercise, your job is to instantiate a pipeline that trains using the numeric column of the sample data.

Instruction

Import Pipeline from sklearn.pipeline.
Create training and test sets using the numeric data only. Do this by specifying sample_df[['numeric']] in train_test_split().
Instantiate a pipeline as pl by adding the classifier step. Use a name of 'clf' and the same classifier from Chapter 2: OneVsRestClassifier(LogisticRegression()).
Fit your pipeline to the training data and compute its accuracy to see it in action! Since this is toy data, you’ll use the default scoring method for now. In the next chapter, you’ll return to log loss scoring.

import numpy as np
import pandas as pd

rng = np.random.RandomState(123)

SIZE = 1000

sample_data = {
 'numeric': rng.normal(0, 10, size=SIZE),
 'text': rng.choice(['', 'foo', 'bar', 'foo bar', 'bar foo'], size=SIZE),
 'with_missing': rng.normal(loc=3, size=SIZE)
}

sample_df = pd.DataFrame(sample_data)

sample_df.loc[rng.choice(sample_df.index, size=np.floor_divide(sample_df.shape[0], 5)), 'with_missing'] = np.nan

foo_values = sample_df.text.str.contains('foo') * 10
bar_values = sample_df.text.str.contains('bar') * -25
no_text = ((foo_values + bar_values) == 0) * 1

val = 2 * sample_df.numeric + -2 * (foo_values + bar_values + no_text) + 4 * sample_df.with_missing.fillna(3)
val += rng.normal(0, 8, size=SIZE)

sample_df['label'] = np.where(val > np.median(val), 'a', 'b')

print(sample_df.head())

     numeric     text  with_missing label
0 -10.856306               4.433240     b
1   9.973454      foo      4.310229     b
2   2.829785  foo bar      2.469828     a
3 -15.062947               2.852981     b
4  -5.786003  foo bar      1.826475     a

# Import Pipeline
from sklearn.pipeline import Pipeline

# Import other necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Split and select numeric data only, no nans 
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric']],
                                                    pd.get_dummies(sample_df['label']), 
                                                    random_state=22)

# Instantiate Pipeline object: pl
pl = Pipeline([
        ('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear')))
    ])

# Fit the pipeline to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - numeric, no nans: ", accuracy)

Accuracy on sample data - numeric, no nans:  0.62

Exercise

Preprocessing numeric features

What would have happened if you had included the with 'with_missing' column in the last exercise? Without imputing missing values, the pipeline would not be happy (try it and see). So, in this exercise you’ll improve your pipeline a bit by using the Imputer() imputation transformer from scikit-learn to fill in missing values in your sample data.

By default, the imputer transformer replaces NaNs with the mean value of the column. That’s a good enough imputation strategy for the sample data, so you won’t need to pass anything extra to the imputer.

After importing the transformer, you will edit the steps list used in the previous exercise by inserting a (name, transform) tuple. Recall that steps are processed sequentially, so make sure the new tuple encoding your preprocessing step is put in the right place.

The sample_df is in the workspace, in case you’d like to take another look. Make sure to select both numeric columns- in the previous exercise we couldn’t use with_missing because we had no preprocessing step!

Instruction

Import Imputer from sklearn.preprocessing.
Create training and test sets by selecting the correct subset of sample_df: 'numeric' and 'with_missing'.
Add the tuple ('imp', Imputer()) to the correct position in the pipeline. Pipeline processes steps sequentially, so the imputation step should come before the classifier step.
Complete the .fit() and .score() methods to fit the pipeline to the data and compute the accuracy.

# Import the Imputer object
from sklearn.preprocessing import Imputer

# Create training and test sets using only numeric data
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', 'with_missing']],
                                                    pd.get_dummies(sample_df['label']), 
                                                    random_state=456)

# Insantiate Pipeline object: pl
pl = Pipeline([
        ('imp', Imputer()),
        ('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear')))
    ])

# Fit the pipeline to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - all numeric, incl nans: ", accuracy)

Accuracy on sample data - all numeric, incl nans:  0.636

Exercise

Preprocessing text features

Here, you’ll perform a similar preprocessing pipeline step, only this time you’ll use the text column from the sample data.

To preprocess the text, you’ll turn to CountVectorizer()to generate a bag-of-words representation of the data, as in Chapter 2. Using the default arguments, add a (step, transform) tuple to the steps list in your pipeline.

Make sure you select only the text column for splitting your training and test sets.

As usual, your sample_df is ready and waiting in the workspace.

Instruction

Import CountVectorizer from sklearn.feature_extraction.text.
Create training and test sets by selecting the correct subset of sample_df: 'text'.
Add the CountVectorizer step (with the name 'vec') to the correct position in the pipeline.
Fit the pipeline to the training data and compute its accuracy.

# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Split out only the text data
X_train, X_test, y_train, y_test = train_test_split(sample_df['text'],
                                                    pd.get_dummies(sample_df['label']), 
                                                    random_state=456)

# Instantiate Pipeline object: pl
pl = Pipeline([
        ('vec', CountVectorizer()),
        ('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear')))
    ])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - just text data: ", accuracy)

Accuracy on sample data - just text data:  0.808

Exercise

Multiple types of processing: FeatureUnion

Now that you can separate text and numeric data in your pipeline, you’re ready to perform separate steps on each by nesting pipelines and using FeatureUnion().

These tools will allow you to streamline all preprocessing steps for your model, even when multiple datatypes are involved. Here, for example, you don’t want to impute our text data, and you don’t want to create a bag-of-words with our numeric data. Instead, you want to deal with these separately and then join the results together using FeatureUnion().

In the end, you’ll still have only two high-level steps in your pipeline: preprocessing and model instantiation. The difference is that the first preprocessing step actually consists of a pipeline for numeric data and a pipeline for text data. The results of those pipelines are joined using FeatureUnion().

Instruction

In the process_and_join_features:
- Add the steps ('selector', get_numeric_data)and ('imputer', Imputer()) to the 'numeric_features' preprocessing step.
- Add the equivalent steps for the text_featurespreprocessing step. That is, use get_text_data and a CountVectorizer step with the name 'vectorizer'.
Add the transform step process_and_join_features to 'union' in the main pipeline, pl.
Hit ‘Submit Answer’ to see the pipeline in action!

import numpy as np
import pandas as pd

rng = np.random.RandomState(123)

SIZE = 1000

sample_data = {
 'numeric': rng.normal(0, 10, size=SIZE),
 'text': rng.choice(['', 'foo', 'bar', 'foo bar', 'bar foo'], size=SIZE),
 'with_missing': rng.normal(loc=3, size=SIZE)
}

sample_df = pd.DataFrame(sample_data)

sample_df.loc[rng.choice(sample_df.index, size=np.floor_divide(sample_df.shape[0], 5)), 'with_missing'] = np.nan

foo_values = sample_df.text.str.contains('foo') * 10
bar_values = sample_df.text.str.contains('bar') * -25
no_text = ((foo_values + bar_values) == 0) * 1

val = 2 * sample_df.numeric + -2 * (foo_values + bar_values + no_text) + 4 * sample_df.with_missing.fillna(3)
val += rng.normal(0, 8, size=SIZE)

sample_df['label'] = np.where(val > np.median(val), 'a', 'b')

## Import the pipeline elements from previous exercise
# Import splitting and pipeline objects from sklearn
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# Import elements for simple pipeline from sklearn 
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Import the Imputer object
from sklearn.preprocessing import Imputer

# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Import functional utilities
from sklearn.preprocessing import FunctionTransformer

# Simple selector transforms to be used in FeatureUnion
get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[['numeric', 'with_missing']], validate=False)

# Import FeatureUnion
from sklearn.pipeline import FeatureUnion

# Split using ALL data in sample_df
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', 'with_missing', 'text']],
                                                    pd.get_dummies(sample_df['label']), 
                                                    random_state=22)

## Create a FeatureUnion with nested pipeline: process_and_join_features
process_and_join_features = FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )

# Instantiate nested pipeline: pl
pl = Pipeline([
        ('union', process_and_join_features),
        ('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear')))
    ])


# Fit pl to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - all data: ", accuracy)

Accuracy on sample data - all data:  0.928

Exercise

Using FunctionTransformer on the main dataset

In this exercise you’re going to use FunctionTransformeron the primary budget data, before instantiating a multiple-datatype pipeline in the next exercise.

Recall from Chapter 2 that you used a custom function combine_text_columns to select and properly format text data for tokenization; it is loaded into the workspace and ready to be put to work in a function transformer!

Concerning the numeric data, you can use NUMERIC_COLUMNS, preloaded as usual, to help design a subset-selecting lambda function.

You’re all finished with sample data. The original df is back in the workspace, ready to use.

Instruction

Complete the call to multilabel_train_test_split()by selecting df[NON_LABELS].
Compute get_text_data by using FunctionTransformer() and passing in combine_text_columns. Be sure to also specify validate=False.
Use FunctionTransformer() to compute get_numeric_data. In the lambda function, select out the NUMERIC_COLUMNS of x. Like you did when computing get_text_data, also specify validate=False.

# Import pandas, numpy, warn, CountVectorizer
import pandas as pd
import numpy as np
from warnings import warn
from sklearn.feature_extraction.text import CountVectorizer


#### DEFINE SAMPLING UTILITIES ####

# First multilabel_sample, which is called by multilabel_train_test_split
def multilabel_sample(y, size=1000, min_count=5, seed=None):
    """ Takes a matrix of binary labels `y` and returns
    the indices for a sample of size `size` if
    `size` > 1 or `size` * len(y) if size =< 1.

    The sample is guaranteed to have > `min_count` of
    each label.
    """
    try:
        if (np.unique(y).astype(int) != np.array([0, 1])).all():
            raise ValueError()
    except (TypeError, ValueError):
            raise ValueError('multilabel_sample only works with binary indicator matrices')

    if (y.sum(axis=0) < min_count).any():
        raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')

    if size <= 1:
        size = np.floor(y.shape[0] * size)

    if y.shape[1] * min_count > size:
        msg = "Size less than number of columns * min_count, returning {} items instead of {}."
        warn(msg.format(y.shape[1] * min_count, size))
        size = y.shape[1] * min_count

    rng = np.random.RandomState(seed if seed is not None else np.random.randint(1))

    if isinstance(y, pd.DataFrame):
        choices = y.index
        y = y.values
    else:
        choices = np.arange(y.shape[0])

    sample_idxs = np.array([], dtype=choices.dtype)
 
    # first, guarantee > min_count of each label
    for j in range(y.shape[1]):
        label_choices = choices[y[:, j] == 1]
        label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)
        sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])

        sample_idxs = np.unique(sample_idxs)

    # now that we have at least min_count of each, we can just random sample
    sample_count = size - sample_idxs.shape[0]

    # get sample_count indices from remaining choices
    remaining_choices = np.setdiff1d(choices, sample_idxs)
    remaining_sampled = rng.choice(remaining_choices, size=sample_count, replace=False)

    return np.concatenate([sample_idxs, remaining_sampled])

# Now define multilabel_train_test_split to be used below
def multilabel_train_test_split(X, Y, size, min_count=5, seed=None):
    """ Takes a features matrix `X` and a label matrix `Y` and
    returns (X_train, X_test, Y_train, Y_test) where all
    classes in Y are represented at least `min_count` times.
    """
    index = Y.index if isinstance(Y, pd.DataFrame) else np.arange(Y.shape[0])

    test_set_idxs = multilabel_sample(Y, size=size, min_count=min_count, seed=seed) 
    train_set_idxs = np.setdiff1d(index, test_set_idxs)

    test_set_mask = index.isin(test_set_idxs)
    train_set_mask = ~test_set_mask

    return (X[train_set_mask], X[test_set_mask], Y[train_set_mask], Y[test_set_mask])



####

# Load data
df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_2533/datasets/TrainingSetSample.csv', index_col=0)
# Labels
LABELS = ['Function',
 'Use',
 'Sharing',
 'Reporting',
 'Student_Type',
 'Position_Type',
 'Object_Type', 
 'Pre_K',
 'Operating_Status']

NUMERIC_COLUMNS = ['FTE', "Total"]

# Convert object to category for LABELS
df[LABELS] = df[LABELS].apply(lambda x: x.astype('category'))

# Define combine_text_columns() for use in sklearn.preprocessing.FunctionTransformer
def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
    """ Takes the dataset as read in, drops the non-feature, non-text columns and
    then combines all of the text columns into a single vector that has all of
    the text for a row.

    :param data_frame: The data as read in with read_csv (no preprocessing necessary)
    :param to_drop (optional): Removes the numeric and label columns by default.
    """
    # drop non-text columns that are in the df
    to_drop = set(to_drop) & set(data_frame.columns.tolist())
    text_data = data_frame.drop(to_drop, axis=1)

    # replace nans with blanks
    text_data.fillna("", inplace=True)

    # joins all of the text items in a row (axis=1)
    # with a space in between
    return text_data.apply(lambda x: " ".join(x), axis=1)

# Import FunctionTransformer
from sklearn.preprocessing import FunctionTransformer

# Get the dummy encoding of the labels
dummy_labels = pd.get_dummies(df[LABELS])

# Get the columns that are features in the original df
NON_LABELS = [c for c in df.columns if c not in LABELS]

# Split into training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS],
                                                               dummy_labels,
                                                               0.2, 
                                                               seed=123)

# Preprocess the text data: get_text_data
get_text_data = FunctionTransformer(combine_text_columns, validate=False)

# Preprocess the numeric data: get_numeric_data
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)

Exercise

Add a model to the pipeline

You’re about to take everything you’ve learned so far and implement it in a Pipeline that works with the real, DrivenData budget line item data you’ve been exploring.

Surprise! The structure of the pipeline is exactly the same as earlier in this chapter:

the preprocessing step uses FeatureUnion to join the results of nested pipelines that each rely on FunctionTransformer to select multiple datatypes
the model step stores the model object

You can then call familiar methods like .fit() and .score() on the Pipeline object pl.

Instruction

Complete the 'numeric_features'transform with the following steps:
- get_numeric_data, with the name 'selector'.
- Imputer(), with the name 'imputer'.
Complete the 'text_features'transform with the following steps:
- get_text_data, with the name 'selector'.
- CountVectorizer(), with the name 'vectorizer'.
Fit the pipeline to the training data.
Hit ‘Submit Answer’ to compute the accuracy!

# Complete the pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)

Accuracy on budget dataset:  0.20384615384615384

Exercise

Try a different class of model

Now you’re cruising. One of the great strengths of pipelines is how easy they make the process of testing different models.

Until now, you’ve been using the model step ('clf', OneVsRestClassifier(LogisticRegression()))in your pipeline.

But what if you want to try a different model? Do you need to build an entirely new pipeline? New nests? New FeatureUnions? Nope! You just have a simple one-line change, as you’ll see in this exercise.

In particular, you’ll swap out the logistic-regression model and replace it with a random forest classifier, which uses the statistics of an ensemble of decision trees to generate predictions.

Instruction

Import the RandomForestClassifier from sklearn.ensemble.
Add a RandomForestClassifier() step named 'clf'to the pipeline.
Hit ‘Submit Answer’ to fit the pipeline to the training data and compute its accuracy.

# Import random forest classifer
from sklearn.ensemble import RandomForestClassifier

# Edit model step in pipeline
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )),
        ('clf', RandomForestClassifier())
    ])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)

Accuracy on budget dataset:  0.28076923076923077

Exercise

Can you adjust the model or parameters to improve accuracy?

You just saw a substantial improvement in accuracy by swapping out the model. Pipelines are amazing!

Can you make it better? Try changing the parameter n_estimators of RandomForestClassifier(), whose default value is 10, to 15.

Instruction

Import the RandomForestClassifier from sklearn.ensemble.
Add a RandomForestClassifier() step with n_estimators=15 to the pipeline with a name of 'clf'.
Hit ‘Submit Answer’ to fit the pipeline to the training data and compute its accuracy.

# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Add model step to pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )),
        ('clf', RandomForestClassifier(n_estimators=15))
    ])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)

Accuracy on budget dataset:  0.3230769230769231

你可能感兴趣的:(datacamp)

python练习：Case Study - Sunlight in Austin 鲸鱼酱375
内容来自datacamp课程：pandasfoundation数据以及代码在github数据：数据1weather_data_austin_2010：2010年的Austin天气情况head为了后续更好使用，把date作为indexdf.Date=pd.to_datetime(df.Date)df.index=df.Datedf=df.drop(['Date'],axis=1)df.head()h
R学习记录 - Cleaning Data 侯悍超
做过数据分析的都知道，可能只有25%左右的时间是花在分析数据上边的，剩下的时间都在清洗数据（cleaningdata）。所以清理数据是数据分析中超级费时间，但是超级重要的事情。以前我清理数据基本上就重点看3个事情：重复值极端值缺失值不过这只是自己总结的东西，今天算是比较系统地在datacamp上学了清理数据的过程，感觉以后清理数据更有信心了。通过这次学习我理解的主要过程稍微升级了一点：观察数据：看
Markdown printQiao
数据帮（DataCamp）专注于引进国外优质数据分析课程。Datacamp.jpgimportpandas
SQL基础操作 dataengineer
SQL，全称StructuredQueryLanguage，是一种关系数据库查询和编程语言。本文是DataCamp《IntrotoSQLforDataScience》课程的学习笔记。1.选择1列#从表格films中选择名为title的1列SELECTtitleFROMfilms;2.选择多列#从表格films中选择名为title、release_year和country的3列SELECTtitle
DataCamp在线学习平台源代码杀手人工智能跨学科编程学习
DataCamp（https://www.datacamp.com/blog）是一个在线学习平台，专注于数据科学和分析领域的教育。该平台提供丰富的课程，涵盖了从数据处理到机器学习和深度学习的各个方面。以下是DataCamp的主要特点：互动学习：DataCamp采用互动式学习方法，通过实时的编程环境，学习者可以直接在浏览器中执行代码，并获得即时反馈。多语言支持：平台主要支持Python和R语言，为学
Python学习资源精华：数据分析篇(数据处理) 可口可乐没有乐 python 学习数据分析
重点推荐：数据科学牛站DataScienceTutorialshttps://www.datacamp.com/tutorial数据分析：四剑客NumPy：数值计算Pandas：数据处理Seaborn：数据可视化Matplotlib：数据可视化，高度定制你想要今天重点分享：NumPy和PandasNumPy简介NumPy（NumericalPython）是Python的一种开源的数值计算扩展。这种
DataCamp课程：Joining Data in SQL Daisy Lee DataCamp课程 sql
JoiningDatainSQL1.IntroductiontojoinsInnerjoinBeginbyselectingallcolumnsfromthecitiestable.SELECT*FROMcities;namecountry_codecity_proper_popmetroarea_popurbanarea_popAbidjanCIV4765000null4765000AbuDha
SQL-DataCamp-Joining Data in SQL radar_sun DataCamp-SQL sql datacamp
1.IntroductiontoJoins1.1IntroductiontoINNERJOIN(video)1.2INNERJOINPostgreSQLwasmentionedintheslidesbutyou’llfindthatthesejoinsandthematerialhereappliestodifferentformsofSQLaswell.Recallfromthevideothe
在我强烈要求下，DataCamp送了我两节课！侯悍超
今天终于学完了16节dataanalyst的课，本来想着可以高高兴兴地把证书分享到linkedin，然后就可以等着猎头来帮我升值加薪了！~结果前几天发生了想不到的事情：我的学习进度从80%多掉到了70%多！刚看到的时候，我是不相信的。是我打开的姿势不对么？于是我退出登陆重进了一下，还是70%多？怎么回事？冷静下来仔细一看，课程的完成数目突然少了2节！原来是DataCamp怕我们跟不上“日新月异的技
DataCamp在线学习python 曼达随笔
一直没有系统学习过python，虽然用这个写过简单的程序，但每次都是要用到的时候祭出google大法，所以基础非常的不扎实，python语言的精髓也没有领会到。很多时候看到别人的程序，特别是复杂一些的程序，还是会一脸懵逼，似懂非懂。所以一直想要系统的学习一下。最开始用的是《Learnpythonthehardway》这本书。这本书就是以练习为主，有很多程序实例，从简单到复杂，学习的过程就是跟着敲程
Datacamp 笔记&代码 Supervised Learning with scikit-learn 第一章 Classification JinnyR datacamp datacamp sklearn data science python machine learning
更多原始数据文档和JupyterNotebookGithub:https://github.com/JinnyR/Datacamp_DataScienceTrack_PythonDatacamptrack:DataScientistwithPython-Course21(1)Exercisek-NearestNeighbors:FitHavingexploredtheCongressionalvo
难以取舍的Python和R，到底学哪个？丨程序之道丨
对于想从事数据行业的人和数据工作者来说，是学习R还是python，哪个工具更实用一直被大家争论。MartijnTheuwissen，DataCamp的教育专家详细比较了这两个工具。ython和R是统计学中两种最流行的的编程语言，R的功能性主要是统计学家在开发时考虑的（R具有强大的可视化功能），而Python因为易于理解的语法被大家所接受。在这篇文章中，我们将重点介绍R和Python以及它们在数据科
R_DATACAMP Working with Dates and Times in R 一条很闲的咸鱼
DatesandTimesinRR中的日期与时间as.Date("2014-04-10")转化为日期格式library(anytime)anytime包sep_10_2009<-c("September102009","2009-09-10","10Sep2009","09-10-2009")anytime(sep_10_2009)统一转化为2009-09-10UTC的格式ggplot(relea
送你一个在线机器学习网站，真香！生信宝典决策树算法机器学习深度学习人工智能
https://campus.datacamp.com机器学习系列教程从随机森林开始，一步步理解决策树、随机森林、ROC/AUC、数据集、交叉验证的概念和实践。文字能说清的用文字、图片能展示的用、描述不清的用公式、公式还不清楚的写个简单代码，一步步理清各个环节和概念。再到成熟代码应用、模型调参、模型比较、模型评估，学习整个机器学习需要用到的知识和技能。机器学习算法-随机森林之决策树初探（1）机器学
R_DATACAMP7 Exploratory Data Analysis 一条很闲的咸鱼
ExploringCategoricalData探索分类数据droplevels()舍弃相关不常用的levelstheme(axis.text.x=element_text(angle=90))有关主题全面了解一下ggplot中，x=as.factor()与x=中的区别在哪里，转化为factor格式后有什么不同？xlim(c(90,550))ggplot中限制图的大小xlim(c())ggtitl
第二章数据可视化——6.21 小憨豆
在完成数据的清理和重构，为了使数据能够更加易于理解，需要将数据进行可视化处理，这里主要用到的是Python数据可视化库Matplotlib。导入Matplotlib库的时候，有的时候需要加上%matplotlibinline，这句话的功能是：可以内嵌绘图，并且可以省略掉plt.show()这一步。最主要的matplotlib库的操作说明可以参考下图（作者是Datacamp，下载于Python程序员
R_DATACAMP10 Cluster Analysis in R分类分析一条很闲的咸鱼
Calculatingdistancebetweenobservations计算两点间距离lims(x=c(-30,30),y=c(-20,20))应用于ggplot中，可以设置图标坐标轴的范围dist(two_players)dist(data.frame)会计算出数据结构中各个点相互之间的举例scale(data.frame)后再dist，可以消除因为同组数之间相差太大引起的影响，比如一个是千
python练习：67 Years of Lego Sets and their Features 鲸鱼酱375
前提：project来自datacamphttps://www.datacamp.com/projects/101.介绍EveryonelovesLego(unlessyoueversteppedonone).Didyouknowbythewaythat"Lego"wasderivedfromtheDanishphraseleggodt,whichmeans"playwell"?Unlessyou
R_DATACAMP8 Exploratory Data Analysis in R: Case Study 一条很闲的咸鱼
countries%filter(country%in%countries)筛选出对应的城市，这么看更明了关于ggplot2中，facet_wrap（）分类后各个图的y一致的问题，如果想要各个图的y不一致，可以使用facet_wrap(~country,scales="free_y")lm(y~x,数据集)回归预测library(broom)tidy(US_fit)即tidy使用lm回归后的数据集
datacamp Cheat Sheet iOSDevLog
1.PythonForDataScienceCheatSheetImportingData.png2.RForDataScienceCheatSheetTidyverseforBeginners.png3.PythonForDataScienceCheatSheetBokeh.png4.PythonForDataScienceCheatSheetJupyterNotebook.png5.Pytho
Seaborn入门指南 Mr_喵
一、官网官网绝对是最权威的，并且存在大量通俗易懂的例子，非常适合学习seaborn官网二、来自DataCamp的教程不同于以往从图表的特性和类别来进行讲解的教程，本教程更注重绘图细节PythonSeabornTutorialForBeginners
DataCamp-Introduction of R 无机牛奶
thetermfactorreferstostatisticaldatatypeusedtohead()enablesyoutoshowthefirstobservationsofadataframe.Similarly,thefunctiontail()printsoutthelastobservationsinyourdataset.Thefunctionstr()showsyouthestr
【译】MLXTEND之StackingCVRegressor wong小尧
www.DataCamp.com中有很多数据科学家的cheatsheet，可以放在手边参考，大部分情况就够用了，以下是个人整理的一些详细的例子。spark中通常使用rdd，但是这样代码可读性差，目前rdd的很多方法已经不再更新了。dataframe大部分使用SparkSQL操作，速度会比rdd的方法更快，dataset是dataframe的子集，大部分api是互通的，目前主流是在使用SparkSQ
时间序列笔记-三指数模型新云旧雨
笔记说明在datacamp网站上学习“TimeSerieswithR”track“ForecastingUsingR”课程做的对应笔记。学识有限，错误难免，还请不吝赐教。学习的课程为“ForecastingUsingR”，主要用forecast包。课程参考教材Forecasting:PrinciplesandPractice课程中数据可在fpp2包得到本次笔记也参考了其他人的文章：指数平滑方法简介
python的论文_python论文_python 论文_python毕业论文 - 云+社区 - 腾讯云 weixin_39963080 python的论文
广告关闭腾讯云11.11云上盛惠，精选热门产品助力上云，云服务器首年88元起，买的越多返的越多，最高返5000元！rank1.python马尔可夫链初学者教程文章地址：https:www.datacamp.comcommunitytutorialsmarkov-chains-python-tutorial?rank2.jakevanderplas-vega，vega-lite和altair探索性数
解读 | 数据分析领域七大热门职业 CDA·数据分析师 sql sqlserver 数据库
CDA数据分析师出品来源：datacamp编译：Mika根据《韦氏词典》，数据指的是用作推理、讨论或计算基础的事实信息。基于这个定义，我们可以进一步得出：数据可以理解为是收集到的任何信息，可以使用、进一步处理和分析以获得见解。而且通常与计算机联系在一起，因为数据通常是在计算机中生成和存储的，然而数据存在的时间比我们想象的要长得多。数据的历史人类存储和分析数据的最早例子可以追溯到公元前18000年，
再见英文版，Python 速查表中文版来了 Python数据挖掘 python python 数据分析数据挖掘 SQL 大数据
近几年以来，Python的应用场景越来越多，几乎可以应用于自然科学、工程技术、金融、通信和商业等各种领域。究其原因在于Python的简单易学、功能强大。想系统地学点东西，发现很多不错的技术文档都是英文资料，发现英文竟然成为了学习的拦路虎。非常幸运的是，DataCamp推出的Python数据科学速查表，已经翻译成中文啦！高清资料已打包。喜欢点赞支持、欢迎收藏学习。Python基础系列推出的内容包括：
终于盼到了，Python 数据科学速查表中文版来了 Python数据开发学习笔记 python
近几年以来，Python的应用场景越来越多，几乎可以应用于自然科学、工程技术、金融、通信和商业等各种领域。究其原因在于Python的简单易学、功能强大。想系统地学点东西，发现很多不错的技术文档都是英文资料，发现英文竟然成为了学习的拦路虎。非常幸运的是，DataCamp推出的Python数据科学速查表，已经翻译成中文啦！高清资料已打包。喜欢点赞支持、欢迎收藏学习。领取方式：资料已打包，获取方法有两种
超级干货：手把手教你学习R语言（附资源链接）数据分析v
作者：NSS；翻译：杨金鸿；校对：韩海畴，林亦霖；本文约3000字，建议阅读7分钟。本文为带大家了解R语言以及分段式的步骤教程！人们学习R语言时普遍存在缺乏系统学习方法的问题。学习者不知道从哪开始，如何进行，选择什么学习资源。虽然网络上有许多不错的免费学习资源，然而它们多过了头，反而会让人挑花了眼。为了构建R语言学习方法，我们在Vidhya和DataCamp中选一组综合资源，帮您从头学习R语言。这
整理了245道Python面试真题！、烟雨楼 phtyon 语言面试 python 面试开发语言 java 压力测试
小编搜罗了网上的各种面试题，现在做成了PDF版本的《Python面试大全》，更加方便阅读。面试大全中涵盖了Python基础、Python高级部分、Python语言特性、操作系统、数据库、网络、数据结构、编程题等。不管你是现在准备找Python工作的，还是将来准备从事Python相关工作的，这份面试宝典都不可错过！书籍部分内容截图如下：另外，推荐一下DataCamp推出的Python数据科学速查表（
分享100个最新免费的高匿HTTP代理IP mcj8089 代理IP 代理服务器匿名代理免费代理IP 最新代理IP
推荐两个代理IP网站： 1. 全网代理IP：http://proxy.goubanjia.com/ 2. 敲代码免费IP：http://ip.qiaodm.com/ 120.198.243.130:80,中国/广东省 58.251.78.71:8088,中国/广东省 183.207.228.22:83,中国/
mysql高级特性之数据分区 annan211 java 数据结构 mongodb 分区 mysql
mysql高级特性 1 以存储引擎的角度分析，分区表和物理表没有区别。是按照一定的规则将数据分别存储的逻辑设计。器底层是由多个物理字表组成。 2 分区的原理分区表由多个相关的底层表实现，这些底层表也是由句柄对象表示，所以我们可以直接访问各个分区。存储引擎管理分区的各个底层表和管理普通表一样(所有底层表都必须使用相同的存储引擎)，分区表的索引只是
JS采用正则表达式简单获取URL地址栏参数 chiangfai js 地址栏参数获取
GetUrlParam:function GetUrlParam(param){ var reg = new RegExp("(^|&)"+ param +"=([^&]*)(&|$)"); var r = window.location.search.substr(1).match(reg); if(r!=null
怎样将数据表拷贝到powerdesigner (本地数据库表) Array_06 powerDesigner
================================================== 1、打开PowerDesigner12，在菜单中按照如下方式进行操作 file->Reverse Engineer->DataBase 点击后，弹出 New Physical Data Model 的对话框 2、在General选项卡中 Model name:模板名字，自
logbackのhelloworld 飞翔的马甲日志 logback
一、概述 1.日志是啥？当我是个逗比的时候我是这么理解的：log.debug()代替了system.out.print(); 当我项目工作时，以为是一堆得.log文件。这两天项目发布新版本，比较轻松，决定好好地研究下日志以及logback。传送门1：日志的作用与方法： http://www.infoq.com/cn/articles/why-and-how-log 上面的作
新浪微博爬虫模拟登陆随意而生新浪微博
转载自：http://hi.baidu.com/erliang20088/item/251db4b040b8ce58ba0e1235 近来由于毕设需要，重新修改了新浪微博爬虫废了不少劲，希望下边的总结能够帮助后来的同学们。现行版的模拟登陆与以前相比，最大的改动在于cookie获取时候的模拟url的请求
synchronized 香水浓 java thread
Java语言的关键字，可用来给对象和方法或者代码块加锁，当它锁定一个方法或者一个代码块的时候，同一时刻最多只有一个线程执行这段代码。当两个并发线程访问同一个对象object中的这个加锁同步代码块时，一个时间内只能有一个线程得到执行。另一个线程必须等待当前线程执行完这个代码块以后才能执行该代码块。然而，当一个线程访问object的一个加锁代码块时，另一个线程仍然
maven 简单实用教程 AdyZhang maven
1. Maven介绍 1.1. 简介 java编写的用于构建系统的自动化工具。目前版本是2.0.9，注意maven2和maven1有很大区别，阅读第三方文档时需要区分版本。 1.2. Maven资源见官方网站；The 5 minute test，官方简易入门文档；Getting Started Tutorial，官方入门文档；Build Coo
Android 通过 intent传值获得null aijuans android
我在通过intent 获得传递兑现过的时候报错，空指针,我是getMap方法进行传值，代码如下 1 2 3 4 5 6 7 8 9 public void getMap(View view){ Intent i =
apache 做代理报如下错误：The proxy server received an invalid response from an upstream baalwolf response
网站配置是apache＋tomcat,tomcat没有报错，apache报错是： The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /. Reason: Error reading fr
Tomcat6 内存和线程配置 BigBird2012 tomcat6
1、修改启动时内存参数、并指定JVM时区（在windows server 2008 下时间少了8个小时）在Tomcat上运行j2ee项目代码时，经常会出现内存溢出的情况，解决办法是在系统参数中增加系统参数： window下，在catalina.bat最前面 set JAVA_OPTS=-XX:PermSize=64M -XX:MaxPermSize=128m -Xms5
Karam与TDD bijian1013 Karam TDD
一.TDD 测试驱动开发（Test-Driven Development,TDD）是一种敏捷（AGILE）开发方法论，它把开发流程倒转了过来，在进行代码实现之前，首先保证编写测试用例，从而用测试来驱动开发（而不是把测试作为一项验证工具来使用）。 TDD的原则很简单： a.只有当某个
[Zookeeper学习笔记之七]Zookeeper源代码分析之Zookeeper.States bit1129 zookeeper
public enum States { CONNECTING, //Zookeeper服务器不可用，客户端处于尝试链接状态 ASSOCIATING, //？？？ CONNECTED, //链接建立，可以与Zookeeper服务器正常通信 CONNECTEDREADONLY, //处于只读状态的链接状态，只读模式可以在
【Scala十四】Scala核心八：闭包 bit1129 scala
Free variable A free variable of an expression is a variable that’s used inside the expression but not defined inside the expression. For instance, in the function literal expression (x: Int) => (x
android发送json并解析返回json ronin47 android
package com.http.test; import org.apache.http.HttpResponse; import org.apache.http.HttpStatus; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import
一份IT实习生的总结 brotherlamp PHP php资料 php教程 php培训 php视频
今天突然发现在不知不觉中自己已经实习了 3 个月了，现在可能不算是真正意义上的实习吧，因为现在自己才大三，在这边撸代码的同时还要考虑到学校的功课跟期末考试。让我震惊的是，我完全想不到在这 3 个月里我到底学到了什么，这是一件多么悲催的事情啊。同时我对我应该 get 到什么新技能也很迷茫。所以今晚还是总结下把，让自己在接下来的实习生活有更加明确的方向。最后感谢工作室给我们几个人这个机会让我们提前出来
据说是2012年10月人人网校招的一道笔试题-给出一个重物重量为X,另外提供的小砝码重量分别为1，3，9。。。3^N。将重物放到天平左侧，问在两边如何添加砝码 bylijinnan java
public class ScalesBalance { /** * 题目： * 给出一个重物重量为X,另外提供的小砝码重量分别为1，3，9。。。3^N。（假设N无限大，但一种重量的砝码只有一个） * 将重物放到天平左侧，问在两边如何添加砝码使两边平衡 * * 分析： * 三进制 * 我们约定括号表示里面的数是三进制，例如 47=(1202
dom4j最常用最简单的方法 chiangfai dom4j
要使用dom4j读写XML文档,需要先下载dom4j包,dom4j官方网站在 http://www.dom4j.org/目前最新dom4j包下载地址:http://nchc.dl.sourceforge.net/sourceforge/dom4j/dom4j-1.6.1.zip 解开后有两个包,仅操作XML文档的话把dom4j-1.6.1.jar加入工程就可以了,如果需要使用XPath的话还需要
简单HBase笔记 chenchao051 hbase
一、Client-side write buffer 客户端缓存请求描述：可以缓存客户端的请求，以此来减少RPC的次数，但是缓存只是被存在一个ArrayList中，所以多线程访问时不安全的。可以使用getWriteBuffer()方法来取得客户端缓存中的数据。默认关闭。二、Scan的Caching 描述： next( )方法请求一行就要使用一次RPC,即使
mysqldump导出时出现when doing LOCK TABLES daizj mysql mysqdump 导数据
　　执行　mysqldump -uxxx -pxxx -hxxx -Pxxxx database tablename > tablename.sql　导出表时，会报 mysqldump: Got error: 1044: Access denied for user 'xxx'@'xxx' to database 'xxx' when doing LOCK TABLES 解决
CSS渲染原理 dcj3sjt126com Web
从事Web前端开发的人都与CSS打交道很多，有的人也许不知道css是怎么去工作的，写出来的css浏览器是怎么样去解析的呢？当这个成为我们提高css水平的一个瓶颈时，是否应该多了解一下呢？一、浏览器的发展与CSS
《阿甘正传》台词 dcj3sjt126com
Part Ⅰ: 《阿甘正传》Forrest Gump经典中英文对白 Forrest: Hello! My names Forrest. Forrest Gump. You wanna Chocolate? I could eat about a million and a half othese. My momma always said life was like a box ochocol
Java处理JSON dyy_gusi json
Json在数据传输中很好用，原因是JSON 比 XML 更小、更快，更易解析。在Java程序中，如何使用处理JSON，现在有很多工具可以处理，比较流行常用的是google的gson和alibaba的fastjson，具体使用如下： 1、读取json然后处理 class ReadJSON { public static void main(String[] args)
win7下nginx和php的配置 geeksun nginx
1. 安装包准备 nginx : 从nginx.org下载nginx-1.8.0.zip php：从php.net下载php-5.6.10-Win32-VC11-x64.zip， php是免安装文件。 RunHiddenConsole: 用于隐藏命令行窗口 2. 配置 # java用8080端口做应用服务器，nginx反向代理到这个端口即可 p
基于2.8版本redis配置文件中文解释 hongtoushizi redis
转载自： http://wangwei007.blog.51cto.com/68019/1548167 在Redis中直接启动redis-server服务时, 采用的是默认的配置文件。采用redis-server xxx.conf 这样的方式可以按照指定的配置文件来运行Redis服务。下面是Redis2.8.9的配置文
第五章常用Lua开发库3-模板渲染 jinnianshilongnian nginx lua
动态web网页开发是Web开发中一个常见的场景，比如像京东商品详情页，其页面逻辑是非常复杂的，需要使用模板技术来实现。而Lua中也有许多模板引擎，如目前我在使用的lua-resty-template，可以渲染很复杂的页面，借助LuaJIT其性能也是可以接受的。如果学习过JavaEE中的servlet和JSP的话，应该知道JSP模板最终会被翻译成Servlet来执行；而lua-r
JZSearch大数据搜索引擎颠覆者 JavaScript
系统简介：大数据的特点有四个层面：第一，数据体量巨大。从TB级别，跃升到PB级别；第二，数据类型繁多。网络日志、视频、图片、地理位置信息等等。第三，价值密度低。以视频为例，连续不间断监控过程中，可能有用的数据仅仅有一两秒。第四，处理速度快。最后这一点也是和传统的数据挖掘技术有着本质的不同。业界将其归纳为4个“V”——Volume，Variety，Value，Velocity。大数据搜索引
10招让你成为杰出的Java程序员 pda158 java 编程框架
如果你是一个热衷于技术的 Java 程序员，那么下面的 10 个要点可以让你在众多 Java 开发人员中脱颖而出。　　 1. 拥有扎实的基础和深刻理解 OO 原则　　对于 Java 程序员，深刻理解 Object Oriented Programming（面向对象编程）这一概念是必须的。没有 OOPS 的坚实基础，就领会不了像 Java 这些面向对象编程语言
tomcat之oracle连接池配置小网客 oracle
tomcat版本7.0 配置oracle连接池方式：修改tomcat的server.xml配置文件： <GlobalNamingResources> <Resource name="utermdatasource" auth="Container" type="javax.sql.DataSou
Oracle 分页算法汇总 vipbooks oracle sql 算法 .net
这是我找到的一些关于Oracle分页的算法，大家那里还有没有其他好的算法没？我们大家一起分享一下！ -- Oracle 分页算法一 select * from ( select page.*,rownum rn from (select * from help) page -- 20 = (currentPag