更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python
Datacamp track: Data Scientist with Python - Course 22 (3)
Exercise
In order to make your life easier as you start to work with all of the data in your original DataFrame, df
, it’s time to turn to one of scikit-learn’s most useful objects: the Pipeline
.
For the next few exercises, you’ll reacquaint yourself with pipelines and train a classifier on some synthetic (sample) data of multiple datatypes before using the same techniques on the main dataset.
The sample data is stored in the DataFrame, sample_df
, which has three kinds of feature data: numeric, text, and numeric with missing values. It also has a label column with two classes, a
and b
.
In this exercise, your job is to instantiate a pipeline that trains using the numeric
column of the sample data.
Instruction
Pipeline
from sklearn.pipeline
.sample_df[['numeric']]
in train_test_split()
.pl
by adding the classifier step. Use a name of 'clf'
and the same classifier from Chapter 2: OneVsRestClassifier(LogisticRegression())
.import numpy as np
import pandas as pd
rng = np.random.RandomState(123)
SIZE = 1000
sample_data = {
'numeric': rng.normal(0, 10, size=SIZE),
'text': rng.choice(['', 'foo', 'bar', 'foo bar', 'bar foo'], size=SIZE),
'with_missing': rng.normal(loc=3, size=SIZE)
}
sample_df = pd.DataFrame(sample_data)
sample_df.loc[rng.choice(sample_df.index, size=np.floor_divide(sample_df.shape[0], 5)), 'with_missing'] = np.nan
foo_values = sample_df.text.str.contains('foo') * 10
bar_values = sample_df.text.str.contains('bar') * -25
no_text = ((foo_values + bar_values) == 0) * 1
val = 2 * sample_df.numeric + -2 * (foo_values + bar_values + no_text) + 4 * sample_df.with_missing.fillna(3)
val += rng.normal(0, 8, size=SIZE)
sample_df['label'] = np.where(val > np.median(val), 'a', 'b')
print(sample_df.head())
numeric text with_missing label
0 -10.856306 4.433240 b
1 9.973454 foo 4.310229 b
2 2.829785 foo bar 2.469828 a
3 -15.062947 2.852981 b
4 -5.786003 foo bar 1.826475 a
# Import Pipeline
from sklearn.pipeline import Pipeline
# Import other necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
# Split and select numeric data only, no nans
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric']],
pd.get_dummies(sample_df['label']),
random_state=22)
# Instantiate Pipeline object: pl
pl = Pipeline([
('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear')))
])
# Fit the pipeline to the training data
pl.fit(X_train, y_train)
# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - numeric, no nans: ", accuracy)
Accuracy on sample data - numeric, no nans: 0.62
Exercise
What would have happened if you had included the with 'with_missing'
column in the last exercise? Without imputing missing values, the pipeline would not be happy (try it and see). So, in this exercise you’ll improve your pipeline a bit by using the Imputer()
imputation transformer from scikit-learn to fill in missing values in your sample data.
By default, the imputer transformer replaces NaNs with the mean value of the column. That’s a good enough imputation strategy for the sample data, so you won’t need to pass anything extra to the imputer.
After importing the transformer, you will edit the steps list used in the previous exercise by inserting a (name, transform)
tuple. Recall that steps are processed sequentially, so make sure the new tuple encoding your preprocessing step is put in the right place.
The sample_df
is in the workspace, in case you’d like to take another look. Make sure to select both numeric columns- in the previous exercise we couldn’t use with_missing
because we had no preprocessing step!
Instruction
Imputer
from sklearn.preprocessing
.sample_df
: 'numeric'
and 'with_missing'
.('imp', Imputer())
to the correct position in the pipeline. Pipeline
processes steps sequentially, so the imputation step should come before the classifier step..fit()
and .score()
methods to fit the pipeline to the data and compute the accuracy.# Import the Imputer object
from sklearn.preprocessing import Imputer
# Create training and test sets using only numeric data
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', 'with_missing']],
pd.get_dummies(sample_df['label']),
random_state=456)
# Insantiate Pipeline object: pl
pl = Pipeline([
('imp', Imputer()),
('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear')))
])
# Fit the pipeline to the training data
pl.fit(X_train, y_train)
# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - all numeric, incl nans: ", accuracy)
Accuracy on sample data - all numeric, incl nans: 0.636
Exercise
Here, you’ll perform a similar preprocessing pipeline step, only this time you’ll use the text
column from the sample data.
To preprocess the text, you’ll turn to CountVectorizer()
to generate a bag-of-words representation of the data, as in Chapter 2. Using the default arguments, add a (step, transform)
tuple to the steps list in your pipeline.
Make sure you select only the text column for splitting your training and test sets.
As usual, your sample_df
is ready and waiting in the workspace.
Instruction
CountVectorizer
from sklearn.feature_extraction.text
.sample_df
: 'text'
.CountVectorizer
step (with the name 'vec'
) to the correct position in the pipeline.# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# Split out only the text data
X_train, X_test, y_train, y_test = train_test_split(sample_df['text'],
pd.get_dummies(sample_df['label']),
random_state=456)
# Instantiate Pipeline object: pl
pl = Pipeline([
('vec', CountVectorizer()),
('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear')))
])
# Fit to the training data
pl.fit(X_train, y_train)
# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - just text data: ", accuracy)
Accuracy on sample data - just text data: 0.808
Exercise
Now that you can separate text and numeric data in your pipeline, you’re ready to perform separate steps on each by nesting pipelines and using FeatureUnion()
.
These tools will allow you to streamline all preprocessing steps for your model, even when multiple datatypes are involved. Here, for example, you don’t want to impute our text data, and you don’t want to create a bag-of-words with our numeric data. Instead, you want to deal with these separately and then join the results together using FeatureUnion()
.
In the end, you’ll still have only two high-level steps in your pipeline: preprocessing and model instantiation. The difference is that the first preprocessing step actually consists of a pipeline for numeric data and a pipeline for text data. The results of those pipelines are joined using FeatureUnion()
.
Instruction
process_and_join_features
:
('selector', get_numeric_data)
and ('imputer', Imputer())
to the 'numeric_features'
preprocessing step.text_features
preprocessing step. That is, use get_text_data
and a CountVectorizer
step with the name 'vectorizer'
.process_and_join_features
to 'union'
in the main pipeline, pl
.import numpy as np
import pandas as pd
rng = np.random.RandomState(123)
SIZE = 1000
sample_data = {
'numeric': rng.normal(0, 10, size=SIZE),
'text': rng.choice(['', 'foo', 'bar', 'foo bar', 'bar foo'], size=SIZE),
'with_missing': rng.normal(loc=3, size=SIZE)
}
sample_df = pd.DataFrame(sample_data)
sample_df.loc[rng.choice(sample_df.index, size=np.floor_divide(sample_df.shape[0], 5)), 'with_missing'] = np.nan
foo_values = sample_df.text.str.contains('foo') * 10
bar_values = sample_df.text.str.contains('bar') * -25
no_text = ((foo_values + bar_values) == 0) * 1
val = 2 * sample_df.numeric + -2 * (foo_values + bar_values + no_text) + 4 * sample_df.with_missing.fillna(3)
val += rng.normal(0, 8, size=SIZE)
sample_df['label'] = np.where(val > np.median(val), 'a', 'b')
## Import the pipeline elements from previous exercise
# Import splitting and pipeline objects from sklearn
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
# Import elements for simple pipeline from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
# Import the Imputer object
from sklearn.preprocessing import Imputer
# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# Import functional utilities
from sklearn.preprocessing import FunctionTransformer
# Simple selector transforms to be used in FeatureUnion
get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[['numeric', 'with_missing']], validate=False)
# Import FeatureUnion
from sklearn.pipeline import FeatureUnion
# Split using ALL data in sample_df
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', 'with_missing', 'text']],
pd.get_dummies(sample_df['label']),
random_state=22)
## Create a FeatureUnion with nested pipeline: process_and_join_features
process_and_join_features = FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', CountVectorizer())
]))
]
)
# Instantiate nested pipeline: pl
pl = Pipeline([
('union', process_and_join_features),
('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear')))
])
# Fit pl to the training data
pl.fit(X_train, y_train)
# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - all data: ", accuracy)
Accuracy on sample data - all data: 0.928
Exercise
In this exercise you’re going to use FunctionTransformer
on the primary budget data, before instantiating a multiple-datatype pipeline in the next exercise.
Recall from Chapter 2 that you used a custom function combine_text_columns
to select and properly format text data for tokenization; it is loaded into the workspace and ready to be put to work in a function transformer!
Concerning the numeric data, you can use NUMERIC_COLUMNS
, preloaded as usual, to help design a subset-selecting lambda function.
You’re all finished with sample data. The original df
is back in the workspace, ready to use.
Instruction
multilabel_train_test_split()
by selecting df[NON_LABELS]
.get_text_data
by using FunctionTransformer()
and passing in combine_text_columns
. Be sure to also specify validate=False
.FunctionTransformer()
to compute get_numeric_data
. In the lambda function, select out the NUMERIC_COLUMNS
of x
. Like you did when computing get_text_data
, also specify validate=False
.# Import pandas, numpy, warn, CountVectorizer
import pandas as pd
import numpy as np
from warnings import warn
from sklearn.feature_extraction.text import CountVectorizer
#### DEFINE SAMPLING UTILITIES ####
# First multilabel_sample, which is called by multilabel_train_test_split
def multilabel_sample(y, size=1000, min_count=5, seed=None):
""" Takes a matrix of binary labels `y` and returns
the indices for a sample of size `size` if
`size` > 1 or `size` * len(y) if size =< 1.
The sample is guaranteed to have > `min_count` of
each label.
"""
try:
if (np.unique(y).astype(int) != np.array([0, 1])).all():
raise ValueError()
except (TypeError, ValueError):
raise ValueError('multilabel_sample only works with binary indicator matrices')
if (y.sum(axis=0) < min_count).any():
raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')
if size <= 1:
size = np.floor(y.shape[0] * size)
if y.shape[1] * min_count > size:
msg = "Size less than number of columns * min_count, returning {} items instead of {}."
warn(msg.format(y.shape[1] * min_count, size))
size = y.shape[1] * min_count
rng = np.random.RandomState(seed if seed is not None else np.random.randint(1))
if isinstance(y, pd.DataFrame):
choices = y.index
y = y.values
else:
choices = np.arange(y.shape[0])
sample_idxs = np.array([], dtype=choices.dtype)
# first, guarantee > min_count of each label
for j in range(y.shape[1]):
label_choices = choices[y[:, j] == 1]
label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)
sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])
sample_idxs = np.unique(sample_idxs)
# now that we have at least min_count of each, we can just random sample
sample_count = size - sample_idxs.shape[0]
# get sample_count indices from remaining choices
remaining_choices = np.setdiff1d(choices, sample_idxs)
remaining_sampled = rng.choice(remaining_choices, size=sample_count, replace=False)
return np.concatenate([sample_idxs, remaining_sampled])
# Now define multilabel_train_test_split to be used below
def multilabel_train_test_split(X, Y, size, min_count=5, seed=None):
""" Takes a features matrix `X` and a label matrix `Y` and
returns (X_train, X_test, Y_train, Y_test) where all
classes in Y are represented at least `min_count` times.
"""
index = Y.index if isinstance(Y, pd.DataFrame) else np.arange(Y.shape[0])
test_set_idxs = multilabel_sample(Y, size=size, min_count=min_count, seed=seed)
train_set_idxs = np.setdiff1d(index, test_set_idxs)
test_set_mask = index.isin(test_set_idxs)
train_set_mask = ~test_set_mask
return (X[train_set_mask], X[test_set_mask], Y[train_set_mask], Y[test_set_mask])
####
# Load data
df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_2533/datasets/TrainingSetSample.csv', index_col=0)
# Labels
LABELS = ['Function',
'Use',
'Sharing',
'Reporting',
'Student_Type',
'Position_Type',
'Object_Type',
'Pre_K',
'Operating_Status']
NUMERIC_COLUMNS = ['FTE', "Total"]
# Convert object to category for LABELS
df[LABELS] = df[LABELS].apply(lambda x: x.astype('category'))
# Define combine_text_columns() for use in sklearn.preprocessing.FunctionTransformer
def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
""" Takes the dataset as read in, drops the non-feature, non-text columns and
then combines all of the text columns into a single vector that has all of
the text for a row.
:param data_frame: The data as read in with read_csv (no preprocessing necessary)
:param to_drop (optional): Removes the numeric and label columns by default.
"""
# drop non-text columns that are in the df
to_drop = set(to_drop) & set(data_frame.columns.tolist())
text_data = data_frame.drop(to_drop, axis=1)
# replace nans with blanks
text_data.fillna("", inplace=True)
# joins all of the text items in a row (axis=1)
# with a space in between
return text_data.apply(lambda x: " ".join(x), axis=1)
# Import FunctionTransformer
from sklearn.preprocessing import FunctionTransformer
# Get the dummy encoding of the labels
dummy_labels = pd.get_dummies(df[LABELS])
# Get the columns that are features in the original df
NON_LABELS = [c for c in df.columns if c not in LABELS]
# Split into training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS],
dummy_labels,
0.2,
seed=123)
# Preprocess the text data: get_text_data
get_text_data = FunctionTransformer(combine_text_columns, validate=False)
# Preprocess the numeric data: get_numeric_data
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)
Exercise
You’re about to take everything you’ve learned so far and implement it in a Pipeline
that works with the real, DrivenData budget line item data you’ve been exploring.
Surprise! The structure of the pipeline is exactly the same as earlier in this chapter:
FeatureUnion
to join the results of nested pipelines that each rely on FunctionTransformer
to select multiple datatypesYou can then call familiar methods like .fit()
and .score()
on the Pipeline
object pl
.
Instruction
'numeric_features'
transform with the following steps:
get_numeric_data
, with the name 'selector'
.Imputer()
, with the name 'imputer'
.'text_features'
transform with the following steps:
get_text_data
, with the name 'selector'
.CountVectorizer()
, with the name 'vectorizer'
.# Complete the pipeline: pl
pl = Pipeline([
('union', FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', CountVectorizer())
]))
]
)),
('clf', OneVsRestClassifier(LogisticRegression()))
])
# Fit to the training data
pl.fit(X_train, y_train)
# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)
Accuracy on budget dataset: 0.20384615384615384
Exercise
Now you’re cruising. One of the great strengths of pipelines is how easy they make the process of testing different models.
Until now, you’ve been using the model step ('clf', OneVsRestClassifier(LogisticRegression()))
in your pipeline.
But what if you want to try a different model? Do you need to build an entirely new pipeline? New nests? New FeatureUnions? Nope! You just have a simple one-line change, as you’ll see in this exercise.
In particular, you’ll swap out the logistic-regression model and replace it with a random forest classifier, which uses the statistics of an ensemble of decision trees to generate predictions.
Instruction
RandomForestClassifier
from sklearn.ensemble
.RandomForestClassifier()
step named 'clf'
to the pipeline.# Import random forest classifer
from sklearn.ensemble import RandomForestClassifier
# Edit model step in pipeline
pl = Pipeline([
('union', FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', CountVectorizer())
]))
]
)),
('clf', RandomForestClassifier())
])
# Fit to the training data
pl.fit(X_train, y_train)
# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)
Accuracy on budget dataset: 0.28076923076923077
Exercise
You just saw a substantial improvement in accuracy by swapping out the model. Pipelines are amazing!
Can you make it better? Try changing the parameter n_estimators
of RandomForestClassifier()
, whose default value is 10
, to 15
.
Instruction
RandomForestClassifier
from sklearn.ensemble
.RandomForestClassifier()
step with n_estimators=15
to the pipeline with a name of 'clf'
.# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
# Add model step to pipeline: pl
pl = Pipeline([
('union', FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', CountVectorizer())
]))
]
)),
('clf', RandomForestClassifier(n_estimators=15))
])
# Fit to the training data
pl.fit(X_train, y_train)
# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)
Accuracy on budget dataset: 0.3230769230769231