In data science, one of the most tedious step that we need to perform is feature engineering and modeling. These steps are very repetitive. Luckily, creating a pipeline can help us solve this problem.
What is a pipeline ?
In a simple term, pipeline is a tool in sklearn package for us to compose multiple estimators. For example, when we want to analyse a dataset, we may need to choose the features (columns) that we want, standardize them and apply the model. All of these steps can be escapulated in a Pipeline
Assuming we have a dataset like the one below, and we want to predict which food a person wants to eat based on their gender, age, height, weight and an unknown feature "empty_col".
# Importing library
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
import random
import warnings
warnings.filterwarnings("ignore")
gender = ["male","female"]
food = ["Viet", "Thai", "Chinese", "Western"]
np.random.RandomState(42)
data1 = pd.DataFrame({"gender" : np.random.choice(gender,20),
"age" : random.sample(range(20,60),20),
"height" : random.sample(range(150,170),20),
"weight" : np.random.uniform(150,250,20),
"empty_col" : np.repeat(None,20),
"food" : np.random.choice(food,20)})
data1.head()
gender | age | height | weight | empty_col | food | |
---|---|---|---|---|---|---|
0 | male | 58 | 167 | 213.127056 | None | Western |
1 | male | 55 | 151 | 161.579091 | None | Chinese |
2 | female | 40 | 164 | 227.105995 | None | Chinese |
3 | female | 54 | 155 | 223.072513 | None | Western |
4 | female | 33 | 158 | 248.173858 | None | Viet |
data1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 6 columns):
gender 20 non-null object
age 20 non-null int64
height 20 non-null int64
weight 20 non-null float64
empty_col 0 non-null object
food 20 non-null object
dtypes: float64(1), int64(2), object(3)
memory usage: 1.0+ KB
X_train = data1.drop("food", axis = 1)
y_train = data1["food"]
Let's say, for some reasons, the person analyzing this dataset only wants to use features age, height, weight, and empty_col to predict the food.
X_train.drop("gender", axis = 1)
age | height | weight | empty_col | |
---|---|---|---|---|
0 | 58 | 167 | 213.127056 | None |
1 | 55 | 151 | 161.579091 | None |
2 | 40 | 164 | 227.105995 | None |
3 | 54 | 155 | 223.072513 | None |
4 | 33 | 158 | 248.173858 | None |
5 | 56 | 150 | 236.798867 | None |
6 | 32 | 153 | 170.120662 | None |
7 | 42 | 163 | 196.733247 | None |
8 | 36 | 159 | 231.915150 | None |
9 | 22 | 162 | 190.901033 | None |
10 | 45 | 154 | 192.580001 | None |
11 | 21 | 157 | 215.189645 | None |
12 | 43 | 165 | 225.999820 | None |
13 | 24 | 152 | 160.628936 | None |
14 | 51 | 166 | 195.947550 | None |
15 | 48 | 161 | 177.813970 | None |
16 | 34 | 169 | 231.051503 | None |
17 | 49 | 160 | 235.308763 | None |
18 | 25 | 168 | 196.450121 | None |
19 | 37 | 156 | 231.366207 | None |
The method shown above is one way to do it. However, imagine that we need to do it for so many datasets, it will become extremely tedious.
In addition, when our datasets are getting more complicated, doing that way repetitively will make our code become hard to read. We are going to create a class that can automate that process, and can be put in a pipeline
Custom Transformers
There are multiple written Transformers in Python such as StandardScaler, and LabelBinarizer. However, during our feature engineering, we usually need to write our own transformers. Here is the general template for a custom transformers:
class MyTransformer(TransformerMixin, BaseEstimator):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
# transform code
return X
- TransformerMixin gives us the access to the extremely popular and useful .fit_transform method.
- BaseEstimator will be very beneficial when we use GridSearch function in the future as it helps us the the access to .get_params
- init is there so that we can intialize the transformer by just typing MyTransformer()
- fit will return self. We can see that even though we are not going to transform y, we still need to include it as a parameter and set it to None. This is because in case we want to include a model in a pipeline, a fit in those model will require a y
- transform: this is where all the magic happens, and our transformed X will be produced from here.
We are going to create our own transformers to choose the columns that we want
# The code based on https://ramhiser.com/post/2018-04-16-building-scikit-learn-pipeline-with-pandas-dataframe/
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
try:
return X[self.columns]
except KeyError:
cols_error = list(set(self.columns) - set(X.columns))
raise KeyError("The DataFrame does not include
the columns: %s" % cols_error)
cs = ColumnSelector(columns=["age","height","weight","empty_col"])
cs.fit_transform(X_train).head()
age | height | weight | empty_col | |
---|---|---|---|---|
0 | 58 | 167 | 213.127056 | None |
1 | 55 | 151 | 161.579091 | None |
2 | 40 | 164 | 227.105995 | None |
3 | 54 | 155 | 223.072513 | None |
4 | 33 | 158 | 248.173858 | None |
Let's create another transformer which removes the columns with > 50% missing
class NotManyNaColumns(BaseEstimator, TransformerMixin):
def __init__(self, threshold):
self.threshold = threshold
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
na_col = X.isnull().sum()
too_many_na_col = na_col[na_col > self.threshold * X.shape[0]].index
return X.drop(too_many_na_col, axis = 1)
na_col = NotManyNaColumns(0.5)
na_col.fit_transform(X_train).head()
gender | age | height | weight | |
---|---|---|---|---|
0 | male | 58 | 167 | 213.127056 |
1 | male | 55 | 151 | 161.579091 |
2 | female | 40 | 164 | 227.105995 |
3 | female | 54 | 155 | 223.072513 |
4 | female | 33 | 158 | 248.173858 |
As we can see, empty_col column has been removed since it has more than 50% missing values. However, if we use this function, we can't perform a similar transformation on the training data to a test data. Let's take a look at an example.
data2 = pd.DataFrame({"gender" : np.random.choice(gender,20),
"age" : random.sample(range(20,60),20),
"height" : random.sample(range(150,170),20),
"weight" : np.random.uniform(150,250,20),
"empty_col" : np.repeat(1,20),
"food" : np.random.choice(food,20)})
X_test = data2.drop("food", axis = 1)
y_test = data2["food"]
X_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
gender 20 non-null object
age 20 non-null int64
height 20 non-null int64
weight 20 non-null float64
empty_col 20 non-null int32
dtypes: float64(1), int32(1), int64(2), object(1)
memory usage: 800.0+ bytes
This dataset surely has no column having more than 50% missing value. However, we need to make our training and testing dataset to have the same column. If we use the NotManyNaColumns as shown above, we will not get it.
na_col = NotManyNaColumns(0.5)
na_col.transform(X_test).head()
gender | age | height | weight | empty_col | |
---|---|---|---|---|---|
0 | female | 25 | 166 | 236.519943 | 1 |
1 | female | 45 | 150 | 173.783799 | 1 |
2 | male | 35 | 167 | 198.093367 | 1 |
3 | male | 47 | 163 | 240.642942 | 1 |
4 | female | 42 | 151 | 173.177008 | 1 |
Let's write a new code to get what we want
class NotManyNaColumns(BaseEstimator, TransformerMixin):
def __init__(self, threshold):
self.threshold = threshold
def fit(self, X, y=None):
na_col = X.isnull().sum()
too_many_na_col = na_col[na_col > self.threshold * X.shape[0]].index
self.too_many_na_col = too_many_na_col
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.drop(self.too_many_na_col, axis = 1)
What we did in here is to make sure whatever we did in the training set, will be performed on the testing set. We are going to fit and transform on the training dataset, but only transform on the testing set.
na_col = NotManyNaColumns(0.5)
na_col.fit_transform(X_train).head()
gender | age | height | weight | |
---|---|---|---|---|
0 | male | 58 | 167 | 213.127056 |
1 | male | 55 | 151 | 161.579091 |
2 | female | 40 | 164 | 227.105995 |
3 | female | 54 | 155 | 223.072513 |
4 | female | 33 | 158 | 248.173858 |
na_col.transform(X_test).head()
gender | age | height | weight | |
---|---|---|---|---|
0 | female | 25 | 166 | 236.519943 |
1 | female | 45 | 150 | 173.783799 |
2 | male | 35 | 167 | 198.093367 |
3 | male | 47 | 163 | 240.642942 |
4 | female | 42 | 151 | 173.177008 |
As we can see, our training data and testing data have the same columns, even though the empty_col column in the testing data does not have more than 50% missing value.
Pipeline
Finally, we are going to learn about Pipeline !!! Normally, we are going to analyse our data using this way
# Select Column
cs = ColumnSelector(columns=["age","height","weight","empty_col"])
X_train_transformed = cs.fit_transform(X_train)
# Select good columns
na_col = NotManyNaColumns(0.5)
X_train_transformed = na_col.fit_transform(X_train_transformed)
# Standard Scaler
sc = StandardScaler()
X_train_transformed = sc.fit_transform(X_train_transformed)
# Apply Logistic Regression to predict the food
lg = LogisticRegression(solver = "lbfgs")
lg.fit(X_train_transformed, y_train)
lg.predict(X_train_transformed)
array(['Western', 'Thai', 'Western', 'Thai', 'Viet', 'Thai', 'Viet',
'Chinese', 'Viet', 'Viet', 'Viet', 'Viet', 'Western', 'Viet',
'Chinese', 'Chinese', 'Western', 'Thai', 'Chinese', 'Viet'],
dtype=object)
# Check the score
lg.score(X_train_transformed, y_train)
0.45
There are so many steps that we need to do as shown above. We can make things shorter by using make_pipeline.
pipe = make_pipeline(ColumnSelector(columns=["age","height","weight","empty_col"]),
NotManyNaColumns(0.5),
StandardScaler(),
LogisticRegression(solver = "lbfgs"))
Note : Order is important
pipe.fit(X_train,y_train)
pipe.predict(X_train)
array(['Western', 'Thai', 'Western', 'Thai', 'Viet', 'Thai', 'Viet',
'Chinese', 'Viet', 'Viet', 'Viet', 'Viet', 'Western', 'Viet',
'Chinese', 'Chinese', 'Western', 'Thai', 'Chinese', 'Viet'],
dtype=object)
pipe.score(X_train, y_train)
0.45