Custom Transformer & Pipeline

In data science, one of the most tedious step that we need to perform is feature engineering and modeling. These steps are very repetitive. Luckily, creating a pipeline can help us solve this problem.

What is a pipeline ?

In a simple term, pipeline is a tool in sklearn package for us to compose multiple estimators. For example, when we want to analyse a dataset, we may need to choose the features (columns) that we want, standardize them and apply the model. All of these steps can be escapulated in a Pipeline

Assuming we have a dataset like the one below, and we want to predict which food a person wants to eat based on their gender, age, height, weight and an unknown feature "empty_col".

# Importing library
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
import random
import warnings
warnings.filterwarnings("ignore")
gender = ["male","female"]
food = ["Viet", "Thai", "Chinese", "Western"]
np.random.RandomState(42)
data1 = pd.DataFrame({"gender" : np.random.choice(gender,20),
                      "age" : random.sample(range(20,60),20),
                      "height" : random.sample(range(150,170),20),
                      "weight" : np.random.uniform(150,250,20),
                      "empty_col" : np.repeat(None,20),
                      "food" : np.random.choice(food,20)})
data1.head()
gender age height weight empty_col food
0 male 58 167 213.127056 None Western
1 male 55 151 161.579091 None Chinese
2 female 40 164 227.105995 None Chinese
3 female 54 155 223.072513 None Western
4 female 33 158 248.173858 None Viet
data1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 6 columns):
gender       20 non-null object
age          20 non-null int64
height       20 non-null int64
weight       20 non-null float64
empty_col    0 non-null object
food         20 non-null object
dtypes: float64(1), int64(2), object(3)
memory usage: 1.0+ KB
X_train = data1.drop("food", axis = 1)
y_train = data1["food"]

Let's say, for some reasons, the person analyzing this dataset only wants to use features age, height, weight, and empty_col to predict the food.

X_train.drop("gender", axis = 1)
age height weight empty_col
0 58 167 213.127056 None
1 55 151 161.579091 None
2 40 164 227.105995 None
3 54 155 223.072513 None
4 33 158 248.173858 None
5 56 150 236.798867 None
6 32 153 170.120662 None
7 42 163 196.733247 None
8 36 159 231.915150 None
9 22 162 190.901033 None
10 45 154 192.580001 None
11 21 157 215.189645 None
12 43 165 225.999820 None
13 24 152 160.628936 None
14 51 166 195.947550 None
15 48 161 177.813970 None
16 34 169 231.051503 None
17 49 160 235.308763 None
18 25 168 196.450121 None
19 37 156 231.366207 None

The method shown above is one way to do it. However, imagine that we need to do it for so many datasets, it will become extremely tedious.

In addition, when our datasets are getting more complicated, doing that way repetitively will make our code become hard to read. We are going to create a class that can automate that process, and can be put in a pipeline

Custom Transformers

There are multiple written Transformers in Python such as StandardScaler, and LabelBinarizer. However, during our feature engineering, we usually need to write our own transformers. Here is the general template for a custom transformers:

class MyTransformer(TransformerMixin, BaseEstimator):

     def __init__(self):
        pass

     def fit(self, X, y=None):
        return self

     def transform(self, X):
        # transform code
         return X
  • TransformerMixin gives us the access to the extremely popular and useful .fit_transform method.
  • BaseEstimator will be very beneficial when we use GridSearch function in the future as it helps us the the access to .get_params
  • init is there so that we can intialize the transformer by just typing MyTransformer()
  • fit will return self. We can see that even though we are not going to transform y, we still need to include it as a parameter and set it to None. This is because in case we want to include a model in a pipeline, a fit in those model will require a y
  • transform: this is where all the magic happens, and our transformed X will be produced from here.

We are going to create our own transformers to choose the columns that we want

# The code based on https://ramhiser.com/post/2018-04-16-building-scikit-learn-pipeline-with-pandas-dataframe/

class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)

        try:
            return X[self.columns]
        except KeyError:
            cols_error = list(set(self.columns) - set(X.columns))
            raise KeyError("The DataFrame does not include 
                           the columns: %s" % cols_error)
cs = ColumnSelector(columns=["age","height","weight","empty_col"])
cs.fit_transform(X_train).head()
age height weight empty_col
0 58 167 213.127056 None
1 55 151 161.579091 None
2 40 164 227.105995 None
3 54 155 223.072513 None
4 33 158 248.173858 None

Let's create another transformer which removes the columns with > 50% missing

class NotManyNaColumns(BaseEstimator, TransformerMixin):
    def __init__(self, threshold):
        self.threshold = threshold

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        na_col = X.isnull().sum()
        too_many_na_col = na_col[na_col > self.threshold * X.shape[0]].index
        return X.drop(too_many_na_col, axis = 1)
na_col = NotManyNaColumns(0.5)
na_col.fit_transform(X_train).head()
gender age height weight
0 male 58 167 213.127056
1 male 55 151 161.579091
2 female 40 164 227.105995
3 female 54 155 223.072513
4 female 33 158 248.173858

As we can see, empty_col column has been removed since it has more than 50% missing values. However, if we use this function, we can't perform a similar transformation on the training data to a test data. Let's take a look at an example.

data2 = pd.DataFrame({"gender" : np.random.choice(gender,20),
                      "age" : random.sample(range(20,60),20),
                      "height" : random.sample(range(150,170),20),
                      "weight" : np.random.uniform(150,250,20),
                      "empty_col" : np.repeat(1,20),
                      "food" : np.random.choice(food,20)})
X_test = data2.drop("food", axis = 1)
y_test = data2["food"]
X_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
gender       20 non-null object
age          20 non-null int64
height       20 non-null int64
weight       20 non-null float64
empty_col    20 non-null int32
dtypes: float64(1), int32(1), int64(2), object(1)
memory usage: 800.0+ bytes

This dataset surely has no column having more than 50% missing value. However, we need to make our training and testing dataset to have the same column. If we use the NotManyNaColumns as shown above, we will not get it.

na_col = NotManyNaColumns(0.5)
na_col.transform(X_test).head()
gender age height weight empty_col
0 female 25 166 236.519943 1
1 female 45 150 173.783799 1
2 male 35 167 198.093367 1
3 male 47 163 240.642942 1
4 female 42 151 173.177008 1

Let's write a new code to get what we want

class NotManyNaColumns(BaseEstimator, TransformerMixin):
    def __init__(self, threshold):
        self.threshold = threshold

    def fit(self, X, y=None):
        na_col = X.isnull().sum()
        too_many_na_col = na_col[na_col > self.threshold * X.shape[0]].index
        self.too_many_na_col = too_many_na_col
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.drop(self.too_many_na_col, axis = 1)

What we did in here is to make sure whatever we did in the training set, will be performed on the testing set. We are going to fit and transform on the training dataset, but only transform on the testing set.

na_col = NotManyNaColumns(0.5)
na_col.fit_transform(X_train).head()
gender age height weight
0 male 58 167 213.127056
1 male 55 151 161.579091
2 female 40 164 227.105995
3 female 54 155 223.072513
4 female 33 158 248.173858
na_col.transform(X_test).head()
gender age height weight
0 female 25 166 236.519943
1 female 45 150 173.783799
2 male 35 167 198.093367
3 male 47 163 240.642942
4 female 42 151 173.177008

As we can see, our training data and testing data have the same columns, even though the empty_col column in the testing data does not have more than 50% missing value.

Pipeline

Finally, we are going to learn about Pipeline !!! Normally, we are going to analyse our data using this way

# Select Column
cs = ColumnSelector(columns=["age","height","weight","empty_col"])
X_train_transformed = cs.fit_transform(X_train)

# Select good columns
na_col = NotManyNaColumns(0.5)
X_train_transformed = na_col.fit_transform(X_train_transformed)

# Standard Scaler
sc = StandardScaler()
X_train_transformed = sc.fit_transform(X_train_transformed)

# Apply Logistic Regression to predict the food
lg = LogisticRegression(solver = "lbfgs")
lg.fit(X_train_transformed, y_train)
lg.predict(X_train_transformed)
array(['Western', 'Thai', 'Western', 'Thai', 'Viet', 'Thai', 'Viet',
       'Chinese', 'Viet', 'Viet', 'Viet', 'Viet', 'Western', 'Viet',
       'Chinese', 'Chinese', 'Western', 'Thai', 'Chinese', 'Viet'],
      dtype=object)
# Check the score
lg.score(X_train_transformed, y_train)
0.45

There are so many steps that we need to do as shown above. We can make things shorter by using make_pipeline.

pipe = make_pipeline(ColumnSelector(columns=["age","height","weight","empty_col"]),
                     NotManyNaColumns(0.5),
                     StandardScaler(),
                     LogisticRegression(solver = "lbfgs"))

Note : Order is important

pipe.fit(X_train,y_train)
pipe.predict(X_train)
array(['Western', 'Thai', 'Western', 'Thai', 'Viet', 'Thai', 'Viet',
       'Chinese', 'Viet', 'Viet', 'Viet', 'Viet', 'Western', 'Viet',
       'Chinese', 'Chinese', 'Western', 'Thai', 'Chinese', 'Viet'],
      dtype=object)
pipe.score(X_train, y_train)
0.45
By
Tags : #python,