In data science, one of the most tedious step that we need to perform is feature engineering and modeling. These steps are very repetitive. Luckily, creating a pipeline can help us solve this problem.

What is a pipeline ?

In a simple term, pipeline is a tool in sklearn package for us to compose multiple estimators. For example, when we want to analyse a dataset, we may need to choose the features (columns) that we want, standardize them and apply the model. All of these steps can be escapulated in a Pipeline

Assuming we have a dataset like the one below, and we want to predict which food a person wants to eat based on their gender, age, height, weight and an unknown feature "empty_col".

# Importing library
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
import random
import warnings
warnings.filterwarnings("ignore")

gender = ["male","female"]
food = ["Viet", "Thai", "Chinese", "Western"]

np.random.RandomState(42)
data1 = pd.DataFrame({"gender" : np.random.choice(gender,20),
                      "age" : random.sample(range(20,60),20),
                      "height" : random.sample(range(150,170),20),
                      "weight" : np.random.uniform(150,250,20),
                      "empty_col" : np.repeat(None,20),
                      "food" : np.random.choice(food,20)})

data1.head()

	gender	age	height	weight	empty_col	food
0	male	58	167	213.127056	None	Western
1	male	55	151	161.579091	None	Chinese
2	female	40	164	227.105995	None	Chinese
3	female	54	155	223.072513	None	Western
4	female	33	158	248.173858	None	Viet

data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 6 columns):
gender       20 non-null object
age          20 non-null int64
height       20 non-null int64
weight       20 non-null float64
empty_col    0 non-null object
food         20 non-null object
dtypes: float64(1), int64(2), object(3)
memory usage: 1.0+ KB

X_train = data1.drop("food", axis = 1)
y_train = data1["food"]

Let's say, for some reasons, the person analyzing this dataset only wants to use features age, height, weight, and empty_col to predict the food.

X_train.drop("gender", axis = 1)

	age	height	weight	empty_col
0	58	167	213.127056	None
1	55	151	161.579091	None
2	40	164	227.105995	None
3	54	155	223.072513	None
4	33	158	248.173858	None
5	56	150	236.798867	None
6	32	153	170.120662	None
7	42	163	196.733247	None
8	36	159	231.915150	None
9	22	162	190.901033	None
10	45	154	192.580001	None
11	21	157	215.189645	None
12	43	165	225.999820	None
13	24	152	160.628936	None
14	51	166	195.947550	None
15	48	161	177.813970	None
16	34	169	231.051503	None
17	49	160	235.308763	None
18	25	168	196.450121	None
19	37	156	231.366207	None

The method shown above is one way to do it. However, imagine that we need to do it for so many datasets, it will become extremely tedious.

In addition, when our datasets are getting more complicated, doing that way repetitively will make our code become hard to read. We are going to create a class that can automate that process, and can be put in a pipeline

Custom Transformers

There are multiple written Transformers in Python such as StandardScaler, and LabelBinarizer. However, during our feature engineering, we usually need to write our own transformers. Here is the general template for a custom transformers:

class MyTransformer(TransformerMixin, BaseEstimator):

     def __init__(self):
        pass

     def fit(self, X, y=None):
        return self

     def transform(self, X):
        # transform code
         return X

TransformerMixin gives us the access to the extremely popular and useful .fit_transform method.
BaseEstimator will be very beneficial when we use GridSearch function in the future as it helps us the the access to .get_params
init is there so that we can intialize the transformer by just typing MyTransformer()
fit will return self. We can see that even though we are not going to transform y, we still need to include it as a parameter and set it to None. This is because in case we want to include a model in a pipeline, a fit in those model will require a y
transform: this is where all the magic happens, and our transformed X will be produced from here.

We are going to create our own transformers to choose the columns that we want

# The code based on https://ramhiser.com/post/2018-04-16-building-scikit-learn-pipeline-with-pandas-dataframe/

class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)

        try:
            return X[self.columns]
        except KeyError:
            cols_error = list(set(self.columns) - set(X.columns))
            raise KeyError("The DataFrame does not include 
                           the columns: %s" % cols_error)

cs = ColumnSelector(columns=["age","height","weight","empty_col"])
cs.fit_transform(X_train).head()

	age	height	weight	empty_col
0	58	167	213.127056	None
1	55	151	161.579091	None
2	40	164	227.105995	None
3	54	155	223.072513	None
4	33	158	248.173858	None

Let's create another transformer which removes the columns with > 50% missing

class NotManyNaColumns(BaseEstimator, TransformerMixin):
    def __init__(self, threshold):
        self.threshold = threshold

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        na_col = X.isnull().sum()
        too_many_na_col = na_col[na_col > self.threshold * X.shape[0]].index
        return X.drop(too_many_na_col, axis = 1)

na_col = NotManyNaColumns(0.5)
na_col.fit_transform(X_train).head()

	gender	age	height	weight
0	male	58	167	213.127056
1	male	55	151	161.579091
2	female	40	164	227.105995
3	female	54	155	223.072513
4	female	33	158	248.173858

As we can see, empty_col column has been removed since it has more than 50% missing values. However, if we use this function, we can't perform a similar transformation on the training data to a test data. Let's take a look at an example.

data2 = pd.DataFrame({"gender" : np.random.choice(gender,20),
                      "age" : random.sample(range(20,60),20),
                      "height" : random.sample(range(150,170),20),
                      "weight" : np.random.uniform(150,250,20),
                      "empty_col" : np.repeat(1,20),
                      "food" : np.random.choice(food,20)})

X_test = data2.drop("food", axis = 1)
y_test = data2["food"]

X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
gender       20 non-null object
age          20 non-null int64
height       20 non-null int64
weight       20 non-null float64
empty_col    20 non-null int32
dtypes: float64(1), int32(1), int64(2), object(1)
memory usage: 800.0+ bytes

This dataset surely has no column having more than 50% missing value. However, we need to make our training and testing dataset to have the same column. If we use the NotManyNaColumns as shown above, we will not get it.

na_col = NotManyNaColumns(0.5)
na_col.transform(X_test).head()

	gender	age	height	weight	empty_col
0	female	25	166	236.519943	1
1	female	45	150	173.783799	1
2	male	35	167	198.093367	1
3	male	47	163	240.642942	1
4	female	42	151	173.177008	1

Let's write a new code to get what we want

class NotManyNaColumns(BaseEstimator, TransformerMixin):
    def __init__(self, threshold):
        self.threshold = threshold

    def fit(self, X, y=None):
        na_col = X.isnull().sum()
        too_many_na_col = na_col[na_col > self.threshold * X.shape[0]].index
        self.too_many_na_col = too_many_na_col
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.drop(self.too_many_na_col, axis = 1)

What we did in here is to make sure whatever we did in the training set, will be performed on the testing set. We are going to fit and transform on the training dataset, but only transform on the testing set.

na_col = NotManyNaColumns(0.5)
na_col.fit_transform(X_train).head()

	gender	age	height	weight
0	male	58	167	213.127056
1	male	55	151	161.579091
2	female	40	164	227.105995
3	female	54	155	223.072513
4	female	33	158	248.173858

na_col.transform(X_test).head()

	gender	age	height	weight
0	female	25	166	236.519943
1	female	45	150	173.783799
2	male	35	167	198.093367
3	male	47	163	240.642942
4	female	42	151	173.177008

As we can see, our training data and testing data have the same columns, even though the empty_col column in the testing data does not have more than 50% missing value.

Pipeline

Finally, we are going to learn about Pipeline !!! Normally, we are going to analyse our data using this way

# Select Column
cs = ColumnSelector(columns=["age","height","weight","empty_col"])
X_train_transformed = cs.fit_transform(X_train)

# Select good columns
na_col = NotManyNaColumns(0.5)
X_train_transformed = na_col.fit_transform(X_train_transformed)

# Standard Scaler
sc = StandardScaler()
X_train_transformed = sc.fit_transform(X_train_transformed)

# Apply Logistic Regression to predict the food
lg = LogisticRegression(solver = "lbfgs")
lg.fit(X_train_transformed, y_train)
lg.predict(X_train_transformed)

array(['Western', 'Thai', 'Western', 'Thai', 'Viet', 'Thai', 'Viet',
       'Chinese', 'Viet', 'Viet', 'Viet', 'Viet', 'Western', 'Viet',
       'Chinese', 'Chinese', 'Western', 'Thai', 'Chinese', 'Viet'],
      dtype=object)

# Check the score
lg.score(X_train_transformed, y_train)

0.45

There are so many steps that we need to do as shown above. We can make things shorter by using make_pipeline.

pipe = make_pipeline(ColumnSelector(columns=["age","height","weight","empty_col"]),
                     NotManyNaColumns(0.5),
                     StandardScaler(),
                     LogisticRegression(solver = "lbfgs"))

Note : Order is important

pipe.fit(X_train,y_train)

pipe.predict(X_train)

array(['Western', 'Thai', 'Western', 'Thai', 'Viet', 'Thai', 'Viet',
       'Chinese', 'Viet', 'Viet', 'Viet', 'Viet', 'Western', 'Viet',
       'Chinese', 'Chinese', 'Western', 'Thai', 'Chinese', 'Viet'],
      dtype=object)

pipe.score(X_train, y_train)

0.45

Custom Transformer & Pipeline

What is a pipeline ?

Custom Transformers

Pipeline