What's the difference between fit and fit_transform in scikit-learn models?



I'm a newbie to data science, and I do not understand the difference between the fit and fit_transform methods in scikit-learn. Can anybody explain simply why we might need to transform data?

What does it mean, fitting a model on training data and transforming to test data? Does it mean, for example, converting categorical variables into numbers in training and transforming the new feature set onto test data?


Posted 2016-06-21T10:05:08.587

Reputation: 2 427

@sds The Answer of above gives the link to this question. – Kaushal28 – 2019-05-02T13:20:27.307

4We apply fit on the training dataset and use the transform method on both - the training dataset and the test dataset – Prakash Kumar – 2019-06-14T11:35:59.017

fit_transform() is equivalant to apply fit() and the transform(). Sometimes the former is faster the later. – Dr Nisha Arora – 2020-09-10T00:36:20.657



To center the data (make it have zero mean and unit standard error), you subtract the mean and then divide the result by the standard deviation:

$$x' = \frac{x-\mu}{\sigma}$$

You do that on the training set of data. But then you have to apply the same transformation to your testing set (e.g. in cross-validation), or to newly obtained examples before forecast. But you have to use the exact same two parameters $\mu$ and $\sigma$ (values) that you used for centering the training set.

Hence, every sklearn's transform's fit() just calculates the parameters (e.g. $\mu$ and $\sigma$ in case of StandardScaler) and saves them as an internal object's state. Afterwards, you can call its transform() method to apply the transformation to any particular set of examples.

fit_transform() joins these two steps and is used for the initial fitting of parameters on the training set $x$, while also returning the transformed $x'$. Internally, the transformer object just calls first fit() and then transform() on the same data.


Posted 2016-06-21T10:05:08.587

Reputation: 2 982

2Thanks a lot for your answer.Just one thing.By parameters in model it does not mean for exmple slope and intercept for regression? when you fit let's say a linear regression for example which parameters are fitted in fit method? Normalization parameters or model parameters like slope and intercept? – Kaggle – 2016-06-23T07:29:07.823


I mean parameters internal to the transforms ($\mu$ and $\sigma$ in case of StandardScaler). Whatever transform's get_params() method returns. See this chapter on imputation, for example: http://scikit-learn.org/stable/modules/preprocessing.html#imputation-of-missing-values

– K3---rnc – 2016-06-23T14:40:19.633

3My previous comment is actually wrong. In case of linear regression, the fitted parameters are the coef_ (i.e. slope and intercept), not the ones returned by get_params() (which, instead, returns the set of model constructor arguments with their associated values). – K3---rnc – 2017-01-25T17:23:44.900

2Great answer! I came across your post while searching on this topic, but I need to clarify. Does that mean that if suppose we want to transform each set of subsequent examples, we should never call fit_transform() as it would not allow us to access the internal objects state, to transform subsequent examples with the same parameters that were obtained using fit() on the initial dataset? This arises for example when, you have a test dataset and want to transform the test set to pass it to your trained classifier. – AKKA – 2018-06-01T13:56:05.407

3After you call t.fit_transform(train_data), t is fitted, so you can safely use t.transform(test_data). – K3---rnc – 2018-06-01T17:58:08.897

You just explained very decently in a twenty second read what my prof couldn't do decently in five minutes – Wouter Vandenputte – 2020-06-01T13:05:54.190


The following explanation is based on fit_transform of Imputer class, but the idea is the same for fit_transform of other scikit_learn classes like MinMaxScaler.

transform replaces the missing values with a number. By default this number is the means of columns of some data that you choose. Consider the following example:

imp = Imputer()
# calculating the means
         [1,      3], 
         [np.nan, 2], 
         [8,      5.5]

Now the imputer have learned to use a mean (1+8)/2 = 4.5 for the first column and mean (2+3+5.5)/3 = 3.5 for the second column when it gets applied to a two-column data:

X = [[np.nan, 11], 
     [4,      np.nan], 
     [8,      2],
     [np.nan, 1]]

we get

[[4.5, 11], 
 [4, 3.5],
 [8, 2],
 [4.5, 1]]

So by fit the imputer calculates the means of columns from some data, and by transform it applies those means to some data (which is just replacing missing values with the means). If both these data are the same (i.e. the data for calculating the means and the data that means are applied to) you can use fit_transform which is basically a fit followed by a transform.

Now your questions:

Why we might need to transform data?

"For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical" (source)

What does it mean fitting model on training data and transforming to test data?

The fit of an imputer has nothing to do with fit used in model fitting. So using imputer's fit on training data just calculates means of each column of training data. Using transform on test data then replaces missing values of test data with means that were calculated from training data.


Posted 2016-06-21T10:05:08.587

Reputation: 381

Your explaination made everything clear to me, thank you :)! – Axel Kennedal – 2020-09-03T15:04:43.603


These methods are used for dataset transformations in scikit-learn:

Let us take an example for Scaling values in a dataset:

Here the fit method, when applied to the training dataset,learns the model parameters (for example, mean and standard deviation). We then need to apply the transform method on the training dataset to get the transformed (scaled) training dataset. We could also perform both of this steps in one step by applying fit_transform on the training dataset.

Then why do we need 2 separate methods - fit and transform ?

In practice we need to have a separate training and testing dataset and that is where having a separate fit and transform method helps. We apply fit on the training dataset and use the transform method on both - the training dataset and the test dataset. Thus the training as well as the test dataset are then transformed(scaled) using the model parameters that were learnt on applying the fit method the training dataset.

Example Code:

scaler = preprocessing.StandardScaler().fit(X_train)

Prasad Nageshkar

Posted 2016-06-21T10:05:08.587

Reputation: 311

Thanks for clarifying! Do you also know how 'fit_transform' is used to ensure that training and test set share the same mean and std? – Ben – 2020-02-27T07:02:16.233

The training and test set may not have the same mean and std - if they do have then it could be a co-incidence. – Prasad Nageshkar – 2020-03-02T07:23:48.380

2Remember that mean and std obtained from the training set are used for scaling all training dataset values. This scaling preprocessing is required for training a few ML models. Finally, note that we should not compute a separate mean and std on the test set to scale the test set values but we have to use the ones obtained using fit on the training set. We have to ensure identical operation on test set. – Prasad Nageshkar – 2020-03-02T07:52:04.543

Yes, this is what I mean. I thought it behaves this way: When I use "fit_transform(training)", it will behave as expected. But when I use "fit_transform(test" it will calculate a new mean and std and not use those of the training set? – Ben – 2020-03-02T08:55:16.183


This isn't a technical answer but, hopefully, it is helpful to build up our intuition:

Firstly, all estimators are trained (or "fit") on some training data. That part is fairly straightforward.

Secondly, all of the scikit-learn estimators can be used in a pipeline and the idea with a pipeline is that data flows through the pipeline. Once fit at a particular level in the pipeline, data is passed on to the next stage in the pipeline but obviously the data needs to be changed (transformed) in some way; otherwise, you wouldn't need that stage in the pipeline at all. So, transform is a way of transforming the data to meet the needs of the next stage in the pipeline.

If you're not using a pipeline, I still think it's helpful to think about these machine learning tools in this way because, even the simplest classifier is still performing a classification function. It takes as input some data and produces an output. This is a pipeline too; just a very simple one.

In summary, fit performs the training, transform changes the data in the pipeline in order to pass it on to the next stage in the pipeline, and fit_transform does both the fitting and the transforming in one possibly optimized step.

Eric McLachlan

Posted 2016-06-21T10:05:08.587

Reputation: 173

"" We apply fit on the training dataset and use the transform method on both - the training dataset and the test dataset"" :) Nice – Prakash Kumar – 2019-06-14T10:12:48.077

2I think you meant to comment below. I'll forward it on to Prasad Nageshkar. (Well... I would have if I had the reputation.) – Eric McLachlan – 2019-06-14T11:07:19.540


"fit" computes the mean and std to be used for later scaling. (jsut a computation), nothing is given to you.

"transform" uses a previously computed mean and std to autoscale the data (subtract mean from all values and then divide it by std).

"fit_transform" does both at the same time. So you can do it with 1 line of code instead of 2.

Now let's look at it in practice:

For X training set, we do "fit_transform" because we need to compute mean and std, and then use it to autoscale the data. For X test set, well, we already have the mean and std, so we only do the "transform" part.

It's super simple. You are doing great. Keep up your good work my friend :-)

Salman Tabatabai

Posted 2016-06-21T10:05:08.587

Reputation: 61


In layman's terms, fit_transform means to do some calculation and then do transformation (say calculating the means of columns from some data and then replacing the missing values). So for training set, you need to both calculate and do transformation.

But for testing set, Machine learning applies prediction based on what was learned during the training set and so it doesn't need to calculate, it just performs the transformation.

Ashish Anand

Posted 2016-06-21T10:05:08.587

Reputation: 141


By Applying the Transformations you are trying to make your data to behave normally for example if you have two variables $V_1$ and $V_2$ both measures the distances but $V_1$ has the units as centimeters and $V_2$ has the units in Kilometers so in order to compare these two you have to convert them to same units... just like that Transforming is making similar behavior or making to behave like normal distribution

Coming to other question you first build the model in training set that is (the model learns the patterns or Behavior of your data from the training set) and when you run the same model in the test set it tries to identify the similar patterns or behaviors once it identifies it makes its conclusions and gives results accordingly training data


Posted 2016-06-21T10:05:08.587

Reputation: 21


Consider a task that requires us to normalize the data. For example, we may use a min-max normalization or z-score normalization. There are some inherent parameters in the model. The minimum and maximum values in min-max normalization and the mean and standard deviation in z-score normalization. The fit() function calculates the values of these parameters.

Effect of fit()

The transform function applies the values of the parameters on the actual data and gives the normalized value.

Effect of transform()

The fit_transform() function performs both in the same step.

Effect of fit_transform()

Note that the same value is got whether we perform in 2 steps or in a single step.

Lovelyn David

Posted 2016-06-21T10:05:08.587

Reputation: 11


You don't want your model to learn anything from the test dataset. You just want to apply the learnings from your trained dataset. So, we only apply transform operation on test dataset and fit_transform operation on the train dataset.

Sohil Grandhi

Posted 2016-06-21T10:05:08.587

Reputation: 1