The following explanation is based on
Imputer class, but the idea is the same for
fit_transform of other scikit_learn classes like
transform replaces the missing values with a number. By default this number is the means of columns of some data that you choose.
Consider the following example:
imp = Imputer()
# calculating the means
Now the imputer have learned to use a mean (1+8)/2 = 4.5 for the first column and mean (2+3+5.5)/3 = 3.5 for the second column when it gets applied to a two-column data:
X = [[np.nan, 11],
fit the imputer calculates the means of columns from some data, and by
transform it applies those means to some data (which is just replacing missing values with the means). If both these data are the same (i.e. the data for calculating the means and the data that means are applied to) you can use
fit_transform which is basically a
fit followed by a
Now your questions:
Why we might need to transform data?
"For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical" (source)
What does it mean fitting model on training data and transforming to test data?
fit of an imputer has nothing to do with
fit used in model fitting.
So using imputer's
fit on training data just calculates means of each column of training data. Using
transform on test data then replaces missing values of test data with means that were calculated from training data.