133

176

Am I looking for a better behaved distribution for the independent variable in question, or to reduce the effect of outliers, or something else?

133

176

Am I looking for a better behaved distribution for the independent variable in question, or to reduce the effect of outliers, or something else?

134

I always hesitate to jump into a thread with as many excellent responses as this, but it strikes me that few of the answers provide any reason to prefer the logarithm to some other transformation that "squashes" the data, such as a root or reciprocal.

Before getting to that, let's **recapitulate** the wisdom in the existing answers in a more general way. *Some* non-linear re-expression of the dependent variable is indicated when any of the following apply:

The residuals have a skewed distribution. The purpose of a transformation is to obtain residuals that are approximately symmetrically distributed (about zero, of course).

The spread of the residuals changes systematically with the values of the dependent variable ("heteroscedasticity"). The purpose of the transformation is to remove that systematic change in spread, achieving approximate "homoscedasticity."

To linearize a relationship.

When scientific theory indicates. For example, chemistry often suggests expressing concentrations as logarithms (giving activities or even the well-known pH).

When a more nebulous statistical theory suggests the residuals reflect "random errors" that do not accumulate additively.

To simplify a model. For example, sometimes a logarithm can simplify the number and complexity of "interaction" terms.

(These indications can conflict with one another; in such cases, judgment is needed.)

So, **when is a logarithm specifically indicated** instead of some other transformation?

The residuals have a "strongly" positively skewed distribution. In his book on EDA, John Tukey provides quantitative ways to estimate the transformation (within the family of Box-Cox, or power, transformations) based on rank statistics of the residuals. It really comes down to the fact that if taking the log symmetrizes the residuals, it was probably the right form of re-expression; otherwise, some other re-expression is needed.

When the SD of the residuals is directly proportional to the fitted values (and not to some power of the fitted values).

When the relationship is close to exponential.

When residuals are believed to reflect multiplicatively accumulating errors.

You really want a model in which marginal changes in the explanatory variables are interpreted in terms of multiplicative (percentage) changes in the dependent variable.

Finally, some ** non - reasons to use a re-expression**:

Making outliers not look like outliers. An outlier is a datum that does not fit some parsimonious, relatively simple description of the data. Changing one's description in order to make outliers look better is usually an incorrect reversal of priorities: first obtain a scientifically valid, statistically good description of the data and then explore any outliers. Don't let the occasional outlier determine how to describe the rest of the data!

Because the software automatically did it. (Enough said!)

Because all the data are positive. (Positivity often implies positive skewness, but it does not have to. Furthermore, other transformations can work better. For example, a root often works best with counted data.)

To make "bad" data (perhaps of low quality) appear well behaved.

To be able to plot the data. (If a transformation is needed to be able to plot the data, it's probably needed for one or more good reasons already mentioned. If the only reason for the transformation truly is for plotting, go ahead and do it--but

*only*to plot the data. Leave the data untransformed for analysis.)

1"When residuals are believed to reflect multiplicatively accumulating errors." I'm having trouble interpreting this phrase. Is it possible to flesh this out a bit with another sentence or two? What is the accumulation you're referring to? – Hatshepsut – 2016-04-19T19:17:36.123

@user1690130 for ratios and densities, these should generally be fitted as a poisson-family distribution for counts with an offset for the exposure. E.g. number of people is the count, and the offset is area of the region. See this question for a good explanation - https://stats.stackexchange.com/questions/11182/when-to-use-an-offset-in-a-poisson-regression

– Michael Barton – 2017-06-30T14:39:12.7071What about variables like population density in a region or the child-teacher ratio for each school district or the number of homicides per 1000 in the population? I have seen professors take the log of these variables. It is not clear to me why. For example, isn't the homicide rate already a percentage? The log would the the percentage change of the rate? Why would the log of child-teacher ratio be preferred? Should the log transformation be taken for every continuous variable when there is no underlying theory about a true functional form? – user1690130 – 2012-10-26T19:52:45.207

1@J G Small ratios tend to have skewed distributions; logarithms and roots are likely to make them more symmetrical. I do not understand your questions related to percentages: perhaps you are conflating different uses of percentages (one to express something as a proportion of a whole and another to express a relative change)? I don't believe I wrote anything advocating that logarithms always be applied--far from it! So I don't understand the basis for your last question. – whuber – 2012-10-26T21:00:06.683

67

I always tell students there are three reasons to transform a variable by taking the natural logarithm. The reason for logging the variable will determine whether you want to log the independent variable(s), dependent or both. To be clear throughout I'm talking about taking the natural logarithm.

Firstly, to improve model fit as other posters have noted. For instance if your residuals aren't normally distributed then taking the logarithm of a skewed variable may improve the fit by altering the scale and making the variable more "normally" distributed. For instance, earnings is truncated at zero and often exhibits positive skew. If the variable has negative skew you could firstly invert the variable before taking the logarithm. I'm thinking here particularly of Likert scales that are inputed as continuous variables. While this usually applies to the dependent variable you occasionally have problems with the residuals (e.g. heteroscedasticity) caused by an independent variable which can be sometimes corrected by taking the logarithm of that variable. For example when running a model that explained lecturer evaluations on a set of lecturer and class covariates the variable "class size" (i.e. the number of students in the lecture) had outliers which induced heteroscedasticity because the variance in the lecturer evaluations was smaller in larger cohorts than smaller cohorts. Logging the student variable would help, although in this example either calculating Robust Standard Errors or using Weighted Least Squares may make interpretation easier.

The second reason for logging one or more variables in the model is for interpretation. I call this convenience reason. If you log both your dependent (Y) and independent (X) variable(s) your regression coefficients ($\beta$) will be elasticities and interpretation would go as follows: a 1% increase in X would lead to a *ceteris paribus* $\beta$% increase in Y (on average). Logging only one side of the regression "equation" would lead to alternative interpretations as outlined below:

Y and X -- a one unit increase in X would lead to a $\beta$ increase/decrease in Y

Log Y and Log X -- a 1% increase in X would lead to a $\beta$% increase/decrease in Y

Log Y and X -- a one unit increase in X would lead to a $\beta*100$ % increase/decrease in Y

Y and Log X -- a 1% increase in X would lead to a $\beta/100$ increase/decrease in Y

And finally there could be a theoretical reason for doing so. For example some models that we would like to estimate are multiplicative and therefore nonlinear. Taking logarithms allows these models to be estimated by linear regression. Good examples of this include the Cobb-Douglas production function in economics and the Mincer Equation in education. The Cobb-Douglas production function explains how inputs are converted into outputs:

$$Y = A L^\alpha K^\beta $$

where

$Y$ is the total production or output of some entity e.g. firm, farm, etc.

$A$ is the total factor productivity (the change in output not caused by the inputs e.g. by technology change or weather)

$L$ is the labour input

$K$ is the capital input

$\alpha$ & $\beta$ are output elasticities.

Taking logarithms of this makes the function easy to estimate using OLS linear regression as such:

$$\log(Y) = \log(A) + \alpha\log(L) + \beta\log(K)$$

5"Log Y and X - a one unit increase in X would lead to a β∗100 % increase/decrease in Y": I think this applies only when β is small so that exp(β) ≈ 1 + β – Ida – 2014-02-26T07:43:56.033

1nice and clear thanks! One question, how do you interpret intercepts in the Log Y and X case? and generally I'm troubled about how to report log transformed regressions... – Bakaburg – 2014-04-08T19:41:38.180

2I'm a sucker for answers that contain examples from Economics ["You had me at '**Cobb-Douglas Production Function**'"].... One thing, though: You should change the intercept term in the second equation to **log(A)** to make it consistent with the first equation. – Steve S – 2014-07-11T22:55:25.680

@Ida indeed. For the interested reader, my post here describes why, for logged "y", the analyst should interpet $100 \times (e^\beta-1)$ as the percent change.

– AdamO – 2017-12-29T22:42:12.43320

For more on whuber's excellent point about reasons to prefer the logarithm to some other transformations such as a root or reciprocal, but focussing on the unique *interpretability* of the regression coefficients resulting from log-transformation compared to other transformations, see:

Oliver N. Keene. The log transformation is special. *Statistics in Medicine* 1995; 14(8):811-819. DOI:10.1002/sim.4780140810. (PDF of dubious legality available at http://rds.epi-ucsf.org/ticr/syllabus/courses/25/2009/04/21/Lecture/readings/log.pdf).

If you log the *independent* variable *x* to base *b*, you can interpret the regression coefficient (and CI) as the change in the dependent variable *y* per *b*-fold increase in *x*. (Logs to base 2 are therefore often useful as they correspond to the change in *y* per doubling in *x*, or logs to base 10 if *x* varies over many orders of magnitude, which is rarer). Other transformations, such as square root, have no such simple interpretation.

If you log the *dependent* variable *y* (not the original question but one which several of the previous answers have addressed), then I find Tim Cole's idea of 'sympercents' attractive for presenting the results (i even used them in a paper once), though they don't seem to have caught on all that widely:

Tim J Cole. Sympercents: symmetric percentage differences on the 100 log(e) scale simplify the presentation of log transformed data. *Statistics in Medicine* 2000; 19(22):3109-3125. DOI:10.1002/1097-0258(20001130)19:22<3109::AID-SIM558>3.0.CO;2-F [I'm so glad *Stat Med* stopped using SICIs as DOIs...]

1Thanks for the reference and very good points. The question of interest is whether this issue applies to all transformations, not just logs. To us statistics/probability is useful insomuch as it allows effective performance prediction, or effective criteria/guidance. Over the years we've used power transformations (logs by another name), polynomial transformations, and others (even piecewise transformations) to try to reduce the residuals, tighten the confidence intervals and generally improve predictive capability from a given set of data. Are we now saying this is incorrect? – AsymLabs – 2013-10-21T09:53:02.850

1

@AsymLabs, how separate are Breiman's Two cultures (roughly predictors and modellers) ? Cf. Two cultures -- contentious.

– denis – 2014-01-20T11:08:17.93013

One typically takes the log of an input variable to scale it and change the distribution (e.g. to make it normally distributed). It cannot be done blindly however; you need to be careful when making any scaling to ensure that the results are still interpretable.

This is discussed in most introductory statistics texts. You can also read Andrew Gelman's paper on "Scaling regression inputs by dividing by two standard deviations" for a discussion on this. He also has a very nice discussion on this at the beginning of "Data Analysis Using Regression and Multilevel/Hierarchical Models".

Taking the log is not an appropriate method for dealing with bad data/outliers.

11

You tend to take logs of the data when there is a problem with the residuals. For example, if you plot the residuals against a particular covariate and observe an increasing/decreasing pattern (a funnel shape), then a transformation may be appropriate. Non-random residuals usually indicate that your model assumptions are wrong, i.e. non-normal data.

Some data types automatically lend themselves to logarithmic transformations. For example, I usually take logs when dealing with concentrations or age.

Although transformations aren't primarily used to deal outliers, they do help since taking logs squashes your data.

1But still, using log changes the model -- for linear regression it is y~a*x+b, fo linear regression on log it is y~y0*exp(x/x0). – mbq – 2010-07-20T14:50:05.523

@whuber "Non-random residuals usually indicate that your model assumptions are wrong, i.e. non-normal data." Elsewhere on this site I was given to understand that OLS imposes no distributional assumptions on the underlying data, but imposes such conditions only on the residuals when you are doing normal-theory inference. So am I misunderstanding the above, or is it poorly worded? – landroni – 2014-03-26T14:41:07.943

1@landroni It's briefly worded. I wouldn't say it's poor, except it's likely "e.g." was intended instead of "i.e." I understand the use of "random" here in the sense of "independent and identically distributed," which indeed is the most general assumption assumed by OLS. In *some* settings people additionally assume this common underlying distribution is normal, but that is not strictly necessary in practice or in theory: all that is necessary is that the sampling distributions of relevant statistics be close to normal. – whuber – 2014-03-26T14:52:29.613

1I agree - taking log's changes your model. But if you have to transform your data, that implies that your model wasn't suitable in the first place. – csgillespie – 2010-07-20T16:00:09.427

2@cgillespie: Concentrations, yes; but age? That is strange. – whuber – 2010-10-12T19:02:52.670

@whuber: I suppose it's very data dependent, but the data sets I used, you would see a big difference between a 10 and 18 yr old, but a small difference between a 20 and 28 yr old. Even for young children the difference between a 0-1 yr old isn't the same as the difference between a 1-2. – csgillespie – 2010-10-12T20:00:38.490

Yes it will be data dependent: your ability to conduct an insightful and effective analysis is the ultimate arbiter of this issue, not my preconceptions. I was just trying to envision situations where age as an *independent* variable would merit such a strong transformation. Some strange things will happen with newborns, too ;-). – whuber – 2010-10-12T21:40:22.480

9

I would like to respond to user1690130's question that was left as a comment to the first answer on Oct 26 '12 and reads as follows: *"What about variables like population density in a region or the child-teacher ratio for each school district or the number of homicides per 1000 in the population? I have seen professors take the log of these variables. It is not clear to me why. For example, isn't the homicide rate already a percentage? The log would the the percentage change of the rate? Why would the log of child-teacher ratio be preferred?"*

I was looking to answer a similar problem and wanted to share what my old stats coursebook (*Jeffrey Wooldridge. 2006. Introductory Econometrics - A Modern Approach, 4th Edition. Chapter 6 Multiple Regression Analysis: Further Issues. 191*) says about it. Wooldridge advises:

Variables that appear in a proportion or percent form, such as the unemployment rate, the participation rate in a pension plan, the percentage of students passing a standardized exam, and the arrest rate on reported crimes -

can appear in either the original or logarithmic form,although there is a tendency to use them in level forms. This is because any regression coefficients involving the original variable - whether it is the dependent or the independent variable - will have a percentage point change interpretation. If we use, say, log(unem) in a regression, whereunemis the percentage of unemployed individuals, we must be very careful to distinguish between a percentage point change and a percentage change. Remember, ifunemgoes from 8 to 9, this is an increase of one percentage point, but a 12.5% increase from the initial unemployment level. Using the log means that we are looking at the percentage change in the unemployment rate: log(9) - log(8) = 0.118 or 11.8%, which is the logarithmic approximation to the actual 12.5% increase.

Based on this and piggybanking on whuber's earlier comment to user1690130's question, I would avoid using the logarithm of a density or percentage rate variable to keep interpretation simple unless using the log form produces a major tradeoff such as being able to reduce skewness of the density or rate variable.

8

Transformation of an independent variable $X$ is one occasion where one can just be empirical without distorting inference as long as one is honest about the number of degrees of freedom in play. One way is to use regression splines for continuous $X$ not already known to act linearly. To me it's not a question of log vs. original scale; it's a question of which transformation of $X$ fits the data. Normality of residuals is not a criterion here.

When $X$ is extremely skewed, cubing $X$ as is needed in cubic spline functions results in extreme values that can sometimes cause numerical problems. I solve this by fitting the cubic spline function on $\sqrt[3]{X}$. The R `rms`

package considers the innermost variable as the predictor, so plotting predicted values will have $X$ on the $x$-axis. Example:

```
require(rms)
dd <- datadist(mydata); options(datadist='dd')
cr <- function(x) x ^ (1/3)
f <- ols(y ~ rcs(cr(X), 5), data=mydata)
ggplot(Predict(f)) # plot spline of cr(X) against X
```

This fits a restricted cubic spline in $\sqrt[3]{X}$ with 5 knots at default quantile locations. The $X$ fit has 4 d.f. (one linear term, 3 nonlinear terms). Confidence bands and tests of association respect these 4 d.f., fully recognizing "transformation uncertainty".

(+1) If there is any ambiguity about the functional form of $E[Y|X] = f(X)$, provided there are sufficient data, the analyst should using smoothing procedures like splines or local regression instead of "eyeballing the best fit". For inference, log and linear trends often agree about direction and magnitude of associations. The main benefit of a log transform is interpretation. – AdamO – 2017-12-29T22:45:20.200

3

Shane's point that taking the log to deal with bad data is well taken. As is Colin's regarding the importance of normal residuals. In practice I find that usually you can get normal residuals if the input and output variables are also relatively normal. In practice this means eyeballing the distribution of the transformed and untransformed datasets and assuring oneself that they have become more normal and/or conducting tests of normality (e.g. Shapiro-Wilk or Kolmogorov-Smirnov tests) and determining whether the outcome is more normal. Interpretablity and tradition are also important. For example, in cognitive psychology log transforms of reaction time are often used, however, to me at least, the interpretation of a log RT is unclear. Furthermore, one should be cautious using log transformed values as the shift in scale can change a main effect into an interaction and vice versa.

@whuber: Agreed. That is why I specified "become more normal". The aim should be to eyeball the test statistic for changes rather than an accept/reject decision based on the p-value of the test. – russellpierce – 2013-03-03T16:02:31.320

2Answers will be reordered based on votes, so please try not to refer to other answers. – Vebjorn Ljosa – 2010-07-20T15:54:34.873

4A test of normality is usually too severe. Often it suffices to obtain symmetrically distributed residuals. (In practice, residuals tend to have strongly peaked distributions, partly as an artifact of estimation I suspect, and therefore will test out as "significantly" non-normal no matter how one re-expresses the data.) – whuber – 2010-10-12T19:02:17.600

Why just the log? Shouldn't this question apply to any data transformation technique that can be used to minimize the residuals associated with mx+b? – AsymLabs – 2013-10-19T19:19:24.720

1Are you asking about how to reduce the effect of outliers or when to use the log of some variable? – Benjamin Bannier – 2010-07-20T13:14:06.983

19I think that the OP is saying "I've heard of people using the log on input variables: why do they do that?" – Shane – 2010-07-20T13:24:23.870

1@AsymLabs - The log might be special in regression, as it is the only function that converts a product into a summation. – probabilityislogic – 2014-03-30T08:51:55.933

7A warning to readers: The question asks about transforming IVs, but some of the answers appear to be talking about reasons to transform DVs. Don't be misled into thinking those are all also reasons to transform IVs -- some can be, others certainly aren't. In particular, the distribution of the IV is not generally of relevance (indeed, the marginal distribution of the DV isn't either). – Glen_b – 2016-03-06T22:10:14.177