So, I was working on a regression problem and my \(y\) values, in theory, would fall in the range \([0,1]\). In reality, though, most of them were crowded between \(0.9\) and \(1.0\). I thought that I could apply some transformation and distribute them more evenly, without thinking about it too much.

Therefore, I applied the arc sine square transformation, again, without checking when this transformation would make sense.

\[z = \text{arcsin}(\sqrt{y})\]

After running my code to the transformed data set, I noticed that not only did the model not perform better, but the results I was getting were very bad.

The reason behind this failure is, I guess, that my data were noisy and this transformation inflated the error. You can check this wikipedia article on propagation of uncertainty. A common formula to calculate error propagation is the following. Assuming that you have \(z = f(x, y, \ldots)\), then your error in the variable \(z\) is given by:

\[s_z = \sqrt{ \left(\frac{\partial f}{\partial x}\right)^2 s_x^2 + \left(\frac{\partial f}{\partial y} \right)^2 s_y^2 + \cdots}\]

Where \(s_z\) represents the standard deviation of the function \(f\), \(s_x\) represents the standard deviation of \(x\), \(s_y\) represents the standard deviation of \(y\) and so forth.

So, in my case it was simply this:

\[s_z = \frac{\mathrm{d} z}{\mathrm{d} y} s_y \Rightarrow s_z = \frac{1}{2\sqrt{y (1-y)}} s_y\]

And since my \(y\)’s were very close to \(1\), naturally \(s_z\) exploded.

Moral: don’t try random stuff; make educated guesses.

Excercise for the reader: What data transformation would actually reduce my error? Can you think of some function \(z = f(y)\), such as that when you calculate \(\frac{\mathrm{d} z}{\mathrm{d} y}\), \(s_z\) is smaller compared to \(s_y\) for \(y\) values close to \(1\)?