Why should we scale/standardize values of variables and how to reverse this transformation?

Question

Why should we scale/standardize values of variables and how to reverse this transformation?

Asked 5 years, 7 months ago

Viewed 572 times

12

When working with multivariable prediction algorithms I came across the function scale of the R, whose objective is to scale/standardize the values of the variables.

I have no difficulty in using the function scale, but my doubt is specifically conceptual.

Why should I scale the values of my variables? What is the goal? Does it make a difference, for example, in the accuracy of my algorithm’s prediction model? And how can I reverse the transformation?

1

Behold this post of Cross Validated. But forget one of the reasons given, that of the numerical stability, nowadays that is not quite true, the computers are much better. This was said in the R-Help some time ago.

– Rui Barradas

2020/01/06 at 20:23
Thanks for the references @Rui Barradas.

– Izak Mandrak

2020/01/07 at 19:20

1 answer

Browser other questions tagged r

You are not signed in. Login or sign up in order to post.

by AlexCiuffa • **2,402** points · Answer 1 · 2020-01-07T00:12:02+00:00

Should I schedule my entries? The answer is: depends.

The truth is that scheduling your data will not worsen the result, so in doubt, scale.

Cases to be staggered

If the model is based on the distance between points, such as Clusterization algorithms (k-meas) or dimensionality reduction (PCA), then it is necessary to scale/normalize its inputs. Take the example:

Starting from the data:

    Ano  Preco
0  2000   2000
1  2010   3000
2  1970   2500

The Euclidean distance matrix is:

       0       1       2   
0 [[   0.   1000.05  500.9 ]
1  [1000.05    0.    501.6 ]
2  [ 500.9   501.6     0.  ]]

We observe that the absolute distance of the preco dictates what the distance will be, because its absolute value is much greater than the ano. But when we normalize between [0, 1], the result changes dramatically:

   Ano_norm  Preco_norm
0      0.75         0.0
1      1.00         1.0
2      0.00         0.5

The new Euclidean distance matrix is:

      0    1    2 
0 [[0.   1.03 0.9 ]
1  [1.03 0.   1.12]
2  [0.9  1.12 0.  ]]

Another example, referring to the PCA, is this one.

For algorithms like Neural Networks (see this reference), using the down gradient and activation functions, scheduling the inputs allows:
- That only positive characters have a negative and positive part, which facilitates training.
- Prevents any account from returning values such as Not a Number during training.
- If inputs are at different scales, the weights connected to inputs will be updated at different rates (some faster than others). This impairs learning.

And even normalize the outputs is important because of the activation function of the last layer.

In this case, to go back to the original output scale, simply store the values used to normalize and do the reverse account. Ex:

To normalize:

X_norm = (X - X_min)/(X_max - X_min)

To return to the original scale:

X = X_norm * (X_max - X_min) + X_min

Cases where staggering is not necessary

Cutting algorithms such as Decision Tree and Random Forest.

Other cases

For some algorithms such as linear regression, scheduling is not mandatory and does not improve accuracy. Scheduling or not the entries will only change the coefficients found. However, since the inputs have different magnitudes (as in the example above ano and preço), the coefficients found can only be compared if the entries are staggered. That is, if you want interpretability, scale the entries.