Why should we scale/standardize values of variables and how to reverse this transformation?

Asked

Viewed 572 times

12

When working with multivariable prediction algorithms I came across the function scale of the R, whose objective is to scale/standardize the values of the variables.

I have no difficulty in using the function scale, but my doubt is specifically conceptual.

Why should I scale the values of my variables? What is the goal? Does it make a difference, for example, in the accuracy of my algorithm’s prediction model? And how can I reverse the transformation?

  • 1

    Behold this post of Cross Validated. But forget one of the reasons given, that of the numerical stability, nowadays that is not quite true, the computers are much better. This was said in the R-Help some time ago.

  • Thanks for the references @Rui Barradas.

1 answer

6


Should I schedule my entries? The answer is: depends.

The truth is that scheduling your data will not worsen the result, so in doubt, scale.

Cases to be staggered

  1. If the model is based on the distance between points, such as Clusterization algorithms (k-meas) or dimensionality reduction (PCA), then it is necessary to scale/normalize its inputs. Take the example:

Starting from the data:

    Ano  Preco
0  2000   2000
1  2010   3000
2  1970   2500

The Euclidean distance matrix is:

       0       1       2   
0 [[   0.   1000.05  500.9 ]
1  [1000.05    0.    501.6 ]
2  [ 500.9   501.6     0.  ]]

We observe that the absolute distance of the preco dictates what the distance will be, because its absolute value is much greater than the ano. But when we normalize between [0, 1], the result changes dramatically:

   Ano_norm  Preco_norm
0      0.75         0.0
1      1.00         1.0
2      0.00         0.5

The new Euclidean distance matrix is:

      0    1    2 
0 [[0.   1.03 0.9 ]
1  [1.03 0.   1.12]
2  [0.9  1.12 0.  ]]

Another example, referring to the PCA, is this one.

  1. For algorithms like Neural Networks (see this reference), using the down gradient and activation functions, scheduling the inputs allows:
    • That only positive characters have a negative and positive part, which facilitates training.
    • Prevents any account from returning values such as Not a Number during training.
    • If inputs are at different scales, the weights connected to inputs will be updated at different rates (some faster than others). This impairs learning.

And even normalize the outputs is important because of the activation function of the last layer.

In this case, to go back to the original output scale, simply store the values used to normalize and do the reverse account. Ex:

To normalize:

X_norm = (X - X_min)/(X_max - X_min)

To return to the original scale:

X = X_norm * (X_max - X_min) + X_min

Cases where staggering is not necessary

  1. Cutting algorithms such as Decision Tree and Random Forest.

Other cases

For some algorithms such as linear regression, scheduling is not mandatory and does not improve accuracy. Scheduling or not the entries will only change the coefficients found. However, since the inputs have different magnitudes (as in the example above ano and preço), the coefficients found can only be compared if the entries are staggered. That is, if you want interpretability, scale the entries.

  • Thanks for the reply @Alexciuffa, clarified many doubts. If I may, I would like to ask one more thing, and if I may, I would be grateful. You mentioned that in Linear Regression Scaling does not improve in accuracy, but what if I use ARIMA would be the same case?

  • 2

    Since ARIMA only uses the time variable itself as input, the scale of the inputs is the same, so there is no need to scale them. Even, the output scale is the same as the inputs.

  • I get it. Even if I include multivariables in the ARIMA model too, is there no need? I did some tests and I saw no difference, I don’t know how much the interpretability changes something.

  • 1

    ARIMA is a linear regression input that has a (or more) history(s) input. So scheduling entries will not improve the result.

  • Got it. Thanks a lot for the clarification.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.