Should I schedule my entries? The answer is: depends.
The truth is that scheduling your data will not worsen the result, so in doubt, scale.
Cases to be staggered
- If the model is based on the distance between points, such as Clusterization algorithms (k-meas) or dimensionality reduction (PCA), then it is necessary to scale/normalize its inputs. Take the example:
Starting from the data:
Ano Preco
0 2000 2000
1 2010 3000
2 1970 2500
The Euclidean distance matrix is:
0 1 2
0 [[ 0. 1000.05 500.9 ]
1 [1000.05 0. 501.6 ]
2 [ 500.9 501.6 0. ]]
We observe that the absolute distance of the preco
dictates what the distance will be, because its absolute value is much greater than the ano
. But when we normalize between [0, 1], the result changes dramatically:
Ano_norm Preco_norm
0 0.75 0.0
1 1.00 1.0
2 0.00 0.5
The new Euclidean distance matrix is:
0 1 2
0 [[0. 1.03 0.9 ]
1 [1.03 0. 1.12]
2 [0.9 1.12 0. ]]
Another example, referring to the PCA, is this one.
- For algorithms like Neural Networks (see this reference), using the down gradient and activation functions, scheduling the inputs allows:
- That only positive characters have a negative and positive part, which facilitates training.
- Prevents any account from returning values such as
Not a Number
during training.
- If inputs are at different scales, the weights connected to inputs will be updated at different rates (some faster than others). This impairs learning.
And even normalize the outputs is important because of the activation function of the last layer.
In this case, to go back to the original output scale, simply store the values used to normalize and do the reverse account. Ex:
To normalize:
X_norm = (X - X_min)/(X_max - X_min)
To return to the original scale:
X = X_norm * (X_max - X_min) + X_min
Cases where staggering is not necessary
- Cutting algorithms such as Decision Tree and Random Forest.
Other cases
For some algorithms such as linear regression, scheduling is not mandatory and does not improve accuracy. Scheduling or not the entries will only change the coefficients found. However, since the inputs have different magnitudes (as in the example above ano
and preço
), the coefficients found can only be compared if the entries are staggered. That is, if you want interpretability, scale the entries.
Behold this post of Cross Validated. But forget one of the reasons given, that of the numerical stability, nowadays that is not quite true, the computers are much better. This was said in the R-Help some time ago.
– Rui Barradas
Thanks for the references @Rui Barradas.
– Izak Mandrak