About the Layers
Your neural network needs to be at the very least the input and output layers, since the input layer is responsible for receiving the problem values (the "pulses") and the output layer is responsible for giving the answer(s) (the regression or classification value).
The intermediate(s) layer(s) (hidden) is intended to capture/represent the various nuances that training data may have. In theory, the more hidden layers you have, the more adjusted your network can become. This is especially useful when the problem is not linearly separable, as there are many small nuances that differentiate a given data among the classes to which it may belong.
In fact, if the problem is linearly separable you wouldn’t need to none hidden layer as the differences can already be learned only from the neurons of the input and output layers - such as you yourself could do with just one or more simple IF’s in your code (read the beginning of this my answer in another question here from SOPT to understand a little more about this reasoning).
If on the one hand more hidden layers make the network more specific to the nuances of the data in non-linearly separable problems, on the other hand they can make it easier for the adjustment to be too large on the training data even with few training interactions (this is the one about overfit, where the network works perfectly for the training data but misses for real problem data). Additionally, more layers naturally require more processing, and so the solution becomes more computationally costly.
In practice, there is a consensus that only 1 hidden layer is ideal for most problems:
One issue on which there is consensus is the difference between
performance when adding new hidden layers: the situations in which there are
a significant performance gain for adding a new layer
hidden are rare. Thus, a single hidden layer is sufficient for the
great majority of problems.
Free translation of the English original:
One Issue Within this Subject on which there is a Consensus is the performance Difference from Adding Additional Hidden layers: the
situations in which performance improves with a Second (or third,
etc.) Hidden layer are very small. One Hidden layer is sufficient for
the large Majority of problems.
About Neurons
In the input layer one must have a neuron for each vector characteristic of the problem used (understand as "for each column of the table"). But it also depends on how the feature is represented. In the case of continuous values (such as temperature measured in degrees floating point, for example), a single neuron is used. But if the values are discrete, one neuron is used for each possible value of (0, 1 and 2, for example, to represent low, medium and high temperatures). Sometimes an extra neuron is added to the input layer to serve as bias (and "move" the activation functions in a controlled way - more details on this other issue of SOEN).
In the output layer the number of neurons depends on the use of the network. If the intention is to make a regression (estimate a value from the input data), there is only one neuron. There is also only one neuron if the use is for binary classification (the network responds whether or not the input characteristics correspond to a single class). But if a multi-class classifier is built, there is an output neuron for each class (one neuron answers 1 if the data corresponds to the class it represents, and the others answer 0 in this case).
The number of neurons in the intermediate layer (occult) is another widely discussed issue. But there’s a simple empirical rule that says you should use the average number between neurons in the input and output layers, because that makes the network robust enough to solve problems (since it only needs to represent nuances compared to input data, but in view of the desired outputs).
About the Activation Functions
The activation function literally describes when a neuron should or should not be activated for a given activation value (called net
and resulting from the weighted sum of the inputs in the neuron). Consider a hypothetical example with only one neuron and some data inputs. If you use a step or linear function, it will be activated in a very similar way to an IF in your code, and so will serve only for linearly separable problems. Other functions, which are not linear, as the Logistic Function and the Tangent Hiperbólica become more useful for problems not linearly separable precisely by the more "subtle" way in which the decision is made about values further away from the decision frontier.
The decision on which function to use will depend essentially on your problem. The Logistic Function is commonly used because it is very simple and gives the message for most problems. I don’t have much practical experience in this aspect, but I do know that for neural networks that use back propagation (backpropagation) for training it is essential that the functions used are differentiable because the algorithm uses the descending gradient for the adjustment of weights in neurons. So this is also another important criterion, in these cases.
P.S.: In the case of character recognition, I suggest this
reading and this practical example in Python (who uses the fantastic library Scikit-Learn machine learning)
Well answered! +
– JJoao
Thank you. Your answer was also a good aid/supplement to the question. :)
– Luiz Vieira
PS. examples (had not noticed them) are also interesting.
– JJoao
@Luizvieira Nonlinearly separable problems are solved by plotting curved lines or different straight lines?
– DanOver
@Danover In practice the layers map the data so that in some dimension (larger) the problem becomes linearly separable. So, yes, you can understand that you use different "straight lines" to approximate a "curved line". :)
– Luiz Vieira
@Luizvieira I asked this question because I usually see two representations for XOR-type problems, one of them (https://goo.gl/xavwps) and one of them (https://goo.gl/6TB4ib).
– DanOver
@I got it, Danover. Note that the first image indicates that you would need to have two different linear classifiers to solve the XOR (which would only help with this problem - in others, you might need to have tens or hundreds of linear classifiers, making it impractical!). The second says what needs to be done in the truth. See the image I posted just after the XOR chart in this other post: https://answall.com/a/193712/73 It will help you understand what I meant by mapping to another dimension, where a single linear classifier is enough. :)
– Luiz Vieira
@Luizvieira Very good posting, I even gave a "up" on it. Luiz, help me on another question, when we say that a network is learning to play a game competing against itself, what does that mean? We are using an RNA algorithm based on genetic algorithm (something like Tweanns networks) and making each individual in the population compete against another?
– DanOver
@Danover This means that she is being trained from "trial and error": the actions she performs in the game (either alone or against several opponents) are weighted in terms of how good your result is and this data is reused as input in another training step. Whether you use GA or not will depend on implementation and intent with training.
– Luiz Vieira
Let’s go continue this discussion in chat.
– Luiz Vieira