How does the Conditional Random Fields algorithm work?

Asked

Viewed 641 times

13

I’ve been researching the algorithm for a long time "Conditional Random Fields", but I can’t find any deep explanation for how it works and I can’t understand it. I understood that it would be a modification of the Markov chain (am I right?), but I cannot understand its functioning to the point of being able to carry out its implementation.

What would be the best didactic explanation for the algorithm "Conditional Random Fields"?

1 answer

6


Let’s assume you like to keep track of your favorite company’s daily stock price. One day you notice a strange relationship: on days when it rains, the stock price is more likely to rise. Intrigued, you search weather information for the last 200 days, compare it to the daily value of shares, and mount this table:

inserir a descrição da imagem aqui

In 40% of the days it did not rain and the stock price fell, in 40% it did not rain and the price went up, in 8% it rained and went down and in 12% it rained and went up. With this table you can calculate the probability of the price rise given that it rained on the day:

P( S=1|C=1 ) = 12%/(8%+12%) = 60%

Knowing for sure that it will rain on a certain day, you can predict that the price will rise with a 60% chance of settling. But you also realize that it is possible to calculate the probability of rain given that on a certain day the stock price will rise:

P(C=1|S=1 ) = 12%/(40%+12%) = 23,08%

Okay, that didn’t help much! We still can’t predict the stock price very well. But if with 2 binary random variables you got this precision imagine with 300? If it is possible to assemble this same type of table with 300 binary random variables (which correlate well with the stock price) it is very likely that we can make a forecast with more than 90% chance of success.

I wish it were possible =). This table is known as the joint pmf (joint probability mass Function) in the case of the random variables S and C. Each percentage of this matrix would have to be represented with a float variable and to use it with 300 binary random variables would need 2 300 floats. That’s much more than the amount of memory of all the computers in the world combined!

Still, this matrix would contain all possible information of correlation between the 300 random variables and with it it would be possible to predict any of the variables if you had a sufficient number of other known variables. Statistically, if your only information is the value of these variables, the joint pmf is the best you can have. To make a prediction with it just add and divide the right values and according to the statistical theorems, you can not have a better prediction. Knowing this many methods try to compress this matrix to have a smaller amount of data. One way to compress functions is factorization, for example:

f(x,y,z) = g(x,y)*h(x,z)*i(y,z)

In the case if the function f is represented by a matrix of 1000x1000x1000 floats, this factorization generates a representation of this matrix with 3x1000x1000 floats and depending on the functions g, h and i chosen is possible to obtain an approximation of all elements of the f matrix with high precision using only these factors. That’s basically what the Conditional Random Fields method does. First it considers fixed the input and output variables, that is, with it if you have the output variables you can not infer the input variables as in the joint pmf, but this consideration already allows to greatly decrease the amount of information needed for the method. Since this consideration is made, we need only p(y|x) and not the whole joint pmf p(x,y), although it is possible to calculate the second from the first with the Bayes rule. And he also uses the following factorization to compress the amount of information in the p(y|x function):

inserir a descrição da imagem aqui

Where x is a vector with all input random variables and y is a vector with all output random variables. Z is a normalization function so that the sum of probabilities gives 1 and only depends on the input variables due to the simplification made. The Phis are the functions obtained after factorization. Now that we have the form of factorization only lack means to calculate the functions after factorization with training data and make inferences upon them. There’s a very well-studied theory that can help with that: graph theory. The mentioned factorization can be represented by a bipartite graph, with two types of nodes: the round nodes representing the random variables and the square nodes representing the functions. With this representation several powerful methods and algorithms of graph theory can be used to aid in training and inference. The following figure shows a graph of a CRF:

inserir a descrição da imagem aqui

Okay, let’s try to find the closest link between Markov chains and CRF. The Markov chain is a stochastic process that presents the Markov property. There is a probabilistic graphical model called the Hidden Markov Model in which its nodes display the Markov property, i.e., you only need the state of the neighboring nodes to infer the state of a node and not the state of all network nodes. The Hidden Markov model is the generative equivalent of the linear chain Conditional Random Fields that is a discriminative model. Therefore the Markov chain relates to the Hidden Markov Model which is a generative-discriminative pair with the linear-chain CRF which is a particular case of the general CRF. There is also a relationship between Markov networks and CRF, but summarizing the Markov chain and CRF is not a very direct relationship.

In addition it is important to clarify something else. Conditional Random Fields is not an algorithm, it is a model. Neural networks, for example, are not an algorithm either. Neural networks are models and there are various algorithms to perform the training of a neural network. The problems we want to solve are real, so we need a model to bring them into the theoretical realm where we have tools to solve them. The modeling is this, a mathematical model would be for example, to describe some phenomenon with an equation that represents the reality of this phenomenon, to solve the equation is another question. Therefore there are different algorithms to perform training on a CRF.

  • The answer was still a little incomplete, the other day I try to improve a little the text and complement the answer. Note that pmf is only for discrete cases, for continuous random variables it is used in pdf, but it is more difficult to explain.

  • Basically, the purpose of CRF then, is to decrease the memory cost of pmf that you mentioned?

  • Not only the cost of memory, but also the number of data needed for training, the running time of the training and the cost to make an inference. This is what probabilistic graphic methods do (CRF is one of them). Ideally you choose the best model for your type of problem.

  • Sérgio could you recommend me some material to deepen the models in AI? And especially one to deepen in CRF?

  • I like Machine Learning: a Probabilistic Perspective (Kelvin P. Murphy). I also find it interesting to have a strong statistical base, learning stochastic processes would be great for this (although the relationship is not very direct), but I still do not know a very good book, An Introduction to Stochastic Modeling (Taylor, Karlin) is an option. A lot doesn’t apply to machine learning so it might be interesting to select more important chapters.

  • 2

    I think to learn Conditional Random Fields the ideal would be to see a lot of machine learning before. There are many techniques that for me are simpler and that solve many problems well, for example: logistic regression, KNN, Learning Vector Quantization, Naive Bayes, Bayes Networks, PCA and many others...

  • I get it, thank you very much. I’m going to look for this stuff that you mentioned.

  • Congratulations @Sérgiomucciaccia, very good answer! :)

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.