How to implement a linear regression algorithm?

Asked

Viewed 8,546 times

12

I need to implement a linear regression algorithm. Preferably that gives equal or near results to the function TENDENCY (or TREND) excel.

I’ve found a lot of material that tries to explain the whole concept of linear regression, but that’s not what I want. I would like a pseudo code that is directly translatable for an imperative programming language.

For those who didn’t use the function TENDENCY it works like this:

Data two vectors representing x and y, in an ordered pair (x,y), for known values, for example, [ (1,10), (2,20), (3,30), (4,40) ], and a value of x for which you want to find out the corresponding value of y, the function TENDENCY fits the known values in a function of the form y=mx+b and gives you the value of y. In this case if I pass the value 5 as the last argument, the function returns me 50.

Someone might give a pseudo code from an algorithm that does this?

  • 1

    Dude, I did it in college over 10 years ago, but I can’t remember (nor find references). I’m also looking for something like this. In this blog post there is a very simple implementation in Python - but according to the author it is more "didactic" than "efficient" (after all, there are zillions of ways to make linear regression, each with its pros and cons).

7 answers

9

To Reply by @Marcos Banik can be translated for example to Python with minimal effort:

>>> x = [1,2,3,4]
>>> y = [10,20,30,40]
>>> m = sum(a*b for (a,b) in zip(x,y)) - sum(x) * sum(y) / len(x)
>>> m /= sum(a**2 for a in x) - (sum(x)**2)/len(x)
>>> b = sum(y)/len(y) - m * sum(x)/len(x)
>>> m*5 + b
50

However not all imperative language has the same level of expressiveness, so I will present a more elaborate version (in Javascript):

var x = [1,2,3,4];
var y = [10,20,30,40];

function produto(x, y) {
    var ret = [];
    for ( var i = 0 ; i < x.length ; i++ )
        ret.push(x[i] * y[i]);
    return ret;
}

function quadrados(x) {
    var ret = [];
    for ( var i = 0 ; i < x.length ; i++ )
        ret.push(x[i] * x[i]);
    return ret;
}

function somatorio(x) {
    var ret = 0;
    for ( var i = 0 ; i < x.length ; i++ )
        ret += x[i];
    return ret;
}

function media(x) { return somatorio(x) / x.length; }

var m = somatorio(produto(x,y)) - somatorio(x) * somatorio(y) / x.length;
m /= somatorio(quadrados(x)) - somatorio(x)*somatorio(x) / x.length;

var b = media(y) - m * media(x);

console.log(m*5 + b);

Example in jsFiddle.

8


I set up a small example in Java based on some examples I found on the Web.

I used the data you provided...

public class Main {

    public static void main(String[] args) {
        double[] x = {1, 2, 3, 4};
        double[] y = {10, 20, 30, 40};
        System.out.println(trend(y, x, 5));
    }

    public static double trend(double[] known_y, double[] known_x, double new_x)
    {
        double[] values = LeastSquaresFitLinear(known_y, known_x);
        return (values[0] * new_x) + values[1];
    }

    public static double[] LeastSquaresFitLinear(double[] known_y, double[] known_x)
    {
        double M, B;
        if (known_y.length != known_x.length)
        {
            return new double[]{0,0};
        }

        int numPoints = known_y.length;

        double x1, y1, xy, x2, J;

        x1 = y1 = xy = x2 = 0.0;
        for (int i = 0; i < numPoints; i++)
        {
            x1 = x1 + known_x[i];
            y1 = y1 + known_y[i];
            xy = xy + known_x[i] * known_y[i];
            x2 = x2 + known_x[i] * known_x[i];
        }

        M = B = 0;
        J = ((double)numPoints * x2) - (x1 * x1);

        if (J != 0.0)
        {
            M = (((double)numPoints * xy) - (x1 * y1)) / J;
            B = ((y1 * x2) - (x1 * xy)) / J;
        }
        return new double[]{M,B};
    }

}
  • 1

    Perfect! The solution in Java helps since it is almost a lingua franca and the algorithm works, not only for the values presented but for other values, very close, but identica, the TENDENCIA/TREND function of Excel.

5

I don’t know what the algorithm is that function TENDENCIA uses, but if it is only an input variable x and an exit y, the answer to m and b by the least squares method is: (code in R)

m <- (sum(x*y) - sum(x) * sum(y) / n) / ( sum(x^2) - ( sum(x) )^2 ) / n )
b <- mean(y) - m * mean(x)

n is the number of vector elements x and y, sum is the function that returns the sum of the vectors, x * y is the vector with the product of the elements of the same index. x^2 is the vector where all its elements are squared

you can find this solution in wikipedia

  • What is j? And this x*y means what in R? (scalar product - guess not - cross product, etc. In J, x*y would be a vector with the products of the individual elements of x and y, but I don’t think it’s the same in R)

  • j was an editing error (should be y). x*y means the same thing in J and R from what I understand.

  • 1

    Yes.Actually in R it is better to use the function lm to calculate regressions, since it returns the confidence interval, residual and other important parameters in the regression quality analysis.

    1. I ask explicitly "in an imperative language" because I know that answers would appear in functional language since problems like this are in their domain. Still thanks for the information. 2. Are the order of precedence of operations equal to the mathematical? Will the substitution be the last operation performed? I could not reproduce your example in Excel assuming this precedence.
  • Yes. You’re right. I’m not from the programming area, hence my confusion. I don’t know the difference between imperative and functional language. I regret the noise. And yes the order of the operation is the same as the mathematics. The reason you couldn’t reproduce the result was because I forgot to put the denominator m. I think you’re right now.

  • 1

    @Raulalmeida For my part, I found the answer easily "translatable" to an imperative language (however, my preferred language is Python - which encompasses several functional features in its design - in addition to having previous experience in J)But I understand that not everyone is like that. Postei a separate answer where expressed in a purely imperative format.

Show 1 more comment

4

Using the same algorithm from the @Marcos Banik response, I wrote a small algorithm in C based on the least squares method:

void lms(double *x, double *y, int n, double *m, double *b)
{
    int i;
    double sumYX = 0.;
    double sumX = 0.;
    double sumY = 0.;
    double sumX2 = 0.;
    double sum2X = 0.;
    for(i = 0; i < n; i++) {
        sumYX += x[i] * y[i];
        sumX += x[i];
        sumY += y[i];
        sumX2 += x[i] * x[i];
    }
    sum2X = sumX * sumX;

    *m = (sumYX - (sumX * sumY) / (double)n) / (sumX2 - sum2X / (double)n);
    *b = sumY / (double)n - *m * sumX / (double)n;
}

The algorithm is exactly the same used in his response. The loop for initial is done to calculate all the sums and later your results are used to find m and b. I believe this is easily translatable to your preferred language. :)

With m and b you have the equation of the line. To calculate which value of y, just make a function:

double trend(double m, double b, double x)
{
    return m*x + b;
}

To call the functions:

int main(void) 
{
    double x[] = {1., 2., 3., 4.};
    double y[] = {10., 20., 30., 40.};
    double m, b;
    double nx = 5.;
    double ny;

    lms(x, y, 4, &m, &b);
    ny = trend(m, b, nx);

    printf("m: %lf \nb: %lf \nnx: %lf \nny: %lf\n", m, b, nx, ny);

    return 0;
}

2

You need y=mx+b and that is enough the data x. y will serve only to discover other variables, but not to define the regression equation, correlation coefficient, average, median and data fashion.

General scope: "which year (x) in which absolute population growth will be zero?"

First we need to find out the growth rate (%) and whether it is positive or negative, which defines TREND or the slope of the graph generated by ER.

On the basis of existing data (anos/x) we define the correlation coefficient or accuracy of the responses for new y.

Halfway we define fashion, median, average etc to sustain the solution of the problem that allows to work any matrix of x:y depending on TREND, the signal of bx or mx + or -.

y = a + bx + c ou y=mx+b

The complete code that generates the regression equation and correlation coefficient is in James Holmes' The State of Art sources from chapter 8.

2

For those who might be interested here is the implementation I was looking for.

It is written in the "language" Clipper (compiler x/Harbour) and works equal to TENDENCY for various values I tested.

Function Main()

   LOCAL x := { 1,2,3,4 }
   LOCAL y := { 10,20,30,40 }

   ? Trend(y, x, 5)

   Return NIL

Function Trend( known_y, known_x, new_x )

   LOCAL v := LSFL(known_y, known_x)

   Return (v[1] * new_x) + v[2]

Function LSFL(known_y, known_x)

   LOCAL M,B
   LOCAL n
   LOCAL x1, y1, xy, x2, J
   LOCAL i

   IF Len( known_y ) != Len( known_x )
      Return { 0, 0 }
   ENDIF

   n := Len( known_y )

   x1 := 0; y1 := 0; xy := 0; x2 := 0

   For i := 1 To n
      x1 := x1 + known_x[i]
      y1 := y1 + known_y[i]
      xy := xy + known_x[i] * known_y[i]
      x2 := x2 + known_x[i] * known_x[i]
   Next

   M := 0
   B := 0

   J := ( n * x2 ) - ( x1 * x1 )

   IF !Empty( J )
      M := (( n * xy ) - ( x1 * y1 )) / J
      B := ( (y1 * x2) - ( x1 * xy )) / J
   ENDIF

   Return { M, B }

1

It follows an algorithm I made to calculate the MMQ based on a video lesson I watched on Youtube:

#include<stdio.h>
#include<stdlib.h>
#include<math.h>


void main(){

    float x[] = {-1,0,1,2};
    float y[] = {-1,1,3,5};
    float sx;
    float xy;
    float totalx = 0;
    float totaly = 0;
    float totalsx = 0;
    float totalxy = 0;
    int n = sizeof(x)/sizeof(float);

    int cont;

    for(cont = 0;cont<n;cont++){
        totalx = totalx + x[cont];
        totaly = totaly + y[cont];
        totalsx = totalsx + pow(x[cont],2);
        totalxy = totalxy + (x[cont]*y[cont]);
    }

    // Passo1
    float vbb = n;
    float vba = totalx;
    float somvba = totaly;
    float b2p1 = totalx;
    float a2p1 = totalsx;
    float soma2p1 = totalxy;

    //Passo 2
    float b1p2 = vbb / vbb;
    float a1p2 = vba / vbb;
    float soma1p2 = somvba / vbb;
    float b2p2 = b2p1-(b1p2*totalx);
    float a2p2 = a2p1-(a1p2*totalx);
    float soma2p2 = soma2p1-(soma1p2*totalx);

    // Passo 3
    float b1p3 = b1p2;
    float a1p3 = a1p2;
    float soma1p3 = soma1p2;
    float a2p3 = a2p2 / a2p2;
    float soma2p3 = soma2p2 / a2p2;

    float afinal = soma2p3 / a2p3;
    float bfinal = soma1p3 - a1p3 * afinal;

    printf("\nValor final de A= %.5f e valor de B= %.5f\n\n", afinal, bfinal);
    system("pause >> log");
}
  • 1

    It seems an interesting code, could edit the question and explain how it works?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.