Iterate over two List optimally - Algorithm Optimization O(n²) C#

Asked

Viewed 278 times

4

I need to iterate over two lists and compare the values contained in both. But the way I’m doing it is taking a lot of time, because the greatness of my algorithm is O(n²), follow below more details.

  • I have a list by name Connections and one of Primary.
  • I need to iterate on both as code below.

    for (int i = 0; i < graph.Connections.Count; i++)
    {
         foreach (var primare in primaries)
         {
             if (graph.Connections[i].Initial.Coordinate.CoordX == primare.LongitudeOne && graph.Connections[i].Initial.Coordinate.CoordY == primare.LatidudeOne && graph.Connections[i].End.Coordinate.CoordX == primare.LongitudeTwo && graph.Connections[i].End.Coordinate.CoordY == primare.LatitudeTwo)
             { 
                   graph.Connections[i].PrimaryConnectionData.Source = graph.Connections[i].Initial.Id;
    
                   //Código aqui......
             }       
         }
    }
    
  • There is a big problem in this approach because, my list is very large, and this iteration takes a lot of time.

I tried a different approach using a dictionary as an example below.

Note: My pPointOne.Id is a long

foreach (var primary in primaries)
{
    var coordOne = new CoordinateGraph() { CoordX = primary.LongitudeOne, CoordY = primary.LatidudeOne };
    var coordTwo = new CoordinateGraph() { CoordX = primary.LongitudeTwo, CoordY = primary.LatitudeTwo };

    var pPointOne = new PrimaryPoint();
    var pPointTwo = new PrimaryPoint();

    int position;
    if (graph.PointsMap.TryGetValue(coordOne, out pPointOne) && graph.PointsMap.TryGetValue(coordTwo, out pPointTwo))
    {
        position = Convert.ToInt32(pPointOne.Id);
        graph.Connections[position].PrimaryConnectionData.Source = graph.Connections[position].Initial.Id; 

        //Código aqui...
     }
}

But in this second approach I have a runtime error, which is below.

Result Message:

System.Argued tofrangeexception: The index was out of range. It should be non-negative and smaller than the collection size. Name of the parameter: index

  • My question is, how can I do this process more optimally?

  • You can do it using Lambda?

  • What is this primaries? Where does it come from?

  • It’s a upload list from my bank, it counts two pairs of coordinates (Long1, Lat1) and (Lon2, Lat2).

  • Do you have a way to show the code? because you do the FOREACH of an item and inside that do primaries foreach

  • Yes, I need to iterate over the two list, to check whether: primare.Long1 == Graph.connection.long1 && primare.Lat1 == Graph.connection.Lat1 && primare.Long2 == Graph.connection.long2 && primare.Lat2 == Graph.connection.Lat2

  • Then put the code where the primaries come from

  • I just make one Findallprimary() which returns all the primaries of my base, which are nothing more than coordinates. Which part specifically you want to see?

  • Basically, the error is that YOUR primaries has more settings than Graph.Connections.Count. just check before it no longer gives the error

Show 2 more comments

1 answer

2


whereas graph.Connections have size n and primaries have size k, has an approach that runs in time o(n log(n) + m log(m) + (n + m)).

This approach is divided into two steps:

  1. data preparation (o(n log(n) + m log(m)))
  2. go through the data (o(n + m))

How does this magic happen? Well, let’s go to the general idea and then go into the details.

If you have an unordered set of cardinality n, to detect if we have repeated elements without ordering it is necessary to do o(n^2 - n) operations. If the set is indexed, we can halve that number. This gives a temporary relief, while the data does not increase in order of magnitude... Imagine that you now have twice as many elements now in the set (ie, 2n); the amount of search operations is now o( ((2n)^2 - 2n)/2) = o((4 n^2 - 2 n)/2) = o(2 n^2 - n). It was enough to increase the order of magnitude for all the optimization to be lost.

Now, what if the set were manageable? Well, being manageable, the cost to order it is o(n log n) (if it is not previously ordered). Now, given an ordered indexed set, how many operations do you need to do to detect any repetition? Only o(n). I explain myself better.

When the set is indexed, I can access its members by the index. For example, C[i] picks up the i-nth element of the whole C, already C[i + 1] take the next item to i-th. Taking an element of a set index is equivalent to determining a function f: N -> E that maps a natural number (set N is the natural set) for an element of an arbitrary set E. If the elements in E are manageable and C is increasingly ordered, f then it will be a monotonously growing function. What does that mean? Well, it basically ensures that f(i) <= f(i+1) for any i within the function domain. Through induction, we can also demonstrate that f(i) <= f(i + k), k >= 0.

One of the impacts of the function being monotonously increasing is that if f(i) < f(i + 1), this ensures that f(i) < f(i + k), K >= 1. So if two consecutive index elements (i and i + 1) are not identical, this ensures that no index element needs to be compared j < i with any other index item k >= i + 1. In other words, the only comparisons needed are between consecutive elements.

But we’re talking about two sets, not just one. How does that apply?

Because the principle is the same. Imagine that we are working with two sets, A and B. Imagine that the sets are sorted and are of the same type of data. Now, choose i a valid index in A and j a valid index in B. With these sets and these indices, we get the elements A[i] and B[j]. There are three possibilities of comparison between A[i] and B[j]:

  • A[i] == B[j]: in this case, we found elements of the same value, what we wanted;
  • A[i] > B[j]: as we are dealing with ordered sets, we know that, to k < j, we have to B[k] <= B[j], so it makes no sense to compare A[i] with any element of B lower than j;
  • A[i] < B[j]: as we are dealing with ordered sets, we know that, to k < i, we have to A[k] <= A[i], so it makes no sense to compare B[j] with any element of A lower than i.

This provides a basis for a search algorithm. Similarly, only elements matter here (in a way) consecutive, as in the case of searching repeated elements in an ordered set. Was to define here a small pseudo-code to write a function that detects equal elements in the sets A and B. The arguments for this function are:

  • A, an ordered set of elements of the type E
  • B, an ordered set of elements of the type E
  • cmp : E,E -> S, a function which, given two elements of E, defines what the relationship between them; returns - if the first element is less than the second; + if the first element is larger than the second or 0 where the elements are identical (S is the whole signal: {-, 0, +})

    entrada:
      A, conjunto ordenado de elementos do tipo E
      B, conjunto ordenado de elementos do tipo E
      cmp, função sinal que compara dois elementos de E
    
    começo
      i <- 0 # índice para iterar em A
      j <- 0 # índice para iterar em B
    
      enquanto i < A.tamanho && j < B.tamanho:
        s = cmp(A[i], B[j])
        se s == '0':
          # elementos são iguais, faça alguma coisa
          i <- i + 1
          j <- j + 1
        senão, se s == '-':
          # A[i] < B[j], então próxima comparação será com A[i + 1] e B[j]
          i <- i + 1
        senão # caso trivial onde s == '+':
          # A[i] > B[j], então próxima comparação será com A[i] e B[j + 1]
          j <- j + 1
      # caso i ou j extrapolem o tamanho de A ou B, respectivamente, não há mais comparações a se fazer
    fim
    

All in all, there is a guarantee that the search will take o(n + m) operations. Linear time, making your problem become tangible now.

Just out of curiosity, asked last week on the performance compared to that of mergesort and insertionsort, since mergesort was taking approximately 2x the insertionsort time. After making a tunning ordering making the most of the allocated memory, the performance achieved was ridiculously better (like, between 3,000 and 7,000 times smaller). So reducing the complexity of the problem at this level brings about big changes.

Until then, everything has been discussed in an abstract way. How can we make it a little more feasible?

The first step is to define the set E and the properties of its elements, so that it can define a sort. If E consisting of only a single checkable key, simply sort by that key. For example, we may have E the set of entries in the dictionary; an entry consists of the word of the entry and its meaning, the key being the word of the entry, and this key is subject to lexicographic ordering/alphabetic ordering. For example, we know that análise comes lexicographically before complexidade. Thus, it is possible to sort a dictionary using lexicographic order.

What about multiple keys? Well, we chose an arbitrary key initially. It will be the, shall we say, dominant key. If we order only by the dominant key, we will have partially ordered lists. In this case, if one chooses another key to dominate over the other elements, then this would be the second dominant key. We repeat this step until all keys are used, defining a dominance hierarchy between keys. The most interesting thing here is that, for the search algorithm above, it does not matter which dominance hierarchy is chosen, as long as it is consistent. (Perhaps the choice of this dominance hierarchy affects the amount of work done in data ordering, however).

In our case, the set we have to compare is that of two coordinates. Elements of this set contain four values: lat1 , long1 , lat2 , long2. As for the search algorithm the key dominance hierarchy does not matter, I will sort initially by lat1, using lat2 as the first tie-breaker, followed by long1 and then long2. Roughly, the comparison function would look something like this:

int cmp_coords(Connection c, Primary p) {
  double delta = c.Initial.Coordinate.CoordY - p.LatitudeOne;

  if (delta != 0) {
    // caso em qua a chave dominante prevalece
    return delta < 0? -1: +1;
  }
  // caso em que a chave dominante empatou

  // primeiro critério de desempate: Lat2
  delta = c.End.Coordinate.CoordY - p.LatitudeTwo;

  if (delta != 0) {
    // desempatou!!
    return delta < 0? -1: +1;
  }

  // segundo critério de desempate: Long1
  delta = c.Initial.Coordinate.CoordX - p.LongitudeOne;

  if (delta != 0) {
    // desempatou!!
    return delta < 0? -1: +1;
  }

  // último critério de desempate/última chave: Long2
  delta = c.End.Coordinate.CoordX - p.LongitudeTwo;

  if (delta != 0) {
    // desempatou!!
    return delta < 0? -1: +1;
  }
  // ok, realmente empatou em tudo!
  return 0;
}

So if we have the variables graphConnectionsOrdered created from graph.Connections applying the analogous comparison function to the above function, and also the primariesOrdered the same way in relation to primaries, we can do the following:

int i = 0; // índice de iteração sobre graphConnectionsOrdered
int j = 0; // índice de iteração sobre primariesOrdered

while (i < graphConnectionsOrdered.Length && j < primariesOrdered.Length) {
  int s = cmp_coords(graphConnectionsOrdered[i], primariesOrdered[j]);
  if (s == 0) {
    // elementos são iguais, faz algo?
    i++;
    j++;
  } else if (s < 0) {
    // elemento em graphConnectionsOrdered menor do que o primariesOrdered; incrementa índice do graphConnectionsOrdered
    i++;
  } else {
    // elemento em graphConnectionsOrdered maior do que o primariesOrdered; incrementa índice do primariesOrdered
    j++;
  }
}

This runs in guaranteed time o(n + m). The time required to sort each set is o(n log(n)) and o(m log(m)). Therefore, the total time is o(n log(n) + m log(m) + (n + m)).

Right, but you’ve always talked about working with a homogenized type, but the comparison uses two distinct types. Why? And how is this possible?

Well, right, I really skipped that part on purpose. Though both Connection and Primary represent two coordinates, they are not of the same type. On the other hand, they are trivially converted one type into the other: there is a bijection between Connection and Primary. Because of this bijection, if we respect the way it is done (and we respect it up there), then we have to match the types. As they are equivalent, I used some implicit conversions between them two, saving creation of temporary variables that would only exist to make this homogenization.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.