Is there a better way to build a graph?

Asked

Viewed 553 times

4

I have a list of articles with their respective tags:

artigo1 ['a','b','c']
artigo2 ['a','d','f']
artigo3 ['z','d','f']
...

I need to create a graph that will relate all articles by tags. In the end I will have to have this structure (Can’t be another):

artigo1: ['artigo2']
artigo2: ['artigo1','artigo3']
artigo3: ['artigo2']
...

Initially I made two for:

for i in range(0,len(vertices)):
    for j in range(i,len(vertices)):
        for tag in vertices[i].tags:
            if(tag in vertices[j].tags):
                vertices[i].neighbor.append(vertices[j].uid)
                vertices[j].neighbor.append(vertices[i].uid)

It works but my file has 5,500,000 articles and will spend an eternity because the complexity is O((n 2)/2). There is a better way to perform this computation?

  • There is a way to improve yes. I made a graph constructor in Java using keys and maps. I’ll send the idea here, but not final code. The asymptotic speed will depend basically on the structure to store and to rescue the data

  • I don’t know if it helps you, but have you checked the igraph? Link: http://igraph.org/python/

1 answer

3


As I had commented, I made a map-based Java graph montage scheme, as long as each node had a key to identify it. Your case is slightly different, because each node here has several keys that identify it. You can also see as a bipartite graph your specific question. I will focus on the bipartite graph.

By the way, while writing this answer, I realized that your problem is something basic for database systems to handle. I realized that databases treat their problem as a relationship between two tables, many for many. And the edges between the two entities is represented in the link table. See more here. Imagine if for every connection between 2 many-many entities it was of quadratic speed?

You have two types of data. The first type of data is artigo. The second type of data is tag. An article has several tags, and a tag can be in several articles. But here there is no link tag -- tag or artigo -- artigo.

It is still possible to exist these links and continue with a bipartite graph in terms. But for this we need to consider that there are connections of distinct types and that the graph is bipartite in the class of links tags belonging to articles.

An example of a bipartite graph:

grafo bipartido

You already have here the navigation between the articles and the tags, in this direction. But to make the navigation complete you need to navigate the tags to the articles. How to do this? How about with a multimope? In short, a multimope is a data structure that, for a single key, returns a collection of elements.

I’ve worked out how to implement a multi-map in this answer, with an analysis of what this data structure is in this other answer.

So we set up the multimope mm_tags_artigos in accordance with this answer and we can access all the articles related to a particular tag by making a map access. Your example would bring exactly:

mm_tags_artigos['a'] ==> [ artigo1 , artigo2 ]
mm_tags_artigos['b'] ==> [ artigo1 ]
mm_tags_artigos['c'] ==> [ artigo1 ]
mm_tags_artigos['d'] ==> [ artigo2 , artigo3 ]
mm_tags_artigos['f'] ==> [ artigo2 , artigo3 ]
mm_tags_artigos['z'] ==> [ artigo3 ]

So, how to know what are the articles related to one another? Well, we take all the edges at two positions of distance in the bipartite graph. Since the graph is bipartite, everything at an even distance of edges of an article is an article as well. By taking all the articles two jumps away from my target article, we will have the following set:

Descrição matemática de artigos a dois saltos de distância

Here, S_artigo is the set of all articles that relate directly to the article tags artigo. Of course, here the article itself would also be returned, so simply remove the article you want to know the related of this set.

Basically, after having the multimap assembled, it would suffice from the following line that then you would get all the articles related to the article x (thank you to @Andersoncarloswoss for showing me the itertools.chain):

[artigo_vizinho for artigo_vizinho, g in groupby(sorted(chain(*[mm_tag_artigo[tag] for tag in x.neighbors]), key = str)) if artigo_vizinho != x]

This will return a list of all neighboring articles, accessed from the tags of x. Explaining every step, from the inside out:

mm_tag_artigo[tag]

I’m getting a list of all the articles that have the tag tag. mm_tag_artigo is the multimap that, given a tag, returns the list of articles associated with it.

[mm_tag_artigo[tag] for tag in x.neighbors]

For each tag that x contains, I return a list with your neighbors. If x were the artigo2 of the example, the elements would be more or less these:

[
  # para a tag 'a'
  [artigo1, artigo2],

  # para a tag 'd'
  [artigo2, artigo3],

  # para a tag 'f'
  [artigo2, artigo3]
]

If it were artigo1:

[
  # para a tag 'a'
  [artigo1, artigo2],

  # para a tag 'b'
  [artigo1],

  # para a tag 'c'
  [artigo1]
]
*[mm_tag_artigo[tag] for tag in x.neighbors]

Using the operator explode before the list, so that each element of the list is passed as a separate argument from the calling function.

chain(*[mm_tag_artigo[tag] for tag in x.neighbors])

Here I am joining the obtained lists in a single iterable. To x = artigo2, we would get something in this format:

[ artigo1, artigo2, artigo2, artigo3, artigo2, artigo3 ]

Already to x = artigo1:

[ artigo1, artigo2, artigo1, artigo1 ]

Documentation of chain: https://docs.python.org/3/library/itertools.html#itertools.chain

sorted(chain(*[mm_tag_artigo[tag] for tag in x.neighbors]), key = str)

Now I am ordering the obtained result. Maybe objects of the type artigo are not directly manageable, so I’m using your string representation as the sorting key (key = str). I’ve done tests for when there’s no method override __str__ of the article class, so it prints a distinct string for each object.

Documentation of sorted: https://docs.python.org/3/library/functions.html#sorted

Of course, if the method __str__ of an article is superscripted to allow two distinct objects with the same representation, there danced the next step.

The intention to sort here is only to be able to group objects only, which is done in the next step:

groupby(sorted(chain(*[mm_tag_artigo[tag] for tag in x.neighbors]), key = str))

I am grouping the elements so that I can get only one of each, without needing repetitions. Note that this function works in linear time, as if it were the command uniq Unix, grouping only consecutive elements.

Documentation of groupby: https://docs.python.org/3/library/itertools.html#itertools.groupby

artigo_vizinho for artigo_vizinho, g in groupby(sorted(chain(*[mm_tag_artigo[tag] for tag in x.neighbors]), key = str))

The function groupby returns a list of tuples, they being:

  • the grouper element
  • the group formed by that element

We’re not interested in the group, so I’m discarding it, just staying with the grouper, which in this case is an article called artigo_vizinho.

artigo_vizinho for artigo_vizinho, g in groupby(sorted(chain(*[mm_tag_artigo[tag] for tag in x.neighbors]), key = str)) if artigo_vizinho != x

Just conditioning pick up the neighboring articles, without repeating the article x.

[artigo_vizinho for artigo_vizinho, g in groupby(sorted(chain(*[mm_tag_artigo[tag] for tag in x.neighbors]), key = str)) if artigo_vizinho != x]

Well, turning this result obtained into list. There is not much why, mania my own.

Just as a detail, don’t forget that sorted is built-in, is already available for Python, while chain and groupby need to be imported, they are in the module itertools. To use the way I used, import like this: from itertools import chain, groupby.

See working on ideone.

  • I understood the whole idea of your answer, I was thinking something similar, but I even stopped creating the multimope. The only thing I didn’t understand was the '*' and the fact of the need to sort(Maybe it would be pro chain?).

  • 1

    @Viniciusmorais the order actually is on account of groupby. More details in the documentation. About the *, try the following experiments: chain([1,2,3], [4,5,6]) (simulating with the asterisk); chain([ [1,2,3], [4,5,6] ]) (simulating without the asterisk)

  • 1

    Another thing, what would be the new complexity?

  • @Viniciusmorais puts, I forgot it! Barely there... depending on the efficiency of the search and the dictionary insertion, I think it is o(n + e log e + e), where e is the average number of edges per vertex. There’s an even shorter asymptotic time alternative, using an article-only graph, not a bipartite graph, but the alternative is an extra abstraction step and I couldn’t type yesterday, I was tired

  • Thank you very much.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.