How to calculate the median of a large amount of values?

Asked

Viewed 1,348 times

2

I have a list of approximately 1.5 million values and I need to define what is the median of this set. The values are stored in a file such as string, between single quotes (ex.: '155').

How do I calculate the median of this amount of values?

Obs.: I cannot use the ready-made functions such as min, max, etc..

  • 1.5mi is 1500 or 1.5 million?

  • There are approximately 1.5 million lines.

  • Daniel, I took the liberty of editing your question and making it a little more direct. I could confirm for myself that I didn’t mistake the interpretation and changed the question?

  • 1

    @Andersoncarloswoss did not miss the performance and got much better. Thank you!

  • Behold this code

3 answers

1

Well, the information is a little vague and also do not know the format of the file where are these data, with this, I can give you just an idea of what to do.

1) Read the file and put the values in a list (if the file is . csv will make it much easier).

2) Turns them into intergers using the int().

3) Use the function Sorted() to sort the list, example here.

4) Calculate the total list size using the function Len().

5) Take the total size of the list and calculate the rest of the division of that number by 2 using the %.

6) If even, you average the central elements, if odd the median is the central element of your list.

  • 1

    And do it with 1.5 million numbers? Seems unfeasible to me.

  • 1

    It gets really heavy, I agree, but in the absence of an alternative... There are ways to process this file separately, like divide it into X parts, sort, count the occurrence of cases and then draw the conclusion... But then it would be another process...

1


In this type of problem we have to be very careful with running time and memory. Working with lists can be a problem. For now I suggest this solution.

1) Read the file line by line and, while reading, already add the numbers in an ordered list

For this, we will use this function that puts a number in a list in the ordered position.

def adiciona_na_ordem(lista, tamanho_lista, numero):
    for i in range(0, tamanho_lista):
        if numero < lista[i]:
            break
    else:
        i+=1
    return lista[:i]+[numero]+lista[i:]

>>> lista = [0,1,2,3]
>>> print (adiciona_na_ordem(lista, len(lista), 10))
[0, 1, 2, 3, 10]

And read the file line by line:

entrada = open('dados.txt', 'r')

lista_ordenada = [int(entrada.readline())] #Para não inicializar a lista vazia
num_lidos = 1 #Evita usar len(lista_ordenada)
for linha in entrada:
    numero = int(linha)
    lista_ordenada = adiciona_na_ordem(lista_ordenada, num_lidos, numero)
    num_lidos += 1

entrada.close()

An example would be:

para uma entrada:
7
7
1
4
4
5
6
7

>>> print (lista_ordenada)
[1, 4, 4, 5, 6, 7, 7, 7]

2) Catch the middle

if num_lidos % 2 == 1:
    mediana = lista_ordenada[num_lidos//2]
else:
    mediana = (lista_ordenada[num_lidos//2 -1]+lista_ordenada[num_lidos//2]) / 2

1

I would like to suggest another approach. If the numbers are integer or with a few decimal places, many of them can be repeated, then we can count the number of occurrences of each number one with a dictionary. So we don’t use so much memory.

1) We read the file line by line and make a dictionary with the occurrences while maintaining a list with the ordered keys (we will use in the future)

Function that puts a number in a list at the ordered position:

def adiciona_na_ordem(lista, tamanho_lista, numero):
    for i in range(0, tamanho_lista):
        if numero < lista[i]:
            break
    else:
        i+=1
    return lista[:i]+[numero]+lista[i:]

We read line by line:

entrada = open('dados.txt', 'r')

primeiro_valor = int(entrada.readline())
ocorrencias = {primeiro_valor:1}
lista_chaves_ordenadas = [primeiro_valor]
num_linhas = 1

for linha in entrada:
    numero = int(linha)
    if numero in ocorrencias: #Se o numero ja esta no dicionario
        ocorrencias[numero] += 1
    else: #Se não está, adiciono em ocorrencias
        ocorrencias[numero] = 1
        #OU ocorrencias.update({numero:1})
        lista_chaves_ordenadas = adiciona_na_ordem(lista_chaves_ordenadas, len(lista_chaves_ordenadas), numero)

    num_linhas += 1

entrada.close()

Example of what the variables would look like:

para uma entrada:
7
7
1
4
4
5
6
7

>>> print (ocorrencias)
{7: 3, 1: 1, 4: 2, 5: 1, 6: 1}
>>> print (lista_chaves_ordenadas)
[1, 4, 5, 6, 7]

2) Now we calculate the median by going through the dictionary in order of ordered keys until the sum of occurrences reaches half

if num_linhas % 2 == 1: #Buscamos o elemento central
    num_elementos = 0
    for key in lista_chaves_ordenadas:
        num_elementos += ocorrencias[key]
        if num_elementos >= num_linhas/2:
            print (key)
            break

else: #media dos dois elementos centrais
    num_elementos = 0
    mediana = None
    for key in lista_chaves_ordenadas:
        num_elementos += ocorrencias[key]
        #Prox 2 ifs se os valores medianos forem chaves diferentes, ex: 5 e 6
        if num_elementos == num_linhas/2:
            mediana = key
        if num_elementos > num_linhas/2 and mediana != None:
            mediana += key
            print(mediana/2)
            break
        #Prox if se os valores medianos forem a mesma chave, ex: 6 e 6
        if num_elementos > num_linhas/2 and mediana == None:
            mediana = key
            print(key)
            break

Edit: If you use pytho2.7, I think the dictionary is already sorted, so you don’t need the sorted keys_list, just do for key, value in ocorrencias.iteritems(): instead of for key in lista_chaves_ordenadas:. But be careful with the divisions. Do /2.0 not to round to the whole

Browser other questions tagged

You are not signed in. Login or sign up in order to post.