Combination of values for sample selection

Asked

Viewed 503 times

3

Guys, I’m having difficulties to finish a job and would like help. Anyway, below is the excerpt from my list.csv:

n,NOME,COR,VALOR
1,Lapis ,Verde,"11,1"
2,Caneta ,Vermelho,"12,25"
3,Lapiseira ,Azul,"15,45"
...

and the code I wrote in Python:

from itertools import combinations
import csv

def get_column_of_csv(filename, column):
    with open(filename) as stream:
        catalogo = csv.DictReader(stream)
        for row in catalogo:
            yield row[column]

for name in get_column_of_csv('lista.csv', 'VALOR'):
    for i in (2, len(name)):
            for comb in combinations(name, i):
                if sum(int(comb)) <= 25:
                    print (comb,'<= 25,00')

    print(name)

My intention is to make the code combine values so that their sum does not exceed certain established value and at the end show me the names of the items whose pairs did not exceed this value. An example of output would be: "Pencil + Pencil" (Since the sum of these does not exceed the established value of 25,00).

I saw this post: LINK based on my code, but I was unsuccessful.

ERROR:

Traceback (most recent call last):
  File "kits.py", line 13, in <module>
    if sum(int(comb)) <= 25:
TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple'

If anyone can help me... I’d be grateful.

EXTRA: If in the list of items I had the name of the item together with the manufacturer (ex: BIC Pen, Faber-Castell Pen,...) and I wanted to assemble only combination of items of a certain brand, or exclude a certain brand from the filter, as could be done?

3 answers

2

The error is basically because you are trying to convert a tuple into a whole. Basically doing something like:

int((2, 5, 7))

And the class int don’t know what to do with a tuple.

I found some points of code a little confusing, mainly the line for i in (2, len(name)); did not understand why you will generate combinations with 2 items and with len(name) items. How you are accessing the column 'VALOR', the value of name will be, for example, "11,1", and so len(name) would be 4. After that you generate the combinations in name with i items, but this would generate combinations between the characters of "11,1" that is not what you intend to do.

Your code will be much simpler if you separate the responsibilities by functions. The first step is to get the file data, then we do the function that reads the file returning a generator to the lines already converting the value of VALOR for float:

def read_csv(path):
    with open(path) as stream:
        reader = csv.DictReader(stream)
        for row in reader:
            row['VALOR'] = float(row['VALOR'].replace(',', '.'))
            yield row

From the result we want to create the combinations 2 to 2, so we use the function itertools.combinations:

data = read_csv('data.csv')
combinations = itertools.combinations(data, 2)

Of all combinations only those that have the column sum VALOR less than or equal to 25. For this we create a function that will receive the sequence of combinations, the name of the column we want to add and the desired limit:

def sum_of_column_le(combinations, column, limit):
    for combination in combinations:
        sum_of_column = sum(product.get(column, 0) for product in combination)
        if sum_of_column <= limit:
            yield combination

So, to get the combinations that have the sum of VALOR less than or equal to 25 do:

result = sum_of_column_le(combinations, 'VALOR', 25.00)

And so we can iterate on result to display the combinations that interest us:

for combination in result:
    print([product['NOME'] for product in combination])

For the question data:

n,NOME,COR,VALOR
1,Lapis ,Verde,"11,1"
2,Caneta ,Vermelho,"12,25"
3,Lapiseira ,Azul,"15,45"

The exit would be:

['Lapis ', 'Caneta ']

For it is the last combination of products that has the sum of values less than or equal to 25.

See working in Repl.it | Github GIST

  • Aderson, thank you for your reply. It worked cool. I just got a question, is it possible to improve the outgoing message? And about the 'EXTRA' that I had commented... in case I wanted it to iterate only the items that had red color (for example), you could guide me on how to do?

  • @Marcelojoão Yes, having the information you can generate the output you want. About EXTRA, it is possible to filter as soon as you read the file in read_csv; there just do the validation you want and only do the yield items that pass validation.

  • Got it. Just to clarify... the output in the case, it comes out of this format: ['Pencil', 'Pen']; ?

  • Yes, you can do whatever you want. Who defines the output is the print in for combination in result; just change it according to your needs;

  • I get it, thank you.

  • Anderson, I came up with just one more question... the output is with repetition, for example: Pencil + Pen and Pen + Pencil. Is this easy to filter? So there’ll be no repetitions...

  • @Marcelojoão You must have done something different. The form of the answer does not generate this duplicate combination; the combinations it generates is (Pencil, Pen), (Pencil, Pencil) and (Pencil, Pen).

  • True, I got confused. Thank you.

Show 3 more comments

1


TL;DR

A solution with pandas, I used the data of your example but added the brands to demonstrate the filtering, eliminating the lines with the brand "bar":

import io
import pandas as pd
from itertools import combinations

# Simulando o csv
csv = '''
nome,cor,marca,valor
Lapis, Verde, bic,"11,1"
Caneta, Vermelho, cristal,"12,25"
Lapiseira, Azul, foo,"15,45"
Lapis, Verde, foo,"12,00"
borracha, Branca, bar,"10,00"
borracha, Verde, bar,"11,00"
'''

# Lendo o csv 
df = pd.read_csv(io.StringIO(csv))

# Filtrando o df para que nao contenha a marca 'bar'
df = df[~df.marca.str.contains('bar')]

# Criando coluna para uma tupla de descricao e valor (adapte ao seu contexto)
df['descricao'] = df.nome + df.cor + df.marca
df['item'] = list(zip(df.descricao, df.valor))

# Apresentando o dataframe resultante
print(df)

Exit:

        nome        cor     marca  valor                descricao  \
0      Lapis      Verde       bic   11.1          Lapis Verde bic   
1     Caneta   Vermelho   cristal  12.25  Caneta Vermelho cristal   
2  Lapiseira       Azul       foo  15.45       Lapiseira Azul foo   
3      Lapis      Verde       foo  12.00          Lapis Verde foo   

                               item  
0           (Lapis Verde bic, 11.1)  
1  (Caneta Vermelho cristal, 12.25)  
2       (Lapiseira Azul foo, 15.45)  
3          (Lapis Verde foo, 12.00)  

The next step is to combine the items into a list:

# Fazendo as combinacoes
combs = list(combinations(df.item, 2))

# Apresentando as combinacoes
print(combs)

Exit:

[(('Lapis Verde bic', '11.1'), ('Caneta Vermelho cristal', '12.25')),
 (('Lapis Verde bic', '11.1'), ('Lapiseira Azul foo', '15.45')),
 (('Lapis Verde bic', '11.1'), ('Lapis Verde foo', '12.00')),
 (('Caneta Vermelho cristal', '12.25'), ('Lapiseira Azul foo', '15.45')),
 (('Caneta Vermelho cristal', '12.25'), ('Lapis Verde foo', '12.00')),
 (('Lapiseira Azul foo', '15.45'), ('Lapis Verde foo', '12.00'))]

Finally we go through the list of combinations, selecting the ones that meet the condition, in the sum of the values <= 25

# Selecionando as combinacoes validas
valid_combs = []
max_value = 25
for comb in combs:
    if float(comb[0][1]) + float(comb[1][1]) <= max_value:
        valid_combs.append(comb)

# Apresentando o resultado
print(valid_combs)        

Output (Tuples with combinations <= 25):

[(('Lapis Verde bic', '11.1'), ('Caneta Vermelho cristal', '12.25')),
 (('Lapis Verde bic', '11.1'), ('Lapis Verde foo', '12.00')),
 (('Caneta Vermelho cristal', '12.25'), ('Lapis Verde foo', '12.00'))]

Putting the final result into one pandas.DataFrame:

final_list = []
for valid in valid_combs:
    final_list.append((valid[0][0],valid[0][1], valid[1][0],valid[1][1], 
                       float(valid[0][1])+float(valid[1][1])))

df_final = pd.DataFrame(final_list, columns=['Item 1', 'Valor', 
                                              'Item 2', 'Valor', 'Total'])

# Apresentando o resultado final
print(df_final)

Exit:

                    Item 1  Valor                   Item 2  Valor  Total
0          Lapis Verde bic   11.1  Caneta Vermelho cristal  12.25  23.35
1          Lapis Verde bic   11.1          Lapis Verde foo  12.00  23.10
2  Caneta Vermelho cristal  12.25          Lapis Verde foo  12.00  24.25

Filtering the DataFrame final:

Let’s filter the final result by removing the lines in which the word "Pen"

# Filtrando o df, removendo as combinacoes q tenham "Caneta"
df_filtered = df_final[~df_final['Item 1'].str.contains('Caneta') & 
                       ~df_final['Item 2'].str.contains('Caneta')]

# Apresentando o dataframe filtrado
print('','DataFrame filtrado: ', df_filtered,sep='\n')

Exit

DataFrame filtrado: 
            Item 1 Valor           Item 2  Valor  Total
1  Lapis Verde bic  11.1  Lapis Verde foo  12.00   23.1

See working on repl.it.

  • Hello Sidon, thank you so much for the reply. Pandas is very cool, but unfortunately it is on a level far beyond my knowledge... rs. However, I put the code and it worked well. I just didn’t like the output very much, because the list is very large, so it ends up getting very cramped and full of special characters. Is this possible to improve? Similar to what was done in Dataframe...

  • Just put the valid combinations in a dataframe, I’m editing the answer to put the final result in a df,

  • Updated tb on repl it.

  • Detail: The pandas only seems complex, but after you get used to it you will see that it is a "hand on the wheel" for finance, data-sciencie, tabulation, etc. A Swiss Army knife in big data. :-)

  • Thank you Sidon.

  • Sidon, a question has arisen... My record . csv is external, so for me to open it I believe the command is: df = pd.read_csv('LISTA.csv') Right?! If in this file the header has a two-word title (e.g., LOWEST VALUE), how would I have to define line 29 of your code? And one more question, let’s assume that I want to delete from the combination some items of the heading NAME (e.g.: Pencil), as I could do this?

  • The file opening is correct, the other questions I did not understand, you would like to discard the first line of the file, is that it? Line 29? what’s in it. Delete from the combination or the final df? The "chat" is getting specific. :-)

  • Actually my intention is to learn how the data entry works. On line 29 you set df['descricao'] = df.nome + df.cor + df.marca since the header items are: nome,cor,marca,valor. If instead of 'value' I had two strings, for example 'lower value', how would I define it? df.menor valor (...only one example). And the other question, I believe it is more complex, but it is to know how to exclude values from the combination... for example, I want a combination only of LAPIS items, thus ignoring the others (PENCIL AND PEN)... this is simple to be done?

  • Okay, I edited the answer and added the topic Filtering the final dataframe: I tried to answer everything in this topic, see that now I am referencing the columns differently, because their name are composed instead of df.label I’m making df["label"] as in df_final['Item 1']. I also make a filter removing the lines that have the plavra "Pen", The filter could be inverted tb, ie only consider the lines that had the word Caneta, for this it would be enough to remove the denial signal ~ and replace & for |

  • Got it, thanks. By the way, line 22 would be df["MENOR VALOR"] = df["MENOR VALOR"].str.replace(',','.') thus?!

  • That, that, that, that! :-)

  • I get it, thank you.

  • Sidon, I promise this is the last question. For me to filter more than one word I have to do so (line 55): df_filtered = df_final[~df_final['Item 1'].str.contains('CANETA', 'LAPIS')? I tried, but it didn’t work. And I also took advantage and made a test for a very large list and noticed that the result has presented values: '...', because? I tried to export to . csv the result, but it did not work, keeps appearing the term continuation. Is it possible to be corrected? How?!

  • df_caneta_lapis = df_final[df_final['Item 1'].str.contains('Caneta') | df_final['Item 2'].str.contains('Lapis')] To avoid the dots on print(): pd.options.display.max_rows = 999 and pd.options.display.max_columns = 999. Take a look at the repl.it, I updated there.

  • Change the values of pd.options.display for your need.

  • You did it right. Thank you very much, really. In fact, just a detail that I noticed after the modification, imagine that I was going to make the impression of the result... the 'List 1' with the values was on one sheet and the 'List 2' is in another. Making it a little difficult to visualize... you know what might have brought this about?

  • Without seeing it is difficult to imagine what it is, but usually the pandas tries to adjust the visualization when the number of columns does not fit in the printing area, there is no way it will be difficult to visualize even, but you can tabulate. Remember that you complained about the visualization of combinations (tuples) with the print() pure? Now I made a presentation with tabulate in repl.it, take a look there.

  • Thus: https://paste.ubuntu.com/p/Zt4hD2zXY7/

  • So... this indicates that the visualization space that you’re trying to make the presentation doesn’t hold all the columns, so, intelligently, the pandas breaks into 2. Do a test, try it like this print('','DataFrame filtrado: ', df_filtered.to_string(), sep='\n') and you will see that the print will now be truncated, because I converted it to string before. You can test the replit. Leave the window where it presents the results very narrowly and see the difference of the print(df.to_string()), for only print(df), actually the pandas are helping you, without it all gets messy.

  • If the space for the presentation is not sufficient you will have to reduce the information in the columns, like present only the item code instead of the description or only part of the description, truncating in a given position, for example.

  • 1

    Sidon, using the df_filtered.to_string() gave right. Thank you very much. I will start studying Pandas, motivated me.

Show 16 more comments

1

I changed your code a little bit and it worked right here.

What I did:

  • I took the column NAME and played all the items on a list;
  • I took the column WORTH and played all the items on a list;
  • I made a replace in the columnar WORTH and changed all the ',' for '.' to convert to float.
  • I discovered the index where the sum of products is below 25.

With that, the code went like this:

from itertools import combinations
import csv

lista_valor = []
lista_produto = []
combinacoes_possiveis = []
resultado = []

def get_column_of_csv(filename, column):
    with open(filename) as stream:
        catalogo = csv.DictReader(stream)
        for row in catalogo:
            yield row[column]

for produto in get_column_of_csv('lista.csv', 'NOME'):

    lista_produto.append(produto)


for name in get_column_of_csv('lista.csv', 'VALOR'):
    name = str(name)                # TRANSFORMA EM STRING PARA PODER USAR REPLACE
    name = name.replace(',', '.')   # TROCA , POR . PARA PODER CONVERTER EM FLOAT
    name = float(name)              # CONVERTE EM FLOAT

    lista_valor.append(name)        # ADICIONA O VALOR A UMA LISTA


# --------BUSCA COMBINAÇÃO DE 2 VALORES DA LISTA-----------------
# --------------QUE NÃO ULTRAPASSAM 25---------------------------

for t in combinations(lista_valor,2):
    if sum(t) <= 25:
        idx = [lista_valor.index(a) for a in t]    # PEGA O INDEX NO QUAL A COMBINAÇÃO
                                                   # NÃO ULTRAPASSA 25

        combinacoes_possiveis.append(idx)



print(combinacoes_possiveis)     # PARES DE ÍNDICES POSSÍVEIS

for i in range(len(combinacoes_possiveis)):  ### PREVINE RESULTADOS REPETIDOS

  if str(lista_produto[combinacoes_possiveis[i][0]]) + 'e ' + 
  str(lista_produto[combinacoes_possiveis[i][1]]) in resultado:
    print('Elemento repetido')

  else:

    resultado.append(str(lista_produto[combinacoes_possiveis[i][0]]) + 'e ' +
    str(lista_produto[combinacoes_possiveis[i][1]]))


print(resultado)     # ['Lapis e Caneta ', 'Lapis e Lapiseira ', 'Lapis e Boracha ', 'Lapis e Caneta Colorida ', 'Lapis e Corretivo ', 'Caneta e Lapiseira ', 'Caneta e Boracha ', 'Caneta e Caneta Colorida ', 'Caneta e Corretivo ', 'Lapiseira e Boracha ', 'Lapiseira e Lapiseira ', 'Lapiseira e Caneta Colorida ', 'Lapiseira e Corretivo ', 'Boracha e Lapiseira ', 'Boracha e Caneta Colorida ', 'Boracha e Corretivo ', 'Caneta Colorida e Corretivo ']

Good luck with your studies!

EDIT:

I changed all the CSV values so there were 21 possible combinations.

  • 1

    William, thank you for the answer... great idea you had. However, he did what I want, but not necessarily what I need. He just showed me a combination and not all possible (no repetition, of course). Why?

  • So, Marcelo, if you print the idx inside the for will see that it returns all possible combinations of pairs (which was defined two lines above) by the index of the combination. In the example I did, it returns [0, 1] because it is the elements of the list that the combination is possible. If you change the values and all combinations are possible, the idx will return: [0, 1]&#xA;[0, 2]&#xA;[1, 2] which are the possible combinations. I hope you have managed to understand!

  • Hello William, thank you for your attention. I’m unfortunately starting my studies so... I didn’t quite understand what I should edit so that he would create combinations with all the elements on the list. If you could show me I’d be very grateful.

  • Marcelo, I edited the code and tried to make it a little easier to understand. First, I changed the product values to 5.00 to make all combinations possible. Then I created a new list (possible combinations) to add the index of the elements to it. On line 41 I added the index of the combinations that are possible to be made to the new created list. Thus, I hold all positions of the CSV in which the condition is true. In the prints at the end of the code, you can get a better idea now of how to use these indexes. I hope you’ve made your understanding easier this time!

  • I believe I do. The detail is that the list has about 350 items, so there are many combinations, if I have to add one for each... I don’t think it would solve my problem (...if I understood the code correctly).

  • Marcelo, I edited the code again to be global. I did the test with 21 possible combinations and it worked. I created a new list to store the final results, and added to it the names of the products where the combination is possible. You can find the change at the end of the code (line 50).

  • Thank you William.

Show 2 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.