Combination of values for sample selection

Question

Combination of values for sample selection

Asked 5 years, 2 months ago

Viewed 503 times

3

Guys, I’m having difficulties to finish a job and would like help. Anyway, below is the excerpt from my list.csv:

n,NOME,COR,VALOR
1,Lapis ,Verde,"11,1"
2,Caneta ,Vermelho,"12,25"
3,Lapiseira ,Azul,"15,45"
...

and the code I wrote in Python:

from itertools import combinations
import csv

def get_column_of_csv(filename, column):
    with open(filename) as stream:
        catalogo = csv.DictReader(stream)
        for row in catalogo:
            yield row[column]

for name in get_column_of_csv('lista.csv', 'VALOR'):
    for i in (2, len(name)):
            for comb in combinations(name, i):
                if sum(int(comb)) <= 25:
                    print (comb,'<= 25,00')

    print(name)

My intention is to make the code combine values so that their sum does not exceed certain established value and at the end show me the names of the items whose pairs did not exceed this value. An example of output would be: "Pencil + Pencil" (Since the sum of these does not exceed the established value of 25,00).

I saw this post: LINK based on my code, but I was unsuccessful.

ERROR:

Traceback (most recent call last):
  File "kits.py", line 13, in <module>
    if sum(int(comb)) <= 25:
TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple'

If anyone can help me... I’d be grateful.

EXTRA: If in the list of items I had the name of the item together with the manufacturer (ex: BIC Pen, Faber-Castell Pen,...) and I wanted to assemble only combination of items of a certain brand, or exclude a certain brand from the filter, as could be done?

3 answers

1

TL;DR

A solution with pandas, I used the data of your example but added the brands to demonstrate the filtering, eliminating the lines with the brand "bar":

import io
import pandas as pd
from itertools import combinations

# Simulando o csv
csv = '''
nome,cor,marca,valor
Lapis, Verde, bic,"11,1"
Caneta, Vermelho, cristal,"12,25"
Lapiseira, Azul, foo,"15,45"
Lapis, Verde, foo,"12,00"
borracha, Branca, bar,"10,00"
borracha, Verde, bar,"11,00"
'''

# Lendo o csv 
df = pd.read_csv(io.StringIO(csv))

# Filtrando o df para que nao contenha a marca 'bar'
df = df[~df.marca.str.contains('bar')]

# Criando coluna para uma tupla de descricao e valor (adapte ao seu contexto)
df['descricao'] = df.nome + df.cor + df.marca
df['item'] = list(zip(df.descricao, df.valor))

# Apresentando o dataframe resultante
print(df)

Exit:

        nome        cor     marca  valor                descricao  \
0      Lapis      Verde       bic   11.1          Lapis Verde bic   
1     Caneta   Vermelho   cristal  12.25  Caneta Vermelho cristal   
2  Lapiseira       Azul       foo  15.45       Lapiseira Azul foo   
3      Lapis      Verde       foo  12.00          Lapis Verde foo   

                               item  
0           (Lapis Verde bic, 11.1)  
1  (Caneta Vermelho cristal, 12.25)  
2       (Lapiseira Azul foo, 15.45)  
3          (Lapis Verde foo, 12.00)

The next step is to combine the items into a list:

# Fazendo as combinacoes
combs = list(combinations(df.item, 2))

# Apresentando as combinacoes
print(combs)

Exit:

[(('Lapis Verde bic', '11.1'), ('Caneta Vermelho cristal', '12.25')),
 (('Lapis Verde bic', '11.1'), ('Lapiseira Azul foo', '15.45')),
 (('Lapis Verde bic', '11.1'), ('Lapis Verde foo', '12.00')),
 (('Caneta Vermelho cristal', '12.25'), ('Lapiseira Azul foo', '15.45')),
 (('Caneta Vermelho cristal', '12.25'), ('Lapis Verde foo', '12.00')),
 (('Lapiseira Azul foo', '15.45'), ('Lapis Verde foo', '12.00'))]

Finally we go through the list of combinations, selecting the ones that meet the condition, in the sum of the values <= 25

# Selecionando as combinacoes validas
valid_combs = []
max_value = 25
for comb in combs:
    if float(comb[0][1]) + float(comb[1][1]) <= max_value:
        valid_combs.append(comb)

# Apresentando o resultado
print(valid_combs)

Output (Tuples with combinations <= 25):

[(('Lapis Verde bic', '11.1'), ('Caneta Vermelho cristal', '12.25')),
 (('Lapis Verde bic', '11.1'), ('Lapis Verde foo', '12.00')),
 (('Caneta Vermelho cristal', '12.25'), ('Lapis Verde foo', '12.00'))]

Putting the final result into one pandas.DataFrame:

final_list = []
for valid in valid_combs:
    final_list.append((valid[0][0],valid[0][1], valid[1][0],valid[1][1], 
                       float(valid[0][1])+float(valid[1][1])))

df_final = pd.DataFrame(final_list, columns=['Item 1', 'Valor', 
                                              'Item 2', 'Valor', 'Total'])

# Apresentando o resultado final
print(df_final)

Exit:

                    Item 1  Valor                   Item 2  Valor  Total
0          Lapis Verde bic   11.1  Caneta Vermelho cristal  12.25  23.35
1          Lapis Verde bic   11.1          Lapis Verde foo  12.00  23.10
2  Caneta Vermelho cristal  12.25          Lapis Verde foo  12.00  24.25

Filtering the DataFrame final:

Let’s filter the final result by removing the lines in which the word "Pen"

# Filtrando o df, removendo as combinacoes q tenham "Caneta"
df_filtered = df_final[~df_final['Item 1'].str.contains('Caneta') & 
                       ~df_final['Item 2'].str.contains('Caneta')]

# Apresentando o dataframe filtrado
print('','DataFrame filtrado: ', df_filtered,sep='\n')

Exit

DataFrame filtrado: 
            Item 1 Valor           Item 2  Valor  Total
1  Lapis Verde bic  11.1  Lapis Verde foo  12.00   23.1

See working on repl.it.

Hello Sidon, thank you so much for the reply. Pandas is very cool, but unfortunately it is on a level far beyond my knowledge... rs. However, I put the code and it worked well. I just didn’t like the output very much, because the list is very large, so it ends up getting very cramped and full of special characters. Is this possible to improve? Similar to what was done in Dataframe...

– Marcelo João

2019/04/05 at 17:06
Just put the valid combinations in a dataframe, I’m editing the answer to put the final result in a df,

– Sidon

2019/04/05 at 19:10
Updated tb on repl it.

– Sidon

2019/04/05 at 19:14
Detail: The pandas only seems complex, but after you get used to it you will see that it is a "hand on the wheel" for finance, data-sciencie, tabulation, etc. A Swiss Army knife in big data. :-)

– Sidon

2019/04/05 at 19:21
Thank you Sidon.

– Marcelo João

2019/04/06 at 15:14
Sidon, a question has arisen... My record . csv is external, so for me to open it I believe the command is: df = pd.read_csv('LISTA.csv') Right?! If in this file the header has a two-word title (e.g., LOWEST VALUE), how would I have to define line 29 of your code? And one more question, let’s assume that I want to delete from the combination some items of the heading NAME (e.g.: Pencil), as I could do this?

– Marcelo João

2019/04/07 at 12:15
The file opening is correct, the other questions I did not understand, you would like to discard the first line of the file, is that it? Line 29? what’s in it. Delete from the combination or the final df? The "chat" is getting specific. :-)

– Sidon

2019/04/07 at 14:49
Actually my intention is to learn how the data entry works. On line 29 you set df['descricao'] = df.nome + df.cor + df.marca since the header items are: nome,cor,marca,valor. If instead of 'value' I had two strings, for example 'lower value', how would I define it? df.menor valor (...only one example). And the other question, I believe it is more complex, but it is to know how to exclude values from the combination... for example, I want a combination only of LAPIS items, thus ignoring the others (PENCIL AND PEN)... this is simple to be done?

– Marcelo João

2019/04/07 at 16:59
Okay, I edited the answer and added the topic Filtering the final dataframe: I tried to answer everything in this topic, see that now I am referencing the columns differently, because their name are composed instead of df.label I’m making df["label"] as in df_final['Item 1']. I also make a filter removing the lines that have the plavra "Pen", The filter could be inverted tb, ie only consider the lines that had the word Caneta, for this it would be enough to remove the denial signal ~ and replace & for |

– Sidon

2019/04/08 at 12:39
Got it, thanks. By the way, line 22 would be df["MENOR VALOR"] = df["MENOR VALOR"].str.replace(',','.') thus?!

– Marcelo João

2019/04/08 at 17:56
That, that, that, that! :-)

– Sidon

2019/04/08 at 17:56
I get it, thank you.

– Marcelo João

2019/04/08 at 18:05
Sidon, I promise this is the last question. For me to filter more than one word I have to do so (line 55): df_filtered = df_final[~df_final['Item 1'].str.contains('CANETA', 'LAPIS')? I tried, but it didn’t work. And I also took advantage and made a test for a very large list and noticed that the result has presented values: '...', because? I tried to export to . csv the result, but it did not work, keeps appearing the term continuation. Is it possible to be corrected? How?!

– Marcelo João

2019/04/08 at 18:43
df_caneta_lapis = df_final[df_final['Item 1'].str.contains('Caneta') | df_final['Item 2'].str.contains('Lapis')] To avoid the dots on print(): pd.options.display.max_rows = 999 and pd.options.display.max_columns = 999. Take a look at the repl.it, I updated there.

– Sidon

2019/04/08 at 19:18
Change the values of pd.options.display for your need.

– Sidon

2019/04/08 at 19:19
You did it right. Thank you very much, really. In fact, just a detail that I noticed after the modification, imagine that I was going to make the impression of the result... the 'List 1' with the values was on one sheet and the 'List 2' is in another. Making it a little difficult to visualize... you know what might have brought this about?

– Marcelo João

2019/04/08 at 21:21
Without seeing it is difficult to imagine what it is, but usually the pandas tries to adjust the visualization when the number of columns does not fit in the printing area, there is no way it will be difficult to visualize even, but you can tabulate. Remember that you complained about the visualization of combinations (tuples) with the print() pure? Now I made a presentation with tabulate in repl.it, take a look there.

– Sidon

2019/04/08 at 21:39
Thus: https://paste.ubuntu.com/p/Zt4hD2zXY7/

– Marcelo João

2019/04/09 at 00:14
So... this indicates that the visualization space that you’re trying to make the presentation doesn’t hold all the columns, so, intelligently, the pandas breaks into 2. Do a test, try it like this print('','DataFrame filtrado: ', df_filtered.to_string(), sep='\n') and you will see that the print will now be truncated, because I converted it to string before. You can test the replit. Leave the window where it presents the results very narrowly and see the difference of the print(df.to_string()), for only print(df), actually the pandas are helping you, without it all gets messy.

– Sidon

2019/04/09 at 00:31
If the space for the presentation is not sufficient you will have to reduce the information in the columns, like present only the item code instead of the description or only part of the description, truncating in a given position, for example.

– Sidon

2019/04/09 at 00:35
1

Sidon, using the df_filtered.to_string() gave right. Thank you very much. I will start studying Pandas, motivated me.

– Marcelo João

2019/04/09 at 11:07

Show 16 more comments

Browser other questions tagged python

You are not signed in. Login or sign up in order to post.

by Woss • **73,416** points · Answer 1 · 2019-04-05T13:30:21+00:00

The error is basically because you are trying to convert a tuple into a whole. Basically doing something like:

int((2, 5, 7))

And the class int don’t know what to do with a tuple.

I found some points of code a little confusing, mainly the line for i in (2, len(name)); did not understand why you will generate combinations with 2 items and with len(name) items. How you are accessing the column 'VALOR', the value of name will be, for example, "11,1", and so len(name) would be 4. After that you generate the combinations in name with i items, but this would generate combinations between the characters of "11,1" that is not what you intend to do.

Your code will be much simpler if you separate the responsibilities by functions. The first step is to get the file data, then we do the function that reads the file returning a generator to the lines already converting the value of VALOR for float:

def read_csv(path):
    with open(path) as stream:
        reader = csv.DictReader(stream)
        for row in reader:
            row['VALOR'] = float(row['VALOR'].replace(',', '.'))
            yield row

From the result we want to create the combinations 2 to 2, so we use the function itertools.combinations:

data = read_csv('data.csv')
combinations = itertools.combinations(data, 2)

Of all combinations only those that have the column sum VALOR less than or equal to 25. For this we create a function that will receive the sequence of combinations, the name of the column we want to add and the desired limit:

def sum_of_column_le(combinations, column, limit):
    for combination in combinations:
        sum_of_column = sum(product.get(column, 0) for product in combination)
        if sum_of_column <= limit:
            yield combination

So, to get the combinations that have the sum of VALOR less than or equal to 25 do:

result = sum_of_column_le(combinations, 'VALOR', 25.00)

And so we can iterate on result to display the combinations that interest us:

for combination in result:
    print([product['NOME'] for product in combination])

For the question data:

n,NOME,COR,VALOR
1,Lapis ,Verde,"11,1"
2,Caneta ,Vermelho,"12,25"
3,Lapiseira ,Azul,"15,45"

The exit would be:

['Lapis ', 'Caneta ']

For it is the last combination of products that has the sum of values less than or equal to 25.

^{See working in Repl.it | Github GIST}

by Guilherme Carvalho Lithg • **646** points · Answer 2 · 2019-04-05T12:55:02+00:00

I changed your code a little bit and it worked right here.

What I did:

I took the column NAME and played all the items on a list;
I took the column WORTH and played all the items on a list;
I made a replace in the columnar WORTH and changed all the ',' for '.' to convert to float.
I discovered the index where the sum of products is below 25.

With that, the code went like this:

from itertools import combinations
import csv

lista_valor = []
lista_produto = []
combinacoes_possiveis = []
resultado = []

def get_column_of_csv(filename, column):
    with open(filename) as stream:
        catalogo = csv.DictReader(stream)
        for row in catalogo:
            yield row[column]

for produto in get_column_of_csv('lista.csv', 'NOME'):

    lista_produto.append(produto)


for name in get_column_of_csv('lista.csv', 'VALOR'):
    name = str(name)                # TRANSFORMA EM STRING PARA PODER USAR REPLACE
    name = name.replace(',', '.')   # TROCA , POR . PARA PODER CONVERTER EM FLOAT
    name = float(name)              # CONVERTE EM FLOAT

    lista_valor.append(name)        # ADICIONA O VALOR A UMA LISTA


# --------BUSCA COMBINAÇÃO DE 2 VALORES DA LISTA-----------------
# --------------QUE NÃO ULTRAPASSAM 25---------------------------

for t in combinations(lista_valor,2):
    if sum(t) <= 25:
        idx = [lista_valor.index(a) for a in t]    # PEGA O INDEX NO QUAL A COMBINAÇÃO
                                                   # NÃO ULTRAPASSA 25

        combinacoes_possiveis.append(idx)



print(combinacoes_possiveis)     # PARES DE ÍNDICES POSSÍVEIS

for i in range(len(combinacoes_possiveis)):  ### PREVINE RESULTADOS REPETIDOS

  if str(lista_produto[combinacoes_possiveis[i][0]]) + 'e ' + 
  str(lista_produto[combinacoes_possiveis[i][1]]) in resultado:
    print('Elemento repetido')

  else:

    resultado.append(str(lista_produto[combinacoes_possiveis[i][0]]) + 'e ' +
    str(lista_produto[combinacoes_possiveis[i][1]]))


print(resultado)     # ['Lapis e Caneta ', 'Lapis e Lapiseira ', 'Lapis e Boracha ', 'Lapis e Caneta Colorida ', 'Lapis e Corretivo ', 'Caneta e Lapiseira ', 'Caneta e Boracha ', 'Caneta e Caneta Colorida ', 'Caneta e Corretivo ', 'Lapiseira e Boracha ', 'Lapiseira e Lapiseira ', 'Lapiseira e Caneta Colorida ', 'Lapiseira e Corretivo ', 'Boracha e Lapiseira ', 'Boracha e Caneta Colorida ', 'Boracha e Corretivo ', 'Caneta Colorida e Corretivo ']

Good luck with your studies!

EDIT:

I changed all the CSV values so there were 21 possible combinations.