Join lines from a text file

Asked

Viewed 997 times

1

Given a text file with the layout below:

VASCO;JOGADOR_1
VASCO;JOGADOR_2
VASCO;JOGADOR_3
PALMEIRAS;JOGADOR_4
PALMEIRAS;JOGADOR_5
PALMEIRAS;JOGADOR_6
PALMEIRAS;JOGADOR_7

How to create a logic (preference for bash, python or java) to get the result below:

VASCO;JOGADOR_1,JOGADOR_2,JOGADOR_3
PALMEIRAS;JOGADOR_4,JOGADOR_5,JOGADOR_6,JOGADOR7

I don’t know how many teams and how many players per team will have in the text file.

Summarize in each line of the generated file a line containing the name of the team semicolon the list of players of the respective team separated by comma (the last player must not have comma at the end)

So far I have been able to create some code in java, to read the file and start to separate what is team and what is player. I am researching about lists to try to make a list of teams and each list of teams have a list of players, and go adding the elements at runtime, is a good alternative?

public class Principal {
public static void main(String[] args) throws FileNotFoundException, IOException {
    String time;
    String jogador;

    String nomeArquivo = "input.txt";
    FileReader arquivoEntrada = new FileReader(nomeArquivo);
    BufferedReader bufferArquivoEntrada = new BufferedReader(arquivoEntrada);
    String linha = bufferArquivoEntrada.readLine();
    while(linha != null){
        System.out.println(linha);
        String[] linhaSplit = linha.split(";");
        time = linhaSplit[0];
        jogador = linhaSplit[1];
        System.out.println("Time........: " + time);
        System.out.println("Funcionario.: " + jogador);
        linha = bufferArquivoEntrada.readLine();
    }
  }
}
  • 3

    Tried anything? What difficulty found? To me seems a college job you don’t even know where to start rs

  • Thank you for your attention and understand your questions. It’s a personal project, not college but you’re right. I have done some things in bash but my biggest difficulty at this time and structure the logic even for this specific case

  • I can answer a logic possible to use, but see that would be too wide.... there are some ways to get the same result, and not limiting yet which language you want to use, gets wider still rs

  • I recommend lowering at least the scope of the question "I intend to use this language" would improve a bit

  • If you can also post what you tried and what was the difficulty found (structuring logic), it would be better directed to those who are interested in answering the question

  • 1

    Marcelo, thank you. At this moment I am in the bus rs, as soon as possible add something I have tried. My preference is java, bash (linux) or python for having already played a little with these languages

  • I’m thinking of starting the logic as follows: Reading the file line by line, for each line separating the content into two variables (splitting by comma port) wanted the contents of the left to be the name of an array or list, and its value added to a control array. The contents on the right added to the array that has the name on the left) to the next line checks whether the contents of the schema exist in the control array, if it already exists only add the new element, otherwise create new array... at the end itero over each array by adding the comma to separate elements

  • Edith your question and post on it :)

Show 3 more comments

2 answers

1


You can create an algorithm like this:

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.*;

public class App {

    public static void main(final String[] args) throws IOException {
        final File file = new File("input.txt");
        final FileInputStream stream = new FileInputStream(file);
        final String content = new Scanner(stream, "UTF-8").useDelimiter("\\A").next();
        final Map<String, Set<String>> map = new LinkedHashMap<>();

        for (final String line : content.split("\n")) {
            final String[] array = line.split(";");
            final String team = array[0];
            final String player = array[1];

            if (map.containsKey(team)) {
                map.get(team).add(player);
            } else {
                final Set<String> players = new LinkedHashSet<>();
                players.add(player);
                map.put(team, players);
            }
        }

        final List<String> lines = new ArrayList<>(map.size());

        for (final Map.Entry<String, Set<String>> entry : map.entrySet()) {
            final StringBuilder builder = new StringBuilder(entry.getKey() + ";");
            final Iterator<String> iterator = entry.getValue().iterator();

            while (iterator.hasNext()) {
                builder.append(iterator.next()).append(iterator.hasNext() ? "," : "");
            }

            lines.add(builder.toString());
        }

        final Path outputFile = Paths.get("output.txt");
        Files.write(outputFile, lines, Charset.forName("UTF-8"));

        System.out.println("Input: \n" + content);
        System.out.println("Output: \n" + lines);
    }
}
  • Pedro, thank you very much! I did a test using your code with a small mass of data and the result seems to be 100% as expected. I will study a lot the concepts you used to learn more. Something in my head said "use something with key value" but the development was out. The cat jump there was the use of Linkedhashmap right? I will study about it. I am marking your answer as the correct one! Again, thank you very much!

  • Yes, both Linkedhashmap and Linkedhashset both maintain insertion order.

1

We can assume we use bash 4 up, right? In this case, we have associative vectors/mappings. Since we’re working with text, this is the type of basic variable of bash, then we’re home.

Let’s call our shell cria_times.sh, beauty? As every good shell script, let’s make it as flexible as possible by assuming the standard input. To receive from the file input.txt, just execute the following command: ./cria_times.sh < input.txt.

The basic idea here is to loop about these concepts:

  1. the line has the format <TIME> ; <JOGADOR>
  2. I read the line and separate the components <TIME> and <JOGADOR>
  3. add <JOGADOR> the positional vector identified by <TIME>

To make a simple reading loop, I like to always start like this:

initialize

while read LINHA; do
    do_something
done

Well, now we need to put those 3 steps pointed up inside the do_something and initialize things in initialize. As said early on, I need to make an association between team and their players; therefore, I already know that I need the following on initialize:

declare -A times

The command declare allows creating variables in the bash. The arguments it receives change this variable. For example, if I wanted to create an integer itself (yes, sometimes we need to use integers in shell script), we could do declare -i meu_inteiro.

In addition to creating variables, it is also possible to view the contents of a variable using the declare -p VAR. Very useful when you are testing the script on the command line. For example, such as declare -p times describes the variable times after you had her raised?

Well, now we have an associative vector. To access the value within the associative vector. To get the value inside the key Palmeiras, we use the following variable expansion:

time=${times[Palmeiras]}

Note that the vector index is case sensitive.

So we can access the string time that is associated with the index Palmeiras. I can tell you that we want players to be separated between commas. Then, when modifying the time variable, I need to place a comma between the previous players and the new one. We can do this with variable expansion as well:

time=${time:+${time},}${jogador}

The expansion ${var:+STRING} will become STRING if and only if var is a filled variable. STRING is any STRING valid, may even be a variable expansion, so I asked to put the value of time followed by a comma. If var has no value or is empty, so the expansion ${var:+STRING} will return an empty string.

We can’t forget to send the information back to the associative vector, so after we update the time, we need to update times:

times[Palmeiras]="${time}"

Okay, we already know how to read a line (step 1) and, knowing who the player is and who the team is, add the player to the team (step 3). We also start our values properly in the region initialize. So only step 2 is missing.

To know who the <TIME> of a line, we can use variable expansion over LINHA. We know that the <TIME> is all there is before the first semicolon. Therefore, we can ask the expansion to remove everything that there is after the semicolon and only get the start of the variable:

nome_time=${LINHA%%;*}

The expansion ${var%%ENDPATTERN} is a greedy expansion that will pull out of var the ending we marry ENDPATTERN. As expansion is greedy, the ENDPATTERN identified above ;* allows me to pass to LINHA the value PALMEIRAS;Fulano ponto-e-virgula;Sicranoso that expansion will return only PALMERIAS.

To identify a player, I use a non-greedy variable expansion. I intend to delete the beginning of the line until I reach the first semicolon:

jogador=${LINHA#*;}

The expansion ${var#BEGINPATTERN} It’s a non-greedy expansion that’s gonna var the prefix you marry with BEGINPATTERN. As the expansion is not greedy, the BEGINPATTERN identified above as *; allows me to pass to LINHA the value PALMEIRAS;Fulano ponto-e-virgula;Sicranoso that expansion will return only Fulano ponto-e-virgula;Sicranoso.

Ready, with these two expansions we have step 2 complete. So, putting all the pieces together, we have:

# bloco initialiaze
declare -A times

# passo 1: leitura da linha
while read LINHA; do
    # passo 2: identificação dos componentes da linha
    nome_time=${LINHA%%;*}
    jogador=${LINHA#*;}

    # passo 3: alterar o conteúdo do time 
    time=${times[${nome_time}]}
    time=${time:+${time},}${jogador}
    times[${nome_time}]="${time}"
done

Okay, we’ve already read our teams from the standard entry, now just left to write =)

As we are in shell script, let’s assume the default output. If you need to direct this default output to a file, use the output redirect ./cria_times.sh > output.txt. You can use the output redirect along with the input one, no problem.

To loop over the keys of the associative vector times, use the following variable expansion:

for time in "${!times[@]}"; do
    do_something
done

How the print format is <TIMES> ; <JOGADORES>, the do_something here will be just the impression:

for time in "${!times[@]}"; do
    echo "${time};${times[${time}]}"
done

Joining reading with printing on a large script, we have:

#!/bin/bash

# bloco initialiaze
declare -A times

# passo 1: leitura da linha
while read LINHA; do
    # passo 2: identificação dos componentes da linha
    nome_time=${LINHA%%;*}
    jogador=${LINHA#*;}

    # passo 3: alterar o conteúdo do time 
    time=${times[${nome_time}]}
    time=${time:+${time},}${jogador}
    times[${nome_time}]="${time}"
done

for time in "${!times[@]}"; do
    echo "${time};${times[${time}]}"
done

UPDATE

Well, I forgot to put a search source =]

The source I learned the most about bash and shell scripting in general was the Swiss Army Knife of Aurelio Verde.

Another point that turns and moves I look is the official GNU documentation, but I only read in the source after I got used to the strange things in the shell, after reading enough the Swiss Army Knife.

  • 1

    Jefferson, good evening! Your answer is practically a class, thank you very much! I also followed your steps and arrived at the expected result. Thank you very much for the time devoted to this answer as well as the past references! I will study more and more about bash, already saved me from some problems and daily jobs that would be manual. Hug!

  • 1

    @hmm bash is very expressive, not to mention that it can become a character soup. Just by code I think it’s not rewarding enough

  • 1

    Boy, I took a closer look at Swiss Army Knife, animal! I’ll even buy the book "Shell Script Professional - Aurélio Marinho Jargas". Thanks again! Big hug!

  • @hmm great book, recommend =]

  • The output in text file is strange, can you take a look and help me? http://bit.ly/2q4U7mQ wanted it to be a team with respective players per line

  • 1

    After standardizing the line break (for only n) in the input file the problem has been solved. Thank you!

  • How was your end of the line treatment? I think it’s fair to put in the answer, to be more complete

  • I started to treat the reading of each line, converting it to the Linux standard if it is windows standard, with the command "LINE=sed 's/\r$//'"

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.