Read txt line and include ";"

Question

Read txt line and include ";"

Asked 8 years, 3 months ago

Viewed 953 times

2

I have a txt file whose lines have the following data:

0 02 020 0201 020110 Z DEMONSTRAR COMPETÊNCIAS PESSOAIS 1 Primar pela correção de atitudes

This way I can not import the data either to excel, or to mysql, because the words do not have the same number of characters compared to the other lines of the txt file.

Using delphi, lazarus or java, how do I read the line and include the character ";" in spaces so that it is as follows:

0 ;02 ;020 ;0201 ;020110 ;Z ;DEMONSTRAR COMPETÊNCIAS PESSOAIS ;1 ;Primar pela correção de atitudes

Each item corresponds to a table capo.

The problem is that it is not in all spaces that you would add ; otherwise it could use this logic and apply regular expression to identify the spaces and use the method replace java to replace spaces with ;

– R.Santos

2017/01/27 at 17:07
All lines in this file have the same format?

– Wilker

2017/01/27 at 17:07
Yes, they have the same format, only the sentence changes size.

– Henqsan

2017/01/27 at 17:14
This field with value Z will always have a single character on any line?

– Anthony Accioly

2017/01/27 at 17:16
Do you know Kedit? It’s easy to work it out with him.

– Reginaldo Rigo

2017/01/27 at 17:18
Anthony, yes, always a single character. The number after the sentence can be up to two characters long.

– Henqsan

2017/01/27 at 17:18
Reginaldo, I don’t know Kedit, what this is about?

– Henqsan

2017/01/27 at 17:19
A powerful text editor with many features.

– Reginaldo Rigo

2017/01/27 at 17:20
How many lines does this file have?

– Reginaldo Rigo

2017/01/27 at 17:21
Reginaldo, over 161,000 lines.

– Henqsan

2017/01/27 at 17:24
Quiet. He’s amazing. I’ve worked with files with over 2 million lines.

– Reginaldo Rigo

2017/01/27 at 17:26
Hi Henqsan, while I believe my answer should solve the problem for that particular question, my recommendation is that you always include a Minimum, Complete and Verifiable Example in your questions. An initial code with all the dependencies (e.g., example file), however simple it may be, helps those who are trying to respond and considerably increases their chances of getting a proper response.

– Anthony Accioly

2017/01/27 at 22:43
@Henqsan Here we do not write "solved" on the question. If you have an answer that really helped you, mark it as accepted. If you came to the solution on your own, put in the solution as an answer. So content is more organized and easier to find in the future by other people with similar problems.

– user28595

2017/01/30 at 20:39
Thanks for the tip Diego.

– Henqsan

2017/01/30 at 20:41

Show 9 more comments

3 answers

2

I created a routine in Delphi exclusive to this file.

Uses System.Character;


procedure TForm1.Button1Click(Sender: TObject);
Var
   str :  string;
   linhacsv : string;
   oldFile, NewFile : TextFile;
   n : Integer;
begin
  AssignFile( newFile, 'c:\pasta\CB02002 - PerfilOcupacional.csv');
  Rewrite( newFile );

  AssignFile( oldFile, 'c:\pasta\CBO2002 - PerfilOcupacional.txt');
  Reset( oldFile );

  readln( oldFile, str ); // ignora o cabeçalho.
  readln( oldFile, str ); // e a proxima linha

  while not Eof( oldFile ) do
  begin
    linhacsv := '';
    readln( oldFile, str );
    for n := 1 to length( str ) do
    begin
      if ( str[n] = ' ' )  then
      begin
        if ( IsNumber(str[n-1]) and ( IsNumber(str[n+1]))) then
          linhacsv := linhacsv + ';'
        else if ( IsNumber(str[n-1]) and ( not IsNumber(str[n+1]))) then
          linhacsv := linhacsv + ';'
        else if ( not IsNumber(str[n-1]) and ( IsNumber(str[n+1]))) then
          linhacsv := linhacsv + ';'
        else if ( not IsNumber(str[n-1]) and ( not IsNumber(str[n+1])) and ( n = 23 )) then
          linhacsv := linhacsv + ';'
        else
         linhacsv := linhacsv + str[n]
      end else
         linhacsv := linhacsv + str[n]
    end;
    writeln( newFile, linhacsv );
  end;
  CloseFile( newFile );
  CloseFile( oldFile );

end;

Dear Reginaldo Rigo, thank you for your feedback! I really appreciate your help. His routine worked perfectly for the presented need. Thanks also to all other professionals who have collaborated in solving this problem. If it were not for your collaboration the dissemination of knowledge and information would not be as expressive as today.

– Henqsan

2017/01/30 at 16:40

Browser other questions tagged java object-pascal

You are not signed in. Login or sign up in order to post.

by Pagotti • **3,042** points · Answer 1 · 2017-01-27T18:15:15+00:00

Depends on the Data Pattern

You can use a regular expression based on the example of data you have placed, but it is complicated to know if it will work for all lines because you do not have a pattern.

The files are either delimited by a character or are delimited by the number of characters for each column. In your case, you follow neither one pattern nor another.

I made an example in Java using an expression that works for your example line:

String REGEX = "\\s([\\dZ]+)\\s";
String INPUT = "0 02 020 0201 020110 Z DEMONSTRAR COMPETÊNCIAS PESSOAIS 1 Primar pela correção de atitudes";
String REPLACE = " ;$1 ;";

Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT); 
INPUT = m.replaceAll(REPLACE);

System.out.println(INPUT);

You need to make your Java Imports:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

Upshot:

0 ;02 ;020 ;0201 ;020110 ;Z ;DEMONSTRAR COMPETÊNCIAS PESSOAIS ;1 ;Primar pela correção de atitudes

You can use the Regexr to test with more examples of lines and adapt to your need. The Notepad++ also makes find+replace with regular expression, if you have to do this operation only once for the file.

A converter in Python

Has been posted an answer from Anthony to parse the file in Java and I believe it is the best answer to the problem. As I had downloaded the file and suggested in the comment for you to separate the file into two parts, I decided to leave an example in Python to do as I had suggested.

import re

line_count = 1

with open('C:\\temp\\CBO2002 - PerfilOcupacional.csv', 'w') as w:
    with open('C:\\temp\\CBO2002 - PerfilOcupacional.txt') as r:
        for line in r:
            if (line_count == 1):
                # parse do cabecalho
                header = re.sub(r"([\w_]+)\s*", r"\1;", line)
                w.write(header + '\n')

            elif (line_count > 2):
                # descarta a linha 2 e
                # divide em dois grupos que tem padrao definido
                side_a = line[0:22]
                side_b = line[23:]

                # faz o parse de cada grupo
                parse_side_a = re.sub(r"(\d)\s([\d|\w])", r"\1;\2", side_a)
                parse_side_b = re.sub(r"([^\d]+)\s(\d+)\s(.+)", r"\1;\2;\3", side_b)

                # junta os dois grupos (o CRLF ja esta no grupo B)
                line_out = parse_side_a + ';' + parse_side_b 
                w.write(line_out)

            line_count += 1

by Anthony Accioly • **20,516** points · Answer 2 · 2017-01-27T22:26:46+00:00

Building on the idea of replace with regular expressions suggested in answer from Pagotti, Here is an example that processes the complete file, line by line, according to a specific regular expression. To compile Java 8 is required:

import java.io.BufferedWriter;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Parser {
    public static void main(String[] args) {
        final Pattern patternLinha =
                Pattern.compile("^(\\d) (\\d{2}) (\\d{3}) (\\d{4}) (\\d{6}) ([A-Z]) (.+?) (\\d{1,2}) (.+)$");

        final Path entrada = Paths.get(args[0]);
        final Path saida = Paths.get(args[1]);
        final Charset cs = Charset.forName(args[2]);
        final String quebraDeLinha = args[3].replace("\\r", "\r").replace("\\n", "\n");

        try (BufferedWriter bw = Files.newBufferedWriter(saida, cs)) {
            Files.lines(entrada, cs).map(linha -> {
                final Matcher matcher = patternLinha.matcher(linha);
                if (matcher.matches()) {
                    return matcher.replaceFirst("$1 ;$2 ;$3 ;$4 ;$5 ;$6 ;$7 ;$8 ;$9");
                } else {
                    throw new RuntimeException("Formato invalido para a linha: " + linha);
                }

            }).forEach(linhaTransformada -> {
                try {
                    bw.write(linhaTransformada);
                    bw.write(quebraDeLinha);
                } catch (IOException e) {
                    System.err.println("Erro ao escrever linha no arquivo de saida: " + saida.toAbsolutePath());
                    e.printStackTrace();
                }
            });
        } catch (IOException e) {
            System.err.println("Erro ao ler do arquivo de entrada: " + entrada.toAbsolutePath());
            e.printStackTrace();
        }
    }
}

Example of use:

java Parser arquivoEntrada.txt arquivoSaida.txt ISO-8859-1 \r\n

Since the question does not contain code or even an example file, you cannot be sure if the answer will work for all the data. For this purpose it would be necessary to know what is the formal structure of the content, in addition to the particularities of the file such as charset, line break type, etc. That said I did my best to make everything easily parameterizable. Changing the Pattern and command line arguments it is possible to make fine adjustments.