Read txt line and include ";"

Asked

Viewed 953 times

2

I have a txt file whose lines have the following data:

0 02 020 0201 020110 Z DEMONSTRAR COMPETÊNCIAS PESSOAIS 1 Primar pela correção de atitudes

This way I can not import the data either to excel, or to mysql, because the words do not have the same number of characters compared to the other lines of the txt file.

Using delphi, lazarus or java, how do I read the line and include the character ";" in spaces so that it is as follows:

0 ;02 ;020 ;0201 ;020110 ;Z ;DEMONSTRAR COMPETÊNCIAS PESSOAIS ;1 ;Primar pela correção de atitudes

Each item corresponds to a table capo.

  • The problem is that it is not in all spaces that you would add ; otherwise it could use this logic and apply regular expression to identify the spaces and use the method replace java to replace spaces with ;

  • All lines in this file have the same format?

  • Yes, they have the same format, only the sentence changes size.

  • This field with value Z will always have a single character on any line?

  • Do you know Kedit? It’s easy to work it out with him.

  • Anthony, yes, always a single character. The number after the sentence can be up to two characters long.

  • Reginaldo, I don’t know Kedit, what this is about?

  • A powerful text editor with many features.

  • How many lines does this file have?

  • Reginaldo, over 161,000 lines.

  • Quiet. He’s amazing. I’ve worked with files with over 2 million lines.

  • Hi Henqsan, while I believe my answer should solve the problem for that particular question, my recommendation is that you always include a Minimum, Complete and Verifiable Example in your questions. An initial code with all the dependencies (e.g., example file), however simple it may be, helps those who are trying to respond and considerably increases their chances of getting a proper response.

  • @Henqsan Here we do not write "solved" on the question. If you have an answer that really helped you, mark it as accepted. If you came to the solution on your own, put in the solution as an answer. So content is more organized and easier to find in the future by other people with similar problems.

  • Thanks for the tip Diego.

Show 9 more comments

3 answers

6

Depends on the Data Pattern

You can use a regular expression based on the example of data you have placed, but it is complicated to know if it will work for all lines because you do not have a pattern.

The files are either delimited by a character or are delimited by the number of characters for each column. In your case, you follow neither one pattern nor another.

I made an example in Java using an expression that works for your example line:

String REGEX = "\\s([\\dZ]+)\\s";
String INPUT = "0 02 020 0201 020110 Z DEMONSTRAR COMPETÊNCIAS PESSOAIS 1 Primar pela correção de atitudes";
String REPLACE = " ;$1 ;";

Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT); 
INPUT = m.replaceAll(REPLACE);

System.out.println(INPUT);

You need to make your Java Imports:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

Upshot:

0 ;02 ;020 ;0201 ;020110 ;Z ;DEMONSTRAR COMPETÊNCIAS PESSOAIS ;1 ;Primar pela correção de atitudes

You can use the Regexr to test with more examples of lines and adapt to your need. The Notepad++ also makes find+replace with regular expression, if you have to do this operation only once for the file.


A converter in Python

Has been posted an answer from Anthony to parse the file in Java and I believe it is the best answer to the problem. As I had downloaded the file and suggested in the comment for you to separate the file into two parts, I decided to leave an example in Python to do as I had suggested.

import re

line_count = 1

with open('C:\\temp\\CBO2002 - PerfilOcupacional.csv', 'w') as w:
    with open('C:\\temp\\CBO2002 - PerfilOcupacional.txt') as r:
        for line in r:
            if (line_count == 1):
                # parse do cabecalho
                header = re.sub(r"([\w_]+)\s*", r"\1;", line)
                w.write(header + '\n')

            elif (line_count > 2):
                # descarta a linha 2 e
                # divide em dois grupos que tem padrao definido
                side_a = line[0:22]
                side_b = line[23:]

                # faz o parse de cada grupo
                parse_side_a = re.sub(r"(\d)\s([\d|\w])", r"\1;\2", side_a)
                parse_side_b = re.sub(r"([^\d]+)\s(\d+)\s(.+)", r"\1;\2;\3", side_b)

                # junta os dois grupos (o CRLF ja esta no grupo B)
                line_out = parse_side_a + ';' + parse_side_b 
                w.write(line_out)

            line_count += 1
  • Hello Pagotti, thank you for the help that, by the way, was of great importance. Your tip worked perfectly for the line I posted, however, I did the test with other lines and this rule does not work. This is due to the fact that for each line both the "Z" character field changes value as well as subsequent words. Soon it would have to be a different expression for each one,.

  • This file is a text file made available by the website of the Ministry of Labor with the Brazilian Code of Occupations - CBO, more precisely the file that contains the Occupational Profile. The file can be downloaded from: http://www.mtecbo.gov.br/cbosite/pages/downloads.jsf;jsessionid=2X82mKTtfL1xnjqnhntN1sQc.slave23:mte-cbo boundless. My need is to create a database with the necessary tables and populate them with this data, which is difficult.

  • 1

    I looked at the file and actually the citizen who made that information available didn’t deserve to be employed in the IT area. All ZIP files except that of occupational profile can be imported with the pq column rule are all "code - description". For this I recommend you cut him in two. The first part up to column 23 which has a pattern, and the other the rest where you have two texts and a number, then you can parse with a regular expression.

  • Exactly Pagotti, and the worst thing is that I contacted them to ask for the file in csv format and so far I got no answer.

2

Building on the idea of replace with regular expressions suggested in answer from Pagotti, Here is an example that processes the complete file, line by line, according to a specific regular expression. To compile Java 8 is required:

import java.io.BufferedWriter;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Parser {
    public static void main(String[] args) {
        final Pattern patternLinha =
                Pattern.compile("^(\\d) (\\d{2}) (\\d{3}) (\\d{4}) (\\d{6}) ([A-Z]) (.+?) (\\d{1,2}) (.+)$");

        final Path entrada = Paths.get(args[0]);
        final Path saida = Paths.get(args[1]);
        final Charset cs = Charset.forName(args[2]);
        final String quebraDeLinha = args[3].replace("\\r", "\r").replace("\\n", "\n");

        try (BufferedWriter bw = Files.newBufferedWriter(saida, cs)) {
            Files.lines(entrada, cs).map(linha -> {
                final Matcher matcher = patternLinha.matcher(linha);
                if (matcher.matches()) {
                    return matcher.replaceFirst("$1 ;$2 ;$3 ;$4 ;$5 ;$6 ;$7 ;$8 ;$9");
                } else {
                    throw new RuntimeException("Formato invalido para a linha: " + linha);
                }

            }).forEach(linhaTransformada -> {
                try {
                    bw.write(linhaTransformada);
                    bw.write(quebraDeLinha);
                } catch (IOException e) {
                    System.err.println("Erro ao escrever linha no arquivo de saida: " + saida.toAbsolutePath());
                    e.printStackTrace();
                }
            });
        } catch (IOException e) {
            System.err.println("Erro ao ler do arquivo de entrada: " + entrada.toAbsolutePath());
            e.printStackTrace();
        }
    }
}

Example of use:

java Parser arquivoEntrada.txt arquivoSaida.txt ISO-8859-1 \r\n

Since the question does not contain code or even an example file, you cannot be sure if the answer will work for all the data. For this purpose it would be necessary to know what is the formal structure of the content, in addition to the particularities of the file such as charset, line break type, etc. That said I did my best to make everything easily parameterizable. Changing the Pattern and command line arguments it is possible to make fine adjustments.

2


I created a routine in Delphi exclusive to this file.

Uses System.Character;


procedure TForm1.Button1Click(Sender: TObject);
Var
   str :  string;
   linhacsv : string;
   oldFile, NewFile : TextFile;
   n : Integer;
begin
  AssignFile( newFile, 'c:\pasta\CB02002 - PerfilOcupacional.csv');
  Rewrite( newFile );

  AssignFile( oldFile, 'c:\pasta\CBO2002 - PerfilOcupacional.txt');
  Reset( oldFile );

  readln( oldFile, str ); // ignora o cabeçalho.
  readln( oldFile, str ); // e a proxima linha

  while not Eof( oldFile ) do
  begin
    linhacsv := '';
    readln( oldFile, str );
    for n := 1 to length( str ) do
    begin
      if ( str[n] = ' ' )  then
      begin
        if ( IsNumber(str[n-1]) and ( IsNumber(str[n+1]))) then
          linhacsv := linhacsv + ';'
        else if ( IsNumber(str[n-1]) and ( not IsNumber(str[n+1]))) then
          linhacsv := linhacsv + ';'
        else if ( not IsNumber(str[n-1]) and ( IsNumber(str[n+1]))) then
          linhacsv := linhacsv + ';'
        else if ( not IsNumber(str[n-1]) and ( not IsNumber(str[n+1])) and ( n = 23 )) then
          linhacsv := linhacsv + ';'
        else
         linhacsv := linhacsv + str[n]
      end else
         linhacsv := linhacsv + str[n]
    end;
    writeln( newFile, linhacsv );
  end;
  CloseFile( newFile );
  CloseFile( oldFile );

end;
  • Dear Reginaldo Rigo, thank you for your feedback! I really appreciate your help. His routine worked perfectly for the presented need. Thanks also to all other professionals who have collaborated in solving this problem. If it were not for your collaboration the dissemination of knowledge and information would not be as expressive as today.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.