Depends on the Data Pattern
You can use a regular expression based on the example of data you have placed, but it is complicated to know if it will work for all lines because you do not have a pattern.
The files are either delimited by a character or are delimited by the number of characters for each column. In your case, you follow neither one pattern nor another.
I made an example in Java using an expression that works for your example line:
String REGEX = "\\s([\\dZ]+)\\s";
String INPUT = "0 02 020 0201 020110 Z DEMONSTRAR COMPETÊNCIAS PESSOAIS 1 Primar pela correção de atitudes";
String REPLACE = " ;$1 ;";
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT);
INPUT = m.replaceAll(REPLACE);
System.out.println(INPUT);
You need to make your Java Imports:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
Upshot:
0 ;02 ;020 ;0201 ;020110 ;Z ;DEMONSTRAR COMPETÊNCIAS PESSOAIS ;1 ;Primar pela correção de atitudes
You can use the Regexr to test with more examples of lines and adapt to your need. The Notepad++ also makes find+replace
with regular expression, if you have to do this operation only once for the file.
A converter in Python
Has been posted an answer from Anthony to parse the file in Java and I believe it is the best answer to the problem. As I had downloaded the file and suggested in the comment for you to separate the file into two parts, I decided to leave an example in Python to do as I had suggested.
import re
line_count = 1
with open('C:\\temp\\CBO2002 - PerfilOcupacional.csv', 'w') as w:
with open('C:\\temp\\CBO2002 - PerfilOcupacional.txt') as r:
for line in r:
if (line_count == 1):
# parse do cabecalho
header = re.sub(r"([\w_]+)\s*", r"\1;", line)
w.write(header + '\n')
elif (line_count > 2):
# descarta a linha 2 e
# divide em dois grupos que tem padrao definido
side_a = line[0:22]
side_b = line[23:]
# faz o parse de cada grupo
parse_side_a = re.sub(r"(\d)\s([\d|\w])", r"\1;\2", side_a)
parse_side_b = re.sub(r"([^\d]+)\s(\d+)\s(.+)", r"\1;\2;\3", side_b)
# junta os dois grupos (o CRLF ja esta no grupo B)
line_out = parse_side_a + ';' + parse_side_b
w.write(line_out)
line_count += 1
The problem is that it is not in all spaces that you would add
;
otherwise it could use this logic and apply regular expression to identify the spaces and use the methodreplace
java to replace spaces with;
– R.Santos
All lines in this file have the same format?
– Wilker
Yes, they have the same format, only the sentence changes size.
– Henqsan
This field with value
Z
will always have a single character on any line?– Anthony Accioly
Do you know Kedit? It’s easy to work it out with him.
– Reginaldo Rigo
Anthony, yes, always a single character. The number after the sentence can be up to two characters long.
– Henqsan
Reginaldo, I don’t know Kedit, what this is about?
– Henqsan
A powerful text editor with many features.
– Reginaldo Rigo
How many lines does this file have?
– Reginaldo Rigo
Reginaldo, over 161,000 lines.
– Henqsan
Quiet. He’s amazing. I’ve worked with files with over 2 million lines.
– Reginaldo Rigo
Hi Henqsan, while I believe my answer should solve the problem for that particular question, my recommendation is that you always include a Minimum, Complete and Verifiable Example in your questions. An initial code with all the dependencies (e.g., example file), however simple it may be, helps those who are trying to respond and considerably increases their chances of getting a proper response.
– Anthony Accioly
@Henqsan Here we do not write "solved" on the question. If you have an answer that really helped you, mark it as accepted. If you came to the solution on your own, put in the solution as an answer. So content is more organized and easier to find in the future by other people with similar problems.
– user28595
Thanks for the tip Diego.
– Henqsan