What better way to analyze a very large json file in java?

Question

What better way to analyze a very large json file in java?

Asked 9 years, 10 months ago

Viewed 1,132 times

3

I have a very large file in json, it has 5 gigas and it has 652339 lines, I was thinking of using the Gson library in java.

I would like to know, what is the best way to analyze the file, since even the json structure could not get it right. Example of a file line:

{"control": {"lang": {"lang": "pt", "d": 1395183935882, "v": 5}, "last": "UPDATE", "read": {"d": 1395183767992, "v": 3}, "update": {"d": 1395308552817, "v": 2}, "rule": {"entities": [80000, 84001, 80034, 84232, 84009, 84051, 84084, 80061], "d": 1395305209944, "v": 3}, "entities": {"entities": [80000, 84001, 80034, 84232, 84009, 84051, 84084, 80061]}, "terms": {"terms": [], "d": 1395249318552, "v": 3}, "coletas": [{"terms": [], "id": 97}]}, "picture": "https://fbexternal-a.akamaihd.net/safe_image.php?d=AQA10tlbPQBXIp4p&w=154&h=154&url=http%3A%2F%2Fimages.immedia.com.br%2F%2F9%2F9146_2_L.JPG", "story": "Georgevan Araujo compartilhou um link.", "updated_time": "2013-12-30T23:59:59", "from": {"name": "Georgevan Araujo", "id": "100000278536009"}, "description": "Segundo o ex-ministro da Fazenda, a prova de que o governo n\u00e3o tem nada de socialista \u00e9 que ele destruiu as suas duas principais empresas: a Petrobras e a Eletrobr\u00e1s", "caption": "www.infomoney.com.br", "privacy": {"value": ""}, "name": "\"O que o governo fez com a Petrobras foi uma trag\u00e9dia\", diz Delfim Netto", "application": {"namespace": "fbipad_", "name": "Facebook for iPad", "id": "173847642670370"}, "link": "http://www.infomoney.com.br/onde-investir/acoes/noticia/3086396/que-governo-fez-com-petrobras-foi-uma-tragedia-diz-delfim", "story_tags": {"0": [{"length": 16, "type": "user", "id": "100000278536009", "name": "Georgevan Araujo", "offset": 0}]}, "created_time": "2013-12-30T23:59:59", "_id": "100000278536009_719669731385638", "type": "link", "id": "100000278536009_719669731385638", "icon": "https://fbstatic-a.akamaihd.net/rsrc.php/v2/yD/r/aS8ecmYRys0.gif"}

I was thinking about:

Split this file into several others and scan one by one
Create a database and put all the information in the database for use in the application
Try to get rid of json structure with a java application and read the file as it runs

I think the alternatives above are not the best.

If each row is a complete JSON, then each row should only have a 8KB (~= 5GB / 652339 lines) - which is well tractable. Why not separate it into smaller files? Or else make a loop where you read line by line (using a BufferedReader or something similar) and uses Gson to interpret each line. etc. If on the other hand it is a single JSON with 5GB, then the problem is bigger but not intractable (let me know if it is the case, and I try to formulate a response).

– mgibsonbr

2015/09/26 at 14:41
it’s just a 5gb file, I was thinking about doing it this way with the Bufferedreader

– Nicolas Bontempo

2015/09/26 at 17:02
3

I found that post showing how to do using the Jackson library (the link in the article requires password, but I think this project at Github if it is the same library). It is somewhat similar to the API SAX from Java (for Xmls), but it seems a little more convenient (because it allows you to read certain sub-elements as a whole, if you like, but reading the "thick" of the file in the form of stream).

– mgibsonbr

2015/09/26 at 22:56

2 answers

2

As my real need in this Json was some tags, what I used was an element reading per element in each iteration, according to the need for use. For this I used the Jackson Json API. My code is below taking only the tags title, url, text, entidades of the aforementioned Json:

public class BrutoNewsJsonParser {

    JsonFactory factory;
    JsonParser jp;
    JsonToken current;

    public BrutoNewsJsonParser() {
        factory = new JsonFactory();
        jp = null;

        String path = "/home/nicolas/Documentos/X9dadosIC/Bruto/news_jul_dez_2013.json";

        try {
            jp = factory.createJsonParser(new File(path));

        } catch (IOException ex) {
            Logger.getLogger(BrutoNewsJsonParser.class.getName()).log(Level.SEVERE, null, ex);
            ex.printStackTrace();
        }

    }

    public News ler() {
        EntidadesReader er = new EntidadesReader();
        String title = null, url = null, text = null;
        LinkedList<String> entidades = new LinkedList<>();
        boolean controleEntidades = true;

        int contador = 0;

        try {
            current = jp.nextToken();
        } catch (IOException ex) {
            Logger.getLogger(BrutoNewsJsonParser.class.getName()).log(Level.SEVERE, null, ex);
        }

        if (current == JsonToken.START_OBJECT) {
            contador++;
        }

        while (contador != 0) {
            try {
                String namefield = jp.getCurrentName();
                if ("title".equals(namefield)) {
                    title = jp.getText();
                } else if ("url".equals(namefield)) {
                    url = jp.getText();
                } else if ("text".equals(namefield)) {
                    text = jp.getText();
                } else if ("entities".equals(namefield) && controleEntidades) {
                    if (current == JsonToken.START_ARRAY) {
                        controleEntidades = false;
                        current = jp.nextToken();
                        while (current != JsonToken.END_ARRAY) {
                            entidades.add(er.traduzir(Integer.parseInt(jp.getText())));
                            current = jp.nextToken();
                        }
                    }
                }

                current = jp.nextToken();
                if (current == JsonToken.END_OBJECT) {
                    contador--;
                } else if (current == JsonToken.START_OBJECT) {
                    contador++;
                }
            } catch (IOException e) {
                System.err.println(current.asString());
                e.printStackTrace();
            }
        }
        try {
            jp.nextToken();
        } catch (JsonParseException j) {

        } catch (IOException ex) {
            Logger.getLogger(BrutoNewsJsonParser.class.getName()).log(Level.SEVERE, null, ex);
        }
        return new News(title, url, text, entidades);
    }
}

With this, every call of the Read method provides an additional element.

Browser other questions tagged java json

You are not signed in. Login or sign up in order to post.

by Vinícius Gobbo A. de Oliveira • **5,575** points · Answer 1 · 2015-09-26T14:32:13+00:00

The database is probably the best solution because:

It is made to work with huge amounts of data;
5GB is much given to be kept in memory, especially in Java;
If the data needs to be reused, it will be necessary to carry out the whole process of interpretation, etc of the data, which will undoubtedly take time.

I don’t know a specific tool that can manipulate that much data, or that is capable of manipulating that much data. But unless you have more than 4 31 records on a single level of your object tree, enough memory on your machine and configure Java to have a really big heap limit (8GB+), I see no problem.

A detail to be noted that can make it a lot easier: if your file is composed only of lines like the one described, and nothing else, if you want separated by comma, and each line is a complete Json, you can process line by line as a Json file, and then send the line to the database, eliminating the previously mentioned memory problem, with the advantage of being reasonably simple to do.