What better way to analyze a very large json file in java?

Asked

Viewed 1,132 times

3

I have a very large file in json, it has 5 gigas and it has 652339 lines, I was thinking of using the Gson library in java.

I would like to know, what is the best way to analyze the file, since even the json structure could not get it right. Example of a file line:

{"control": {"lang": {"lang": "pt", "d": 1395183935882, "v": 5}, "last": "UPDATE", "read": {"d": 1395183767992, "v": 3}, "update": {"d": 1395308552817, "v": 2}, "rule": {"entities": [80000, 84001, 80034, 84232, 84009, 84051, 84084, 80061], "d": 1395305209944, "v": 3}, "entities": {"entities": [80000, 84001, 80034, 84232, 84009, 84051, 84084, 80061]}, "terms": {"terms": [], "d": 1395249318552, "v": 3}, "coletas": [{"terms": [], "id": 97}]}, "picture": "https://fbexternal-a.akamaihd.net/safe_image.php?d=AQA10tlbPQBXIp4p&w=154&h=154&url=http%3A%2F%2Fimages.immedia.com.br%2F%2F9%2F9146_2_L.JPG", "story": "Georgevan Araujo compartilhou um link.", "updated_time": "2013-12-30T23:59:59", "from": {"name": "Georgevan Araujo", "id": "100000278536009"}, "description": "Segundo o ex-ministro da Fazenda, a prova de que o governo n\u00e3o tem nada de socialista \u00e9 que ele destruiu as suas duas principais empresas: a Petrobras e a Eletrobr\u00e1s", "caption": "www.infomoney.com.br", "privacy": {"value": ""}, "name": "\"O que o governo fez com a Petrobras foi uma trag\u00e9dia\", diz Delfim Netto", "application": {"namespace": "fbipad_", "name": "Facebook for iPad", "id": "173847642670370"}, "link": "http://www.infomoney.com.br/onde-investir/acoes/noticia/3086396/que-governo-fez-com-petrobras-foi-uma-tragedia-diz-delfim", "story_tags": {"0": [{"length": 16, "type": "user", "id": "100000278536009", "name": "Georgevan Araujo", "offset": 0}]}, "created_time": "2013-12-30T23:59:59", "_id": "100000278536009_719669731385638", "type": "link", "id": "100000278536009_719669731385638", "icon": "https://fbstatic-a.akamaihd.net/rsrc.php/v2/yD/r/aS8ecmYRys0.gif"}

I was thinking about:

  • Split this file into several others and scan one by one
  • Create a database and put all the information in the database for use in the application
  • Try to get rid of json structure with a java application and read the file as it runs

I think the alternatives above are not the best.

  • If each row is a complete JSON, then each row should only have a 8KB (~= 5GB / 652339 lines) - which is well tractable. Why not separate it into smaller files? Or else make a loop where you read line by line (using a BufferedReader or something similar) and uses Gson to interpret each line. etc. If on the other hand it is a single JSON with 5GB, then the problem is bigger but not intractable (let me know if it is the case, and I try to formulate a response).

  • it’s just a 5gb file, I was thinking about doing it this way with the Bufferedreader

  • 3

    I found that post showing how to do using the Jackson library (the link in the article requires password, but I think this project at Github if it is the same library). It is somewhat similar to the API SAX from Java (for Xmls), but it seems a little more convenient (because it allows you to read certain sub-elements as a whole, if you like, but reading the "thick" of the file in the form of stream).

2 answers

2

The database is probably the best solution because:

  • It is made to work with huge amounts of data;

  • 5GB is much given to be kept in memory, especially in Java;

  • If the data needs to be reused, it will be necessary to carry out the whole process of interpretation, etc of the data, which will undoubtedly take time.

I don’t know a specific tool that can manipulate that much data, or that is capable of manipulating that much data. But unless you have more than 4 31 records on a single level of your object tree, enough memory on your machine and configure Java to have a really big heap limit (8GB+), I see no problem.

A detail to be noted that can make it a lot easier: if your file is composed only of lines like the one described, and nothing else, if you want separated by comma, and each line is a complete Json, you can process line by line as a Json file, and then send the line to the database, eliminating the previously mentioned memory problem, with the advantage of being reasonably simple to do.

2


As my real need in this Json was some tags, what I used was an element reading per element in each iteration, according to the need for use. For this I used the Jackson Json API. My code is below taking only the tags title, url, text, entidades of the aforementioned Json:

public class BrutoNewsJsonParser {

    JsonFactory factory;
    JsonParser jp;
    JsonToken current;

    public BrutoNewsJsonParser() {
        factory = new JsonFactory();
        jp = null;

        String path = "/home/nicolas/Documentos/X9dadosIC/Bruto/news_jul_dez_2013.json";

        try {
            jp = factory.createJsonParser(new File(path));

        } catch (IOException ex) {
            Logger.getLogger(BrutoNewsJsonParser.class.getName()).log(Level.SEVERE, null, ex);
            ex.printStackTrace();
        }

    }

    public News ler() {
        EntidadesReader er = new EntidadesReader();
        String title = null, url = null, text = null;
        LinkedList<String> entidades = new LinkedList<>();
        boolean controleEntidades = true;

        int contador = 0;

        try {
            current = jp.nextToken();
        } catch (IOException ex) {
            Logger.getLogger(BrutoNewsJsonParser.class.getName()).log(Level.SEVERE, null, ex);
        }

        if (current == JsonToken.START_OBJECT) {
            contador++;
        }

        while (contador != 0) {
            try {
                String namefield = jp.getCurrentName();
                if ("title".equals(namefield)) {
                    title = jp.getText();
                } else if ("url".equals(namefield)) {
                    url = jp.getText();
                } else if ("text".equals(namefield)) {
                    text = jp.getText();
                } else if ("entities".equals(namefield) && controleEntidades) {
                    if (current == JsonToken.START_ARRAY) {
                        controleEntidades = false;
                        current = jp.nextToken();
                        while (current != JsonToken.END_ARRAY) {
                            entidades.add(er.traduzir(Integer.parseInt(jp.getText())));
                            current = jp.nextToken();
                        }
                    }
                }

                current = jp.nextToken();
                if (current == JsonToken.END_OBJECT) {
                    contador--;
                } else if (current == JsonToken.START_OBJECT) {
                    contador++;
                }
            } catch (IOException e) {
                System.err.println(current.asString());
                e.printStackTrace();
            }
        }
        try {
            jp.nextToken();
        } catch (JsonParseException j) {

        } catch (IOException ex) {
            Logger.getLogger(BrutoNewsJsonParser.class.getName()).log(Level.SEVERE, null, ex);
        }
        return new News(title, url, text, entidades);
    }
}

With this, every call of the Read method provides an additional element.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.