I have a very large file in json, it has 5 gigas and it has 652339 lines, I was thinking of using the Gson library in java.
I would like to know, what is the best way to analyze the file, since even the json structure could not get it right. Example of a file line:
{"control": {"lang": {"lang": "pt", "d": 1395183935882, "v": 5}, "last": "UPDATE", "read": {"d": 1395183767992, "v": 3}, "update": {"d": 1395308552817, "v": 2}, "rule": {"entities": [80000, 84001, 80034, 84232, 84009, 84051, 84084, 80061], "d": 1395305209944, "v": 3}, "entities": {"entities": [80000, 84001, 80034, 84232, 84009, 84051, 84084, 80061]}, "terms": {"terms": [], "d": 1395249318552, "v": 3}, "coletas": [{"terms": [], "id": 97}]}, "picture": "https://fbexternal-a.akamaihd.net/safe_image.php?d=AQA10tlbPQBXIp4p&w=154&h=154&url=http%3A%2F%2Fimages.immedia.com.br%2F%2F9%2F9146_2_L.JPG", "story": "Georgevan Araujo compartilhou um link.", "updated_time": "2013-12-30T23:59:59", "from": {"name": "Georgevan Araujo", "id": "100000278536009"}, "description": "Segundo o ex-ministro da Fazenda, a prova de que o governo n\u00e3o tem nada de socialista \u00e9 que ele destruiu as suas duas principais empresas: a Petrobras e a Eletrobr\u00e1s", "caption": "www.infomoney.com.br", "privacy": {"value": ""}, "name": "\"O que o governo fez com a Petrobras foi uma trag\u00e9dia\", diz Delfim Netto", "application": {"namespace": "fbipad_", "name": "Facebook for iPad", "id": "173847642670370"}, "link": "http://www.infomoney.com.br/onde-investir/acoes/noticia/3086396/que-governo-fez-com-petrobras-foi-uma-tragedia-diz-delfim", "story_tags": {"0": [{"length": 16, "type": "user", "id": "100000278536009", "name": "Georgevan Araujo", "offset": 0}]}, "created_time": "2013-12-30T23:59:59", "_id": "100000278536009_719669731385638", "type": "link", "id": "100000278536009_719669731385638", "icon": "https://fbstatic-a.akamaihd.net/rsrc.php/v2/yD/r/aS8ecmYRys0.gif"}
I was thinking about:
- Split this file into several others and scan one by one
- Create a database and put all the information in the database for use in the application
- Try to get rid of json structure with a java application and read the file as it runs
I think the alternatives above are not the best.
If each row is a complete JSON, then each row should only have a 8KB (~= 5GB / 652339 lines) - which is well tractable. Why not separate it into smaller files? Or else make a loop where you read line by line (using a
or something similar) and uses Gson to interpret each line. etc. If on the other hand it is a single JSON with 5GB, then the problem is bigger but not intractable (let me know if it is the case, and I try to formulate a response).– mgibsonbr
it’s just a 5gb file, I was thinking about doing it this way with the Bufferedreader
– Nicolas Bontempo
I found that post showing how to do using the Jackson library (the link in the article requires password, but I think this project at Github if it is the same library). It is somewhat similar to the API SAX from Java (for Xmls), but it seems a little more convenient (because it allows you to read certain sub-elements as a whole, if you like, but reading the "thick" of the file in the form of stream).
– mgibsonbr