Import and manipulate json in Python

Asked

Viewed 1,971 times

1

I am trying to import a . json file with the following structure:

short_description:She left her husband. He killed their children. Just 
another day in America.
headline:There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV
date:2018-05-26
link:https://www.huffingtonpost.com/entry/texas-amanda-painter-mass-shooting_us_5b081ab4e4b0802d69caad89
authors:Melissa Jeltsen
category:CRIME

But apparently json is not formatted properly (the file is here), then nay I could import using pandas like this:

df = pd.read_json('../input/news-category-dataset/News_Category_Dataset.json', lines=True)

I got it this way:

data = []
for line in open("News_Category_Dataset.json",'r'):
    data.append(json.loads(line))

But from what I understand, this way it’s like any file and the json structure is lost (is that right?), so I wanted to understand if the structure is really wrong, if you have to read with Pandas anyway and/ or if reading as file has to manipulate easily.

EDIT: a larger chunk of the file

{"short_description": "She left her husband. He killed their children. Just another day in America.", "headline": "There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/texas-amanda-painter-mass-shooting_us_5b081ab4e4b0802d69caad89", "authors": "Melissa Jeltsen", "category": "CRIME"}
{"short_description": "Of course it has a song.", "headline": "Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/will-smith-joins-diplo-and-nicky-jam-for-the-official-2018-world-cup-song_us_5b09726fe4b0fdb2aa541201", "authors": "Andy McDonald", "category": "ENTERTAINMENT"}
{"short_description": "The actor and his longtime girlfriend Anna Eberstein tied the knot in a civil ceremony.", "headline": "Hugh Grant Marries For The First Time At Age 57", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/hugh-grant-marries_us_5b09212ce4b0568a880b9a8c", "authors": "Ron Dicker", "category": "ENTERTAINMENT"}
{"short_description": "The actor gives Dems an ass-kicking for not fighting hard enough against Donald Trump.", "headline": "Jim Carrey Blasts 'Castrato' Adam Schiff And Democrats In New Artwork", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/jim-carrey-adam-schiff-democrats_us_5b0950e8e4b0fdb2aa53e675", "authors": "Ron Dicker", "category": "ENTERTAINMENT"}
{"short_description": "The \"Dietland\" actress said using the bags is a \"really cathartic, therapeutic moment.\"", "headline": "Julianna Margulies Uses Donald Trump Poop Bags To Pick Up After Her Dog", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/julianna-margulies-trump-poop-bag_us_5b093ec2e4b0fdb2aa53df70", "authors": "Ron Dicker", "category": "ENTERTAINMENT"}
{"short_description": "\"It is not right to equate horrific incidents of sexual assault with misplaced compliments or humor,\" he said in a statement.", "headline": "Morgan Freeman 'Devastated' That Sexual Harassment Claims Could Undermine Legacy", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/morgan-freeman-devastated-sexual-misconduct_us_5b096319e4b0802d69cba298", "authors": "Ron Dicker", "category": "ENTERTAINMENT"}
{"short_description": "It's catchy, all right.", "headline": "Donald Trump Is Lovin' New McDonald's Jingle In 'Tonight Show' Bit", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/donald-trump-mcondalds-tonight-show_us_5b093561e4b0fdb2aa53daba", "authors": "Ron Dicker", "category": "ENTERTAINMENT"}
{"short_description": "There's a great mini-series joining this week.", "headline": "What To Watch On Amazon Prime That\u2019s New This Week", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/amazon-prime-what-to-watch_us_5b044625e4b0c0b8b23ec14f", "authors": "Todd Van Luling", "category": "ENTERTAINMENT"}
{"short_description": "Myer's kids may be pushing for a new \"Powers\" film more than anyone.", "headline": "Mike Myers Reveals He'd 'Like To' Do A Fourth Austin Powers Film", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/mike-myers-reveals-he-wants-to-do-a-fourth-austin-powers-film_us_5b096198e4b0802d69cb9f15", "authors": "Andy McDonald", "category": "ENTERTAINMENT"}
{"short_description": "You're getting a recent Academy Award-winning movie.", "headline": "What To Watch On Hulu That\u2019s New This Week", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/hulu-what-to-watch_us_5b0445bae4b0c0b8b23ec046", "authors": "Todd Van Luling", "category": "ENTERTAINMENT"}
{"short_description": "The pop star also wore a \"Santa Fe Strong\" shirt at his show in Houston.", "headline": "Justin Timberlake Visits Texas School Shooting Victims", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/justin-timberlake-visits-texas-school-shooting-victims_us_5b098161e4b0fdb2aa54167e", "authors": "Sebastian Murdock", "category": "ENTERTAINMENT"}
{"short_description": "The two met to pave the way for a summit between North Korean and the U.S.", "headline": "South Korean President Meets North Korea's Kim Jong Un To Talk Trump Summit", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/south-korean-president-meets-north-koreas-kim-jong-un_us_5b094ebae4b0fdb2aa53e504", "authors": "", "category": "WORLD NEWS"}
{"short_description": "The revolution is coming to rural New Brunswick.", "headline": "With Its Way Of Life At Risk, This Remote Oyster-Growing Region Called In Robots", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/remote-oyster-growing-region-called-in-robots_us_5b083658e4b0fdb2aa53415d", "authors": "Karen Pinchin", "category": "IMPACT"}
{"short_description": "Last month a Health and Human Services official revealed the government was unable to locate nearly 1,500 children who had been released from its custody.", "headline": "Trump's Crackdown On Immigrant Parents Puts More Kids In An Already Strained System", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/immigrant-children-separated-from-parents_us_5b087b90e4b0802d69cb4070", "authors": "Elise Foley and Roque Planas", "category": "POLITICS"}
{"short_description": "The wiretaps feature conversations between Alexander Torshin and Alexander Romanov, a convicted Russian money launderer.", "headline": "'Trump's Son Should Be Concerned': FBI Obtained Wiretaps Of Putin Ally Who Met With Trump Jr.", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/fbi-wiretaps-putin-ally-trump-jr_us_5b08bf56e4b0568a880b7859", "authors": "Michael Isikoff, Yahoo News", "category": "POLITICS"}

1 answer

4


Your file structure is a variant of JSON, called JSON Lines. The file extension should be .jsonl.

It’s a very simple format, exactly like JSON, but instead of a single JSON throughout the file, this format uses one JSON object per file line. To read it you can do it in several ways: using the pandas, or as in your example, reading each line separately from the file and then decoding with the module json normal. There are also specific libraries to read this format.

I couldn’t import using Pandas

I downloaded the complete file (I had to register on the site) and then pandas normally, using lines=True which is the pandas parameter that allows reading jsonl:

>>> df = pd.read_json('News_Category_Dataset.json', lines=True)
>>> df.describe()
       authors  category        ...                                                      link short_description
count   124989    124989        ...                                                    124989            124989
unique   19250        31        ...                                                    124964            103905
top             POLITICS        ...         https://www.huffingtonpost.comhttps://www.publ...                  
freq     14151     32739        ...                                                         2             19590
first      NaN       NaN        ...                                                       NaN               NaN
last       NaN       NaN        ...                                                       NaN               NaN

Worked no problems here, as you can see above... if you are not able to read using the pandas I suggest editing the question and adding the full error message including the traceback for something else must be wrong.

But from what I understand, this way is like any file and the json structure is lost (that’s right?)

This question is confusing. A JSON file is also a "any" file, after all, every file is "any file". The structure is not lost, because the data continue to be read in a structured way, so much so that you can separate, for example, the category description, normally.

The only difference would be that instead of using the ready-made function that comes in the pandas To interpret the format, you yourself are doing a part of the interpretation. Most of the time, using a ready-made implementation of a known library is a better solution, but it may be that for a particular specific use it is better to read it manually. everything depends on what you want to do with the structure afterwards, that is, how you will handle this data.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.