[
Multiple JSON Documents in files¶
At times we might have multiple JSON documents in a text file. Typically, we will have one valid JSON per line. Let us understand the process of reading a file where there are multiple JSON documents one per line.
- If you use
pandas
, it is straight forward. However, we will talk about usingpandas
later. - We cannot use
json
module directly. Here are the steps to use JSON module.- Create file type object by passing the path to
open
. - Use
read
to read the content in the file into a string. - Once string object is created, we can use
splitlines
to convert these lines into list of strings. Here each element is of type string which contain json. - Now we can iterate through the elements and convert each string with JSON to dict using
json.loads
.
- Create file type object by passing the path to
In [1]:
import json
In [2]:
json.load?
Signature: json.load( fp, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw, ) Docstring: Deserialize ``fp`` (a ``.read()``-supporting file-like object containing a JSON document) to a Python object. ``object_hook`` is an optional function that will be called with the result of any object literal decode (a ``dict``). The return value of ``object_hook`` will be used instead of the ``dict``. This feature can be used to implement custom decoders (e.g. JSON-RPC class hinting). ``object_pairs_hook`` is an optional function that will be called with the result of any object literal decoded with an ordered list of pairs. The return value of ``object_pairs_hook`` will be used instead of the ``dict``. This feature can be used to implement custom decoders. If ``object_hook`` is also defined, the ``object_pairs_hook`` takes priority. To use a custom ``JSONDecoder`` subclass, specify it with the ``cls`` kwarg; otherwise ``JSONDecoder`` is used. File: /usr/local/lib/python3.8/json/__init__.py Type: function
In [3]:
json.load(open('customers.json'))
--------------------------------------------------------------------------- JSONDecodeError Traceback (most recent call last) Input In [3], in <cell line: 1>() ----> 1 json.load(open('customers.json')) File /usr/local/lib/python3.8/json/__init__.py:293, in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw) 274 def load(fp, *, cls=None, object_hook=None, parse_float=None, 275 parse_int=None, parse_constant=None, object_pairs_hook=None, **kw): 276 """Deserialize ``fp`` (a ``.read()``-supporting file-like object containing 277 a JSON document) to a Python object. 278 (...) 291 kwarg; otherwise ``JSONDecoder`` is used. 292 """ --> 293 return loads(fp.read(), 294 cls=cls, object_hook=object_hook, 295 parse_float=parse_float, parse_int=parse_int, 296 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File /usr/local/lib/python3.8/json/__init__.py:357, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw) 352 del kw['encoding'] 354 if (cls is None and object_hook is None and 355 parse_int is None and parse_float is None and 356 parse_constant is None and object_pairs_hook is None and not kw): --> 357 return _default_decoder.decode(s) 358 if cls is None: 359 cls = JSONDecoder File /usr/local/lib/python3.8/json/decoder.py:340, in JSONDecoder.decode(self, s, _w) 338 end = _w(s, end).end() 339 if end != len(s): --> 340 raise JSONDecodeError("Extra data", s, end) 341 return obj JSONDecodeError: Extra data: line 2 column 1 (char 124)
- Create file type object by passing the path to
open
.
In [4]:
type(open('customers.json'))
Out[4]:
_io.TextIOWrapper
- Use
read
to read the content in the file into a string.
In [5]:
type(open('customers.json').read())
Out[5]:
str
In [6]:
open('customers.json').read()
Out[6]:
'{"id":1,"first_name":"Frasco","last_name":"Necolds","email":"fnecolds0@vk.com","gender":"Male","ip_address":"243.67.63.34"}\n{"id":2,"first_name":"Dulce","last_name":"Santos","email":"dsantos1@mashable.com","gender":"Female","ip_address":"60.30.246.227"}\n{"id":3,"first_name":"Prissie","last_name":"Tebbett","email":"ptebbett2@infoseek.co.jp","gender":"Genderfluid","ip_address":"22.21.162.56"}\n{"id":4,"first_name":"Schuyler","last_name":"Coppledike","email":"scoppledike3@gnu.org","gender":"Agender","ip_address":"120.35.186.161"}\n{"id":5,"first_name":"Leopold","last_name":"Jarred","email":"ljarred4@wp.com","gender":"Agender","ip_address":"30.119.34.4"}\n{"id":6,"first_name":"Joanna","last_name":"Teager","email":"jteager5@apache.org","gender":"Bigender","ip_address":"245.221.176.34"}\n{"id":7,"first_name":"Lion","last_name":"Beere","email":"lbeere6@bloomberg.com","gender":"Polygender","ip_address":"105.54.139.46"}\n{"id":8,"first_name":"Marabel","last_name":"Wornum","email":"mwornum7@posterous.com","gender":"Polygender","ip_address":"247.229.14.25"}\n{"id":9,"first_name":"Helenka","last_name":"Mullender","email":"hmullender8@cloudflare.com","gender":"Non-binary","ip_address":"133.216.118.88"}\n{"id":10,"first_name":"Christine","last_name":"Swane","email":"cswane9@shop-pro.jp","gender":"Polygender","ip_address":"86.16.210.164"}'
- Once string object is created, we can use
splitlines
to convert these lines into list of strings. Here each element is of type string which contain json.
In [7]:
customers_str_list = open('customers.json').read().splitlines()
In [8]:
type(customers_str_list)
Out[8]:
list
- Each element in the list is of type string.
In [9]:
type(customers_str_list[0])
Out[9]:
str
In [10]:
len(customers_str_list)
Out[10]:
10
In [11]:
customers_str_list[0]
Out[11]:
'{"id":1,"first_name":"Frasco","last_name":"Necolds","email":"fnecolds0@vk.com","gender":"Male","ip_address":"243.67.63.34"}'
In [12]:
json.loads(customers_str_list[0])
Out[12]:
{'id': 1, 'first_name': 'Frasco', 'last_name': 'Necolds', 'email': 'fnecolds0@vk.com', 'gender': 'Male', 'ip_address': '243.67.63.34'}
In [13]:
customers_str_list[:3]
Out[13]:
['{"id":1,"first_name":"Frasco","last_name":"Necolds","email":"fnecolds0@vk.com","gender":"Male","ip_address":"243.67.63.34"}', '{"id":2,"first_name":"Dulce","last_name":"Santos","email":"dsantos1@mashable.com","gender":"Female","ip_address":"60.30.246.227"}', '{"id":3,"first_name":"Prissie","last_name":"Tebbett","email":"ptebbett2@infoseek.co.jp","gender":"Genderfluid","ip_address":"22.21.162.56"}']
- Now we can iterate through the elements and convert each string with JSON to dict using
json.loads
.
In [14]:
customers_dict_list = [json.loads(customer) for customer in customers_str_list]
In [15]:
type(customers_dict_list)
Out[15]:
list
In [16]:
type(customers_dict_list[0])
Out[16]:
dict
In [17]:
customers_dict_list[0]
Out[17]:
{'id': 1, 'first_name': 'Frasco', 'last_name': 'Necolds', 'email': 'fnecolds0@vk.com', 'gender': 'Male', 'ip_address': '243.67.63.34'}
- Here is the logic to convert list of strings to list of dicts using
map
function.
In [18]:
customers_dict_list = list(map(json.loads, customers_str_list))
In [19]:
type(customers_dict_list)
Out[19]:
list
In [20]:
type(customers_dict_list[0])
Out[20]:
dict
In [21]:
customers_dict_list[0]
Out[21]:
{'id': 1, 'first_name': 'Frasco', 'last_name': 'Necolds', 'email': 'fnecolds0@vk.com', 'gender': 'Male', 'ip_address': '243.67.63.34'}
]