[
Write to JSON files using pandas¶
Let us understand how to write to JSON files using pandas
module.
- We can use
to_json
to write to JSON file usingpandas
.
In [1]:
import pandas as pd
In [2]:
courses = [{'course_name': 'Programming using Python',
'course_author': 'Bob Dillon',
'course_status': 'published',
'course_published_dt': '2020-09-30'},
{'course_name': 'Data Engineering using Python',
'course_author': 'Bob Dillon',
'course_status': 'published',
'course_published_dt': '2020-07-15'}]
In [3]:
df = pd.DataFrame(courses)
In [4]:
df
Out[4]:
course_name | course_author | course_status | course_published_dt | |
---|---|---|---|---|
0 | Programming using Python | Bob Dillon | published | 2020-09-30 |
1 | Data Engineering using Python | Bob Dillon | published | 2020-07-15 |
In [5]:
df.to_json?
Signature: df.to_json( path_or_buf: 'FilePathOrBuffer | None' = None, orient: 'str | None' = None, date_format: 'str | None' = None, double_precision: 'int' = 10, force_ascii: 'bool_t' = True, date_unit: 'str' = 'ms', default_handler: 'Callable[[Any], JSONSerializable] | None' = None, lines: 'bool_t' = False, compression: 'CompressionOptions' = 'infer', index: 'bool_t' = True, indent: 'int | None' = None, storage_options: 'StorageOptions' = None, ) -> 'str | None' Docstring: Convert the object to a JSON string. Note NaN's and None will be converted to null and datetime objects will be converted to UNIX timestamps. Parameters ---------- path_or_buf : str or file handle, optional File path or object. If not specified, the result is returned as a string. orient : str Indication of expected JSON string format. * Series: - default is 'index' - allowed values are: {'split', 'records', 'index', 'table'}. * DataFrame: - default is 'columns' - allowed values are: {'split', 'records', 'index', 'columns', 'values', 'table'}. * The format of the JSON string: - 'split' : dict like {'index' -> [index], 'columns' -> [columns], 'data' -> [values]} - 'records' : list like [{column -> value}, ... , {column -> value}] - 'index' : dict like {index -> {column -> value}} - 'columns' : dict like {column -> {index -> value}} - 'values' : just the values array - 'table' : dict like {'schema': {schema}, 'data': {data}} Describing the data, where data component is like ``orient='records'``. date_format : {None, 'epoch', 'iso'} Type of date conversion. 'epoch' = epoch milliseconds, 'iso' = ISO8601. The default depends on the `orient`. For ``orient='table'``, the default is 'iso'. For all other orients, the default is 'epoch'. double_precision : int, default 10 The number of decimal places to use when encoding floating point values. force_ascii : bool, default True Force encoded string to be ASCII. date_unit : str, default 'ms' (milliseconds) The time unit to encode to, governs timestamp and ISO8601 precision. One of 's', 'ms', 'us', 'ns' for second, millisecond, microsecond, and nanosecond respectively. default_handler : callable, default None Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object. lines : bool, default False If 'orient' is 'records' write out line-delimited json format. Will throw ValueError if incorrect 'orient' since others are not list-like. compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None} A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename. index : bool, default True Whether to include the index values in the JSON string. Not including the index (``index=False``) is only supported when orient is 'split' or 'table'. indent : int, optional Length of whitespace used to indent each record. .. versionadded:: 1.0.0 storage_options : dict, optional Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to ``urllib`` as header options. For other URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are forwarded to ``fsspec``. Please see ``fsspec`` and ``urllib`` for more details. .. versionadded:: 1.2.0 Returns ------- None or str If path_or_buf is None, returns the resulting json format as a string. Otherwise returns None. See Also -------- read_json : Convert a JSON string to pandas object. Notes ----- The behavior of ``indent=0`` varies from the stdlib, which does not indent the output but does insert newlines. Currently, ``indent=0`` and the default ``indent=None`` are equivalent in pandas, though this may change in a future release. ``orient='table'`` contains a 'pandas_version' field under 'schema'. This stores the version of `pandas` used in the latest revision of the schema. Examples -------- >>> import json >>> df = pd.DataFrame( ... [["a", "b"], ["c", "d"]], ... index=["row 1", "row 2"], ... columns=["col 1", "col 2"], ... ) >>> result = df.to_json(orient="split") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) # doctest: +SKIP { "columns": [ "col 1", "col 2" ], "index": [ "row 1", "row 2" ], "data": [ [ "a", "b" ], [ "c", "d" ] ] } Encoding/decoding a Dataframe using ``'records'`` formatted JSON. Note that index labels are not preserved with this encoding. >>> result = df.to_json(orient="records") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) # doctest: +SKIP [ { "col 1": "a", "col 2": "b" }, { "col 1": "c", "col 2": "d" } ] Encoding/decoding a Dataframe using ``'index'`` formatted JSON: >>> result = df.to_json(orient="index") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) # doctest: +SKIP { "row 1": { "col 1": "a", "col 2": "b" }, "row 2": { "col 1": "c", "col 2": "d" } } Encoding/decoding a Dataframe using ``'columns'`` formatted JSON: >>> result = df.to_json(orient="columns") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) # doctest: +SKIP { "col 1": { "row 1": "a", "row 2": "c" }, "col 2": { "row 1": "b", "row 2": "d" } } Encoding/decoding a Dataframe using ``'values'`` formatted JSON: >>> result = df.to_json(orient="values") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) # doctest: +SKIP [ [ "a", "b" ], [ "c", "d" ] ] Encoding with Table Schema: >>> result = df.to_json(orient="table") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) # doctest: +SKIP { "schema": { "fields": [ { "name": "index", "type": "string" }, { "name": "col 1", "type": "string" }, { "name": "col 2", "type": "string" } ], "primaryKey": [ "index" ], "pandas_version": "0.20.0" }, "data": [ { "index": "row 1", "col 1": "a", "col 2": "b" }, { "index": "row 2", "col 1": "c", "col 2": "d" } ] } File: ~/.local/lib/python3.8/site-packages/pandas/core/generic.py Type: method
- Dumping a JSON documents into a file with one JSON document per line.
In [6]:
courses = [{'course_name': 'Programming using Python',
'course_author': 'Bob Dillon',
'course_status': 'published',
'course_published_dt': '2020-09-30'},
{'course_name': 'Data Engineering using Python',
'course_author': 'Bob Dillon',
'course_status': 'published',
'course_published_dt': '2020-07-15'},
{'course_name': 'Data Engineering using Scala',
'course_author': 'Elvis Presley',
'course_status': 'draft',
'course_published_dt': None},
{'course_name': 'Programming using Scala',
'course_author': 'Elvis Presley',
'course_status': 'published',
'course_published_dt': '2020-05-12'},
{'course_name': 'Programming using Java',
'course_author': 'Mike Jack',
'course_status': 'inactive',
'course_published_dt': '2020-08-10'},
{'course_name': 'Web Applications - Python Flask',
'course_author': 'Bob Dillon',
'course_status': 'inactive',
'course_published_dt': '2020-07-20'},
{'course_name': 'Web Applications - Java Spring',
'course_author': 'Mike Jack',
'course_status': 'draft',
'course_published_dt': None},
{'course_name': 'Pipeline Orchestration - Python',
'course_author': 'Bob Dillon',
'course_status': 'draft',
'course_published_dt': None},
{'course_name': 'Streaming Pipelines - Python',
'course_author': 'Bob Dillon',
'course_status': 'published',
'course_published_dt': '2020-10-05'},
{'course_name': 'Web Applications - Scala Play',
'course_author': 'Elvis Presley',
'course_status': 'inactive',
'course_published_dt': '2020-09-30'},
{'course_name': 'Web Applications - Python Django',
'course_author': 'Bob Dillon',
'course_status': 'published',
'course_published_dt': '2020-06-23'},
{'course_name': 'Server Automation - Ansible',
'course_author': 'Uncle Sam',
'course_status': 'published',
'course_published_dt': '2020-07-05'}]
In [7]:
!rm data/courses/courses.json
In [8]:
import pandas as pd
In [9]:
courses_df = pd.DataFrame(courses)
In [10]:
courses_df
Out[10]:
course_name | course_author | course_status | course_published_dt | |
---|---|---|---|---|
0 | Programming using Python | Bob Dillon | published | 2020-09-30 |
1 | Data Engineering using Python | Bob Dillon | published | 2020-07-15 |
2 | Data Engineering using Scala | Elvis Presley | draft | None |
3 | Programming using Scala | Elvis Presley | published | 2020-05-12 |
4 | Programming using Java | Mike Jack | inactive | 2020-08-10 |
5 | Web Applications – Python Flask | Bob Dillon | inactive | 2020-07-20 |
6 | Web Applications – Java Spring | Mike Jack | draft | None |
7 | Pipeline Orchestration – Python | Bob Dillon | draft | None |
8 | Streaming Pipelines – Python | Bob Dillon | published | 2020-10-05 |
9 | Web Applications – Scala Play | Elvis Presley | inactive | 2020-09-30 |
10 | Web Applications – Python Django | Bob Dillon | published | 2020-06-23 |
11 | Server Automation – Ansible | Uncle Sam | published | 2020-07-05 |
In [11]:
courses_df.to_json('data/courses/courses.json', orient='records', lines=True)
In [12]:
!ls -ltr data/courses/courses.json
-rw-r--r-- 1 itversity itversity 1633 Mar 30 12:47 data/courses/courses.json
In [13]:
!cat data/courses/courses.json
{"course_name":"Programming using Python","course_author":"Bob Dillon","course_status":"published","course_published_dt":"2020-09-30"} {"course_name":"Data Engineering using Python","course_author":"Bob Dillon","course_status":"published","course_published_dt":"2020-07-15"} {"course_name":"Data Engineering using Scala","course_author":"Elvis Presley","course_status":"draft","course_published_dt":null} {"course_name":"Programming using Scala","course_author":"Elvis Presley","course_status":"published","course_published_dt":"2020-05-12"} {"course_name":"Programming using Java","course_author":"Mike Jack","course_status":"inactive","course_published_dt":"2020-08-10"} {"course_name":"Web Applications - Python Flask","course_author":"Bob Dillon","course_status":"inactive","course_published_dt":"2020-07-20"} {"course_name":"Web Applications - Java Spring","course_author":"Mike Jack","course_status":"draft","course_published_dt":null} {"course_name":"Pipeline Orchestration - Python","course_author":"Bob Dillon","course_status":"draft","course_published_dt":null} {"course_name":"Streaming Pipelines - Python","course_author":"Bob Dillon","course_status":"published","course_published_dt":"2020-10-05"} {"course_name":"Web Applications - Scala Play","course_author":"Elvis Presley","course_status":"inactive","course_published_dt":"2020-09-30"} {"course_name":"Web Applications - Python Django","course_author":"Bob Dillon","course_status":"published","course_published_dt":"2020-06-23"} {"course_name":"Server Automation - Ansible","course_author":"Uncle Sam","course_status":"published","course_published_dt":"2020-07-05"}
- We can also write Pandas dataframe to json file with orient split.
In [14]:
courses_df
Out[14]:
course_name | course_author | course_status | course_published_dt | |
---|---|---|---|---|
0 | Programming using Python | Bob Dillon | published | 2020-09-30 |
1 | Data Engineering using Python | Bob Dillon | published | 2020-07-15 |
2 | Data Engineering using Scala | Elvis Presley | draft | None |
3 | Programming using Scala | Elvis Presley | published | 2020-05-12 |
4 | Programming using Java | Mike Jack | inactive | 2020-08-10 |
5 | Web Applications – Python Flask | Bob Dillon | inactive | 2020-07-20 |
6 | Web Applications – Java Spring | Mike Jack | draft | None |
7 | Pipeline Orchestration – Python | Bob Dillon | draft | None |
8 | Streaming Pipelines – Python | Bob Dillon | published | 2020-10-05 |
9 | Web Applications – Scala Play | Elvis Presley | inactive | 2020-09-30 |
10 | Web Applications – Python Django | Bob Dillon | published | 2020-06-23 |
11 | Server Automation – Ansible | Uncle Sam | published | 2020-07-05 |
In [15]:
courses_df.to_json('data/courses/courses.json', orient='split')
In [16]:
!ls -ltr data/courses/courses.json
-rw-r--r-- 1 itversity itversity 943 Mar 30 12:47 data/courses/courses.json
In [17]:
!cat data/courses/courses.json
{"columns":["course_name","course_author","course_status","course_published_dt"],"index":[0,1,2,3,4,5,6,7,8,9,10,11],"data":[["Programming using Python","Bob Dillon","published","2020-09-30"],["Data Engineering using Python","Bob Dillon","published","2020-07-15"],["Data Engineering using Scala","Elvis Presley","draft",null],["Programming using Scala","Elvis Presley","published","2020-05-12"],["Programming using Java","Mike Jack","inactive","2020-08-10"],["Web Applications - Python Flask","Bob Dillon","inactive","2020-07-20"],["Web Applications - Java Spring","Mike Jack","draft",null],["Pipeline Orchestration - Python","Bob Dillon","draft",null],["Streaming Pipelines - Python","Bob Dillon","published","2020-10-05"],["Web Applications - Scala Play","Elvis Presley","inactive","2020-09-30"],["Web Applications - Python Django","Bob Dillon","published","2020-06-23"],["Server Automation - Ansible","Uncle Sam","published","2020-07-05"]]}
- We can also write Pandas dataframe to json file with orient records. It will generate string with valid JSON array.
In [18]:
courses_df.to_json('data/courses/courses.json', orient='records')
In [19]:
!ls -ltr data/courses/courses.json
-rw-r--r-- 1 itversity itversity 1634 Mar 30 12:47 data/courses/courses.json
In [20]:
!cat data/courses/courses.json
[{"course_name":"Programming using Python","course_author":"Bob Dillon","course_status":"published","course_published_dt":"2020-09-30"},{"course_name":"Data Engineering using Python","course_author":"Bob Dillon","course_status":"published","course_published_dt":"2020-07-15"},{"course_name":"Data Engineering using Scala","course_author":"Elvis Presley","course_status":"draft","course_published_dt":null},{"course_name":"Programming using Scala","course_author":"Elvis Presley","course_status":"published","course_published_dt":"2020-05-12"},{"course_name":"Programming using Java","course_author":"Mike Jack","course_status":"inactive","course_published_dt":"2020-08-10"},{"course_name":"Web Applications - Python Flask","course_author":"Bob Dillon","course_status":"inactive","course_published_dt":"2020-07-20"},{"course_name":"Web Applications - Java Spring","course_author":"Mike Jack","course_status":"draft","course_published_dt":null},{"course_name":"Pipeline Orchestration - Python","course_author":"Bob Dillon","course_status":"draft","course_published_dt":null},{"course_name":"Streaming Pipelines - Python","course_author":"Bob Dillon","course_status":"published","course_published_dt":"2020-10-05"},{"course_name":"Web Applications - Scala Play","course_author":"Elvis Presley","course_status":"inactive","course_published_dt":"2020-09-30"},{"course_name":"Web Applications - Python Django","course_author":"Bob Dillon","course_status":"published","course_published_dt":"2020-06-23"},{"course_name":"Server Automation - Ansible","course_author":"Uncle Sam","course_status":"published","course_published_dt":"2020-07-05"}]
]