Sai

Writing DataFrames to Files

[

Writing Data Frames to Files¶

Pandas also provides simple APIs to write the data back to files.

Let us write the revenue per order along with order_id to a file.

Here are the steps which you need to follow before writing Data Frame to a file.

Make sure you have the Data Frame that is supposed to be written to file.
You need to ensure that you have write permissions on the folder under which files are supposed to be written.
Make sure to use appropriate key word arguments to write the Data Frame into file as per the requirements.

In [1]:

%run 06_csv_to_pandas_data_frame.ipynb

In [2]:

order_items.to_csv?

Signature:
order_items.to_csv(
    path_or_buf: 'FilePathOrBuffer[AnyStr] | None' = None,
    sep: 'str' = ',',
    na_rep: 'str' = '',
    float_format: 'str | None' = None,
    columns: 'Sequence[Hashable] | None' = None,
    header: 'bool_t | list[str]' = True,
    index: 'bool_t' = True,
    index_label: 'IndexLabel | None' = None,
    mode: 'str' = 'w',
    encoding: 'str | None' = None,
    compression: 'CompressionOptions' = 'infer',
    quoting: 'int | None' = None,
    quotechar: 'str' = '"',
    line_terminator: 'str | None' = None,
    chunksize: 'int | None' = None,
    date_format: 'str | None' = None,
    doublequote: 'bool_t' = True,
    escapechar: 'str | None' = None,
    decimal: 'str' = '.',
    errors: 'str' = 'strict',
    storage_options: 'StorageOptions' = None,
) -> 'str | None'
Docstring:
Write object to a comma-separated values (csv) file.

Parameters
----------
path_or_buf : str or file handle, default None
    File path or object, if None is provided the result is returned as
    a string.  If a non-binary file object is passed, it should be opened
    with `newline=''`, disabling universal newlines. If a binary
    file object is passed, `mode` might need to contain a `'b'`.

    .. versionchanged:: 1.2.0

       Support for binary file objects was introduced.

sep : str, default ','
    String of length 1. Field delimiter for the output file.
na_rep : str, default ''
    Missing data representation.
float_format : str, default None
    Format string for floating point numbers.
columns : sequence, optional
    Columns to write.
header : bool or list of str, default True
    Write out the column names. If a list of strings is given it is
    assumed to be aliases for the column names.
index : bool, default True
    Write row names (index).
index_label : str or sequence, or False, default None
    Column label for index column(s) if desired. If None is given, and
    `header` and `index` are True, then the index names are used. A
    sequence should be given if the object uses MultiIndex. If
    False do not print fields for index names. Use index_label=False
    for easier importing in R.
mode : str
    Python write mode, default 'w'.
encoding : str, optional
    A string representing the encoding to use in the output file,
    defaults to 'utf-8'. `encoding` is not supported if `path_or_buf`
    is a non-binary file object.
compression : str or dict, default 'infer'
    If str, represents compression mode. If dict, value at 'method' is
    the compression mode. Compression mode may be any of the following
    possible values: {'infer', 'gzip', 'bz2', 'zip', 'xz', None}. If
    compression mode is 'infer' and `path_or_buf` is path-like, then
    detect compression mode from the following extensions: '.gz',
    '.bz2', '.zip' or '.xz'. (otherwise no compression). If dict given
    and mode is one of {'zip', 'gzip', 'bz2'}, or inferred as
    one of the above, other entries passed as
    additional compression options.

    .. versionchanged:: 1.0.0

       May now be a dict with key 'method' as compression mode
       and other entries as additional compression options if
       compression mode is 'zip'.

    .. versionchanged:: 1.1.0

       Passing compression options as keys in dict is
       supported for compression modes 'gzip' and 'bz2'
       as well as 'zip'.

    .. versionchanged:: 1.2.0

        Compression is supported for binary file objects.

    .. versionchanged:: 1.2.0

        Previous versions forwarded dict entries for 'gzip' to
        `gzip.open` instead of `gzip.GzipFile` which prevented
        setting `mtime`.

quoting : optional constant from csv module
    Defaults to csv.QUOTE_MINIMAL. If you have set a `float_format`
    then floats are converted to strings and thus csv.QUOTE_NONNUMERIC
    will treat them as non-numeric.
quotechar : str, default '\"'
    String of length 1. Character used to quote fields.
line_terminator : str, optional
    The newline character or character sequence to use in the output
    file. Defaults to `os.linesep`, which depends on the OS in which
    this method is called ('\\n' for linux, '\\r\\n' for Windows, i.e.).
chunksize : int or None
    Rows to write at a time.
date_format : str, default None
    Format string for datetime objects.
doublequote : bool, default True
    Control quoting of `quotechar` inside a field.
escapechar : str, default None
    String of length 1. Character used to escape `sep` and `quotechar`
    when appropriate.
decimal : str, default '.'
    Character recognized as decimal separator. E.g. use ',' for
    European data.
errors : str, default 'strict'
    Specifies how encoding and decoding errors are to be handled.
    See the errors argument for :func:`open` for a full list
    of options.

    .. versionadded:: 1.1.0

storage_options : dict, optional
    Extra options that make sense for a particular storage connection, e.g.
    host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
    are forwarded to ``urllib`` as header options. For other URLs (e.g.
    starting with "s3://", and "gcs://") the key-value pairs are forwarded to
    ``fsspec``. Please see ``fsspec`` and ``urllib`` for more details.

    .. versionadded:: 1.2.0

Returns
-------
None or str
    If path_or_buf is None, returns the resulting csv format as a
    string. Otherwise returns None.

See Also
--------
read_csv : Load a CSV file into a DataFrame.
to_excel : Write DataFrame to an Excel file.

Examples
--------
>>> df = pd.DataFrame({'name': ['Raphael', 'Donatello'],
...                    'mask': ['red', 'purple'],
...                    'weapon': ['sai', 'bo staff']})
>>> df.to_csv(index=False)
'name,mask,weapon\nRaphael,red,sai\nDonatello,purple,bo staff\n'

Create 'out.zip' containing 'out.csv'

>>> compression_opts = dict(method='zip',
...                         archive_name='out.csv')  # doctest: +SKIP
>>> df.to_csv('out.zip', index=False,
...           compression=compression_opts)  # doctest: +SKIP
File:      ~/.local/lib/python3.8/site-packages/pandas/core/generic.py
Type:      method

In [3]:

import getpass
username = getpass.getuser()

In [4]:

username

Out[4]:

'itversity'

In [5]:

base_dir = f"/home/{username}/data/retail_db"
base_dir

Out[5]:

'/home/itversity/data/retail_db'

In [6]:

output_dir = f'{base_dir}/revenue_per_order'
output_dir

Out[6]:

'/home/itversity/data/retail_db/revenue_per_order'

In [7]:

%%sh

rm -rf /home/`whoami`/data/retail_db/revenue_per_order

In [8]:

import subprocess

subprocess.call(['rm', '-rf', output_dir])

Out[8]:

In [9]:

import subprocess

subprocess.call(['mkdir', '-p', output_dir])

Out[9]:

In [10]:

import subprocess
#ls -ltr /Users/itversity/Research/data/retail_db/revenue_per_order
subprocess.check_output(['ls', '-ltr', output_dir])

Out[10]:

b'total 0\n'

In [11]:

%%sh
ls -ltr /data/retail_db

total 20128
drwxrwxr-x 2 itversity itversity       24 Mar  8 02:04 categories
drwxrwxr-x 2 itversity itversity       24 Mar  8 02:04 customers
-rw-rw-r-- 1 itversity itversity     1748 Mar  8 02:04 create_db_tables_pg.sql
-rw-rw-r-- 1 itversity itversity 10303297 Mar  8 02:04 create_db.sql
drwxrwxr-x 2 itversity itversity       24 Mar  8 02:04 departments
drwxrwxr-x 2 itversity itversity       24 Mar  8 02:04 order_items
-rw-rw-r-- 1 itversity itversity 10297372 Mar  8 02:04 load_db_tables_pg.sql
drwxrwxr-x 2 itversity itversity       24 Mar  8 02:04 orders
drwxrwxr-x 2 itversity itversity       24 Mar  8 02:04 products

In [12]:

orders

Out[12]:

	order_id	order_date	order_customer_id	order_status
0	1	2013-07-25 00:00:00.0	11599	CLOSED
1	2	2013-07-25 00:00:00.0	256	PENDING_PAYMENT
2	3	2013-07-25 00:00:00.0	12111	COMPLETE
3	4	2013-07-25 00:00:00.0	8827	CLOSED
4	5	2013-07-25 00:00:00.0	11318	COMPLETE
…	…	…	…	…
68878	68879	2014-07-09 00:00:00.0	778	COMPLETE
68879	68880	2014-07-13 00:00:00.0	1117	COMPLETE
68880	68881	2014-07-19 00:00:00.0	2518	PENDING_PAYMENT
68881	68882	2014-07-22 00:00:00.0	10000	ON_HOLD
68882	68883	2014-07-23 00:00:00.0	5533	COMPLETE

68883 rows × 4 columns

In [13]:

order_items

Out[13]:

	order_item_id	order_item_order_id	order_item_product_id	order_item_quantity	order_item_subtotal	order_item_product_price
0	1	1	957	1	299.98	299.98
1	2	2	1073	1	199.99	199.99
2	3	2	502	5	250.00	50.00
3	4	2	403	1	129.99	129.99
4	5	4	897	2	49.98	24.99
…	…	…	…	…	…	…
172193	172194	68881	403	1	129.99	129.99
172194	172195	68882	365	1	59.99	59.99
172195	172196	68882	502	1	50.00	50.00
172196	172197	68883	208	1	1999.99	1999.99
172197	172198	68883	502	3	150.00	50.00

172198 rows × 6 columns

In [14]:

order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum', 'min', 'max', 'count'])

Out[14]:

	sum	min	max	count
order_item_order_id
1	299.98	299.98	299.98	1
2	579.98	129.99	250.00	3
4	699.85	49.98	299.95	4
5	1129.86	99.96	299.98	5
7	579.92	79.95	299.98	3
…	…	…	…	…
68879	1259.97	129.99	999.99	3
68880	999.77	149.94	250.00	5
68881	129.99	129.99	129.99	1
68882	109.99	50.00	59.99	2
68883	2149.99	150.00	1999.99	2

57431 rows × 4 columns

In [15]:

order_items

Out[15]:

	order_item_id	order_item_order_id	order_item_product_id	order_item_quantity	order_item_subtotal	order_item_product_price
0	1	1	957	1	299.98	299.98
1	2	2	1073	1	199.99	199.99
2	3	2	502	5	250.00	50.00
3	4	2	403	1	129.99	129.99
4	5	4	897	2	49.98	24.99
…	…	…	…	…	…	…
172193	172194	68881	403	1	129.99	129.99
172194	172195	68882	365	1	59.99	59.99
172195	172196	68882	502	1	50.00	50.00
172196	172197	68883	208	1	1999.99	1999.99
172197	172198	68883	502	3	150.00	50.00

172198 rows × 6 columns

In [16]:

order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum', 'min', 'max', 'count']). \
    reset_index()

Out[16]:

	order_item_order_id	sum	min	max	count
0	1	299.98	299.98	299.98	1
1	2	579.98	129.99	250.00	3
2	4	699.85	49.98	299.95	4
3	5	1129.86	99.96	299.98	5
4	7	579.92	79.95	299.98	3
…	…	…	…	…	…
57426	68879	1259.97	129.99	999.99	3
57427	68880	999.77	149.94	250.00	5
57428	68881	129.99	129.99	129.99	1
57429	68882	109.99	50.00	59.99	2
57430	68883	2149.99	150.00	1999.99	2

57431 rows × 5 columns

In [17]:

order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum', 'min', 'max', 'count']). \
    rename(columns={'count': 'item_count', 'sum': 'revenue'}). \
    to_json(f'{output_dir}/revenue_per_order.json', orient='table')

In [18]:

%%sh
ls -ltr /home/`whoami`/data/retail_db/revenue_per_order

total 4884
-rw-r--r-- 1 itversity itversity 4999377 Mar 30 11:45 revenue_per_order.json

In [ ]:

%%sh
head -10 /home/`whoami`/data/retail_db/revenue_per_order/revenue_per_order.json

In [ ]:

order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum', 'min', 'max', 'count']). \
    rename(columns={'count': 'item_count', 'sum': 'revenue'})

In [ ]:

order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum', 'min', 'max', 'count']). \
    rename(columns={'count': 'item_count', 'sum': 'revenue'}). \
    round(2). \
    to_csv(output_dir + '/revenue_per_order.csv')

In [ ]:

%%sh
ls -ltr /home/`whoami`/data/retail_db/revenue_per_order

In [ ]:

%%sh
head /home/`whoami`/data/retail_db/revenue_per_order/revenue_per_order.csv

]

Sai

Writing DataFrames to Files

Writing Data Frames to Files¶

Share this post

Join Our Community

Follow Us

Links

Contact Info

Address

Phone

Email