[
Writing Data Frames to Files¶
Pandas also provides simple APIs to write the data back to files.
- Let us write the revenue per order along with order_id to a file.
Here are the steps which you need to follow before writing Data Frame to a file.
- Make sure you have the Data Frame that is supposed to be written to file.
- You need to ensure that you have write permissions on the folder under which files are supposed to be written.
- Make sure to use appropriate key word arguments to write the Data Frame into file as per the requirements.
In [1]:
%run 06_csv_to_pandas_data_frame.ipynb
In [2]:
order_items.to_csv?
Signature: order_items.to_csv( path_or_buf: 'FilePathOrBuffer[AnyStr] | None' = None, sep: 'str' = ',', na_rep: 'str' = '', float_format: 'str | None' = None, columns: 'Sequence[Hashable] | None' = None, header: 'bool_t | list[str]' = True, index: 'bool_t' = True, index_label: 'IndexLabel | None' = None, mode: 'str' = 'w', encoding: 'str | None' = None, compression: 'CompressionOptions' = 'infer', quoting: 'int | None' = None, quotechar: 'str' = '"', line_terminator: 'str | None' = None, chunksize: 'int | None' = None, date_format: 'str | None' = None, doublequote: 'bool_t' = True, escapechar: 'str | None' = None, decimal: 'str' = '.', errors: 'str' = 'strict', storage_options: 'StorageOptions' = None, ) -> 'str | None' Docstring: Write object to a comma-separated values (csv) file. Parameters ---------- path_or_buf : str or file handle, default None File path or object, if None is provided the result is returned as a string. If a non-binary file object is passed, it should be opened with `newline=''`, disabling universal newlines. If a binary file object is passed, `mode` might need to contain a `'b'`. .. versionchanged:: 1.2.0 Support for binary file objects was introduced. sep : str, default ',' String of length 1. Field delimiter for the output file. na_rep : str, default '' Missing data representation. float_format : str, default None Format string for floating point numbers. columns : sequence, optional Columns to write. header : bool or list of str, default True Write out the column names. If a list of strings is given it is assumed to be aliases for the column names. index : bool, default True Write row names (index). index_label : str or sequence, or False, default None Column label for index column(s) if desired. If None is given, and `header` and `index` are True, then the index names are used. A sequence should be given if the object uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R. mode : str Python write mode, default 'w'. encoding : str, optional A string representing the encoding to use in the output file, defaults to 'utf-8'. `encoding` is not supported if `path_or_buf` is a non-binary file object. compression : str or dict, default 'infer' If str, represents compression mode. If dict, value at 'method' is the compression mode. Compression mode may be any of the following possible values: {'infer', 'gzip', 'bz2', 'zip', 'xz', None}. If compression mode is 'infer' and `path_or_buf` is path-like, then detect compression mode from the following extensions: '.gz', '.bz2', '.zip' or '.xz'. (otherwise no compression). If dict given and mode is one of {'zip', 'gzip', 'bz2'}, or inferred as one of the above, other entries passed as additional compression options. .. versionchanged:: 1.0.0 May now be a dict with key 'method' as compression mode and other entries as additional compression options if compression mode is 'zip'. .. versionchanged:: 1.1.0 Passing compression options as keys in dict is supported for compression modes 'gzip' and 'bz2' as well as 'zip'. .. versionchanged:: 1.2.0 Compression is supported for binary file objects. .. versionchanged:: 1.2.0 Previous versions forwarded dict entries for 'gzip' to `gzip.open` instead of `gzip.GzipFile` which prevented setting `mtime`. quoting : optional constant from csv module Defaults to csv.QUOTE_MINIMAL. If you have set a `float_format` then floats are converted to strings and thus csv.QUOTE_NONNUMERIC will treat them as non-numeric. quotechar : str, default '\"' String of length 1. Character used to quote fields. line_terminator : str, optional The newline character or character sequence to use in the output file. Defaults to `os.linesep`, which depends on the OS in which this method is called ('\\n' for linux, '\\r\\n' for Windows, i.e.). chunksize : int or None Rows to write at a time. date_format : str, default None Format string for datetime objects. doublequote : bool, default True Control quoting of `quotechar` inside a field. escapechar : str, default None String of length 1. Character used to escape `sep` and `quotechar` when appropriate. decimal : str, default '.' Character recognized as decimal separator. E.g. use ',' for European data. errors : str, default 'strict' Specifies how encoding and decoding errors are to be handled. See the errors argument for :func:`open` for a full list of options. .. versionadded:: 1.1.0 storage_options : dict, optional Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to ``urllib`` as header options. For other URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are forwarded to ``fsspec``. Please see ``fsspec`` and ``urllib`` for more details. .. versionadded:: 1.2.0 Returns ------- None or str If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None. See Also -------- read_csv : Load a CSV file into a DataFrame. to_excel : Write DataFrame to an Excel file. Examples -------- >>> df = pd.DataFrame({'name': ['Raphael', 'Donatello'], ... 'mask': ['red', 'purple'], ... 'weapon': ['sai', 'bo staff']}) >>> df.to_csv(index=False) 'name,mask,weapon\nRaphael,red,sai\nDonatello,purple,bo staff\n' Create 'out.zip' containing 'out.csv' >>> compression_opts = dict(method='zip', ... archive_name='out.csv') # doctest: +SKIP >>> df.to_csv('out.zip', index=False, ... compression=compression_opts) # doctest: +SKIP File: ~/.local/lib/python3.8/site-packages/pandas/core/generic.py Type: method
In [3]:
import getpass
username = getpass.getuser()
In [4]:
username
Out[4]:
'itversity'
In [5]:
base_dir = f"/home/{username}/data/retail_db"
base_dir
Out[5]:
'/home/itversity/data/retail_db'
In [6]:
output_dir = f'{base_dir}/revenue_per_order'
output_dir
Out[6]:
'/home/itversity/data/retail_db/revenue_per_order'
In [7]:
%%sh
rm -rf /home/`whoami`/data/retail_db/revenue_per_order
In [8]:
import subprocess
subprocess.call(['rm', '-rf', output_dir])
Out[8]:
0
In [9]:
import subprocess
subprocess.call(['mkdir', '-p', output_dir])
Out[9]:
0
In [10]:
import subprocess
#ls -ltr /Users/itversity/Research/data/retail_db/revenue_per_order
subprocess.check_output(['ls', '-ltr', output_dir])
Out[10]:
b'total 0\n'
In [11]:
%%sh
ls -ltr /data/retail_db
total 20128 drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 categories drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 customers -rw-rw-r-- 1 itversity itversity 1748 Mar 8 02:04 create_db_tables_pg.sql -rw-rw-r-- 1 itversity itversity 10303297 Mar 8 02:04 create_db.sql drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 departments drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 order_items -rw-rw-r-- 1 itversity itversity 10297372 Mar 8 02:04 load_db_tables_pg.sql drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 orders drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 products
In [12]:
orders
Out[12]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
0 | 1 | 2013-07-25 00:00:00.0 | 11599 | CLOSED |
1 | 2 | 2013-07-25 00:00:00.0 | 256 | PENDING_PAYMENT |
2 | 3 | 2013-07-25 00:00:00.0 | 12111 | COMPLETE |
3 | 4 | 2013-07-25 00:00:00.0 | 8827 | CLOSED |
4 | 5 | 2013-07-25 00:00:00.0 | 11318 | COMPLETE |
… | … | … | … | … |
68878 | 68879 | 2014-07-09 00:00:00.0 | 778 | COMPLETE |
68879 | 68880 | 2014-07-13 00:00:00.0 | 1117 | COMPLETE |
68880 | 68881 | 2014-07-19 00:00:00.0 | 2518 | PENDING_PAYMENT |
68881 | 68882 | 2014-07-22 00:00:00.0 | 10000 | ON_HOLD |
68882 | 68883 | 2014-07-23 00:00:00.0 | 5533 | COMPLETE |
68883 rows × 4 columns
In [13]:
order_items
Out[13]:
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 957 | 1 | 299.98 | 299.98 |
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
4 | 5 | 4 | 897 | 2 | 49.98 | 24.99 |
… | … | … | … | … | … | … |
172193 | 172194 | 68881 | 403 | 1 | 129.99 | 129.99 |
172194 | 172195 | 68882 | 365 | 1 | 59.99 | 59.99 |
172195 | 172196 | 68882 | 502 | 1 | 50.00 | 50.00 |
172196 | 172197 | 68883 | 208 | 1 | 1999.99 | 1999.99 |
172197 | 172198 | 68883 | 502 | 3 | 150.00 | 50.00 |
172198 rows × 6 columns
In [14]:
order_items. \
groupby('order_item_order_id')['order_item_subtotal']. \
agg(['sum', 'min', 'max', 'count'])
Out[14]:
sum | min | max | count | |
---|---|---|---|---|
order_item_order_id | ||||
1 | 299.98 | 299.98 | 299.98 | 1 |
2 | 579.98 | 129.99 | 250.00 | 3 |
4 | 699.85 | 49.98 | 299.95 | 4 |
5 | 1129.86 | 99.96 | 299.98 | 5 |
7 | 579.92 | 79.95 | 299.98 | 3 |
… | … | … | … | … |
68879 | 1259.97 | 129.99 | 999.99 | 3 |
68880 | 999.77 | 149.94 | 250.00 | 5 |
68881 | 129.99 | 129.99 | 129.99 | 1 |
68882 | 109.99 | 50.00 | 59.99 | 2 |
68883 | 2149.99 | 150.00 | 1999.99 | 2 |
57431 rows × 4 columns
In [15]:
order_items
Out[15]:
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 957 | 1 | 299.98 | 299.98 |
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
4 | 5 | 4 | 897 | 2 | 49.98 | 24.99 |
… | … | … | … | … | … | … |
172193 | 172194 | 68881 | 403 | 1 | 129.99 | 129.99 |
172194 | 172195 | 68882 | 365 | 1 | 59.99 | 59.99 |
172195 | 172196 | 68882 | 502 | 1 | 50.00 | 50.00 |
172196 | 172197 | 68883 | 208 | 1 | 1999.99 | 1999.99 |
172197 | 172198 | 68883 | 502 | 3 | 150.00 | 50.00 |
172198 rows × 6 columns
In [16]:
order_items. \
groupby('order_item_order_id')['order_item_subtotal']. \
agg(['sum', 'min', 'max', 'count']). \
reset_index()
Out[16]:
order_item_order_id | sum | min | max | count | |
---|---|---|---|---|---|
0 | 1 | 299.98 | 299.98 | 299.98 | 1 |
1 | 2 | 579.98 | 129.99 | 250.00 | 3 |
2 | 4 | 699.85 | 49.98 | 299.95 | 4 |
3 | 5 | 1129.86 | 99.96 | 299.98 | 5 |
4 | 7 | 579.92 | 79.95 | 299.98 | 3 |
… | … | … | … | … | … |
57426 | 68879 | 1259.97 | 129.99 | 999.99 | 3 |
57427 | 68880 | 999.77 | 149.94 | 250.00 | 5 |
57428 | 68881 | 129.99 | 129.99 | 129.99 | 1 |
57429 | 68882 | 109.99 | 50.00 | 59.99 | 2 |
57430 | 68883 | 2149.99 | 150.00 | 1999.99 | 2 |
57431 rows × 5 columns
In [17]:
order_items. \
groupby('order_item_order_id')['order_item_subtotal']. \
agg(['sum', 'min', 'max', 'count']). \
rename(columns={'count': 'item_count', 'sum': 'revenue'}). \
to_json(f'{output_dir}/revenue_per_order.json', orient='table')
In [18]:
%%sh
ls -ltr /home/`whoami`/data/retail_db/revenue_per_order
total 4884 -rw-r--r-- 1 itversity itversity 4999377 Mar 30 11:45 revenue_per_order.json
In [ ]:
%%sh
head -10 /home/`whoami`/data/retail_db/revenue_per_order/revenue_per_order.json
In [ ]:
order_items. \
groupby('order_item_order_id')['order_item_subtotal']. \
agg(['sum', 'min', 'max', 'count']). \
rename(columns={'count': 'item_count', 'sum': 'revenue'})
In [ ]:
order_items. \
groupby('order_item_order_id')['order_item_subtotal']. \
agg(['sum', 'min', 'max', 'count']). \
rename(columns={'count': 'item_count', 'sum': 'revenue'}). \
round(2). \
to_csv(output_dir + '/revenue_per_order.csv')
In [ ]:
%%sh
ls -ltr /home/`whoami`/data/retail_db/revenue_per_order
In [ ]:
%%sh
head /home/`whoami`/data/retail_db/revenue_per_order/revenue_per_order.csv
]