[
Pandas Data Structures – Overview¶
Let us understand the details with respect to Pandas.
- Pandas is not a core Python module and hence we need to install using pip –
pip install pandas
. - It has 2 types of data structures –
Series
andDataFrame
. Series
is a one dimension array whileDataFrame
is a two dimension array.Series
only contains index for each row and one attribute or column.DataFrame
contains index for each row and multiple columns.- Each attribute in the DataFrame is nothing but a Series.
- We can perform all standard transformations using Pandas APIs
- We also have SQL based wrappers on top of Pandas where we can write queries.Here are the steps to get started with Pandas Data Structures:
- Make sure Pandas library is installed using
pip
. - Import Pandas library –
import pandas as pd
- We need to have a collection or data in a file to create Pandas Data Structures.
- Use appropriate APIs on the data to create Pandas Data Structures.
Series
for single dimension array.DataFrame
for two dimension array.
{note}
Typically we use `Series` for list of regular objects or dict and `DataFrame` for list of tuples or list of dicts. Let us use list for `Series` and list of dicts for `DataFrame`.
In [1]:
!pip install pandas
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: pandas in /home/itversity/.local/lib/python3.8/site-packages (1.3.4) Requirement already satisfied: pytz>=2017.3 in /home/itversity/.local/lib/python3.8/site-packages (from pandas) (2021.3) Requirement already satisfied: numpy>=1.17.3 in /home/itversity/.local/lib/python3.8/site-packages (from pandas) (1.22.2) Requirement already satisfied: python-dateutil>=2.7.3 in /home/itversity/.local/lib/python3.8/site-packages (from pandas) (2.8.2) Requirement already satisfied: six>=1.5 in /home/itversity/.local/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)
In [2]:
import pandas as pd
In [3]:
sals_l = [1500.0, 2000.0, 2200.00]
In [4]:
pd.Series?
Init signature: pd.Series( data=None, index=None, dtype: 'Dtype | None' = None, name=None, copy: 'bool' = False, fastpath: 'bool' = False, ) Docstring: One-dimensional ndarray with axis labels (including time series). Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN). Operations between Series (+, -, /, *, **) align values based on their associated index values-- they need not be the same length. The result index will be the sorted union of the two indexes. Parameters ---------- data : array-like, Iterable, dict, or scalar value Contains data stored in Series. If data is a dict, argument order is maintained. index : array-like or Index (1d) Values must be hashable and have the same length as `data`. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values. dtype : str, numpy.dtype, or ExtensionDtype, optional Data type for the output Series. If not specified, this will be inferred from `data`. See the :ref:`user guide <basics.dtypes>` for more usages. name : str, optional The name to give to the Series. copy : bool, default False Copy input data. Only affects Series or 1d ndarray input. See examples. Examples -------- Constructing Series from a dictionary with an Index specified >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['a', 'b', 'c']) >>> ser a 1 b 2 c 3 dtype: int64 The keys of the dictionary match with the Index values, hence the Index values have no effect. >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['x', 'y', 'z']) >>> ser x NaN y NaN z NaN dtype: float64 Note that the Index is first build with the keys from the dictionary. After this the Series is reindexed with the given Index values, hence we get all NaN as a result. Constructing Series from a list with `copy=False`. >>> r = [1, 2] >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r [1, 2] >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `copy` of the original data even though `copy=False`, so the data is unchanged. Constructing Series from a 1d ndarray with `copy=False`. >>> r = np.array([1, 2]) >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r array([999, 2]) >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `view` on the original data, so the data is changed as well. File: ~/.local/lib/python3.8/site-packages/pandas/core/series.py Type: type Subclasses: SubclassedSeries
In [5]:
sals_s = pd.Series(sals_l, name='sal')
In [6]:
sals_s
Out[6]:
0 1500.0 1 2000.0 2 2200.0 Name: sal, dtype: float64
In [7]:
sals_s[:2]
Out[7]:
0 1500.0 1 2000.0 Name: sal, dtype: float64
In [8]:
sals_ld = [(1, 1500.0), (2, 2000.0), (3, 2200.00)]
In [9]:
pd.DataFrame?
Init signature: pd.DataFrame( data=None, index: 'Axes | None' = None, columns: 'Axes | None' = None, dtype: 'Dtype | None' = None, copy: 'bool | None' = None, ) Docstring: Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure. Parameters ---------- data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. .. versionchanged:: 0.25.0 If data is a list of dicts, column order follows insertion-order. index : Index or array-like Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided. columns : Index or array-like Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels, will perform column selection instead. dtype : dtype, default None Data type to force. Only a single dtype is allowed. If None, infer. copy : bool or None, default None Copy data from inputs. For dict data, the default of None behaves like ``copy=True``. For DataFrame or 2d ndarray input, the default of None behaves like ``copy=False``. .. versionchanged:: 1.3.0 See Also -------- DataFrame.from_records : Constructor from tuples, also record arrays. DataFrame.from_dict : From dicts of Series, arrays, or dicts. read_csv : Read a comma-separated values (csv) file into DataFrame. read_table : Read general delimited file into DataFrame. read_clipboard : Read text from clipboard into DataFrame. Examples -------- Constructing DataFrame from a dictionary. >>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = pd.DataFrame(data=d) >>> df col1 col2 0 1 3 1 2 4 Notice that the inferred dtype is int64. >>> df.dtypes col1 int64 col2 int64 dtype: object To enforce a single dtype: >>> df = pd.DataFrame(data=d, dtype=np.int8) >>> df.dtypes col1 int8 col2 int8 dtype: object Constructing DataFrame from numpy ndarray: >>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), ... columns=['a', 'b', 'c']) >>> df2 a b c 0 1 2 3 1 4 5 6 2 7 8 9 Constructing DataFrame from a numpy ndarray that has labeled columns: >>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)], ... dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")]) >>> df3 = pd.DataFrame(data, columns=['c', 'a']) ... >>> df3 c a 0 3 1 1 6 4 2 9 7 Constructing DataFrame from dataclass: >>> from dataclasses import make_dataclass >>> Point = make_dataclass("Point", [("x", int), ("y", int)]) >>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)]) x y 0 0 0 1 0 3 2 2 3 File: ~/.local/lib/python3.8/site-packages/pandas/core/frame.py Type: type Subclasses: SubclassedDataFrame
In [10]:
sals_df = pd.DataFrame(sals_ld, columns=['id', 'sal'])
In [11]:
sals_df
Out[11]:
id | sal | |
---|---|---|
0 | 1 | 1500.0 |
1 | 2 | 2000.0 |
2 | 3 | 2200.0 |
In [12]:
sals_df['id']
Out[12]:
0 1 1 2 2 3 Name: id, dtype: int64
]