Getting unique elements¶
Let us perform few tasks to understand how to extract unique elements. We can use either of these approaches.
- We can create a list of elements first and then convert into a set.
- We can also build set directly while extracting the information.
In [1]:
%%sh
ls -ltr /data/retail_db/orders/part-00000
-rw-rw-r-- 1 itversity itversity 2999944 Mar 8 02:04 /data/retail_db/orders/part-00000
In [2]:
%%sh
tail /data/retail_db/orders/part-00000
68874,2014-07-03 00:00:00.0,1601,COMPLETE 68875,2014-07-04 00:00:00.0,10637,ON_HOLD 68876,2014-07-06 00:00:00.0,4124,COMPLETE 68877,2014-07-07 00:00:00.0,9692,ON_HOLD 68878,2014-07-08 00:00:00.0,6753,COMPLETE 68879,2014-07-09 00:00:00.0,778,COMPLETE 68880,2014-07-13 00:00:00.0,1117,COMPLETE 68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT 68882,2014-07-22 00:00:00.0,10000,ON_HOLD 68883,2014-07-23 00:00:00.0,5533,COMPLETE
In [3]:
path = '/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\orders\\part-00000
orders_file = open(path)
In [4]:
type(orders_file)
Out[4]:
_io.TextIOWrapper
In [5]:
orders_raw = orders_file.read()
In [6]:
type(orders_raw)
Out[6]:
str
In [7]:
orders_raw.splitlines?
Signature: orders_raw.splitlines(keepends=False) Docstring: Return a list of the lines in the string, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true. Type: builtin_function_or_method
In [8]:
orders = orders_raw.splitlines()
In [9]:
type(orders)
Out[9]:
list
In [10]:
orders[:10]
Out[10]:
['1,2013-07-25 00:00:00.0,11599,CLOSED', '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT', '3,2013-07-25 00:00:00.0,12111,COMPLETE', '4,2013-07-25 00:00:00.0,8827,CLOSED', '5,2013-07-25 00:00:00.0,11318,COMPLETE', '6,2013-07-25 00:00:00.0,7130,COMPLETE', '7,2013-07-25 00:00:00.0,4530,COMPLETE', '8,2013-07-25 00:00:00.0,2911,PROCESSING', '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT', '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
In [11]:
type(orders[0])
Out[11]:
str
In [12]:
len(orders)
Out[12]:
68883
In [13]:
%%sh
wc -l /data/retail_db/orders/part-00000
68883 /data/retail_db/orders/part-00000
Task 1¶
Get all the unique dates from orders data.
In [14]:
order = '1,2013-07-25 00:00:00.0,11599,CLOSED'
In [15]:
order.split(',')[1]
Out[15]:
'2013-07-25 00:00:00.0'
In [16]:
order_dates = set()
In [17]:
order_dates.add('2013-07-25 00:00:00.0')
In [18]:
order_dates
Out[18]:
{'2013-07-25 00:00:00.0'}
In [19]:
order_dates.add('2013-07-26 00:00:00.0')
In [20]:
order_dates
Out[20]:
{'2013-07-25 00:00:00.0', '2013-07-26 00:00:00.0'}
In [21]:
order_dates.add('2013-07-25 00:00:00.0')
In [22]:
order_dates
Out[22]:
{'2013-07-25 00:00:00.0', '2013-07-26 00:00:00.0'}
In [23]:
order_dates = set()
for order in orders:
order_dates.add(order.split(',')[1])
In [24]:
list(order_dates)[:10]
Out[24]:
['2013-09-14 00:00:00.0', '2013-10-05 00:00:00.0', '2014-04-10 00:00:00.0', '2014-05-10 00:00:00.0', '2014-02-19 00:00:00.0', '2013-12-28 00:00:00.0', '2014-03-23 00:00:00.0', '2014-04-06 00:00:00.0', '2013-09-26 00:00:00.0', '2013-11-18 00:00:00.0']
In [25]:
len(order_dates)
Out[25]:
364
In [26]:
order_dates = {order.split(',')[1] for order in orders}
In [27]:
list(order_dates)[:10]
Out[27]:
['2013-09-14 00:00:00.0', '2013-10-05 00:00:00.0', '2014-04-10 00:00:00.0', '2014-05-10 00:00:00.0', '2014-02-19 00:00:00.0', '2013-12-28 00:00:00.0', '2014-03-23 00:00:00.0', '2014-04-06 00:00:00.0', '2013-09-26 00:00:00.0', '2013-11-18 00:00:00.0']
In [28]:
len(order_dates)
Out[28]:
364
Task 2¶
Get all the unique weekend dates from orders data.
In [29]:
order_date = '2014-01-25 00:00:00.0'
In [30]:
import datetime as dt
In [31]:
dt.datetime.strptime(order_date, '%Y-%m-%d %H:%M:%S.%f')
Out[31]:
datetime.datetime(2014, 1, 25, 0, 0)
In [32]:
d = dt.datetime.strptime(order_date, '%Y-%m-%d %H:%M:%S.%f')
In [33]:
d.weekday?
Docstring: Return the day of the week represented by the date. Monday == 0 ... Sunday == 6 Type: builtin_function_or_method
In [34]:
dt.datetime.strptime(order_date, '%Y-%m-%d %H:%M:%S.%f').weekday() # Returns 0 to 6 (for Monday to Sunday)
Out[34]:
5
In [35]:
import calendar
In [36]:
list(calendar.day_name)
Out[36]:
['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
In [37]:
calendar.day_name[5]
Out[37]:
'Saturday'
In [38]:
calendar.day_name[dt.datetime.strptime(order_date, '%Y-%m-%d %H:%M:%S.%f').weekday()]
Out[38]:
'Saturday'
In [39]:
calendar.day_abbr[dt.datetime.strptime(order_date, '%Y-%m-%d %H:%M:%S.%f').weekday()]
Out[39]:
'Sat'
In [40]:
dt.datetime.strptime(order_date, '%Y-%m-%d %H:%M:%S.%f').weekday() in (5, 6)
Out[40]:
True
In [41]:
import datetime as dt
def is_weekend(order_date):
return dt.datetime.strptime(order_date, '%Y-%m-%d %H:%M:%S.%f').weekday() in (5, 6)
In [42]:
is_weekend('2014-01-25 00:00:00.0')
Out[42]:
True
In [43]:
is_weekend('2014-01-22 00:00:00.0')
Out[43]:
False
In [44]:
weekend_dates = set()
for order in orders:
order_date = order.split(',')[1]
if is_weekend(order_date):
weekend_dates.add(order_date)
In [45]:
list(weekend_dates)[:10]
Out[45]:
['2013-09-14 00:00:00.0', '2013-10-05 00:00:00.0', '2014-02-22 00:00:00.0', '2014-06-28 00:00:00.0', '2014-05-10 00:00:00.0', '2013-12-28 00:00:00.0', '2014-03-23 00:00:00.0', '2013-08-04 00:00:00.0', '2014-04-06 00:00:00.0', '2013-10-26 00:00:00.0']
In [46]:
len(weekend_dates)
Out[46]:
103