Read csv into list of strings¶
Let us read csv data into list of strings. First let us go through the steps involved in reading csv data into list of strings.
- Create file object using
open
in read mode. - Invoke
read
function to read the contents in the file into memory. csv
files typically contains multiple lines where each line have multiple attribute values. Values related to different attributes in each line are typically delimited or separated.- Lines (also known as records) are typically delimited or separated by new line character. We can use
splitlines
on the output ofread
function to create list of strings.
{note}
Even though CSV stands for comma separated values, we often find the data separated by other separators or delimiters than comma. Tab character (\t), Pipe Character (|) are also commonly used. They are also known as tsv and psv sometimes.
In [1]:
!ls -ltr /data/retail_db/categories/
total 4 -rw-rw-r-- 1 itversity itversity 1029 Mar 8 02:04 part-00000
In [2]:
!cat /data/retail_db/categories/part-00000 # Viewing file contents using Linux commands
1,2,Football 2,2,Soccer 3,2,Baseball & Softball 4,2,Basketball 5,2,Lacrosse 6,2,Tennis & Racquet 7,2,Hockey 8,2,More Sports 9,3,Cardio Equipment 10,3,Strength Training 11,3,Fitness Accessories 12,3,Boxing & MMA 13,3,Electronics 14,3,Yoga & Pilates 15,3,Training by Sport 16,3,As Seen on TV! 17,4,Cleats 18,4,Men's Footwear 19,4,Women's Footwear 20,4,Kids' Footwear 21,4,Featured Shops 22,4,Accessories 23,5,Men's Apparel 24,5,Women's Apparel 25,5,Boys' Apparel 26,5,Girls' Apparel 27,5,Accessories 28,5,Top Brands 29,5,Shop By Sport 30,6,Men's Golf Clubs 31,6,Women's Golf Clubs 32,6,Golf Apparel 33,6,Golf Shoes 34,6,Golf Bags & Carts 35,6,Golf Gloves 36,6,Golf Balls 37,6,Electronics 38,6,Kids' Golf Clubs 39,6,Team Shop 40,6,Accessories 41,6,Trade-In 42,7,Bike & Skate Shop 43,7,Camping & Hiking 44,7,Hunting & Shooting 45,7,Fishing 46,7,Indoor/Outdoor Games 47,7,Boating 48,7,Water Sports 49,8,MLB 50,8,NFL 51,8,NHL 52,8,NBA 53,8,NCAA 54,8,MLS 55,8,International Soccer 56,8,World Cup Shop 57,8,MLB Players 58,8,NFL Players
In [3]:
# Create file object
file = open('/data/retail_db/categories/part-00000')
In [4]:
# Read file contents into the memory
data = file.read()
In [5]:
type(data)
Out[5]:
str
In [6]:
data # One big string. You can find new line characters in between
Out[6]:
"1,2,Football\n2,2,Soccer\n3,2,Baseball & Softball\n4,2,Basketball\n5,2,Lacrosse\n6,2,Tennis & Racquet\n7,2,Hockey\n8,2,More Sports\n9,3,Cardio Equipment\n10,3,Strength Training\n11,3,Fitness Accessories\n12,3,Boxing & MMA\n13,3,Electronics\n14,3,Yoga & Pilates\n15,3,Training by Sport\n16,3,As Seen on TV!\n17,4,Cleats\n18,4,Men's Footwear\n19,4,Women's Footwear\n20,4,Kids' Footwear\n21,4,Featured Shops\n22,4,Accessories\n23,5,Men's Apparel\n24,5,Women's Apparel\n25,5,Boys' Apparel\n26,5,Girls' Apparel\n27,5,Accessories\n28,5,Top Brands\n29,5,Shop By Sport\n30,6,Men's Golf Clubs\n31,6,Women's Golf Clubs\n32,6,Golf Apparel\n33,6,Golf Shoes\n34,6,Golf Bags & Carts\n35,6,Golf Gloves\n36,6,Golf Balls\n37,6,Electronics\n38,6,Kids' Golf Clubs\n39,6,Team Shop\n40,6,Accessories\n41,6,Trade-In\n42,7,Bike & Skate Shop\n43,7,Camping & Hiking\n44,7,Hunting & Shooting\n45,7,Fishing\n46,7,Indoor/Outdoor Games\n47,7,Boating\n48,7,Water Sports\n49,8,MLB\n50,8,NFL\n51,8,NHL\n52,8,NBA\n53,8,NCAA\n54,8,MLS\n55,8,International Soccer\n56,8,World Cup Shop\n57,8,MLB Players\n58,8,NFL Players\n"
In [7]:
# Create list of strings (each line in the file will be element in a string)
categories = data.split('\n')
In [8]:
type(categories)
Out[8]:
list
In [9]:
categories
# Data in the file /data/retail_db/categories/part-00000 is now in a list
# Each element in the list is a line in the file
Out[9]:
['1,2,Football', '2,2,Soccer', '3,2,Baseball & Softball', '4,2,Basketball', '5,2,Lacrosse', '6,2,Tennis & Racquet', '7,2,Hockey', '8,2,More Sports', '9,3,Cardio Equipment', '10,3,Strength Training', '11,3,Fitness Accessories', '12,3,Boxing & MMA', '13,3,Electronics', '14,3,Yoga & Pilates', '15,3,Training by Sport', '16,3,As Seen on TV!', '17,4,Cleats', "18,4,Men's Footwear", "19,4,Women's Footwear", "20,4,Kids' Footwear", '21,4,Featured Shops', '22,4,Accessories', "23,5,Men's Apparel", "24,5,Women's Apparel", "25,5,Boys' Apparel", "26,5,Girls' Apparel", '27,5,Accessories', '28,5,Top Brands', '29,5,Shop By Sport', "30,6,Men's Golf Clubs", "31,6,Women's Golf Clubs", '32,6,Golf Apparel', '33,6,Golf Shoes', '34,6,Golf Bags & Carts', '35,6,Golf Gloves', '36,6,Golf Balls', '37,6,Electronics', "38,6,Kids' Golf Clubs", '39,6,Team Shop', '40,6,Accessories', '41,6,Trade-In', '42,7,Bike & Skate Shop', '43,7,Camping & Hiking', '44,7,Hunting & Shooting', '45,7,Fishing', '46,7,Indoor/Outdoor Games', '47,7,Boating', '48,7,Water Sports', '49,8,MLB', '50,8,NFL', '51,8,NHL', '52,8,NBA', '53,8,NCAA', '54,8,MLS', '55,8,International Soccer', '56,8,World Cup Shop', '57,8,MLB Players', '58,8,NFL Players', '']
In [10]:
!wc -l /data/retail_db/categories/part-00000
58 /data/retail_db/categories/part-00000
In [11]:
len(categories)
Out[11]:
59
In [12]:
# Same thing can be achieved using splitlines
categories = data.splitlines()
In [13]:
categories
Out[13]:
['1,2,Football', '2,2,Soccer', '3,2,Baseball & Softball', '4,2,Basketball', '5,2,Lacrosse', '6,2,Tennis & Racquet', '7,2,Hockey', '8,2,More Sports', '9,3,Cardio Equipment', '10,3,Strength Training', '11,3,Fitness Accessories', '12,3,Boxing & MMA', '13,3,Electronics', '14,3,Yoga & Pilates', '15,3,Training by Sport', '16,3,As Seen on TV!', '17,4,Cleats', "18,4,Men's Footwear", "19,4,Women's Footwear", "20,4,Kids' Footwear", '21,4,Featured Shops', '22,4,Accessories', "23,5,Men's Apparel", "24,5,Women's Apparel", "25,5,Boys' Apparel", "26,5,Girls' Apparel", '27,5,Accessories', '28,5,Top Brands', '29,5,Shop By Sport', "30,6,Men's Golf Clubs", "31,6,Women's Golf Clubs", '32,6,Golf Apparel', '33,6,Golf Shoes', '34,6,Golf Bags & Carts', '35,6,Golf Gloves', '36,6,Golf Balls', '37,6,Electronics', "38,6,Kids' Golf Clubs", '39,6,Team Shop', '40,6,Accessories', '41,6,Trade-In', '42,7,Bike & Skate Shop', '43,7,Camping & Hiking', '44,7,Hunting & Shooting', '45,7,Fishing', '46,7,Indoor/Outdoor Games', '47,7,Boating', '48,7,Water Sports', '49,8,MLB', '50,8,NFL', '51,8,NHL', '52,8,NBA', '53,8,NCAA', '54,8,MLS', '55,8,International Soccer', '56,8,World Cup Shop', '57,8,MLB Players', '58,8,NFL Players']
In [14]:
!wc -l /data/retail_db/categories/part-00000
58 /data/retail_db/categories/part-00000
In [15]:
len(categories)
Out[15]:
58