TIP: (5/28/2020)
- In a TSV file, I found it tricky to load it up in pandas because of mismatch in quote char
- This helped:
- df = pd.read_csv(mrpc_train, sep=’\t’, quoting=csv.QUOTE_NONE)
- https://stackoverflow.com/questions/28284912/pandas-is-it-possible-to-read-csv-with-no-quotechar
Lets say we have a file coffee.csv:
$ cat -n ../_resources/coffee.csv
1 “Coffee”,”Water”,”Milk”,”Icecream”
2 “Espresso”,”No”,”No”,”No”
3 “Long Black”,”Yes”,”No”,”No”
4 “Flat White”,”No”,”Yes”,”No”
5 “Cappuccino”,”No”,“Yes,Frothy”,”No”
6 “Affogato”,”No”,”No”,”Yes”
7
8
9 abcd
$ wc -l ../_resources/coffee.csv
8 ../_resources/coffee.csv
Note: how the cat -n is showing 9 lines, but wc -l is showing 8. This is because in this file the last line does not end in a newline. wc -l is only counting the number of newlines
Note: how the file has two empty lines (not really empty its the special ‘\n’ character). Also one of the columns has a comma. So we need to be careful.
So lets explore ways to read in this file:
[1]. f.read() :
The entire contents of the file will be read
[2] f.readline() :
f.readline() reads a single line from the file; a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline. This makes the return value unambiguous; if f.readline() returns an empty string, the end of the file has been reached, while a blank line is represented by ‘\n’, a string containing only a single newline.
[3] f.readlines()
f.readlines() reads everything in the text file and has them in a list of lines.
[4] Looping over the file object.
[5] Using csv library.
This is especially important since CSV files can have unexpected commas (e.g. “Yes,Frothy” above) .
In this case we first create a csv reader using the file handle. Then we iterate over the csv reader. In each iteration we get a list.
#CSV Code
import csv
with open(filename, ‘rb’) as f:
header = f.readline().strip()
print header
line1 = f.next().strip()
print “line1 : {0}”.format(line1)
creating the csv reader
csvreader = csv.reader(f)
line2 = csvreader.next()
print “line2 : {0}”.format(line2)
iterate over the csvreader
for line in csvreader:
print line
Output:
“Coffee”,”Water”,”Milk”,”Icecream”
line1 : “Espresso”,”No”,”No”,”No”
line2 : [‘Long Black’, ‘Yes’, ‘No’, ‘No’]
[‘Flat White’, ‘No’, ‘Yes’, ‘No’]
[‘Cappuccino’, ‘No’, ‘Yes,Frothy’, ‘No’]
[‘Affogato’, ‘No’, ‘No’, ‘Yes’]
[]
[]
[‘abcd’]
[6] Pandas.
(See how pandas throws out the empty lines)
import pandas as pd
df = pd.read_csv(filename)
df
Out[114]:
Coffee Water Milk Icecream
0 Espresso No No No
1 Long Black Yes No No
2 Flat White No Yes No
3 Cappuccino No Yes,Frothy No
4 Affogato No No Yes
5 abcd NaN NaN NaN
Code: