Dealing with Header Rows

Datasets containing header rows sometimes require  pre-processing,  including scenarios like  deleting header lines, inserting header lines, listing down header line as numbered rows to visualize etc.

Listing down some of scenarios I have encountered.

 

[1]  Sed (d),  to delete single lines and write to new file.

Deleting the header row using Linux commands. Create a new *_preprocessed file. Simplest option is I don’t want to mess with the original file.

e.g.  $ sed 1d Trial.txt > Trial_preprocessed.txt

 

[2 ]  Sed (d), to delete lines in-place.

We can also use sed in-place without a temp file. This won’t echo anything, it will just modify the file in-place. e.g.  $ sed -i -e 1,3d yourfile.
$ cat Trial.txt
Date LanguageTag App AcceptanceRate
8/10/2015 en-US Excel 0.96
8/11/2015 en-US Excel 1.00
8/12/2015 en-US Excel 0.94
8/10/2015 en-US Word 0.68
8/11/2015 en-US Word 0.88
8/12/2015 en-US Word 0.93

$ sed -e 1,4d Trial.txt   –> This echoes the file contents without the top 4 lines.
8/10/2015 en-US Word 0.68
8/11/2015 en-US Word 0.88
8/12/2015 en-US Word 0.93

$ cat Trial.txt   –> But the file is still intact.
Date LanguageTag App AcceptanceRate
8/10/2015 en-US Excel 0.96
8/11/2015 en-US Excel 1.00
8/12/2015 en-US Excel 0.94
8/10/2015 en-US Word 0.68
8/11/2015 en-US Word 0.88
8/12/2015 en-US Word 0.93

$ sed -i -e 1,4d Trial.txt –> This time this is in-place. There is no echo.

$ cat Trial.txt –>Verify.
8/10/2015 en-US Word 0.68
8/11/2015 en-US Word 0.88
8/12/2015 en-US Word 0.93

 

[3]  Sed (i),  to insert line within the file. Can be used to insert header lines if needed.

$ cat Trial.txt  –> Note file has no header.
8/10/2015 en-US Word 0.68
8/11/2015 en-US Word 0.88
8/12/2015 en-US Word 0.93

$ sed -i ‘1i date\tlang\tapp\tscore’ Trial.txt

$ cat Trial.txt
date lang app score  –> Note how we have inserted a header now!
8/10/2015 en-US Word 0.68
8/11/2015 en-US Word 0.88
8/12/2015 en-US Word 0.93

 

[4]  Sed (p), to print out header lines.

$ sed -n ‘1p’ Trial.txt | tr “\t” ‘\n’ | cat -n  –> Note how I have used a combination of sed (p), tr and cat  commands to print header columns as a numbered list.  Can also use  head command if getting only top rows.
1 date
2 lang
3 app
4 score

>> Of course this can be done if i use “head -n 2” instead of sed

[5] Pandas.

  • If the file contains a header row, by default pandas handles header rows. However, as best practice, it is recommended we use header=0 when you know row 0 is the header row

 

  • If the file does not contain a header row then we have to specify the header=None  parameter.  In these cases, its also good to specify the names parameter.  

e.g.

$ cat Trial.txt –>Note file noes not contain header row.
8/10/2015 en-US Word 0.68
8/11/2015 en-US Word 0.88
8/12/2015 en-US Word 0.93

df = pd.read_csv(r’E:\datasets\12Dec\Trial.txt’, header=None, sep = ‘\t’, names=[‘Date’, ‘LangCountry’, ‘Type’, ‘Score’])

print df
Out[321]:
Date LangCountry Type Score
0 8/10/2015 en-US Word 0.68
1 8/11/2015 en-US Word 0.88
2 8/12/2015 en-US Word 0.93

[6]  For dealing with concatenating files with header and stuff, look for the tips available in the Tail and Cat  posts

  • Super Useful

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s