Dealing with Header Rows

Datasets containing header rows sometimes require  pre-processing,  including scenarios like  deleting header lines, inserting header lines, listing down header line as numbered rows to visualize etc.

Listing down some of scenarios I have encountered.

 

[1]  Sed (d),  to delete single lines and write to new file.

Deleting the header row using Linux commands. Create a new *_preprocessed file. Simplest option is I don’t want to mess with the original file.

e.g.  $ sed 1d Trial.txt > Trial_preprocessed.txt

 

[2 ]  Sed (d), to delete lines in-place.

We can also use sed in-place without a temp file. This won’t echo anything, it will just modify the file in-place. e.g.  $ sed -i -e 1,3d yourfile.
$ cat Trial.txt
Date LanguageTag App AcceptanceRate
8/10/2015 en-US Excel 0.96
8/11/2015 en-US Excel 1.00
8/12/2015 en-US Excel 0.94
8/10/2015 en-US Word 0.68
8/11/2015 en-US Word 0.88
8/12/2015 en-US Word 0.93

$ sed -e 1,4d Trial.txt   –> This echoes the file contents without the top 4 lines.
8/10/2015 en-US Word 0.68
8/11/2015 en-US Word 0.88
8/12/2015 en-US Word 0.93

$ cat Trial.txt   –> But the file is still intact.
Date LanguageTag App AcceptanceRate
8/10/2015 en-US Excel 0.96
8/11/2015 en-US Excel 1.00
8/12/2015 en-US Excel 0.94
8/10/2015 en-US Word 0.68
8/11/2015 en-US Word 0.88
8/12/2015 en-US Word 0.93

$ sed -i -e 1,4d Trial.txt –> This time this is in-place. There is no echo.

$ cat Trial.txt –>Verify.
8/10/2015 en-US Word 0.68
8/11/2015 en-US Word 0.88
8/12/2015 en-US Word 0.93

 

[3]  Sed (i),  to insert line within the file. Can be used to insert header lines if needed.

$ cat Trial.txt  –> Note file has no header.
8/10/2015 en-US Word 0.68
8/11/2015 en-US Word 0.88
8/12/2015 en-US Word 0.93

$ sed -i ‘1i date\tlang\tapp\tscore’ Trial.txt

$ cat Trial.txt
date lang app score  –> Note how we have inserted a header now!
8/10/2015 en-US Word 0.68
8/11/2015 en-US Word 0.88
8/12/2015 en-US Word 0.93

 

[4]  Sed (p), to print out header lines.

$ sed -n ‘1p’ Trial.txt | tr “\t” ‘\n’ | cat -n  –> Note how I have used a combination of sed (p), tr and cat  commands to print header columns as a numbered list.  Can also use  head command if getting only top rows.
1 date
2 lang
3 app
4 score

>> Of course this can be done if i use “head -n 2” instead of sed

[5] Pandas.

  • If the file contains a header row, by default pandas handles header rows. However, as best practice, it is recommended we use header=0 when you know row 0 is the header row

 

  • If the file does not contain a header row then we have to specify the header=None  parameter.  In these cases, its also good to specify the names parameter.  

e.g.

$ cat Trial.txt –>Note file noes not contain header row.
8/10/2015 en-US Word 0.68
8/11/2015 en-US Word 0.88
8/12/2015 en-US Word 0.93

df = pd.read_csv(r’E:\datasets\12Dec\Trial.txt’, header=None, sep = ‘\t’, names=[‘Date’, ‘LangCountry’, ‘Type’, ‘Score’])

print df
Out[321]:
Date LangCountry Type Score
0 8/10/2015 en-US Word 0.68
1 8/11/2015 en-US Word 0.88
2 8/12/2015 en-US Word 0.93

[6]  For dealing with concatenating files with header and stuff, look for the tips available in the Tail and Cat  posts

  • Super Useful

 

Unstacking in Python

This post is a follow up from a previous post. The Python solution is presented here.

Tip: Pandas MultiIndex.

Question #1: Dataset.

Code:

# -*- coding: utf-8 -*-
“””
Created on Wed Dec 02 12:11:14 2015

@author: agoswami
“””

import pandas as pd

df = pd.read_csv(r’E:\datasets\12Dec\ReshapingData1_Q1.txt’, sep = ‘\t’)

byidyear = df.groupby([‘ID’, ‘Year’])
reshaped = byidyear[‘Value’].mean().unstack()

reshaped.to_csv(r’E:\datasets\12Dec\ReshapedData1_Q1_Python.txt’, sep=’\t’)

 

Question #2:  Dataset.

Code:

# -*- coding: utf-8 -*-
“””
Created on Wed Dec 02 12:35:47 2015

@author: agoswami
“””

import pandas as pd

df = pd.read_csv(r’E:\datasets\12Dec\ReshapingData1_Q2.txt’, sep = ‘\t’)

bydatelanguageapp = df.groupby([‘Date’, ‘LanguageTag’, ‘App’])
df_groupedmean = bydatelanguageapp[‘AcceptanceRate’].mean()

#Unstack the ‘App’ column
reshaped = df_groupedmean.unstack()

reshaped.to_csv(r’E:\datasets\12Dec\ReshapedData1_Q2_Python.txt’, sep=’\t’)

Fun problem : Data Transpose

Common patterns for data reshaping / transformation.

Tip(s):

[1] Python Solution.

[2] SQL Solution.

Question #1: Transform Table A into Table B 

Table A

ID Year Value
2521 2013 5
2521 2014 6
2521 2015 5
2523 2014 15
2523 2015 12
2525 2014 2
2525 2015 3
2527 2015 3

Table B

ID 2013 2014 2015
2521 5 6 5
2523 NULL 15 12
2525 NULL 2 3
2527 NULL NULL 3

 

 

Question #2: Transform Table A into Table B 

Table A

Date

LanguageTag

App

AcceptanceRate

8/10/2015

en-US

Excel

0.96

8/11/2015

en-US

Excel

1.00

8/12/2015

en-US

Excel

0.94

8/10/2015

en-US

Word

0.68

8/11/2015

en-US

Word

0.88

8/12/2015

en-US

Word

0.93

Table B

Date

LanguageTag

Excel

Word

8/10/2015

en-US

0.96

0.68

8/11/2015

en-US

1.00

0.88

8/12/2015

en-US

0.94

0.93