Cat Command (Linux)

Useful References:

[1]  https://en.wikipedia.org/wiki/Cat_(Unix)

 

Tips:  Using the cat command in Linux.

[1] With no FILE, or when FILE is -, read standard input.

 

[2] Command line options I have found useful:

  • n (GNU only: –number), number all output lines

 

[3] To concatenate 2 text files and write them to a new file, you can use the following command:

$ cat file1.txt file2.txt > newcombinedfile.txt

With option -n, cat can also number lines as follows:

$ cat -n file1.txt file2.txt > newnumberedfile.txt

[4] To concatenate 2 files, while removing header from 2nd file this following cat command came to my rescue :

$ cat userCorpus_CMU_Featurized.csv <(tail -n +2 userCorpus_OKC_Featurized.csv) > output2.csv

  • Pro tip : For some weird reason, it would complain if there was a space between the < and (.  Not sure why.

Fun problem : Doing transpose & join in Python

An interesting problem I came across recently.

Tip(s):

[1] : http://chrisalbon.com/python/pandas_join_merge_dataframe.html

Input

User Level Features (100M rows)

UserId

FeatureName

FeatureValue

a

X

1

a

Y

2

a

Z

3

b

Y

4

b

Z

5

Doc Level Features (100M rows)

DocId

FeatureName

FeatureValue

1

A

1

1

B

0

1

C

1

2

A

0

2

B

1

Labeled Data (10M rows)

UserId

DocId

Label

a

1

1

a

2

0

a

3

1

b

1

0

b

2

1

Expected Output:

User Level Features

Document Level Features

UserId

DocId

Label

X

Y

Z

A

B

C

a

1

1

1

2

3

1

0

1

a

2

0

1

2

3

0

1

a

3

1

1

2

3

b

1

0

4

5

1

0

1

b

2

1

4

5

0

1

 

Python Solution:

import pandas as pd

df_userfeatures = pd.read_csv(r’E:\datasets\12Dec\UserFeatures.txt’, sep=’\t’)
df_docfeatures = pd.read_csv(r’E:\datasets\12Dec\DocFeatures.txt’, sep=’\t’)
df_labeleddata = pd.read_csv(r’E:\datasets\12Dec\LabeledData.txt’, sep=’\t’)

byuseridfeatures = df_userfeatures.groupby([‘UserId’, ‘FeatureName’])
bydocidfeatures = df_docfeatures.groupby([‘DocId’, ‘FeatureName’])

df_byuserid_transpose = byuseridfeatures[‘FeatureValue’].mean().unstack().reset_index()
df_bydocid_transpose = bydocidfeatures[‘FeatureValue’].mean().unstack().reset_index()

df_joinintermediate = pd.merge(df_labeleddata, df_byuserid_transpose, left_on=’UserId’, right_on=’UserId’, how=’left’)
df_joined = pd.merge(df_joinintermediate, df_bydocid_transpose, left_on=’DocId’, right_on=’DocId’, how=’left’)

print df_joined
df_joined.to_csv(r’E:\datasets\12Dec\TransposeJoin_Python.txt’, index=False, sep=’\t’)

Output:

transposeJoin.JPG