An interesting problem I came across recently.
Tip(s):
[1] : http://chrisalbon.com/python/pandas_join_merge_dataframe.html
Input
User Level Features (100M rows)
UserId |
FeatureName |
FeatureValue |
a |
X |
1 |
a |
Y |
2 |
a |
Z |
3 |
b |
Y |
4 |
b |
Z |
5 |
Doc Level Features (100M rows)
DocId |
FeatureName |
FeatureValue |
1 |
A |
1 |
1 |
B |
0 |
1 |
C |
1 |
2 |
A |
0 |
2 |
B |
1 |
Labeled Data (10M rows)
UserId |
DocId |
Label |
a |
1 |
1 |
a |
2 |
0 |
a |
3 |
1 |
b |
1 |
0 |
b |
2 |
1 |
Expected Output:
User Level Features |
Document Level Features |
|||||||
UserId |
DocId |
Label |
X |
Y |
Z |
A |
B |
C |
a |
1 |
1 |
1 |
2 |
3 |
1 |
0 |
1 |
a |
2 |
0 |
1 |
2 |
3 |
0 |
1 |
|
a |
3 |
1 |
1 |
2 |
3 |
|||
b |
1 |
0 |
4 |
5 |
1 |
0 |
1 |
|
b |
2 |
1 |
4 |
5 |
0 |
1 |
Python Solution:
import pandas as pd
df_userfeatures = pd.read_csv(r’E:\datasets\12Dec\UserFeatures.txt’, sep=’\t’)
df_docfeatures = pd.read_csv(r’E:\datasets\12Dec\DocFeatures.txt’, sep=’\t’)
df_labeleddata = pd.read_csv(r’E:\datasets\12Dec\LabeledData.txt’, sep=’\t’)
byuseridfeatures = df_userfeatures.groupby([‘UserId’, ‘FeatureName’])
bydocidfeatures = df_docfeatures.groupby([‘DocId’, ‘FeatureName’])
df_byuserid_transpose = byuseridfeatures[‘FeatureValue’].mean().unstack().reset_index()
df_bydocid_transpose = bydocidfeatures[‘FeatureValue’].mean().unstack().reset_index()
df_joinintermediate = pd.merge(df_labeleddata, df_byuserid_transpose, left_on=’UserId’, right_on=’UserId’, how=’left’)
df_joined = pd.merge(df_joinintermediate, df_bydocid_transpose, left_on=’DocId’, right_on=’DocId’, how=’left’)
print df_joined
df_joined.to_csv(r’E:\datasets\12Dec\TransposeJoin_Python.txt’, index=False, sep=’\t’)
Output: