Applying operations over pandas dataframes.

There are 3 keywords to consider when thinking of applying operations over pandas dataframes.

  • map
  • apply
  • applymap

 

References:

  1. http://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html
  2. http://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas
  3. http://stackoverflow.com/questions/16575868/efficiently-creating-additional-columns-in-a-pandas-dataframe-using-map

 

Using Groupby in Pandas

Tips:

  1. upon doing a groupby, we either get a SeriesGroupBy object, or a DataFrameGroupBy object.
    • “This grouped variable is now a GroupBy object. It has not actually computed anything yet except for some intermediate data about the group key df[‘key1’]. The idea is that this object has all of the information needed to then apply some operation to each of the groups.” – Python for Data Analysis
  2. using aggregate functions on the grouped object.
    • some common aggregations are provided by default as instance methods on the GroupBy object
      • .sum()
      • .mean()
      • .size()
        • size has a slightly different output than others
        • there are some examples which show using count().  but i had trouble using count()
    • applying multiple functions / applying different functions of different columns
      • look up section in reference [1]
  3. column selection in group by.
    • In [37]: grouped = df.groupby(['A'])
      
      In [38]: grouped_C = grouped['C']
      
      In [39]: grouped_D = grouped['D']
      

      This is mainly syntactic sugar for the alternative and much more verbose:

      In [40]: df['C'].groupby(df['A'])
      Out[40]: <pandas.core.groupby.SeriesGroupBy object at 0x129fce310>
  4.  as_index=False
    • do note that using as_index=False still returns a groupby object
  5. reset_index
    • there are some oddities when using groupby (reference [3]). In those cases, using reset_index will be useful
  6. using unstack()
    • the typical use of unstack is to remove the effects of hierarchical indexing
    • see reference [2] for a nice example
  7. iterate operations over groups

    # Group the dataframe by regiment, and for each regiment,
    for name, group in df.groupby(‘regiment’):
    # print the name of the regiment
    print(name)
    # print the data of that regiment
    print(group)

  8. applying multiple functions at once
    • look up section in reference [1] around applying multiple functions

References:

  1. http://pandas.pydata.org/pandas-docs/stable/groupby.html
  2. http://chrisalbon.com/python/pandas_apply_operations_to_groups.html
  3. http://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-object-to-dataframe
  4. http://wesmckinney.com/blog/groupby-fu-improvements-in-grouping-and-aggregating-data-in-pandas/

 

Code:

Classification: Multi-Class, Label-Dependent

Its interesting to note down the different flavors of multi-class classification.

  1. Basic multiclass classification.
    • Here we have a fixed number of labels (K) and want to drop inputs into one of those K buckets.
  2. Basic multiclass classification with weighted examples.
    • Extension of the basic multi-class, where some examples have more weight than others
  3. Cost-sensitive multiclass.
    • Here, instead of just having one correct label (and all others incorrect), you can have different costs for each of the K different labels.
  4. Label-dependent features
    • This is for the case where we know that we can put in additional features that depend on the label
    • This is the flavor used in ‘action-dependent features’ mode of VW.

 

References:

 

Classification : Cost-Sensitive

In regular classification the aim is to minimize the misclassification rate and thus all types of misclassification errors are deemed equally severe. A more general setting is cost-sensitive classification where the costs caused by different kinds of errors are not assumed to be equal and the objective is to minimize the expected costs.

Cost-Sensitive classification broadly falls into 2 categories:

  1. Class-dependent costs
  2. Example-dependent misclassification costs

costdependentclassification

 

References:

  1. https://mlr-org.github.io/mlr-tutorial/release/html/cost_sensitive_classif/index.html

 

Bandit algorithms.

I have been trying to understand contextual bandit (CB) algorithms. I am using VW where CB is implemented natively.

Here are some insights about the bandit algorithms.

Tips:

  • In VW, contextual bandit learning algorithms consist of two broad classes.
    • the first class consists of settings where the maximum number of actions is known ahead of time, and the semantics of these actions stay fixed across examples.
    • a more advanced setting allows potentially changing semantics per example. In this latter setting, the actions are specified via features, different features associated with each action. this is referred to as the ADF setting for action dependent features.

References:

 

Data Wrangling Using Pandas

As part of a data wrangling exercise, this is what I had to do recently:

  1. Crack open a 2.7GB file.  File has rows and columns.
  2. Filter this file to extract rows which were satisfying some conditions.
    • Conditions were imposed on couple of columns with specific values
  3. Write out the result to a new file.

Tips / Insights:

  • Approach 1 : The file can be read in line by line, and the filters applied etc.
    • Below I have shown the code in both python and perl
  • Approach 2 : With pandas its a 2 line code
    • Go pandas!!

Code:

 

Vowpal Wabbit : Example Commands

Vowpal Wabbit Commands can be pretty cryptic.

As I play around with it, I am listing down example commands. Hopefully the flags will become clear as I play with this tool more.

rcv1 dataset:

  • [0.0666918]../vw.exe -d rcv1.train.raw.txt.gz -c –loss_function logistic –sgd -l 1.5 -f rcv1_raw_sgd_model -b 22 –binary
  • [0.0579419]../vw.exe -d rcv1.train.raw.txt.gz -c –loss_function logistic -f rcv1_raw_model -b 22 -l 1 –binary
  • [0.0462865]../vw.exe -d rcv1.train.raw.txt.gz -c –loss_function logistic -f rcv1_raw_n2skip4_model -b 22 –ngram 2 –skips 4 –binary -l 1
  • [0.0455684]../vw.exe -d rcv1.train.raw.txt.gz -c -f rcv1_raw_sqloss_n2skip4_model -b 22 –ngram 2 –skips 4 –binary -l 0.25

 

titanic dataset:

  • vw train_titanic.vw -f model.vw –binary –passes 20 -c -q ff –adaptive –normalized –l1 0.00000001 –l2 0.0000001 -b 24
  • vw -d test_titanic.vw -t -i model.vw -p preds_titanic.txt