Applying operations over pandas dataframes.

There are 3 keywords to consider when thinking of applying operations over pandas dataframes.

  • map
  • apply
  • applymap

 

References:

  1. http://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html
  2. http://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas
  3. http://stackoverflow.com/questions/16575868/efficiently-creating-additional-columns-in-a-pandas-dataframe-using-map

 

Using Groupby in Pandas

Tips:

  1. upon doing a groupby, we either get a SeriesGroupBy object, or a DataFrameGroupBy object.
    • “This grouped variable is now a GroupBy object. It has not actually computed anything yet except for some intermediate data about the group key df[‘key1’]. The idea is that this object has all of the information needed to then apply some operation to each of the groups.” – Python for Data Analysis
  2. using aggregate functions on the grouped object.
    • some common aggregations are provided by default as instance methods on the GroupBy object
      • .sum()
      • .mean()
      • .size()
        • size has a slightly different output than others
        • there are some examples which show using count().  but i had trouble using count()
    • applying multiple functions / applying different functions of different columns
      • look up section in reference [1]
  3. column selection in group by.
    • In [37]: grouped = df.groupby(['A'])
      
      In [38]: grouped_C = grouped['C']
      
      In [39]: grouped_D = grouped['D']
      

      This is mainly syntactic sugar for the alternative and much more verbose:

      In [40]: df['C'].groupby(df['A'])
      Out[40]: <pandas.core.groupby.SeriesGroupBy object at 0x129fce310>
  4.  as_index=False
    • do note that using as_index=False still returns a groupby object
  5. reset_index
    • there are some oddities when using groupby (reference [3]). In those cases, using reset_index will be useful
  6. using unstack()
    • the typical use of unstack is to remove the effects of hierarchical indexing
    • see reference [2] for a nice example
  7. iterate operations over groups

    # Group the dataframe by regiment, and for each regiment,
    for name, group in df.groupby(‘regiment’):
    # print the name of the regiment
    print(name)
    # print the data of that regiment
    print(group)

  8. applying multiple functions at once
    • look up section in reference [1] around applying multiple functions

References:

  1. http://pandas.pydata.org/pandas-docs/stable/groupby.html
  2. http://chrisalbon.com/python/pandas_apply_operations_to_groups.html
  3. http://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-object-to-dataframe
  4. http://wesmckinney.com/blog/groupby-fu-improvements-in-grouping-and-aggregating-data-in-pandas/

 

Code:

Classification: Multi-Class, Label-Dependent

Its interesting to note down the different flavors of multi-class classification.

  1. Basic multiclass classification.
    • Here we have a fixed number of labels (K) and want to drop inputs into one of those K buckets.
  2. Basic multiclass classification with weighted examples.
    • Extension of the basic multi-class, where some examples have more weight than others
  3. Cost-sensitive multiclass.
    • Here, instead of just having one correct label (and all others incorrect), you can have different costs for each of the K different labels.
  4. Label-dependent features
    • This is for the case where we know that we can put in additional features that depend on the label
    • This is the flavor used in ‘action-dependent features’ mode of VW.

 

References:

 

Classification : Cost-Sensitive

In regular classification the aim is to minimize the misclassification rate and thus all types of misclassification errors are deemed equally severe. A more general setting is cost-sensitive classification where the costs caused by different kinds of errors are not assumed to be equal and the objective is to minimize the expected costs.

Cost-Sensitive classification broadly falls into 2 categories:

  1. Class-dependent costs
  2. Example-dependent misclassification costs

costdependentclassification

 

References:

  1. https://mlr-org.github.io/mlr-tutorial/release/html/cost_sensitive_classif/index.html

 

Bandit algorithms.

I have been trying to understand contextual bandit (CB) algorithms. I am using VW where CB is implemented natively.

Here are some insights about the bandit algorithms.

Tips:

  • In VW, contextual bandit learning algorithms consist of two broad classes.
    • the first class consists of settings where the maximum number of actions is known ahead of time, and the semantics of these actions stay fixed across examples.
    • a more advanced setting allows potentially changing semantics per example. In this latter setting, the actions are specified via features, different features associated with each action. this is referred to as the ADF setting for action dependent features.

References:

 

Data Wrangling Using Pandas

As part of a data wrangling exercise, this is what I had to do recently:

  1. Crack open a 2.7GB file.  File has rows and columns.
  2. Filter this file to extract rows which were satisfying some conditions.
    • Conditions were imposed on couple of columns with specific values
  3. Write out the result to a new file.

Tips / Insights:

  • Approach 1 : The file can be read in line by line, and the filters applied etc.
    • Below I have shown the code in both python and perl
  • Approach 2 : With pandas its a 2 line code
    • Go pandas!!

Code:

 

Vowpal Wabbit : Example Commands

Vowpal Wabbit Commands can be pretty cryptic.

As I play around with it, I am listing down example commands. Hopefully the flags will become clear as I play with this tool more.

rcv1 dataset:

  • [0.0666918]../vw.exe -d rcv1.train.raw.txt.gz -c –loss_function logistic –sgd -l 1.5 -f rcv1_raw_sgd_model -b 22 –binary
  • [0.0579419]../vw.exe -d rcv1.train.raw.txt.gz -c –loss_function logistic -f rcv1_raw_model -b 22 -l 1 –binary
  • [0.0462865]../vw.exe -d rcv1.train.raw.txt.gz -c –loss_function logistic -f rcv1_raw_n2skip4_model -b 22 –ngram 2 –skips 4 –binary -l 1
  • [0.0455684]../vw.exe -d rcv1.train.raw.txt.gz -c -f rcv1_raw_sqloss_n2skip4_model -b 22 –ngram 2 –skips 4 –binary -l 0.25

 

titanic dataset:

  • vw train_titanic.vw -f model.vw –binary –passes 20 -c -q ff –adaptive –normalized –l1 0.00000001 –l2 0.0000001 -b 24
  • vw -d test_titanic.vw -t -i model.vw -p preds_titanic.txt

 

Recommendation Systems : Approaches

There are several approaches in building up a recommendation system. I have been intrigued at how we can connect the different approaches, and understand the pros and cons of each approach.

Here is a high-level overview of the approaches being used for solving the recommendation problem:

[1] High Level Approaches:

  • Content Based
    • Based on using weights across content features
  • Collaborative Methods
    • Based on an approach : “users who liked this also liked”

[2] Collaborative Filtering:

  • Memory Based (e.g. K-nearest neighbors)
  • Model Based (e.g. Matrix Factorization)

cf

[3]  Collaborative Filtering : Memory Based : K-nearest neighbors

  • Key Intuition: “Take a local popularity vote among “similar” users”
  • Need to quantify similarity, predict unseen rating.
  • Can take 2  forms :
    • Item-item collaborative filtering, or item-based, or item-to-item
    • User-User collaborative filtering

[4]  Collaborative Filtering : Model Based : Matrix Factorization.

  • Key Intuition : Model item attributes as belonging to a set of unobserved topics
    and user preferences across these topics

mf

  • Model quality of fit with squared-loss.
  • There are 2 ways to do the loss optimization :
    • Alternating Least Squares.
    • Stochastic Gradient Descent.

 

References:

  1. https://en.wikipedia.org/wiki/Item-item_collaborative_filtering
  2. http://dl.acm.org/citation.cfm?id=372071

 

Online Learning Reductions Using Vowpal Wabbit

I have been using Vowpal Wabbit lately, especially in the context of building systems for online learning.

Vowpal Wabbit supports several online learning reductions out of the box. I am listing down a few below:

  1. Importance Weighted Classification.

iwc

2. Multi Class.

(Look up -oaa and -ect options in VW)

3. Cost-Sensitive Multiclass

csmc

(Look up -csoaa and -wap options in VW)

4. Structured Prediction

 

References:

  1. https://github.com/JohnLangford/vowpal_wabbit/wiki
  2. http://www.slideshare.net/pauldix/terascale-learning
  3. http://www.slideshare.net/jakehofman/technical-tricks-of-vowpal-wabbit
  4. http://fastml.com/large-scale-l1-feature-selection-with-vowpal-wabbit/
  5. http://www.zinkov.com/posts/2013-08-13-vowpal-tutorial/

 

 

 

 

Multi-dimensional (axial) data handling in Python

Recently I was playing around with multi-dimensional data structures in Python.

Some interesting observations:

  1. Multi-dimensional lists and multi-dimensional arrays are fundamentally handled differently.
  2. Slicing of multi-dimensional arrays (numpy) need to be carefully considered in regards to shallow copy etc

 

Some references below for further examination:

References:

  1. http://ilan.schnell-web.net/prog/slicing/
  2. https://docs.python.org/2/library/copy.html
  3. http://stackoverflow.com/questions/509211/explain-pythons-slice-notation
  4. http://cs231n.github.io/python-numpy-tutorial/
  5. http://www.physics.nyu.edu/pine/pymanual/html/chap3/chap3_arrays.html
  6. http://www.astro.ufl.edu/~warner/prog/python.html