REST Calls in Python. JSON. Pandas.

I recently had to make REST calls in Python for sending data to Azure EventHub.

In this particular case I could not use the Python SDK to talk to EventHub. As I wrote down the code to make the raw REST calls, I came across several gems. Am listing them down below.

Tips:

  • Use the python ‘requests’ library.
    • i am yet to figure out how to make async calls. can i use this library for async as well or would I have to use something else
  • Sending JSON is way to go.
    • Don’t even try sending anything else
  • Pandas has great functionality to convert  Series/DataFrames to JSON.
    • the ‘to_json’ function has awesome functionality including orient by ‘records’ etc
  • Python has an awesome library called ‘json’ to deal with JSON data.
    • To deserialize ,use json.loads()
    • In particular,  to convert dict to JSON use  json.dumps().
    • Note: If you want to preserve the order, one would have to use ‘collections.OrderedDict’. Check this link

Check this out:


myj = '[{"reward":30,"actionname":"x","age":60,"gender":"M","weight":150,"Scored Labels":30.9928596354},{"reward":20,"actionname":"y","age":60,"gender":"M","weight":150,"Scored Labels":19.0217225957}]'

myj_l = json.loads(myj, object_pairs_hook=collections.OrderedDict)

myj_l
Out[177]:
[OrderedDict([(u'reward', 30), (u'actionname', u'x'), (u'age', 60), (u'gender', u'M'), (u'weight', 150), (u'Scored Labels', 30.9928596354)]),
 OrderedDict([(u'reward', 20), (u'actionname', u'y'), (u'age', 60), (u'gender', u'M'), (u'weight', 150), (u'Scored Labels', 19.0217225957)])]

for item in myj_l:
    print json.dumps(item)

{"reward": 30, "actionname": "x", "age": 60, "gender": "M", "weight": 150, "Scored Labels": 30.9928596354}
{"reward": 20, "actionname": "y", "age": 60, "gender": "M", "weight": 150, "Scored Labels": 19.0217225957}

References:

Code:

Debugging Standard Deviation

In one of my previous posts, I had noted my thoughts around statistical measures like standard deviation and confidence intervals.

The fun part is of course when one has to debug these measures.

To that end I developed some insights by trying to visualize the data and plotting different kinds of charts using matplotlib

  • The code below also acts as a reference to one of the pet peeves I have when trying to plot data from a python dataframe.
  • Use the code below as reference going forward.

capture2

Also, sometimes you have to debug plots when they make no sense at all. Like this one below:

  • The first plot didnt make sense to me initially. But once I started debugging it made total sense.
  • Check the 2nd plot below which is what I get when I ‘sort’ the data

capture1

Code:

 

Frequency Counting in Python.

One of the most frequent operations when doing data analysis is looking at the frequency counts information.

I wanted to list down the various ways of doing this task:

  • using python collections: Counter and Defaultdict
  • using numpy
    • with numpy.unique, with return_counts argument
    • with bincount, nonzero, zip / vstack
  • using pandas
  • using scipy

 

References:

Code:

 

 

Shuffling and Splitting Operations

Shuffling is a pretty interesting operation in several scenarios. And different languages / platforms have  interesting features using the shuffling  operation.

Examples:

  • Shuffle the contents of a C# List
  • Select random lines from file  (using ‘shuf’ command in Linux)

References:

Code:

DecisionController

Comparing ML algos : Multi Armed bandit, Contextual Bandit, Logistic Regression, Online Learning

We have a system running Multi-Armed Bandit.

So when it came to select the next generation of ML algo to try out, we had a few choices:

  1. Multi-Armed Bandit  (we had this running)
    • This entails ranking the items based on their respective conversion rates till that point of time.
  2. Contextual Bandit
    • We use Vowpal Wabbit for this.
    • Internally Vowpal Wabbit treats contextual bandit in 2 distinct ways:
      • Without Action Dependent Features (non-ADF)
      • With Action Dependent Features (ADF)
    • Interestingly there is a difference between non-ADF and ADF modes.
      • In non-ADF mode, the VW creates multiple models (i.e. creates a model for each class).
      • In ADF mode, VW creates a single model.
  3. Logistic Regression.
    • This entails reducing the problem to a binary classification problem.
    • Then using the model to score the items. Finally ranking the items based on the model score.
  4. Online ML
    • Again treating this as a binary classification model, except this time we are updating the model in an online fashion.

 

Interestingly, on the dataset I was using I didn’t see much of a difference in algorithmic performance across the 4 different algorithms above.

algo_compare

 

Code:

_trials_compare3

 

Applying operations over pandas dataframes.

There are 3 keywords to consider when thinking of applying operations over pandas dataframes.

  • map
  • apply
  • applymap

 

References:

  1. http://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html
  2. http://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas
  3. http://stackoverflow.com/questions/16575868/efficiently-creating-additional-columns-in-a-pandas-dataframe-using-map

 

Using Groupby in Pandas

Tips:

  1. upon doing a groupby, we either get a SeriesGroupBy object, or a DataFrameGroupBy object.
    • “This grouped variable is now a GroupBy object. It has not actually computed anything yet except for some intermediate data about the group key df[‘key1’]. The idea is that this object has all of the information needed to then apply some operation to each of the groups.” – Python for Data Analysis
  2. using aggregate functions on the grouped object.
    • some common aggregations are provided by default as instance methods on the GroupBy object
      • .sum()
      • .mean()
      • .size()
        • size has a slightly different output than others
        • there are some examples which show using count().  but i had trouble using count()
    • applying multiple functions / applying different functions of different columns
      • look up section in reference [1]
  3. column selection in group by.
    • In [37]: grouped = df.groupby(['A'])
      
      In [38]: grouped_C = grouped['C']
      
      In [39]: grouped_D = grouped['D']
      

      This is mainly syntactic sugar for the alternative and much more verbose:

      In [40]: df['C'].groupby(df['A'])
      Out[40]: <pandas.core.groupby.SeriesGroupBy object at 0x129fce310>
  4.  as_index=False
    • do note that using as_index=False still returns a groupby object
  5. reset_index
    • there are some oddities when using groupby (reference [3]). In those cases, using reset_index will be useful
  6. using unstack()
    • the typical use of unstack is to remove the effects of hierarchical indexing
    • see reference [2] for a nice example
  7. iterate operations over groups

    # Group the dataframe by regiment, and for each regiment,
    for name, group in df.groupby(‘regiment’):
    # print the name of the regiment
    print(name)
    # print the data of that regiment
    print(group)

  8. applying multiple functions at once
    • look up section in reference [1] around applying multiple functions

References:

  1. http://pandas.pydata.org/pandas-docs/stable/groupby.html
  2. http://chrisalbon.com/python/pandas_apply_operations_to_groups.html
  3. http://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-object-to-dataframe
  4. http://wesmckinney.com/blog/groupby-fu-improvements-in-grouping-and-aggregating-data-in-pandas/

 

Code: