REST Calls in Python. JSON. Pandas.

I recently had to make REST calls in Python for sending data to Azure EventHub.

In this particular case I could not use the Python SDK to talk to EventHub. As I wrote down the code to make the raw REST calls, I came across several gems. Am listing them down below.

Tips:

  • Use the python ‘requests’ library.
    • i am yet to figure out how to make async calls. can i use this library for async as well or would I have to use something else
  • Sending JSON is way to go.
    • Don’t even try sending anything else
  • Pandas has great functionality to convert  Series/DataFrames to JSON.
    • the ‘to_json’ function has awesome functionality including orient by ‘records’ etc
  • Python has an awesome library called ‘json’ to deal with JSON data.
    • To deserialize ,use json.loads()
    • In particular,  to convert dict to JSON use  json.dumps().
    • Note: If you want to preserve the order, one would have to use ‘collections.OrderedDict’. Check this link

Check this out:


myj = '[{"reward":30,"actionname":"x","age":60,"gender":"M","weight":150,"Scored Labels":30.9928596354},{"reward":20,"actionname":"y","age":60,"gender":"M","weight":150,"Scored Labels":19.0217225957}]'

myj_l = json.loads(myj, object_pairs_hook=collections.OrderedDict)

myj_l
Out[177]:
[OrderedDict([(u'reward', 30), (u'actionname', u'x'), (u'age', 60), (u'gender', u'M'), (u'weight', 150), (u'Scored Labels', 30.9928596354)]),
 OrderedDict([(u'reward', 20), (u'actionname', u'y'), (u'age', 60), (u'gender', u'M'), (u'weight', 150), (u'Scored Labels', 19.0217225957)])]

for item in myj_l:
    print json.dumps(item)

{"reward": 30, "actionname": "x", "age": 60, "gender": "M", "weight": 150, "Scored Labels": 30.9928596354}
{"reward": 20, "actionname": "y", "age": 60, "gender": "M", "weight": 150, "Scored Labels": 19.0217225957}

References:

Code:

Debugging Standard Deviation

In one of my previous posts, I had noted my thoughts around statistical measures like standard deviation and confidence intervals.

The fun part is of course when one has to debug these measures.

To that end I developed some insights by trying to visualize the data and plotting different kinds of charts using matplotlib

  • The code below also acts as a reference to one of the pet peeves I have when trying to plot data from a python dataframe.
  • Use the code below as reference going forward.

capture2

Also, sometimes you have to debug plots when they make no sense at all. Like this one below:

  • The first plot didnt make sense to me initially. But once I started debugging it made total sense.
  • Check the 2nd plot below which is what I get when I ‘sort’ the data

capture1

Code:

 

Frequency Counting in Python.

One of the most frequent operations when doing data analysis is looking at the frequency counts information.

I wanted to list down the various ways of doing this task:

  • using python collections: Counter and Defaultdict
  • using numpy
    • with numpy.unique, with return_counts argument
    • with bincount, nonzero, zip / vstack
  • using pandas
  • using scipy

 

References:

Code:

 

 

Shuffling and Splitting Operations

Shuffling is a pretty interesting operation in several scenarios. And different languages / platforms have  interesting features using the shuffling  operation.

Examples:

  • Shuffle the contents of a C# List
  • Select random lines from file  (using ‘shuf’ command in Linux)

References:

Code:

DecisionController

Comparing ML algos : Multi Armed bandit, Contextual Bandit, Logistic Regression, Online Learning

We have a system running Multi-Armed Bandit.

So when it came to select the next generation of ML algo to try out, we had a few choices:

  1. Multi-Armed Bandit  (we had this running)
    • This entails ranking the items based on their respective conversion rates till that point of time.
  2. Contextual Bandit
    • We use Vowpal Wabbit for this.
    • Internally Vowpal Wabbit treats contextual bandit in 2 distinct ways:
      • Without Action Dependent Features (non-ADF)
      • With Action Dependent Features (ADF)
    • Interestingly there is a difference between non-ADF and ADF modes.
      • In non-ADF mode, the VW creates multiple models (i.e. creates a model for each class).
      • In ADF mode, VW creates a single model.
  3. Logistic Regression.
    • This entails reducing the problem to a binary classification problem.
    • Then using the model to score the items. Finally ranking the items based on the model score.
  4. Online ML
    • Again treating this as a binary classification model, except this time we are updating the model in an online fashion.

 

Interestingly, on the dataset I was using I didn’t see much of a difference in algorithmic performance across the 4 different algorithms above.

algo_compare

 

Code:

_trials_compare3

 

Applying operations over pandas dataframes.

There are 3 keywords to consider when thinking of applying operations over pandas dataframes.

  • map
  • apply
  • applymap

 

References:

  1. http://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html
  2. http://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas
  3. http://stackoverflow.com/questions/16575868/efficiently-creating-additional-columns-in-a-pandas-dataframe-using-map

 

Using Groupby in Pandas

Tips:

  1. upon doing a groupby, we either get a SeriesGroupBy object, or a DataFrameGroupBy object.
    • “This grouped variable is now a GroupBy object. It has not actually computed anything yet except for some intermediate data about the group key df[‘key1’]. The idea is that this object has all of the information needed to then apply some operation to each of the groups.” – Python for Data Analysis
  2. using aggregate functions on the grouped object.
    • some common aggregations are provided by default as instance methods on the GroupBy object
      • .sum()
      • .mean()
      • .size()
        • size has a slightly different output than others
        • there are some examples which show using count().  but i had trouble using count()
    • applying multiple functions / applying different functions of different columns
      • look up section in reference [1]
  3. column selection in group by.
    • In [37]: grouped = df.groupby(['A'])
      
      In [38]: grouped_C = grouped['C']
      
      In [39]: grouped_D = grouped['D']
      

      This is mainly syntactic sugar for the alternative and much more verbose:

      In [40]: df['C'].groupby(df['A'])
      Out[40]: <pandas.core.groupby.SeriesGroupBy object at 0x129fce310>
  4.  as_index=False
    • do note that using as_index=False still returns a groupby object
  5. reset_index
    • there are some oddities when using groupby (reference [3]). In those cases, using reset_index will be useful
  6. using unstack()
    • the typical use of unstack is to remove the effects of hierarchical indexing
    • see reference [2] for a nice example
  7. iterate operations over groups

    # Group the dataframe by regiment, and for each regiment,
    for name, group in df.groupby(‘regiment’):
    # print the name of the regiment
    print(name)
    # print the data of that regiment
    print(group)

  8. applying multiple functions at once
    • look up section in reference [1] around applying multiple functions

References:

  1. http://pandas.pydata.org/pandas-docs/stable/groupby.html
  2. http://chrisalbon.com/python/pandas_apply_operations_to_groups.html
  3. http://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-object-to-dataframe
  4. http://wesmckinney.com/blog/groupby-fu-improvements-in-grouping-and-aggregating-data-in-pandas/

 

Code:

Data Wrangling Using Pandas

As part of a data wrangling exercise, this is what I had to do recently:

  1. Crack open a 2.7GB file.  File has rows and columns.
  2. Filter this file to extract rows which were satisfying some conditions.
    • Conditions were imposed on couple of columns with specific values
  3. Write out the result to a new file.

Tips / Insights:

  • Approach 1 : The file can be read in line by line, and the filters applied etc.
    • Below I have shown the code in both python and perl
  • Approach 2 : With pandas its a 2 line code
    • Go pandas!!

Code:

 

Handling Missing Data in Pandas

Oftentimes while working with missing data, I prefer working with pandas. Just because pandas makes things so much easier.

Tip:

  1. dropna has arguments subset and how:
df2.dropna(subset=['three', 'four', 'five'], how='all')

As the names suggests:

  • how='all' requires every column (of subset) in the row to be NaN in order to be dropped, as opposed to the default 'any'.
  • subset is those columns to inspect for NaNs.

 

Code:

 

References:

  1. http://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-of-certain-column-is-nan
  2. http://stackoverflow.com/questions/14991195/how-to-remove-rows-with-null-values-from-kth-column-onward-in-python
  3. http://pandas.pydata.org/pandas-docs/stable/missing_data.html#missing-data-basics

 

Google While Coding – 3 : Mapping Pandas columns

There was this requirement to map the values of a column into a new column. e.g 0->0, 1->0, 2->1, 3->1

Some would look to excel. I looked towards pandas.

Some queries that came out:

  1. “pandas map”
  2. “python pandas map column to another”
  3. “pandas write to csv”

it surprises me how much of learning comes out from something as simple as this. For instance here are new things that came out of this :

  1. Pythons lambda, filter, map, reduce operations
  2. Pandas function mapping
  3. apply, applymap and map for pandas
  4. replace, update, put
  5. Another example of adding an existing dataframe in pandas

 

Code: