Data Wrangling Using Pandas

As part of a data wrangling exercise, this is what I had to do recently:

  1. Crack open a 2.7GB file.  File has rows and columns.
  2. Filter this file to extract rows which were satisfying some conditions.
    • Conditions were imposed on couple of columns with specific values
  3. Write out the result to a new file.

Tips / Insights:

  • Approach 1 : The file can be read in line by line, and the filters applied etc.
    • Below I have shown the code in both python and perl
  • Approach 2 : With pandas its a 2 line code
    • Go pandas!!

Code:

 

Advertisements

Handling Missing Data in Pandas

Oftentimes while working with missing data, I prefer working with pandas. Just because pandas makes things so much easier.

Tip:

  1. dropna has arguments subset and how:
df2.dropna(subset=['three', 'four', 'five'], how='all')

As the names suggests:

  • how='all' requires every column (of subset) in the row to be NaN in order to be dropped, as opposed to the default 'any'.
  • subset is those columns to inspect for NaNs.

 

Code:

 

References:

  1. http://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-of-certain-column-is-nan
  2. http://stackoverflow.com/questions/14991195/how-to-remove-rows-with-null-values-from-kth-column-onward-in-python
  3. http://pandas.pydata.org/pandas-docs/stable/missing_data.html#missing-data-basics

 

Google While Coding – 3 : Mapping Pandas columns

There was this requirement to map the values of a column into a new column. e.g 0->0, 1->0, 2->1, 3->1

Some would look to excel. I looked towards pandas.

Some queries that came out:

  1. “pandas map”
  2. “python pandas map column to another”
  3. “pandas write to csv”

it surprises me how much of learning comes out from something as simple as this. For instance here are new things that came out of this :

  1. Pythons lambda, filter, map, reduce operations
  2. Pandas function mapping
  3. apply, applymap and map for pandas
  4. replace, update, put
  5. Another example of adding an existing dataframe in pandas

 

Code:

 

Pandas. Views versus copy

In pandas, it can be tricky setting the values of a dataframe. One can hit the ‘SettingWithCopy’ warning message.

The crux of the problem is this:

  • “The reason for having the SettingWithCopy warning is this. Sometimes when you slice an array you will simply get a view back, which means you can set it no problem. However, even a single dtyped array can generate a copy if it is sliced in a particular way. A multi-dtyped DataFrame (meaning it has say float and object data), will almost always yield a copy. Whether a view is created is dependent on the memory layout of the array”

–> Note, Pandas throws this warning because of the ambiguity. It doesnt know if the operation was successful or not etc. In some cases it actually fails, but in other cases it works fine.  So Pandas just warns user that its operating on a copy.

  • Sometimes a SettingWithCopy warning will arise at times when there’s no obvious chained indexing going on. Theseare the bugs that SettingWithCopy is designed to catch! Pandas is probably trying to warn you that you’ve done this:
    def do_something(df):
       foo = df[['bar', 'baz']]  
       # Is foo a view? A copy? Nobody knows!
       # ... many lines here ...
       foo['quux'] = value
       # We don't know whether this will modify df or not!
       return foo
    

    Yikes!

  • In particular, the warning can be a misnomer if there is no chained indexing going on (as described above)

__main__:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

  • In case I know I  want to work with a copy,  I can create a copy explicitly (using .copy(). see below). This suppresses the warning.

 

Code:

 

Reference: