Googling While Coding – 2 : Parsing CSV file, extracting some URLs, and downloading images

I recently had a curious problem of parsing a CSV file, extracting some URLs, and downloading the image to my machine etc.

Anyway, I think it would be instructive to list the google searches involved in coming up with the final code:

  1. “write pandas dataframe to file”
  2. “split pandas column and get last column”
  3. “merge 2 dataframes pandas”
  4. “iterate over pandas dataframe rows”
  5. “python download image from url”
  6. “python write tab delimited file”

Some additional stuff since I wanted to work with only 1% of the data I had.

  1. “python get random number in range”

 

Advertisements

Scraping Search Engines

Scraping Search Engines is Fun.

Google.

  1. site:match.com/Profile
  2. site:facebook.com/*/about
  3. inurl:profile site:match.com 
  4. female inurl:profile inurl:about site:match.com

Note:

  • The use of the keywords ‘inurl’ and ‘site’. ‘site’ is straightforward look for that specific site. ‘inurl’ suggests that word should be present in the url
  • Notice the use of the wildcard operator (*) in 3 above.

Bing.

  • Interestingly Bing doesn’t support ‘inurl’ or wildcards (*)
  • So 2 & 3 above dont work.

Google API

  • I played around with the Google Apis to scrape some profiles. However it seemed they had some limitations with the number of results sent back.
  • The code below should give you a sense of my trials in scraping Google programmatically.
    • It detects fairly easily you are a bot. And throttles you.
  • Also, sometimes google can do re-directs. So be careful with the url format you use for querying.

Bing API

For Bing API, I use the one they expose in the azure Marketplace.

 

Code:

 

 

Pending Posts

 

References:

  1. http://www.codeproject.com/Tips/805923/Asynchronous-programming-in-Web-API-ASP-NET-MVC
  2. http://stackoverflow.com/questions/8463809/customize-the-authorization-http-header

Learning Curves. What to try next in ML ?

A very interesting problem in ML is : What to try next ?  Andrew Ng has some very interesting insights on this topic. (See the reference section below)

  • Nowadays most ML platforms, e.g. AzureML give the ability to do parameter sweeps.
    • Most of the time they also do cross validation when doing sweeps.
    • This simplifies model selection, the platforms will automatically select the parameters during cross validation which give the best accuracy/AUC on the cross validation dataset.
    • This is usually the 1st thing to do for pretty much all ML problems.

 

  • However, an interesting question still remains esp from a practical standpoint –
    • Should I focus more on feature engineering i.e. add more features.  OR Should I focus more on getting more data
    • For these cases I would generally use learning curves.
    • There are some nuances. So let me explain what I usually do.

 

  • Plot of Training Error v/s Cross Validation Error.
    • This usually indicates whether I am currently suffering from a high bias (underfit) problem or a high variance (overfit) problem.
    • High Bias (underfit):
      • high training error. high generalization (CV) error
    • High Variance (overfit):
      • low training error. high generalization (CV) error

learningcurve

  • High Variance (Overfitting) : Plot how the Error / Accuracy varies with increasing data.
    • A good idea here is to use log-base2 scale on the x-axis.
    • Using a log-base-2 scheme gives a good sense of how much the Error/Accuracy with decrease/increase with more data

logplot

  • Based on the intuition above the following steps can be taken. 

 

What to Try next ?

Underfit (high bias)

Overfit (high variance)

Getting More Training Examples

No

Yes

Try smaller set of features

No

Yes. But first see if you can get more training examples.

Additional features

Yes

Maybe. If we get a feature that gives a strong signal then yes add it. But also invest in more data collection in parallel.

 

Code:

 

References:

  1. https://class.coursera.org/ml-005/lecture

Regex in C#

Recent had to ramp up on my regex skills. i was using C#,  so decided to explore the Regex libs in C#

Some use cases.

  • Finding matches.
    • Lets say I want to extract all the numbers present in a string.
    • I identify the proper regex to use. So in this case “\d+”
    • Use Regex.matches. e.g.
var digitsGroups = Regex.Matches(username, @"\d+", RegexOptions.IgnoreCase);

 

Code.

 

Also, check this out : http://www.ultrapico.com/Expresso.htm

References.

Dynamic Programming Pb

Its been a while since I have revisited the theory and practice of Dynamic Programming.

Recently however I came across a couple of problems which required some theory similar to DP. Interesting problems, check them out.

 

Code:

 

Made me realize a few things:

  1. need to revisit DP
  2. think of the problem thoroughly before coding.
  3. simple code oftentimes wins the day.