Scraping Search Engines

Scraping Search Engines is Fun.

Google.

  1. site:match.com/Profile
  2. site:facebook.com/*/about
  3. inurl:profile site:match.com 
  4. female inurl:profile inurl:about site:match.com

Note:

  • The use of the keywords ‘inurl’ and ‘site’. ‘site’ is straightforward look for that specific site. ‘inurl’ suggests that word should be present in the url
  • Notice the use of the wildcard operator (*) in 3 above.

Bing.

  • Interestingly Bing doesn’t support ‘inurl’ or wildcards (*)
  • So 2 & 3 above dont work.

Google API

  • I played around with the Google Apis to scrape some profiles. However it seemed they had some limitations with the number of results sent back.
  • The code below should give you a sense of my trials in scraping Google programmatically.
    • It detects fairly easily you are a bot. And throttles you.
  • Also, sometimes google can do re-directs. So be careful with the url format you use for querying.

Bing API

For Bing API, I use the one they expose in the azure Marketplace.

 

Code:

 

 

Pending Posts

 

References:

  1. http://www.codeproject.com/Tips/805923/Asynchronous-programming-in-Web-API-ASP-NET-MVC
  2. http://stackoverflow.com/questions/8463809/customize-the-authorization-http-header