Scraping Search Engines

Scraping Search Engines is Fun.

Google.

  1. site:match.com/Profile
  2. site:facebook.com/*/about
  3. inurl:profile site:match.com 
  4. female inurl:profile inurl:about site:match.com

Note:

  • The use of the keywords ‘inurl’ and ‘site’. ‘site’ is straightforward look for that specific site. ‘inurl’ suggests that word should be present in the url
  • Notice the use of the wildcard operator (*) in 3 above.

Bing.

  • Interestingly Bing doesn’t support ‘inurl’ or wildcards (*)
  • So 2 & 3 above dont work.

Google API

  • I played around with the Google Apis to scrape some profiles. However it seemed they had some limitations with the number of results sent back.
  • The code below should give you a sense of my trials in scraping Google programmatically.
    • It detects fairly easily you are a bot. And throttles you.
  • Also, sometimes google can do re-directs. So be careful with the url format you use for querying.

Bing API

For Bing API, I use the one they expose in the azure Marketplace.

 

Code:

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s