Scraping Search Engines is Fun.
- inurl:profile site:match.com
- female inurl:profile inurl:about site:match.com
- The use of the keywords ‘inurl’ and ‘site’. ‘site’ is straightforward look for that specific site. ‘inurl’ suggests that word should be present in the url
- Notice the use of the wildcard operator (*) in 3 above.
- Interestingly Bing doesn’t support ‘inurl’ or wildcards (*)
- So 2 & 3 above dont work.
- I played around with the Google Apis to scrape some profiles. However it seemed they had some limitations with the number of results sent back.
- The code below should give you a sense of my trials in scraping Google programmatically.
- It detects fairly easily you are a bot. And throttles you.
- Also, sometimes google can do re-directs. So be careful with the url format you use for querying.
For Bing API, I use the one they expose in the azure Marketplace.