I was recently hacking on a way to retrieve username, age, and gender from OkCupid. OkCupid has no APIs that I can use to get what I wanted. So I realized I had to rely on scraping web pages directly.
It seems there is a nice library called HtmlAgilityPack .
Since I am still ramping up on web related technologies, I was not sure what to expect. However, i was pleasantly surprised. It took me < 10 mins to accomplish what I was trying to do.
This exercise also made me realize therre are several areas I should get familiarized with. These will be the focus of the next set of blogposts I make:
- basic html
- xml processing
- json processing
 Using XPath : http://www.4guysfromrolla.com/articles/011211-1.aspx
Specifically see the usage of ‘SelectNodes’ when using XPath. It has come to my rescue on more than one occasion.
 Our Use LINQ
 Equivalent code in Python.
links = 
for a in soup.findAll(“a”):
if a.has_attr(“href”) and len(a[“href”]) > 0: