HTML Scraping in CSharp

I was recently hacking on a way to retrieve username, age, and gender from OkCupid. OkCupid has no APIs that I can use to get what I wanted. So I realized I had to rely on scraping web pages directly.

It seems there is a nice library called  HtmlAgilityPack .

Since I am still ramping up on web related technologies, I was not sure what to expect. However, i was pleasantly surprised. It took me < 10 mins to accomplish what I was trying to do.

This exercise also made me realize therre are several areas I should get familiarized with. These will be the focus of the next set of blogposts I make:

  • basic html
  • jQuery
  • xml processing
  • json processing



[1]   Using XPath :

Specifically see the usage of ‘SelectNodes’ when using XPath. It has come to my rescue on more than one occasion.

[2]  Our Use LINQ

[3]  Equivalent code in Python.

links = []
for a in soup.findAll(“a”):
if a.has_attr(“href”) and len(a[“href”]) > 0:









Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s