HTML Scraping in CSharp

I was recently hacking on a way to retrieve username, age, and gender from OkCupid. OkCupid has no APIs that I can use to get what I wanted. So I realized I had to rely on scraping web pages directly.

It seems there is a nice library called  HtmlAgilityPack .

Since I am still ramping up on web related technologies, I was not sure what to expect. However, i was pleasantly surprised. It took me < 10 mins to accomplish what I was trying to do.

This exercise also made me realize therre are several areas I should get familiarized with. These will be the focus of the next set of blogposts I make:

  • basic html
  • jQuery
  • xml processing
  • json processing

 

Tips:

[1]   Using XPath : http://www.4guysfromrolla.com/articles/011211-1.aspx

Specifically see the usage of ‘SelectNodes’ when using XPath. It has come to my rescue on more than one occasion.

[2]  Our Use LINQ

[3]  Equivalent code in Python.

links = []
for a in soup.findAll(“a”):
if a.has_attr(“href”) and len(a[“href”]) > 0:
links.append(a[“href”])

 

Code:

 

 

 

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s