The crawler and the headless browser is in place and it’s time to get some parsing done. This is needed since I so far only have the source of the pages I crawled and scraped. Now I want to fetch the specific content and data I’m interested of. For this I need to parse the source of the pages and get specific elements.
There are multiple parsing frameworks out there and with Abot, the crawler I chose, both AngleSharp and HtmlAgilityPack is included. That’s enough for me so I won’t spend any time finding something else.
Parsing can be done using Xpath or by element classes, types or name.
Let’s say we have the following HTML source code:
<html> <head> <title>Razor Blackwidow - BestStore</title> </head> <body> <div id="header"> <h1>Razor Blackwidow</h1> <p class="shortdescription"> The Razor Blackwidow is the coolest keyboard ever! </p> </div> <div id="main"> <div class="section features"> <h2>Great new features</h2> <ul> <li>Mechanical keyboard</li> <li>Cool lights</li> <li>Solid metal casing</li> </ul> </div> </div> </body> </html>
From this I want to get the name of the product. As you can see this is find in an h1 element in a div with the id “header”.
With AngleSharp this would look like this:
var productName = angleSharpHtmlDocument.QuerySelector("div.header h1").TextContent;
I could have used only “h1” in the selector since there’s only one h1 but added the div just to show you how to build your queries.
It differs a little if I chose HtmlAgilityPack instead but AngleSharp is enough for me.
When running the headless browser I have to use my WebDriver to parse the code instead. Then it would look like this.
WebElement element = driver.findElement(By.xpath("//div[@id='header']/h1")); var productName = element.GetAttribute("textContent");
In this case I use Xpath just to show the difference. Xpath is more powerful than what’s currently available in AngleSharp so if AngleSharp is not enough for you there are other frameworks.
When it comes to finding the Xpath in a page there’s a Chrome Plug called Selector Gadget http://selectorgadget.com/ that’s very helpful.
Now I have all the parts I need in order to put my solution in place.