For part of our business we rely a great deal on reliable web scraping. The solution in place is Scrapy on Python. Scrapy is a great framework for crawling and scraping, especially general purposes. However, looking forward we have some more specific needs, at least that’s my feeling.

This is my journey to a hopefully solid crawler:

  1. Dipping my toes
  2. Head on with headless browsers
  3. Diving into scraping
  4. Deep dive for a solution
  5. Making it reusable
  6. Getting things running on schedule

Adapting our existing solution will take some work and since I already had to do some changes previously it feels a bit too much like a patchwork. I want to rebuild the logic to be more adaptable and configurable and we need functionality to execute JavaScript.

With this in mind, and with my primary platform being .NET, not Python, I’ve decided to look at some different solutions to rebuild our logic around and run it in Azure.

First step is to find a good crawler. One solution is to build my own and while the basics of this is pretty easy crawling can become very complicated when taking bot traps, recurring links, and more in account. So instead I will look at ready made solutions, free and commercial.

abotxlogoxMy first stop is the open source framework Abot, and it’s commercial version AbotX.
This seems to be a very stable and nice framework with activity in the repo. The commercial version is if you need to execute/render JavaScript and other nice features. The examples are okey and i got a trial of AbotX working but I’m not really sure about the JavaScript execution.

visualwebripperlogoThen I tried a commercial product, Visual Web Ripper from Sequentum.
I’m sure this is a great product and I found some good reviews but the primary use is as the name says, visual, and I want a solution that is built for server use. Moving on.

nightmarejsNightmareJS is a browser automation library using Electron to parse websites. It runs on node and is easy to install and get going. Unfortunately this is a framework to build the automations, not a web crawler. If I use this I have to build the whole logic for the crawler myself and that’s what I don’t want to do.

webcontentextractor-logoWeb Content Extractor from Newprosoft is another visual crawler and extractor. Not what I’m looking for but I’m mentioning it here anyway if someone would be interested.

 

 

 

scrapyLast crawler I’m looking at is once again Scrapy. I’ve previously implemented Selenium in one of our spiders to execute JavaScript with an okey result. But our solution is old, needs improvements and now there are other solutions, for example using Splash or Portia instead of Selenium. But this runs best in Python and even if it’s a good solution I’m still a better .NET programmer than Python programmer.

In my mind I’m starting to get a good picture of what I need to do. I need a web crawler, a headless browser, a parsing framework and a way to schedule and run my spiders.

After trying out the different solutions I’ve chosen Abot as my crawler. In my tests it worked great. Instead of going with the commercial version AbotX to get JavaScript execution I will add a headless browser.
More about that in the next post.

Advertisements