For part of our business we rely a great deal on reliable web scraping. The solution in place is Scrapy on Python. Scrapy is a great framework for crawling and scraping, especially general purposes. However, looking forward we have some more specific needs, at least that’s my feeling.
This is my journey to a hopefully solid crawler:
- Dipping my toes
- Head on with headless browsers
- Diving into scraping
- Deep dive for a solution
- Making it reusable
- Getting things running on schedule
With this in mind, and with my primary platform being .NET, not Python, I’ve decided to look at some different solutions to rebuild our logic around and run it in Azure.
First step is to find a good crawler. One solution is to build my own and while the basics of this is pretty easy crawling can become very complicated when taking bot traps, recurring links, and more in account. So instead I will look at ready made solutions, free and commercial.
My first stop is the open source framework Abot, and it’s commercial version AbotX.
Then I tried a commercial product, Visual Web Ripper from Sequentum.
I’m sure this is a great product and I found some good reviews but the primary use is as the name says, visual, and I want a solution that is built for server use. Moving on.
NightmareJS is a browser automation library using Electron to parse websites. It runs on node and is easy to install and get going. Unfortunately this is a framework to build the automations, not a web crawler. If I use this I have to build the whole logic for the crawler myself and that’s what I don’t want to do.
Web Content Extractor from Newprosoft is another visual crawler and extractor. Not what I’m looking for but I’m mentioning it here anyway if someone would be interested.
In my mind I’m starting to get a good picture of what I need to do. I need a web crawler, a headless browser, a parsing framework and a way to schedule and run my spiders.
More about that in the next post.