Nestor Encinas6.10.2017Back

Best Scraping Tools in 2017

Data scraping is a computer technique to extract data from human-readable output coming from another program. Extracting data from websites is called web scraping. Sometimes it is referred to as web harvesting or web data extraction.

Here at Kurzor we have completed numerous projects using various web scraping techniques. We tried some DOM parsing approach using Selenium driver. This approach needs a programmer to define a whole sequence of steps and actions to extract data from a web page. It takes expert skills in a programming language, HTML, DOM structure and various selector types (xPath, CSS, jQuery).

A more flexible but less powerful approach is to use a service which does not require you to be the IT guru. So if you need to make a script to grab all products from an e-shop, articles from a blog or collect some images, it is easy to try one of the following tools.

Webscraper.io

Web Scraper is a company specializing in data extraction from web pages. It offers 2 great options for our users: free Google Chrome Web Scraper Extension, and cloud-based Web Scraper.

Web Scraper Extension (Free!)

Using the extension, you can create a plan (sitemap) of how a web site should be traversed and what should be extracted. Using these sitemaps, the Web Scraper will navigate the site accordingly and extract all data. Scraped data later can be exported as CSV.

Cloud Web Scraper

Cloud Web Scraper offers top quality results driven at the level you require. This option allows you to extract large amounts of data, run multiple scrapings at once, and even run them on a set schedule!


Image: Webscraper.io Chrome extension in action 

Pricing: Chromium extension is free of charge. Prices for cloud service start at $50 for 100,000 pages and goes up to 2,000,000 pages credit worth $250.

Pros:

  • You develop scripts quickly for pages with a regular structure.
  • You can play the script and see the behavior directly in Chrome browser.
  • You can define most elements on a page by just clicking on them.
  • There will be an API soon to call your webhooks upon scraping job finish.
  • You don't need knowledge about programming to prepare the scripts.
  • The data never expire.

Cons:

  • Sometimes script working in the Chrome extension produces a different output when running in a cloud service.
  • Some advanced selectors need to be defined by user as xPath or jQuery selector.
  • You can't scrape pages that require a login.

Import.io

It is a web-based platform to extract data from websites without writing any code.

Users enter a URL and the app extract the data that it thinks you need. If the data obtained is not what you needed, you have an interface to click and select the specific data you want to extract. The data collected by users are stored on Import.io's cloud servers and can be downloaded as CSV, Excel, Google Sheets, JSON os accessed via API.


Image: Import.io interface

Pricing: 

  • The desktop application is free of charge but has limited options.

  • Essential: $299 for 5k queries (expires after one month).

  • Professional: $1,999 fro 100k queries (expires after one year).

  • Enterprise: $4,999 for 500k queries (expires after one year).

Pros:

  • No coding.

  • Automatic data & image extraction.

  • Get data from behind logins.

  • Public APIs

  • Desktop application works on Windows, Mac and Linux.

Cons:

  • A bit too expensive for what it offers.

  • The queries have a limited time, expire after “X” days.

  • Doesn't work on dynamic pages.

ParseHub

Parsehub is a web scraping software that supports complicated data extraction from sites that use AJAX, JavaScript, redirects and cookies. It is equipped with machine learning technology that can read and analyse documents on the web to deliver relevant data. Parsehub is available as a desktop client for Windows, MacOS and Linux and there is also a web app that you can use within the browser. You can have up to 5 crawl projects with the free plan from ParseHub.


Image: ParseHub interface for scraping

Pricing:

  • Everyone: Free, 200 pages, data retention for 14 days.

  • Standard: $149 per month, 10k pages, 20 private projects and data retention for 14 days.

  • Professional: $499 per month, unlimited pages, 120 private projects and data retention for 30 days.

  • Enterprise: contact company to know details about price, unlimited pages, unlimited projects and data retention for 30 days.

Pros:

  • Desktop client for Windows, Mac and Linux.

  • Rest API and web hooks.

  • Get data from behind logins.

  • No coding.

Cons:

  • Low time of data retention.

  • Takes a while to learn how to use properly.

  • XPath selectors.

Agenty

Is a hosted web scraping tool. It offers 3 options to users: a hosted application, desktop application and chrome extension.

Hosted Application

Crawl the web at a large scale using revolutionary pages-based cloud hosted web scraping app to extract data from static and dynamic websites automatically. API ready, no programming required, free plan available!

Desktop Application

Lightening-fast and self-service data extraction software for windows designed to easily extract data from websites using CSS selector or REGEX in few minutes.

Advanced Web Scraper (Chrome extension)

A very simple & advanced data scraping extension by Agenty to extract data from websites using point-and-click CSS Selectors with real-time extracted data preview and a quick data export into JSON/CSV/TSV.


Image: Agenty agents overview

Pricing:

  • Starter: $29 per month or $296 per year, 5k monthly pages, upto 3 scraping agents (a container which holds the configuration such as fields, selectors, URLs etc. of a particular website scraping), 30 days data history, 1 user.

  • Basic: $49 per month or $500 per year, 25k monthly pages, upto 10 scraping agents, 30 days data history, 3 users.

  • Professional: $99 per month or $1,010 per year, 100k monthly pages, upto 25 scraping agents, 60 days data history, 5 users.

  • Enterprise: contact company to get details about price, pages customizable to your needs, unlimited scraping agents, 180 days data history, unlimited users.

Pros:

  • Get data behind logins.

  • Get data from form submission pages.

  • Schedule.

  • Write your own script to modify the scraped data into your choice of format.

Cons:

  • Low time of data retention.

  • Scraping agents too low, you need to pay a professional to have a decent number of selectors to scrape pages.

  • Desktop app only works on Windows.

Octoparse

Octoparse is a cloud-based web crawler that helps you easily extract any web data without coding in real time. Simulates human operation to interact with web pages. You can use the point-&-click UI to easily bulk extract web data from web pages (including those using Ajax, JS, and etc.) and there are various export formats of your choice like CSV, Excel, HTML, TXT, and database (MySQL, SQL Server, and Oracle).

Octoparse’s cloud service (available in paid editions) can extract and store large amounts of data to meet large-scale extraction needs.

Pricing:

  • Basic: Free to use. You can run 10 scripts and extract an unlimited number of web pages.

  • Standard Plan: $75 per month when billed annually or $89 when billed monthly. You can run 100 scripts, 6 cloud servers and extract unlimited web pages.

  • Professional Plan: $158 per month when billed annually or $189 when billed monthly. You can run 200 scripts, 14 cloud servers and extract unlimited web pages.

  • Professional Data Service: starting from $299. Contact the company and they will do the work for you.

Pros:

  • Select the data to be scraped with mouse clicks. No coding.

  • It has API.

  • Works on dynamic pages.

  • Automatically generates Xpath.

Cons:

  • Only works on Windows.

  • Requires some tutorials to learn how to use it.

  • Program hangs on some pages.

Dexi.io

Dexi.io is a web scraping tool for IT professionals. Delivering the most powerful web extraction (web scraping) tool available. With the web data extraction and robotic process automation (RPA) tool, you can extract and transform data from any source.


Image: Dexi interface in action

Pricing:

  • Free trial.

  • Standard: $119 per month (or $105 per month if you paid annually), you can only run one script at a time.

  • Professional: $399 per month (or $355 per month if you paid annually), you can run three scripts at a time.

  • Corporate: $699 per month (or $625 per month if you paid annually), you can run six scripts at a time

Pros:

  • Easy to use GUI.

  • Run executions on schedules.

  • Supports any website.

  • Integration with Amazon S3, Box, DropBox, Google Drive and Web Hooks.

Cons:

  • Lack of documentation.

  • Quite expensive.

  • Limited scripts you can run at the same time.

Apifier

Web scraper tool that extracts structured data from pages.

This tool doesn't have a user interface where you can select the data you want to extract by clicking with your mouse, you choose the data you want to extract using JavaScript so it's perfect for scraping websites that don't have a regular structure.

This is the main difference between the previous tools and this one.

Pricing:

  • Developer: Free, 10k monthly pages, 1 parallel requests (maximum number of web pages that can be requested at a time by all your crawlers) and 7 days of data retention.

  • XS: $19 per month, 25k monthly pages, 3 parallel requests and 30 days of data retention.

  • S: $49 per month, 100k monthly pages, 5 parallel requests and 30 days of data retention.

  • Business M: $129 per month, 400k monthly pages, 10 parallel requests and 30 days of data retention.

  • L: $349 per month, 1.5M monthly pages, 20 parallel requests and 30 days of data retention.

  • XL: $999 per month, 5M monthly pages, 50 parallel requests and 30 days of data retention.

  • Enterprise: contact the company to get details about the price, unlimited monthly pages, unlimited parallel requests and unlimited data retention.

Pros:

  • Has a lot of documentation.

  • Can scrape pages with irregular structure.

  • Works on dynamic websites.

  • Supports any website.

  • Jquery integration.

  • You can schedule the script.

Cons:

  • No user interface, you need programming skills.

  • Low data retention time.

Conclusion

Each of the services offers a slightly different approach and pricing. Some of them will suit your project more, some less. The main goal is to select the best scraping service for your project. Definitely, 2017 can be seen as the year where data extraction from web pages is gaining its place in Kurzor’s company portfolio.

Disclaimer: Prices in the article are from October 2017. We are not paid or otherwise advantaged by any of the services mentioned for promoting them.

Nestor Encinas

Web developer