Scrape a website efficiently, block by block, page by page.
This is a Cheerio based scraper, useful to extract data from a website using CSS selectors.
The motivation behind this package is to provide a simple cheerio-based scraping tool, able to divide a website into blocks, and transform each block into a JSON object using CSS selectors.
https://github.com/cheeriojs/cheerio
https://github.com/chriso/curlrequest
https://github.com/kriskowal/q
https://github.com/dharmafly/noodle
Install the module with: npm install cheers
Configuration options:
config.url: the URL to scrapeconfig.blockSelector: the CSS selector to apply on the page to divide it in scraping blocks. This field is optional (will use "body" by default)config.scrape: the definition of what you want to extract in each block. Each key has two mandatory attributes :selector(a CSS selector or.to stay on the current node) andextract. The possible values forextractare text, html, outerHTML or the name of an attribute of the html element (e.g. "href")
var cheers = require('cheers');
//let's scrape this excellent JS news website
var config = {
url: "http://www.echojs.com/",
blockSelector: "article",
scrape: {
title: {
selector: "h2 a",
extract: "text"
},
link: {
selector: "h2 a",
extract: "href"
},
articleInnerHtml: {
selector: ".",
extract: "html"
},
articleOuterHtml: {
selector: ".",
extract: "outerHTML"
}
}
};
cheers.scrape(config).then(function (results) {
console.log(JSON.stringify(results));
}).catch(function (error) {
console.error(error);
});
- Website pagination
- Option to use a headless browser
- Unit tests
Cheers!
Copyright (c) 2014 Fabien Allanic
Licensed under the MIT license.