GitHub - ReedD/crawler: Chromium / Puppeteer site crawler (original) (raw)

Chromium / Puppeteer site crawler

styled with prettier

This crawler does a BFS starting from a given site entry point. It will not leave the entry point domain and it will not crawl a page more than once. Given a shared redis host/cluster this crawler can be distributed across multiple machines or processes. Discovered pages will be stored in mongo collection, each with a url, outbound urls, and a radius from the origin.

Installation

Usage

Basic

./crawl -u https://www.dadoune.com

Distributed

Terminal 1

./crawl -u https://www.dadoune.com

Debug

DEBUG=crawler:* ./crawl -u https://www.dadoune.com

Options