GitHub - brave/pagegraph-crawl: Gather pagegraph data from all over the internet (original) (raw)
Command line tool for crawling web pages with PageGraph.
Install
Requires a recent version of node (current testing is done on v23.4.0).
npm install npm run build
Test
The tests are defined in test/test.js. Test parameters are defined in test/config.js and can be overriden via environment variables. You need to specify a PageGraph binary path.
Usage
Since PageGraph is built as part of Brave Nightly, you can simply point the binary path to be your local installation.
npm run crawl --
-b /Applications/Brave\ Browser\ Nightly.app/Contents/MacOS/Brave\ Browser\ Nightly
-u https://brave.com
-t 5
-o output/
--debug debug
The -t specifies how many seconds to crawl the URL provided in -u using the PageGraph binary in -b.
You can see all supported options:
NOTE: PageGraph currently does not track puppeteer / automation scripts, and so modifying or interacting with the document through devtools/puppeteer while recording a PageGraph file will likely fail.