Getting Started (original) (raw)
1. Install
go get github.com/antchfx/antch
2. Defining our Item
type item struct {
Title string json:"title"
Link string json:"link"
Desc string json:"desc"
}
3. Our first Spider
Create a struct called dmozSpider that implement Handler interface.
type dmozSpider struct {}
func (s *dmozSpider) ServeSpider(c chan<- antch.Item, res *http.Response) {}
dmozSpider will extracting data from received pages and pass data into Pipeline.
doc, err := antch.ParseHTML(res) for _, node := range htmlquery.Find(doc, "//div[@id='site-list-content']/div") { v := new(item) v.Title = htmlquery.InnerText(htmlquery.FindOne(node, "//div[@class='site-title']")) v.Link = htmlquery.SelectAttr(htmlquery.FindOne(node, "//a"), "href") v.Desc = htmlquery.InnerText(htmlquery.FindOne(node, "//div[contains(@class,'site-descr')]")) c <- v }
htmlquery package, that supports XPath expression extracting data, and then send Item toGo'Channel c.
4. Our first Pipeline
Create new Pipeline called jsonOutputPipeline, implements PipelineHandler interface.
jsonOutputPipeline serialize received Item data as JSON format print into console.
type jsonOutputPipeline struct {}
func (p *jsonOutputPipeline) ServePipeline(v Item) { b, err := json.Marshal(v) if err != nil { panic(err) } os.Stdout.Write(b) }
5. Crawler
Create a new web crawler instance.
crawler := antch.NewCrawler()
You can enables middleware for HTTP cookies or robots.txt if you want.
- enable cookies middleware for web crawler.
- you even registers custom middleware for web crawler.
crawler.UseMiddleware(CustomMiddleware())
6. Register Spider and Pipeline
Register dmozSpider to the web crawler instance.
dmozSpider will process all matches pages if its matches by dmoztools.net pattern.
crawler.Handle("dmoztools.net", &dmozSpider{})
Register jsonOutputPipeline to the web crawler instance.
crawler.UsePipeline(newTrimSpacePipeline(), newJsonOutputPipeline())
7. Running
startURLs := []string{ "http://dmoztools.net/Computers/Programming/Languages/Python/Books/", "http://dmoztools.net/Computers/Programming/Languages/Python/Resources/", } crawler.StartURLs(startURLs)
END
Enjoy it.