Getting Started (original) (raw)

1. Install

go get github.com/antchfx/antch

2. Defining our Item

type item struct { Title string json:"title" Link string json:"link" Desc string json:"desc" }

3. Our first Spider

Create a struct called dmozSpider that implement Handler interface.

type dmozSpider struct {}

func (s *dmozSpider) ServeSpider(c chan<- antch.Item, res *http.Response) {}

dmozSpider will extracting data from received pages and pass data into Pipeline.

doc, err := antch.ParseHTML(res) for _, node := range htmlquery.Find(doc, "//div[@id='site-list-content']/div") { v := new(item) v.Title = htmlquery.InnerText(htmlquery.FindOne(node, "//div[@class='site-title']")) v.Link = htmlquery.SelectAttr(htmlquery.FindOne(node, "//a"), "href") v.Desc = htmlquery.InnerText(htmlquery.FindOne(node, "//div[contains(@class,'site-descr')]")) c <- v }

htmlquery package, that supports XPath expression extracting data, and then send Item toGo'Channel c.

4. Our first Pipeline

Create new Pipeline called jsonOutputPipeline, implements PipelineHandler interface.

jsonOutputPipeline serialize received Item data as JSON format print into console.

type jsonOutputPipeline struct {}

func (p *jsonOutputPipeline) ServePipeline(v Item) { b, err := json.Marshal(v) if err != nil { panic(err) } os.Stdout.Write(b) }

5. Crawler

Create a new web crawler instance.

crawler := antch.NewCrawler()

You can enables middleware for HTTP cookies or robots.txt if you want.

crawler.UseMiddleware(CustomMiddleware())

6. Register Spider and Pipeline

Register dmozSpider to the web crawler instance.

dmozSpider will process all matches pages if its matches by dmoztools.net pattern.

crawler.Handle("dmoztools.net", &dmozSpider{})

Register jsonOutputPipeline to the web crawler instance.

crawler.UsePipeline(newTrimSpacePipeline(), newJsonOutputPipeline())

7. Running

startURLs := []string{ "http://dmoztools.net/Computers/Programming/Languages/Python/Books/", "http://dmoztools.net/Computers/Programming/Languages/Python/Resources/", } crawler.StartURLs(startURLs)

END

Enjoy it.

Source Code

https://github.com/antchfx/antch-getstarted