-
Notifications
You must be signed in to change notification settings - Fork 42
Getting Started
zhengchun edited this page Dec 29, 2017
·
4 revisions
go get github.com/antchfx/antchtype item struct {
Title string `json:"title"`
Link string `json:"link"`
Desc string `json:"desc"`
}Create a struct called dmozSpider that implement Handler interface.
type dmozSpider struct {}
func (s *dmozSpider) ServeSpider(c chan<- antch.Item, res *http.Response) {}dmozSpider will extracting data from received pages and pass data into Pipeline.
doc, err := antch.ParseHTML(res)
for _, node := range htmlquery.Find(doc, "//div[@id='site-list-content']/div") {
v := new(item)
v.Title = htmlquery.InnerText(htmlquery.FindOne(node, "//div[@class='site-title']"))
v.Link = htmlquery.SelectAttr(htmlquery.FindOne(node, "//a"), "href")
v.Desc = htmlquery.InnerText(htmlquery.FindOne(node, "//div[contains(@class,'site-descr')]"))
c <- v
}htmlquery package, that supports XPath expression extracting data,
and then send Item toGo'Channel c.
c <- vCreate new Pipeline called jsonOutputPipeline, implements PipelineHandler interface.
jsonOutputPipeline serialize received Item data as JSON format print into console.
type jsonOutputPipeline struct {}
func (p *jsonOutputPipeline) ServePipeline(v Item) {
b, err := json.Marshal(v)
if err != nil {
panic(err)
}
os.Stdout.Write(b)
}Create a new web crawler instance.
crawler := antch.NewCrawler()You can enables middleware for HTTP cookies or robots.txt if you want.
- enable cookies middleware for web crawler.
crawler.UseCookies()- you even registers custom middleware for web crawler.
crawler.UseMiddleware(CustomMiddleware())Register dmozSpider to the web crawler instance.
dmozSpider will process all matches pages if its matches by dmoztools.net pattern.
crawler.Handle("dmoztools.net", &dmozSpider{})Register jsonOutputPipeline to the web crawler instance.
crawler.UsePipeline(newTrimSpacePipeline(), newJsonOutputPipeline())startURLs := []string{
"http://dmoztools.net/Computers/Programming/Languages/Python/Books/",
"http://dmoztools.net/Computers/Programming/Languages/Python/Resources/",
}
crawler.StartURLs(startURLs)go run main.goEnjoy it.