$ ./sm -s https://google.com | jq
2021/12/31 14:57:26 Using mode: concurrent
2021/12/31 14:57:26 Crawling https://google.com with depth 1
2021/12/31 14:57:26 visiting URL https://google.com at depth 0 with parent https://google.com
2021/12/31 14:57:26 Elapsed milliseconds: 263
[
{
"URL": "https://google.com",
"Links": [
"https://accounts.google.com/ServiceLogin",
"https://drive.google.com/",
"https://mail.google.com/mail/",
"https://news.google.com/",
"https://play.google.com/",
"https://www.google.com/advanced_search",
"https://www.google.com/intl/en/about.html",
"https://www.google.com/intl/en/ads/",
"https://www.google.com/intl/en/policies/privacy/",
"https://www.google.com/intl/en/policies/terms/",
"https://www.google.com/preferences",
"https://www.google.com/search",
"https://www.google.com/services/",
"https://www.google.com/setprefdomain"
]
}
]
I’ve really been enjoying my journey in becoming more familiar with Go. One of Go’s strengths is its built-in concurrency primitives, namely goroutines and channels. I decided to practice my Go skills by building a tool which explores some of these features. SiteMapper accepts a root URL and a maximum depth, and then uses one of three modes of operation to crawl the site (synchronous, concurrent and concurrent-limited), building out a sitemap listing links to internal pages.
Part 1 of this project details the stand-alone CLI tool implementation written in Go. Part 2 details an implementation using Kubernetes, where site crawling is performed by ephemeral Kubernetes job pods.
Link to GitHub repo: https://github.com/dinofizz/sitemapper
Notes on usage example above:
- The output is piped to jq to improve readability.
- Increasing the depth to 2 with a
-d 2
flag will mean each of the link results above will also be crawled for google.com links.
Modes of operation
Three modes of operation are available: synchronous, concurrent and concurrent-limited. Each mode is implemented as a recursive method and checks against the maximum depth to identify when to break the recursion.
Synchronous
Not much to explain here. In this mode a standard recursive traversal is performed for each link at each depth below the maximum.
func (c *SynchronousCrawlEngine) crawl(u, root, parent string, depth int) {
if c.maxDepth == depth {
return
}
urls := getLinks(u, root, parent, depth, c.sm)
depth++
for _, urlLink := range urls {
c.crawl(urlLink, root, u, depth)
}
}
Concurrent
- For the concurrent mode I run the recursive call as a goroutine wrapped in an closure.
- A WaitGroup is used to monitor for all goroutines to complete.
- For each goroutine created the WaitGroup count is incremented using Add().
- A corresponding decrement is performed once the goroutine is complete using Done().
- The crawl method caller waits for all goroutines to complete using the Wait() method.
func (c *ConcurrentCrawlEngine) crawl(u, root, parent string, depth int) {
if c.maxDepth == depth {
return
}
urls := getLinks(u, root, parent, depth, c.sm)
depth++
for _, urlLink := range urls {
c.WG.Add(1)
go func(urlLink, root, parent string, d int) {
defer c.WG.Done()
c.crawl(urlLink, root, parent, d)
}(urlLink, root, u, depth)
}
}
Concurrent Limited
The concurrent limited mode is similar to the concurrent mode, but uses a buffered channel to gate the number of goroutines running (a pattern I learnt from reading “Learning Go” by Jon Bodner).
Limiter code
- A buffered channel is created and filled with empty structs.
- Each time a new crawl goroutine is created the limiter code reads from the buffered channel, runs the crawl function, and then tops up the buffered channel.
- If there are no structs available to be read from the channel the limiter returns an error
type Limiter struct {
ch chan struct{}
limit int
}
func NewLimiter(limit int) *Limiter {
ch := make(chan struct{}, limit)
for i := 0; i < limit; i++ {
ch <- struct{}{}
}
return &Limiter{
ch: ch,
limit: limit,
}
}
func (l *Limiter) RunFunc(f func()) error {
select {
case <-l.ch:
f()
l.ch <- struct{}{}
return nil
default:
return errors.New("limit reached, try again later")
}
}
Concurrent limited crawl implementation
The operation is the same as for the concurrent crawl engine, however the crawl method is passed as a parameter to the limiter implementation. If an error is returned by the limiter I wait a random amount of time before trying again.
func (c *ConcurrentLimitedCrawlEngine) crawl(u, root, parent string, depth int) {
if c.maxDepth == depth {
return
}
urls := getLinks(u, root, parent, depth, c.sm)
depth++
for _, urlLink := range urls {
c.WG.Add(1)
go func(urlLink, root, parent string, d int) {
defer c.WG.Done()
for {
err := c.limiter.RunFunc(func() {
c.crawl(urlLink, root, parent, d)
})
if err != nil {
n := rand.Intn(500)
log.Printf("task limited for URL %s, sleeping for %depth millisecconds\n", urlLink, n)
time.Sleep(time.Duration(n) * time.Millisecond)
} else {
break
}
}
}(urlLink, root, u, depth)
}
}
Other features
- To keep track of the links found at each URL I’m using a map data structure, protected by a sync.Mutex to lock and unlock the map before and after reads and writes.
- I don’t visit URLs I’ve already visited.
- The sitemap struct implements the WriterTo interface, which allows me to write the output to stdout.
- Custom JSON marshalling methods are provided to provide a more user friendly JSON structure for the sitemap.
- Tests exist for the crawler and sitemap related functionality.
- I’m using cobra for command line argument handling.
Things I would do differently
- The sitemap data structure is actually a map of maps (done this way to provide a map of a set - but no set data structure exists). I can probably come up with a better structure.
- I made the decision to key my map with URLs that trim their trailing slash as I was seeing crawl results showing links for both http://www.example.com/about and http://www.example.com/about/. I might want to revert that decision.
- Rules for parsing internal URLs are tricker than expected (handling schemes, anchors, relative, absolute links). The current implementation is very trial and error.
- Add more tests