$ ./sm -s https://google.com | jq 
2021/12/31 14:57:26 Using mode: concurrent
2021/12/31 14:57:26 Crawling https://google.com with depth 1
2021/12/31 14:57:26 visiting URL https://google.com at depth 0 with parent https://google.com
2021/12/31 14:57:26 Elapsed milliseconds:  263
[
  {
    "URL": "https://google.com",
    "Links": [
      "https://accounts.google.com/ServiceLogin",
      "https://drive.google.com/",
      "https://mail.google.com/mail/",
      "https://news.google.com/",
      "https://play.google.com/",
      "https://www.google.com/advanced_search",
      "https://www.google.com/intl/en/about.html",
      "https://www.google.com/intl/en/ads/",
      "https://www.google.com/intl/en/policies/privacy/",
      "https://www.google.com/intl/en/policies/terms/",
      "https://www.google.com/preferences",
      "https://www.google.com/search",
      "https://www.google.com/services/",
      "https://www.google.com/setprefdomain"
    ]
  }
]

I’ve really been enjoying my journey in becoming more familiar with Go. One of Go’s strengths is its built-in concurrency primitives, namely goroutines and channels. I decided to practice my Go skills by building a tool which explores some of these features. SiteMapper accepts a root URL and a maximum depth, and then uses one of three modes of operation to crawl the site (synchronous, concurrent and concurrent-limited), building out a sitemap listing links to internal pages.

Part 1 of this project details the stand-alone CLI tool implementation written in Go. Part 2 details an implementation using Kubernetes, where site crawling is performed by ephemeral Kubernetes job pods.

Link to GitHub repo: https://github.com/dinofizz/sitemapper

Notes on usage example above:

The output is piped to jq to improve readability.
Increasing the depth to 2 with a -d 2 flag will mean each of the link results above will also be crawled for google.com links.

Modes of operation

Three modes of operation are available: synchronous, concurrent and concurrent-limited. Each mode is implemented as a recursive method and checks against the maximum depth to identify when to break the recursion.

Synchronous

Not much to explain here. In this mode a standard recursive traversal is performed for each link at each depth below the maximum.

func (c *SynchronousCrawlEngine) crawl(u, root, parent string, depth int) {
	if c.maxDepth == depth {
		return
	}
	urls := getLinks(u, root, parent, depth, c.sm)
	depth++

	for _, urlLink := range urls {
		c.crawl(urlLink, root, u, depth)
	}
}

Concurrent

For the concurrent mode I run the recursive call as a goroutine wrapped in an closure.
A WaitGroup is used to monitor for all goroutines to complete.
For each goroutine created the WaitGroup count is incremented using Add().
A corresponding decrement is performed once the goroutine is complete using Done().
The crawl method caller waits for all goroutines to complete using the Wait() method.

func (c *ConcurrentCrawlEngine) crawl(u, root, parent string, depth int) {
	if c.maxDepth == depth {
		return
	}
	urls := getLinks(u, root, parent, depth, c.sm)
	depth++

	for _, urlLink := range urls {
		c.WG.Add(1)
		go func(urlLink, root, parent string, d int) {
			defer c.WG.Done()
			c.crawl(urlLink, root, parent, d)
		}(urlLink, root, u, depth)
	}
}

Concurrent Limited

The concurrent limited mode is similar to the concurrent mode, but uses a buffered channel to gate the number of goroutines running (a pattern I learnt from reading “Learning Go” by Jon Bodner).

Limiter code

A buffered channel is created and filled with empty structs.
Each time a new crawl goroutine is created the limiter code reads from the buffered channel, runs the crawl function, and then tops up the buffered channel.
If there are no structs available to be read from the channel the limiter returns an error

type Limiter struct {
	ch    chan struct{}
	limit int
}

func NewLimiter(limit int) *Limiter {
	ch := make(chan struct{}, limit)

	for i := 0; i < limit; i++ {
		ch <- struct{}{}
	}

	return &Limiter{
		ch:    ch,
		limit: limit,
	}
}

func (l *Limiter) RunFunc(f func()) error {
	select {
	case <-l.ch:
		f()
		l.ch <- struct{}{}
		return nil
	default:
		return errors.New("limit reached, try again later")
	}
}

Concurrent limited crawl implementation

The operation is the same as for the concurrent crawl engine, however the crawl method is passed as a parameter to the limiter implementation. If an error is returned by the limiter I wait a random amount of time before trying again.

func (c *ConcurrentLimitedCrawlEngine) crawl(u, root, parent string, depth int) {
	if c.maxDepth == depth {
		return
	}
	urls := getLinks(u, root, parent, depth, c.sm)
	depth++

	for _, urlLink := range urls {
		c.WG.Add(1)
		go func(urlLink, root, parent string, d int) {
			defer c.WG.Done()
			for {
				err := c.limiter.RunFunc(func() {
					c.crawl(urlLink, root, parent, d)
				})
				if err != nil {
					n := rand.Intn(500)
					log.Printf("task limited for URL %s, sleeping for %depth millisecconds\n", urlLink, n)
					time.Sleep(time.Duration(n) * time.Millisecond)
				} else {
					break
				}
			}
		}(urlLink, root, u, depth)
	}
}

Other features

To keep track of the links found at each URL I’m using a map data structure, protected by a sync.Mutex to lock and unlock the map before and after reads and writes.
I don’t visit URLs I’ve already visited.
The sitemap struct implements the WriterTo interface, which allows me to write the output to stdout.
Custom JSON marshalling methods are provided to provide a more user friendly JSON structure for the sitemap.
Tests exist for the crawler and sitemap related functionality.
I’m using cobra for command line argument handling.

Things I would do differently

The sitemap data structure is actually a map of maps (done this way to provide a map of a set - but no set data structure exists). I can probably come up with a better structure.
I made the decision to key my map with URLs that trim their trailing slash as I was seeing crawl results showing links for both http://www.example.com/about and http://www.example.com/about/. I might want to revert that decision.
Rules for parsing internal URLs are tricker than expected (handling schemes, anchors, relative, absolute links). The current implementation is very trial and error.
Add more tests

Dino Fizzotti

Engineer, maker, hacker, thinker, funner.

SiteMapper Part 1: Exploring concurrency in Go