Building a Concurrent Web Crawler in Go with Redis and Docker

In this blog post, we'll explore how to build a concurrent web crawler in Go using Go's powerful concurrency model. We'll also use Redis to keep track of the pages we've already crawled, ensuring we don't crawl the same page twice. Additionally, we'll use Docker and Docker Compose to set up a Redis server. Our crawler will fetch HTML pages and follow any links (href tags) to subpages, outputting the HTML content to standard output.

Prerequisites

Before we get started, ensure you have the following installed on your machine:

Go (latest version)
Docker
Docker Compose

Setting Up Redis with Docker Compose

First, let's set up a Redis server using Docker Compose. Create a docker-compose.yml file in your project directory:

version: '3.8'

services:
  redis:
    image: redis:latest
    ports:
      - '6379:6379'
    volumes:
      - redis_data:/data

volumes:
  redis_data:

Start the Redis server by running the following command in the same directory as your docker-compose.yml file:

docker-compose up -d

Building the Web Crawler

Next, let's create our web crawler in Go. We'll start by creating a new Go project and installing necessary packages:

mkdir go-web-crawler
cd go-web-crawler
go mod init go-web-crawler
go get github.com/go-redis/redis/v8
go get golang.org/x/net/html

Create a main.go file in your project directory with the following content:

Initialization

First, we need to initialize our Redis client and set up some global variables.

package main

import (
    "context"
    "sync"

    "github.com/go-redis/redis/v8"
)

var (
    ctx        = context.Background()
    redisClient *redis.Client
    visited     sync.Map
)

func init() {
    redisClient = redis.NewClient(&redis.Options{
        Addr: "localhost:6379",
    })
}

In this code, we initialize the Redis client and set up a context and a sync.Map to keep track of visited URLs. The init function sets up the Redis client to connect to the Redis server running on localhost:6379.

Main Function

Next, we define the main function which starts the crawling process.

func main() {
    startURL := "http://example.com"
    var wg sync.WaitGroup
    wg.Add(1)
    go crawl(startURL, &wg)
    wg.Wait()
}

The main function sets the starting URL and initializes a wait group to manage our goroutines. We add one to the wait group and start the crawling process by calling the crawl function in a new goroutine.

Crawl Function

The crawl function performs the actual crawling, fetching pages, and processing links.

func crawl(url string, wg *sync.WaitGroup) {
    defer wg.Done()
    if isVisited(url) {
        return
    }
    markVisited(url)

    resp, err := http.Get(url)
    if err != nil {
        log.Println("Error fetching URL:", err)
        return
    }
    defer resp.Body.Close()

    if resp.StatusCode != http.StatusOK {
        log.Println("Error: Non-OK HTTP status:", resp.StatusCode)
        return
    }

    fmt.Println("Crawled:", url)
    doc, err := html.Parse(resp.Body)
    if err != nil {
        log.Println("Error parsing HTML:", err)
        return
    }

    for _, link := range extractLinks(doc) {
        if strings.HasPrefix(link, "http") {
            wg.Add(1)
            go crawl(link, wg)
        }
    }
}

The crawl function:

Checks if the URL has already been visited. If it has, the function returns early.
Marks the URL as visited.
Fetches the page using http.Get.
Parses the HTML content of the page.
Extracts links from the page and recursively crawls each link in a new goroutine.

Extract Links

We need a function to extract links from the HTML content.

func extractLinks(n *html.Node) []string {
    var links []string
    if n.Type == html.ElementNode && n.Data == "a" {
        for _, attr := range n.Attr {
            if attr.Key == "href" {
                links = append(links, attr.Val)
            }
        }
    }
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        links = append(links, extractLinks(c)...)
    }
    return links
}

The extractLinks function recursively traverses the HTML node tree and extracts all links (href attributes) from a tags.

Visited Functions

Finally, we need functions to check and mark URLs as visited.

func isVisited(url string) bool {
    if _, ok := visited.Load(url); ok {
        return true
    }
    result, err := redisClient.Get(ctx, url).Result()
    if err == redis.Nil || result == "" {
        return false
    }
    return true
}

func markVisited(url string) {
    visited.Store(url, true)
    err := redisClient.Set(ctx, url, true, 0).Err()
    if err != nil {
        log.Println("Error setting URL in Redis:", err)
    }
}

The isVisited function checks if a URL has been visited using both an in-memory map and Redis. The markVisited function marks a URL as visited in both the in-memory map and Redis.

Running the Web Crawler

To run the web crawler, ensure your Redis server is running (via Docker Compose), then execute the following command:

go run main.go

This will start the crawler at the specified URL and print the HTML content of each crawled page to standard output.

Conclusion

In this blog post, we built a concurrent web crawler in Go using Go's concurrency model. We used Redis to keep track of visited URLs and avoid crawling the same page twice. By leveraging Docker and Docker Compose, we easily set up a Redis server for our application. This setup ensures efficient and scalable web crawling, leveraging Go's capabilities and Redis's performance for tracking state.

For more information and detailed documentation, visit the Go Redis documentation and the Go HTML package documentation. Happy coding!