- Published on
Building a Concurrent Web Crawler in Go with Redis and Docker
- Authors
- Name
In this blog post, we'll explore how to build a concurrent web crawler in Go using Go's powerful concurrency model. We'll also use Redis to keep track of the pages we've already crawled, ensuring we don't crawl the same page twice. Additionally, we'll use Docker and Docker Compose to set up a Redis server. Our crawler will fetch HTML pages and follow any links (href
tags) to subpages, outputting the HTML content to standard output.
Prerequisites
Before we get started, ensure you have the following installed on your machine:
- Go (latest version)
- Docker
- Docker Compose
Setting Up Redis with Docker Compose
First, let's set up a Redis server using Docker Compose. Create a docker-compose.yml
file in your project directory:
version: '3.8'
services:
redis:
image: redis:latest
ports:
- '6379:6379'
volumes:
- redis_data:/data
volumes:
redis_data:
Start the Redis server by running the following command in the same directory as your docker-compose.yml
file:
docker-compose up -d
Building the Web Crawler
Next, let's create our web crawler in Go. We'll start by creating a new Go project and installing necessary packages:
mkdir go-web-crawler
cd go-web-crawler
go mod init go-web-crawler
go get github.com/go-redis/redis/v8
go get golang.org/x/net/html
Create a main.go
file in your project directory with the following content:
Initialization
First, we need to initialize our Redis client and set up some global variables.
package main
import (
"context"
"sync"
"github.com/go-redis/redis/v8"
)
var (
ctx = context.Background()
redisClient *redis.Client
visited sync.Map
)
func init() {
redisClient = redis.NewClient(&redis.Options{
Addr: "localhost:6379",
})
}
In this code, we initialize the Redis client and set up a context and a sync.Map
to keep track of visited URLs. The init
function sets up the Redis client to connect to the Redis server running on localhost:6379
.
Main Function
Next, we define the main function which starts the crawling process.
func main() {
startURL := "http://example.com"
var wg sync.WaitGroup
wg.Add(1)
go crawl(startURL, &wg)
wg.Wait()
}
The main function sets the starting URL and initializes a wait group to manage our goroutines. We add one to the wait group and start the crawling process by calling the crawl
function in a new goroutine.
Crawl Function
The crawl
function performs the actual crawling, fetching pages, and processing links.
func crawl(url string, wg *sync.WaitGroup) {
defer wg.Done()
if isVisited(url) {
return
}
markVisited(url)
resp, err := http.Get(url)
if err != nil {
log.Println("Error fetching URL:", err)
return
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
log.Println("Error: Non-OK HTTP status:", resp.StatusCode)
return
}
fmt.Println("Crawled:", url)
doc, err := html.Parse(resp.Body)
if err != nil {
log.Println("Error parsing HTML:", err)
return
}
for _, link := range extractLinks(doc) {
if strings.HasPrefix(link, "http") {
wg.Add(1)
go crawl(link, wg)
}
}
}
The crawl
function:
- Checks if the URL has already been visited. If it has, the function returns early.
- Marks the URL as visited.
- Fetches the page using
http.Get
. - Parses the HTML content of the page.
- Extracts links from the page and recursively crawls each link in a new goroutine.
Extract Links
We need a function to extract links from the HTML content.
func extractLinks(n *html.Node) []string {
var links []string
if n.Type == html.ElementNode && n.Data == "a" {
for _, attr := range n.Attr {
if attr.Key == "href" {
links = append(links, attr.Val)
}
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
links = append(links, extractLinks(c)...)
}
return links
}
The extractLinks
function recursively traverses the HTML node tree and extracts all links (href
attributes) from a
tags.
Visited Functions
Finally, we need functions to check and mark URLs as visited.
func isVisited(url string) bool {
if _, ok := visited.Load(url); ok {
return true
}
result, err := redisClient.Get(ctx, url).Result()
if err == redis.Nil || result == "" {
return false
}
return true
}
func markVisited(url string) {
visited.Store(url, true)
err := redisClient.Set(ctx, url, true, 0).Err()
if err != nil {
log.Println("Error setting URL in Redis:", err)
}
}
The isVisited
function checks if a URL has been visited using both an in-memory map and Redis. The markVisited
function marks a URL as visited in both the in-memory map and Redis.
Running the Web Crawler
To run the web crawler, ensure your Redis server is running (via Docker Compose), then execute the following command:
go run main.go
This will start the crawler at the specified URL and print the HTML content of each crawled page to standard output.
Conclusion
In this blog post, we built a concurrent web crawler in Go using Go's concurrency model. We used Redis to keep track of visited URLs and avoid crawling the same page twice. By leveraging Docker and Docker Compose, we easily set up a Redis server for our application. This setup ensures efficient and scalable web crawling, leveraging Go's capabilities and Redis's performance for tracking state.
For more information and detailed documentation, visit the Go Redis documentation and the Go HTML package documentation. Happy coding!