â† Back to Tutorials

Python

Build a Web Scraper in Python

Difficulty: Beginner Est. Time: ~2 hours

Introduction

Web scraping is the automated process of extracting data from websites. It's used for price monitoring, lead generation, research, and much more.

In this tutorial, you'll build a web scraper in Python that can extract product information from e-commerce sites. You'll learn the fundamentals of HTML parsing and data extraction.

What You'll Build

A web scraper that can:

Extract product titles and prices
Handle pagination
Save data to CSV/JSON
Handle errors gracefully

What You'll Learn

HTTP requests with requests library
HTML parsing with Beautiful Soup
CSS selectors for element selection
Data storage to CSV/JSON
Web scraping ethics and best practices

How Web Scraping Works

Web scraping involves three main steps:

1. Make HTTP Request

Send a request to the website's server to fetch the HTML content of a page.

2. Parse HTML

Analyze the HTML structure and find the elements containing the data you want.

3. Extract Data

Pull out the relevant information and save it in a usable format.

Legal Considerations

Always check a website's robots.txt and terms of service before scraping. Some websites explicitly prohibit scraping. Use scraping responsibly and don't overload servers with requests.

Project Overview

We'll build a product price scraper that:

Searches for products on a sample e-commerce page
Extracts product name, price, and rating
Handles multiple pages of results
Saves data to CSV for analysis

Technical Stack

requests - HTTP library
Beautiful Soup 4 - HTML parser
CSV/JSON - Data storage

Prerequisites

Python 3.8+ installed
Basic Python knowledge
Code editor

Project Setup

Bash

# Create project
mkdir web-scraper
cd web-scraper

# Create virtual environment
python -m venv venv

# Activate
# Windows:
venv\Scripts\activate

# Mac/Linux:
source venv/bin/activate

# Install libraries
pip install requests beautifulsoup4 lxml

HTML Basics

Before scraping, let's understand HTML structure.

HTML

<!-- HTML structure example -->
<div class="product">
    <h2 class="product-title">Product Name</h2>
    <span class="price">$99.99</span>
    <div class="rating" data-value="4.5">4.5 stars</div>
</div>

CSS Selectors

.class - Select by class
#id - Select by ID
tag - Select by tag name
parent child - Select children
[attr] - Select by attribute

Beautiful Soup Basics

Let's learn Beautiful Soup fundamentals.

Python

from bs4 import BeautifulSoup
import requests

# Sample HTML
html = """
<html>
<body>
    <div class="product">
        <h2>Laptop</h2>
        <span class="price">$999</span>
    </div>
    <div class="product">
        <h2>Phone</h2>
        <span class="price">$599</span>
    </div>
</body>
</html>
"""

# Parse HTML
soup = BeautifulSoup(html, 'lxml')

# Find all products
products = soup.find_all('div', class_='product')

for product in products:
    title = product.find('h2').text
    price = product.find('span', class_='price').text
    print(f"Title: ${title}, Price: ${price}")

# Output:
# Title: Laptop, Price: $999
# Title: Phone, Price: $599

Key Methods

find() - First matching element
find_all() - All matching elements
.text - Get text content
.get('attr') - Get attribute
.parent - Navigate to parent

Scraping Static Pages

Now let's build a real scraper for a books website.

Python

# scraper.py
import requests
from bs4 import BeautifulSoup
import csv
import time


class BookScraper:
    """Scraper for books.toscrape.com - a demo e-commerce site"""
    
    def __init__(self):
        self.base_url = "http://books.toscrape.com/catalogue/"
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
    
    def get_page(self, page=1):
        """Fetch a page of books"""
        url = f"http://books.toscrape.com/catalogue/page-${page}.html"
        
        response = requests.get(url, headers=self.headers)
        
        if response.status_code == 200:
            return BeautifulSoup(response.text, 'lxml')
        return None
    
    def extract_book_data(self, book_element):
        """Extract data from a single book element"""
        
        try:
            # Get book title
            title = book_element.find('h3').find('a')['title']
            
            # Get price
            price = book_element.find('p', class_='price_color').text
            
            # Get rating (star-based)
            rating_class = book_element.find('p', class_='star-rating')['class'][1]
            rating = rating_class.replace('Star', '')
            
            # Get availability
            availability = book_element.find('p', class_='instock').text.strip()
            
            return {
                'title': title,
                'price': price,
                'rating': rating,
                'availability': availability
            }
        except Exception as e:
            print(f"Error extracting book: ${e}")
            return None
    
    def scrape_pages(self, num_pages=2):
        """Scrape multiple pages"""
        all_books = []
        
        for page in range(1, num_pages + 1):
            print(f"Scraping page ${page}...")
            
            soup = self.get_page(page)
            
            if not soup:
                print(f"Failed to get page ${page}")
                break
            
            # Find all book elements
            books = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
            
            for book in books:
                book_data = self.extract_book_data(book)
                if book_data:
                    all_books.append(book_data)
            
            # Be polite - wait between requests
            time.sleep(1)
        
        return all_books
    
    def save_to_csv(self, books, filename='books.csv'):
        """Save books to CSV file"""
        
        if not books:
            print("No books to save")
            return
        
        keys = books[0].keys()
        
        with open(filename, 'w', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=keys)
            writer.writeheader()
            writer.writerows(books)
        
        print(f"Saved ${len(books)} books to ${filename}")


# Run the scraper
if __name__ == "__main__":
    scraper = BookScraper()
    books = scraper.scrape_pages(num_pages=2)
    scraper.save_to_csv(books)

Key Points

Always set a User-Agent header
Add delays between requests
Handle exceptions gracefully
Check response status codes

Handling Dynamic Content

Some websites load content with JavaScript. For those, we need Selenium.

Bash

# Install Selenium (for dynamic content)
pip install selenium

# Also need ChromeDriver
# Download from: https://chromedriver.chromium.org/

Python

# dynamic_scraper.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import time

# Setup Chrome
service = Service('path/to/chromedriver')
driver = webdriver.Chrome(service=service)

# Navigate to page
driver.get('https://example.com')

# Wait for content to load
time.sleep(3)

# Get page source (includes JS-rendered content)
html = driver.page_source

# Parse with Beautiful Soup
soup = BeautifulSoup(html, 'lxml')

# Or use Selenium directly
titles = driver.find_elements(By.CSS_SELECTOR, '.product-title')
for title in titles:
    print(title.text)

# Close browser
driver.quit()

When to Use Selenium

Selenium is slower and more resource-intensive. Use it only when:

Content loads with JavaScript
Need to interact with page (clicks, scrolls)
Regular scraping doesn't work

For static pages, Beautiful Soup is faster and preferred.

Storing Scraped Data

Let's also add JSON export functionality.

Python

import json


def save_to_json(self, books, filename='books.json'):
    """Save books to JSON file"""
    
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(books, f, indent=2, ensure_ascii=False)
    
    print(f"Saved ${len(books)} books to ${filename}")


def save_to_database(self, books):
    """Save books to SQLite database"""
    
    import sqlite3
    
    conn = sqlite3.connect('books.db')
    cursor = conn.cursor()
    
    # Create table
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS books (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            title TEXT,
            price TEXT,
            rating TEXT,
            availability TEXT
        )
    ''')
    
    # Insert data
    for book in books:
        cursor.execute('''
            INSERT INTO books (title, price, rating, availability)
            VALUES (?, ?, ?, ?)
        ''', (book['title'], book['price'], 
                  book['rating'], book['availability']))
    
    conn.commit()
    conn.close()
    print(f"Saved ${len(books)} books to database")

Best Practices

Check robots.txt - Respect website rules
Read Terms of Service - Some sites prohibit scraping
Add delays - Don't overload servers
Set proper headers - Identify your scraper
Cache responses - Don't re-fetch unchanged data
Handle errors - Graceful degradation

Alternatives to Scraping

Before scraping, check if the website offers:

Public APIs - Official data access
RSS Feeds - For blogs/news
Data Exports - Some sites sell data

Summary

Congratulations! You've built a complete web scraper.

What You Built

HTTP Request Handler - Fetch web pages
HTML Parser - Extract data with Beautiful Soup
Multi-page Scraper - Handle pagination
Data Exporter - Save to CSV/JSON/Database

Next Steps

Add proxy rotation
Implement caching
Build a scrapy project for larger projects
Learn about API design

Continue Learning

Try these tutorials next:

Build a Password Manager
Build a REST API
Build a Markdown Editor