← Back to Tutorials
Python

Build a Web Scraper in Python

Difficulty: Beginner Est. Time: ~2 hours

Introduction

Web scraping is the automated process of extracting data from websites. It's used for price monitoring, lead generation, research, and much more.

In this tutorial, you'll build a web scraper in Python that can extract product information from e-commerce sites. You'll learn the fundamentals of HTML parsing and data extraction.

What You'll Build

A web scraper that can:

  • Extract product titles and prices
  • Handle pagination
  • Save data to CSV/JSON
  • Handle errors gracefully
What You'll Learn
  • HTTP requests with requests library
  • HTML parsing with Beautiful Soup
  • CSS selectors for element selection
  • Data storage to CSV/JSON
  • Web scraping ethics and best practices

How Web Scraping Works

Web scraping involves three main steps:

1. Make HTTP Request

Send a request to the website's server to fetch the HTML content of a page.

2. Parse HTML

Analyze the HTML structure and find the elements containing the data you want.

3. Extract Data

Pull out the relevant information and save it in a usable format.

Legal Considerations

Always check a website's robots.txt and terms of service before scraping. Some websites explicitly prohibit scraping. Use scraping responsibly and don't overload servers with requests.

Project Overview

We'll build a product price scraper that:

  • Searches for products on a sample e-commerce page
  • Extracts product name, price, and rating
  • Handles multiple pages of results
  • Saves data to CSV for analysis

Technical Stack

  • requests - HTTP library
  • Beautiful Soup 4 - HTML parser
  • CSV/JSON - Data storage

Prerequisites

  • Python 3.8+ installed
  • Basic Python knowledge
  • Code editor

Project Setup

Bash
# Create project
mkdir web-scraper
cd web-scraper

# Create virtual environment
python -m venv venv

# Activate
# Windows:
venv\Scripts\activate

# Mac/Linux:
source venv/bin/activate

# Install libraries
pip install requests beautifulsoup4 lxml

HTML Basics

Before scraping, let's understand HTML structure.

HTML
<!-- HTML structure example -->
<div class="product">
    <h2 class="product-title">Product Name</h2>
    <span class="price">$99.99</span>
    <div class="rating" data-value="4.5">4.5 stars</div>
</div>

CSS Selectors

  • .class - Select by class
  • #id - Select by ID
  • tag - Select by tag name
  • parent child - Select children
  • [attr] - Select by attribute

Beautiful Soup Basics

Let's learn Beautiful Soup fundamentals.

Python
from bs4 import BeautifulSoup
import requests

# Sample HTML
html = """
<html>
<body>
    <div class="product">
        <h2>Laptop</h2>
        <span class="price">$999</span>
    </div>
    <div class="product">
        <h2>Phone</h2>
        <span class="price">$599</span>
    </div>
</body>
</html>
"""

# Parse HTML
soup = BeautifulSoup(html, 'lxml')

# Find all products
products = soup.find_all('div', class_='product')

for product in products:
    title = product.find('h2').text
    price = product.find('span', class_='price').text
    print(f"Title: ${title}, Price: ${price}")

# Output:
# Title: Laptop, Price: $999
# Title: Phone, Price: $599
Key Methods
  • find() - First matching element
  • find_all() - All matching elements
  • .text - Get text content
  • .get('attr') - Get attribute
  • .parent - Navigate to parent

Scraping Static Pages

Now let's build a real scraper for a books website.

Python
# scraper.py
import requests
from bs4 import BeautifulSoup
import csv
import time


class BookScraper:
    """Scraper for books.toscrape.com - a demo e-commerce site"""
    
    def __init__(self):
        self.base_url = "http://books.toscrape.com/catalogue/"
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
    
    def get_page(self, page=1):
        """Fetch a page of books"""
        url = f"http://books.toscrape.com/catalogue/page-${page}.html"
        
        response = requests.get(url, headers=self.headers)
        
        if response.status_code == 200:
            return BeautifulSoup(response.text, 'lxml')
        return None
    
    def extract_book_data(self, book_element):
        """Extract data from a single book element"""
        
        try:
            # Get book title
            title = book_element.find('h3').find('a')['title']
            
            # Get price
            price = book_element.find('p', class_='price_color').text
            
            # Get rating (star-based)
            rating_class = book_element.find('p', class_='star-rating')['class'][1]
            rating = rating_class.replace('Star', '')
            
            # Get availability
            availability = book_element.find('p', class_='instock').text.strip()
            
            return {
                'title': title,
                'price': price,
                'rating': rating,
                'availability': availability
            }
        except Exception as e:
            print(f"Error extracting book: ${e}")
            return None
    
    def scrape_pages(self, num_pages=2):
        """Scrape multiple pages"""
        all_books = []
        
        for page in range(1, num_pages + 1):
            print(f"Scraping page ${page}...")
            
            soup = self.get_page(page)
            
            if not soup:
                print(f"Failed to get page ${page}")
                break
            
            # Find all book elements
            books = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
            
            for book in books:
                book_data = self.extract_book_data(book)
                if book_data:
                    all_books.append(book_data)
            
            # Be polite - wait between requests
            time.sleep(1)
        
        return all_books
    
    def save_to_csv(self, books, filename='books.csv'):
        """Save books to CSV file"""
        
        if not books:
            print("No books to save")
            return
        
        keys = books[0].keys()
        
        with open(filename, 'w', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=keys)
            writer.writeheader()
            writer.writerows(books)
        
        print(f"Saved ${len(books)} books to ${filename}")


# Run the scraper
if __name__ == "__main__":
    scraper = BookScraper()
    books = scraper.scrape_pages(num_pages=2)
    scraper.save_to_csv(books)
Key Points
  • Always set a User-Agent header
  • Add delays between requests
  • Handle exceptions gracefully
  • Check response status codes

Handling Dynamic Content

Some websites load content with JavaScript. For those, we need Selenium.

Bash
# Install Selenium (for dynamic content)
pip install selenium

# Also need ChromeDriver
# Download from: https://chromedriver.chromium.org/
Python
# dynamic_scraper.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import time

# Setup Chrome
service = Service('path/to/chromedriver')
driver = webdriver.Chrome(service=service)

# Navigate to page
driver.get('https://example.com')

# Wait for content to load
time.sleep(3)

# Get page source (includes JS-rendered content)
html = driver.page_source

# Parse with Beautiful Soup
soup = BeautifulSoup(html, 'lxml')

# Or use Selenium directly
titles = driver.find_elements(By.CSS_SELECTOR, '.product-title')
for title in titles:
    print(title.text)

# Close browser
driver.quit()
When to Use Selenium

Selenium is slower and more resource-intensive. Use it only when:

  • Content loads with JavaScript
  • Need to interact with page (clicks, scrolls)
  • Regular scraping doesn't work

For static pages, Beautiful Soup is faster and preferred.

Storing Scraped Data

Let's also add JSON export functionality.

Python
import json


def save_to_json(self, books, filename='books.json'):
    """Save books to JSON file"""
    
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(books, f, indent=2, ensure_ascii=False)
    
    print(f"Saved ${len(books)} books to ${filename}")


def save_to_database(self, books):
    """Save books to SQLite database"""
    
    import sqlite3
    
    conn = sqlite3.connect('books.db')
    cursor = conn.cursor()
    
    # Create table
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS books (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            title TEXT,
            price TEXT,
            rating TEXT,
            availability TEXT
        )
    ''')
    
    # Insert data
    for book in books:
        cursor.execute('''
            INSERT INTO books (title, price, rating, availability)
            VALUES (?, ?, ?, ?)
        ''', (book['title'], book['price'], 
                  book['rating'], book['availability']))
    
    conn.commit()
    conn.close()
    print(f"Saved ${len(books)} books to database")

Best Practices

  • Check robots.txt - Respect website rules
  • Read Terms of Service - Some sites prohibit scraping
  • Add delays - Don't overload servers
  • Set proper headers - Identify your scraper
  • Cache responses - Don't re-fetch unchanged data
  • Handle errors - Graceful degradation
Alternatives to Scraping

Before scraping, check if the website offers:

  • Public APIs - Official data access
  • RSS Feeds - For blogs/news
  • Data Exports - Some sites sell data

Summary

Congratulations! You've built a complete web scraper.

What You Built

  • HTTP Request Handler - Fetch web pages
  • HTML Parser - Extract data with Beautiful Soup
  • Multi-page Scraper - Handle pagination
  • Data Exporter - Save to CSV/JSON/Database

Next Steps

  • Add proxy rotation
  • Implement caching
  • Build a scrapy project for larger projects
  • Learn about API design

Continue Learning

Try these tutorials next:

  • Build a Password Manager
  • Build a REST API
  • Build a Markdown Editor