Build a Web Scraper in Python
Introduction
Web scraping is the automated process of extracting data from websites. It's used for price monitoring, lead generation, research, and much more.
In this tutorial, you'll build a web scraper in Python that can extract product information from e-commerce sites. You'll learn the fundamentals of HTML parsing and data extraction.
A web scraper that can:
- Extract product titles and prices
- Handle pagination
- Save data to CSV/JSON
- Handle errors gracefully
- HTTP requests with requests library
- HTML parsing with Beautiful Soup
- CSS selectors for element selection
- Data storage to CSV/JSON
- Web scraping ethics and best practices
How Web Scraping Works
Web scraping involves three main steps:
1. Make HTTP Request
Send a request to the website's server to fetch the HTML content of a page.
2. Parse HTML
Analyze the HTML structure and find the elements containing the data you want.
3. Extract Data
Pull out the relevant information and save it in a usable format.
Always check a website's robots.txt and terms of service before scraping. Some websites explicitly prohibit scraping. Use scraping responsibly and don't overload servers with requests.
Project Overview
We'll build a product price scraper that:
- Searches for products on a sample e-commerce page
- Extracts product name, price, and rating
- Handles multiple pages of results
- Saves data to CSV for analysis
Technical Stack
- requests - HTTP library
- Beautiful Soup 4 - HTML parser
- CSV/JSON - Data storage
Prerequisites
- Python 3.8+ installed
- Basic Python knowledge
- Code editor
Project Setup
# Create project
mkdir web-scraper
cd web-scraper
# Create virtual environment
python -m venv venv
# Activate
# Windows:
venv\Scripts\activate
# Mac/Linux:
source venv/bin/activate
# Install libraries
pip install requests beautifulsoup4 lxml
HTML Basics
Before scraping, let's understand HTML structure.
<!-- HTML structure example -->
<div class="product">
<h2 class="product-title">Product Name</h2>
<span class="price">$99.99</span>
<div class="rating" data-value="4.5">4.5 stars</div>
</div>
CSS Selectors
- .class - Select by class
- #id - Select by ID
- tag - Select by tag name
- parent child - Select children
- [attr] - Select by attribute
Beautiful Soup Basics
Let's learn Beautiful Soup fundamentals.
from bs4 import BeautifulSoup
import requests
# Sample HTML
html = """
<html>
<body>
<div class="product">
<h2>Laptop</h2>
<span class="price">$999</span>
</div>
<div class="product">
<h2>Phone</h2>
<span class="price">$599</span>
</div>
</body>
</html>
"""
# Parse HTML
soup = BeautifulSoup(html, 'lxml')
# Find all products
products = soup.find_all('div', class_='product')
for product in products:
title = product.find('h2').text
price = product.find('span', class_='price').text
print(f"Title: ${title}, Price: ${price}")
# Output:
# Title: Laptop, Price: $999
# Title: Phone, Price: $599
- find() - First matching element
- find_all() - All matching elements
- .text - Get text content
- .get('attr') - Get attribute
- .parent - Navigate to parent
Scraping Static Pages
Now let's build a real scraper for a books website.
# scraper.py
import requests
from bs4 import BeautifulSoup
import csv
import time
class BookScraper:
"""Scraper for books.toscrape.com - a demo e-commerce site"""
def __init__(self):
self.base_url = "http://books.toscrape.com/catalogue/"
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
def get_page(self, page=1):
"""Fetch a page of books"""
url = f"http://books.toscrape.com/catalogue/page-${page}.html"
response = requests.get(url, headers=self.headers)
if response.status_code == 200:
return BeautifulSoup(response.text, 'lxml')
return None
def extract_book_data(self, book_element):
"""Extract data from a single book element"""
try:
# Get book title
title = book_element.find('h3').find('a')['title']
# Get price
price = book_element.find('p', class_='price_color').text
# Get rating (star-based)
rating_class = book_element.find('p', class_='star-rating')['class'][1]
rating = rating_class.replace('Star', '')
# Get availability
availability = book_element.find('p', class_='instock').text.strip()
return {
'title': title,
'price': price,
'rating': rating,
'availability': availability
}
except Exception as e:
print(f"Error extracting book: ${e}")
return None
def scrape_pages(self, num_pages=2):
"""Scrape multiple pages"""
all_books = []
for page in range(1, num_pages + 1):
print(f"Scraping page ${page}...")
soup = self.get_page(page)
if not soup:
print(f"Failed to get page ${page}")
break
# Find all book elements
books = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
for book in books:
book_data = self.extract_book_data(book)
if book_data:
all_books.append(book_data)
# Be polite - wait between requests
time.sleep(1)
return all_books
def save_to_csv(self, books, filename='books.csv'):
"""Save books to CSV file"""
if not books:
print("No books to save")
return
keys = books[0].keys()
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=keys)
writer.writeheader()
writer.writerows(books)
print(f"Saved ${len(books)} books to ${filename}")
# Run the scraper
if __name__ == "__main__":
scraper = BookScraper()
books = scraper.scrape_pages(num_pages=2)
scraper.save_to_csv(books)
- Always set a User-Agent header
- Add delays between requests
- Handle exceptions gracefully
- Check response status codes
Handling Dynamic Content
Some websites load content with JavaScript. For those, we need Selenium.
# Install Selenium (for dynamic content)
pip install selenium
# Also need ChromeDriver
# Download from: https://chromedriver.chromium.org/
# dynamic_scraper.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import time
# Setup Chrome
service = Service('path/to/chromedriver')
driver = webdriver.Chrome(service=service)
# Navigate to page
driver.get('https://example.com')
# Wait for content to load
time.sleep(3)
# Get page source (includes JS-rendered content)
html = driver.page_source
# Parse with Beautiful Soup
soup = BeautifulSoup(html, 'lxml')
# Or use Selenium directly
titles = driver.find_elements(By.CSS_SELECTOR, '.product-title')
for title in titles:
print(title.text)
# Close browser
driver.quit()
Selenium is slower and more resource-intensive. Use it only when:
- Content loads with JavaScript
- Need to interact with page (clicks, scrolls)
- Regular scraping doesn't work
For static pages, Beautiful Soup is faster and preferred.
Storing Scraped Data
Let's also add JSON export functionality.
import json
def save_to_json(self, books, filename='books.json'):
"""Save books to JSON file"""
with open(filename, 'w', encoding='utf-8') as f:
json.dump(books, f, indent=2, ensure_ascii=False)
print(f"Saved ${len(books)} books to ${filename}")
def save_to_database(self, books):
"""Save books to SQLite database"""
import sqlite3
conn = sqlite3.connect('books.db')
cursor = conn.cursor()
# Create table
cursor.execute('''
CREATE TABLE IF NOT EXISTS books (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT,
price TEXT,
rating TEXT,
availability TEXT
)
''')
# Insert data
for book in books:
cursor.execute('''
INSERT INTO books (title, price, rating, availability)
VALUES (?, ?, ?, ?)
''', (book['title'], book['price'],
book['rating'], book['availability']))
conn.commit()
conn.close()
print(f"Saved ${len(books)} books to database")
Best Practices
- Check robots.txt - Respect website rules
- Read Terms of Service - Some sites prohibit scraping
- Add delays - Don't overload servers
- Set proper headers - Identify your scraper
- Cache responses - Don't re-fetch unchanged data
- Handle errors - Graceful degradation
Before scraping, check if the website offers:
- Public APIs - Official data access
- RSS Feeds - For blogs/news
- Data Exports - Some sites sell data
Summary
Congratulations! You've built a complete web scraper.
What You Built
- HTTP Request Handler - Fetch web pages
- HTML Parser - Extract data with Beautiful Soup
- Multi-page Scraper - Handle pagination
- Data Exporter - Save to CSV/JSON/Database
Next Steps
- Add proxy rotation
- Implement caching
- Build a scrapy project for larger projects
- Learn about API design