Skip to content

Asiwaju24/Scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

BooksToScrape — Python Web Scraper

A small, easy-to-understand web scraper written in Python 3 that crawls the BooksToScrape demo site and extracts book data (title, price, availability, description, rating, image URL and category). It also downloads book cover images and saves the collected data to a CSV file.

This repository is intended as a learning example for web scraping with Requests and BeautifulSoup.


Table of contents


Features

  • Scrapes all categories on BooksToScrape
  • Automatically follows pagination
  • Extracts details for every book:
    • Title
    • Price
    • Availability
    • Product description
    • Star rating
    • Image URL
    • Category
  • Downloads book cover images into category-based folders
  • Saves all data to a CSV file
  • Basic error handling and polite request delays

Requirements

  • Python 3.7+ (3.x)
  • requests
  • beautifulsoup4

Install the dependencies:

pip install requests beautifulsoup4

(Optionally, create a virtual environment before installing.)


Installation

  1. Clone the repository:
git clone https://github.com/Asiwaju24/Scraping.git
cd Scraping
  1. Install the required packages:
pip install requests beautifulsoup4

Usage

Run the scraper:

python scrape.py

After the script completes:

  • A CSV file named books_full_scrape.csv will be created in the project root.
  • All book cover images will be saved inside an images/ folder, grouped by category (e.g. images/Travel/).

If you want, run the script inside a virtual environment to avoid impacting your system Python packages.


Output

  • CSV file: books_full_scrape.csv
    Example CSV columns (header row):
title, price, availability, description, star_rating, image_url, category
  • Images folder: images/<Category>/<image-files>

Project structure

├── scrape.py
├── books_full_scrape.csv    # Generated after running the script
├── images/
│   ├── Travel/
│   ├── Mystery/
│   ├── Fiction/
│   └── ...
└── README.md

Notes on ethics & legality

The target site, https://books.toscrape.com, is intentionally provided for scraping practice. Respect robots.txt and site owners when scraping real sites. Limit request rate and avoid excessive parallel requests that could cause problems for servers.


Error handling & rate limiting

The script includes simple error handling and request delays to be polite to the server. If you extend the scraper or apply it to other sites, add more robust retry logic, exponential backoff, and careful handling of network errors or site structure changes.


Contributing

Contributions and improvements are welcome. Suggested improvements you could add:

  • Add a requirements.txt or pyproject.toml
  • Add CLI flags (output filename, delay, categories to limit to)
  • Add logging and more robust retry/backoff logic
  • Add unit tests for parsing functions

If you make changes, please open a pull request with a clear description of what changed and why.


License

This project is provided for learning and demonstration purposes. You may reuse or adapt the code for non-commercial or educational uses. Add a formal license (e.g., MIT) if you want to publish this repository for wider reuse.


Target website

BooksToScrape: https://books.toscrape.com — a site provided specifically for practicing and testing scraping techniques.

About

Python scraping project: Books to scrape

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages