A small, easy-to-understand web scraper written in Python 3 that crawls the BooksToScrape demo site and extracts book data (title, price, availability, description, rating, image URL and category). It also downloads book cover images and saves the collected data to a CSV file.
This repository is intended as a learning example for web scraping with Requests and BeautifulSoup.
- Features
- Requirements
- Installation
- Usage
- Output
- Project structure
- Notes on ethics & legality
- Error handling & rate limiting
- Contributing
- License
- Scrapes all categories on BooksToScrape
- Automatically follows pagination
- Extracts details for every book:
- Title
- Price
- Availability
- Product description
- Star rating
- Image URL
- Category
- Downloads book cover images into category-based folders
- Saves all data to a CSV file
- Basic error handling and polite request delays
- Python 3.7+ (3.x)
- requests
- beautifulsoup4
Install the dependencies:
pip install requests beautifulsoup4(Optionally, create a virtual environment before installing.)
- Clone the repository:
git clone https://github.com/Asiwaju24/Scraping.git
cd Scraping- Install the required packages:
pip install requests beautifulsoup4Run the scraper:
python scrape.pyAfter the script completes:
- A CSV file named
books_full_scrape.csvwill be created in the project root. - All book cover images will be saved inside an
images/folder, grouped by category (e.g.images/Travel/).
If you want, run the script inside a virtual environment to avoid impacting your system Python packages.
- CSV file:
books_full_scrape.csv
Example CSV columns (header row):
title, price, availability, description, star_rating, image_url, category
- Images folder:
images/<Category>/<image-files>
├── scrape.py
├── books_full_scrape.csv # Generated after running the script
├── images/
│ ├── Travel/
│ ├── Mystery/
│ ├── Fiction/
│ └── ...
└── README.md
The target site, https://books.toscrape.com, is intentionally provided for scraping practice. Respect robots.txt and site owners when scraping real sites. Limit request rate and avoid excessive parallel requests that could cause problems for servers.
The script includes simple error handling and request delays to be polite to the server. If you extend the scraper or apply it to other sites, add more robust retry logic, exponential backoff, and careful handling of network errors or site structure changes.
Contributions and improvements are welcome. Suggested improvements you could add:
- Add a
requirements.txtorpyproject.toml - Add CLI flags (output filename, delay, categories to limit to)
- Add logging and more robust retry/backoff logic
- Add unit tests for parsing functions
If you make changes, please open a pull request with a clear description of what changed and why.
This project is provided for learning and demonstration purposes. You may reuse or adapt the code for non-commercial or educational uses. Add a formal license (e.g., MIT) if you want to publish this repository for wider reuse.
BooksToScrape: https://books.toscrape.com — a site provided specifically for practicing and testing scraping techniques.