Skip to content

Z786ZA/Instagram-web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

instagram web scraper

A production-ready boilerplate to collect publicly available Instagram web data (profiles, posts, hashtags) using safe automation patterns, rotating proxies, and human-like delays. Built for agencies, researchers, and growth teams that want reliable scraping with lower block risk.

Telegram Discord WhatsApp Gmail

For discussion, queries, and freelance work — reach out 👆


Introduction

This repository provides a modular Instagram web scraping starter that focuses on resilience (anti-detect flows, rotating proxies, session reuse) and clarity (typed schema, storage adapters). It’s ideal for analysts, SaaS builders, and agencies that need compliant, rate-aware scraping of public pages.

instagram-web-scraper.png

Key Benefits

  1. Saves time with prebuilt Playwright/Selenium runners.
  2. Scales from single run to distributed jobs.
  3. Safer with proxy rotation, backoff, fingerprint & session logic.

Features must be in table

Feature Details
Headless/Visible Browsers Playwright or Selenium drivers with toggleable headless mode
Proxy Rotation Supports residential/mobile proxies with per-request rotation
Session Persistence Reuse cookies/storage to reduce challenges and CAPTCHAs
Human-like Throttling Randomized delays, jitter, scrolling, and viewport variance
Target Modules Profile, posts, hashtag pages (public data) with parsers
Output Formats JSONL, CSV, SQLite/Postgres adapters
Error/Retry Logic Exponential backoff, soft-fail queues, resumable runs
CLI Runner scrape profiles, scrape hashtag, resume subcommands
Dockerized Reproducible runs with one-line Docker start
Env-First Config .env for proxies, rate limits, storage, headless flags

Use Cases

  • Competitive research and trend tracking
  • Social listening for public hashtags
  • Creator discovery & lead lists (public info)
  • Academic/market research on public engagement

FAQs

Q: How to remove scraping warning?
A: Scraping warnings (blocks/challenges) often result from aggressive request rates, reused fingerprints, or IP reputation. Reduce concurrency, add randomized delays, persist sessions, rotate high-quality residential/mobile proxies, and lower fetch depth. Clearing cookies blindly can worsen flags—prefer stable sessions per account/profile, rotate user-agents with consistent device signatures, and implement exponential backoff on 4xx/429 responses.

Q: Does Instagram allow web scraping?
A: Accessing or collecting data is governed by Instagram’s Terms and your local laws. This boilerplate is for educational and compliance-oriented uses on publicly available pages. Always review and follow the platform’s terms and applicable regulations before running any scraper.

Q: Can web scraping be detected?
A: Yes. Platforms detect patterns like high request rates, identical fingerprints, datacenter IPs, and scripted navigation. Mitigate via residential/mobile proxies, realistic browser automation (Playwright/Selenium), randomized timings, scroll/viewport simulation, and consistent sessions. Even with safeguards, detection risk can’t be eliminated—only reduced.


Results


10x faster posting schedules
80% engagement increase on group campaigns
Fully automated lead response system

Performance Metrics


Average Performance Benchmarks:

  • Speed: 2x faster than manual posting
  • Stability: 99.2% uptime
  • Ban Rate: <0.5% with safe automation mode
  • Throughput: 100+ posts/hour per session

##Do you have a customize project for us ? Contact Us


Installation

Pre-requisites

  • Node.js or Python
  • Git
  • Docker (optional)

Steps

# Clone the repo
git clone https://github.com/yourusername/instagram-web-scraper.git
cd instagram-web-scraper

# Install dependencies
# Node (Playwright)
npm install
npx playwright install

# or Python (Selenium/Playwright)
pip install -r requirements.txt

# Setup environment
cp .env.example .env
# then edit .env to set:
# PROXY_URL=           # e.g. http://user:pass@host:port
# DRIVER=playwright    # or selenium
# HEADLESS=true
# RATE_MIN_MS=800
# RATE_MAX_MS=2200
# STORAGE_DIR=.storage
# OUT_FORMAT=jsonl     # csv|jsonl|sqlite|postgres

# Run (examples)
# Scrape a hashtag page (public)
npm run scrape:hashtag -- --tag "travel" --limit 50
# or
python main.py hashtag --tag "travel" --limit 50

Example Output

{"type":"post","shortcode":"CxyZ12A","likes":1243,"comments":57,"caption":"Sunset shots #travel","timestamp":"2025-10-11T14:22:10Z","author":"@example"}
{"type":"profile","username":"example","followers":10422,"following":312,"posts":87,"bio":"Photographer | Traveler"}

License

MIT License

Releases

No releases published

Packages

No packages published