# EasyScrape Tutorial

A comprehensive, step-by-step guide to mastering web scraping with EasyScrape.

---

## Table of Contents

1. [Introduction](#1-introduction)
2. [Installation & Setup](#2-installation--setup)
3. [Your First Scraper](#3-your-first-scraper)
4. [CSS Selectors Deep Dive](#4-css-selectors-deep-dive)
5. [Structured Data Extraction](#5-structured-data-extraction)
6. [Configuration & Customisation](#6-configuration--customisation)
7. [Handling Pagination](#7-handling-pagination)
8. [JavaScript-Rendered Pages](#8-javascript-rendered-pages)
9. [Asynchronous Scraping](#9-asynchronous-scraping)
10. [Sessions & Authentication](#10-sessions--authentication)
11. [Error Handling](#11-error-handling)
12. [Data Export](#12-data-export)
13. [Best Practices](#13-best-practices)
14. [Real-World Project](#14-real-world-project)

---

## 1. Introduction

### What is Web Scraping?

Web scraping is the automated extraction of data from websites. Instead of manually copying information, you write code that:

1. **Fetches** web pages (like a browser)
2. **Parses** the HTML structure
3. **Extracts** the specific data you need
4. **Stores** it in a useful format (CSV, JSON, database)

### Why EasyScrape?

EasyScrape was designed with three principles:

1. **Simplicity**: Common tasks should be one-liners
2. **Safety**: Security features built-in, not bolted on
3. **Speed**: Async support for high-performance scraping

### Prerequisites

- Python 3.9 or higher
- Basic Python knowledge (variables, functions, loops)
- Understanding of HTML (tags, attributes, classes)

---

## 2. Installation & Setup

### Basic Installation

```bash
pip install easyscrape
```

### Verify Installation

```python
# test_install.py
import easyscrape
print(f"EasyScrape version: {easyscrape.__version__}")
print("Installation successful!")
```

Run it:

```bash
python test_install.py
# Output: EasyScrape version: 0.1.0
# Output: Installation successful!
```

### Optional Dependencies

```bash
# For JavaScript-rendered pages
pip install easyscrape[browser]

# For stealth mode (bypass bot detection)
pip install easyscrape[stealth]

# For Excel/Parquet export
pip install easyscrape[export]

# Everything
pip install easyscrape[all]
```

---

## 3. Your First Scraper

Let's scrape a real website step by step.

### Step 1: Import and Fetch

```python
from easyscrape import scrape

# Fetch a web page
result = scrape("https://example.com")
```

That's it! One line to fetch a page. The `result` object contains everything you need.

### Step 2: Check the Response

```python
# Did it work?
print(f"Status code: {result.status_code}")  # 200 = success
print(f"OK: {result.ok}")                     # True if status < 400
print(f"URL: {result.url}")                   # Final URL (after redirects)
```

### Step 3: View the Content

```python
# See the raw HTML
print(result.text[:500])  # First 500 characters

# Or just the title
print(f"Page title: {result.title()}")
```

### Step 4: Extract Data

```python
# Get specific elements
heading = result.css("h1")
print(f"Main heading: {heading}")

# Get a paragraph
paragraph = result.css("p")
print(f"First paragraph: {paragraph}")
```

### Complete Example

```python
"""
my_first_scraper.py - A complete beginner example
"""
from easyscrape import scrape

def main():
    # Fetch the page
    print("Fetching https://example.com...")
    result = scrape("https://example.com")
    
    # Check if successful
    if not result.ok:
        print(f"Error: {result.status_code}")
        return
    
    # Extract data
    title = result.title()
    heading = result.css("h1")
    paragraph = result.css("p")
    links = result.links()
    
    # Display results
    print(f"\nPage Title: {title}")
    print(f"Main Heading: {heading}")
    print(f"First Paragraph: {paragraph[:100]}...")
    print(f"Number of Links: {len(links)}")

if __name__ == "__main__":
    main()
```

---

## 4. CSS Selectors Deep Dive

CSS selectors are patterns that identify HTML elements. Master these to extract any data.

### Basic Selectors

| Selector | Meaning | Example |
|----------|---------|---------|
| `tag` | Element by tag name | `h1`, `p`, `div` |
| `.class` | Element by class | `.price`, `.title` |
| `#id` | Element by ID | `#header`, `#main` |
| `[attr]` | Element with attribute | `[href]`, `[src]` |
| `[attr=val]` | Attribute equals value | `[type="text"]` |

### Combinators

| Selector | Meaning | Example |
|----------|---------|---------|
| `A B` | B inside A (any level) | `div p` |
| `A > B` | B directly inside A | `ul > li` |
| `A + B` | B immediately after A | `h1 + p` |
| `A, B` | A or B | `h1, h2, h3` |

### Pseudo-Selectors (EasyScrape Extensions)

| Selector | Returns | Example |
|----------|---------|---------|
| `::text` | Text content | `p::text` |
| `::attr(name)` | Attribute value | `a::attr(href)` |
| `::html` | Inner HTML | `div::html` |

### Practical Examples

```python
from easyscrape import scrape

result = scrape("https://books.toscrape.com")

# Get all book titles (attribute value)
titles = result.css_list("h3 a::attr(title)")

# Get all prices (text content)
prices = result.css_list(".price_color::text")

# Get star ratings (class name contains rating)
ratings = result.css_list(".star-rating::attr(class)")

# Get book URLs (combine with base URL)
urls = result.css_list("h3 a::attr(href)")
full_urls = [result.urljoin(url) for url in urls]

# Print first 3 books
for i in range(3):
    print(f"{titles[i]}: {prices[i]}")
```

### Finding the Right Selector

1. **Open browser DevTools** (F12 or right-click > Inspect)
2. **Select the element** (Ctrl+Shift+C, then click)
3. **Look at the HTML** - note the tag, classes, and structure
4. **Build your selector** - start simple, add specificity if needed

**Pro tip**: In Chrome DevTools, right-click an element > Copy > Copy selector

---

## 5. Structured Data Extraction

Instead of extracting one field at a time, extract complete records.

### Single Item Extraction

```python
from easyscrape import scrape

result = scrape("https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

# Extract multiple fields at once
book = result.extract({
    "title": "h1",
    "price": ".price_color",
    "availability": ".availability::text",
    "description": "#product_description + p",
    "upc": "tr:nth-child(1) td",
})

print(book)
# {
#     "title": "A Light in the Attic",
#     "price": "GBP 51.77",
#     "availability": "In stock (22 available)",
#     "description": "It's hard to imagine a world without...",
#     "upc": "a897fe39b1053632"
# }
```

### Multiple Items Extraction

```python
from easyscrape import scrape

result = scrape("https://books.toscrape.com")

# Extract all books on the page
books = result.extract_all(".product_pod", {
    "title": "h3 a::attr(title)",
    "price": ".price_color::text",
    "rating": ".star-rating::attr(class)",
    "url": "h3 a::attr(href)",
})

print(f"Found {len(books)} books")
for book in books[:3]:
    print(f"  - {book['title']}: {book['price']}")
```

### Nested Extraction

```python
# For complex structures, use nested schemas
result = scrape("https://example.com/products")

products = result.extract_all(".product", {
    "name": ".name",
    "price": ".price",
    "specs": {
        "_selector": ".specifications",  # Container
        "weight": ".weight",
        "dimensions": ".dimensions",
    },
    "reviews": {
        "_selector": ".reviews .review",
        "_multiple": True,
        "author": ".author",
        "rating": ".rating",
    }
})
```

---

## 6. Configuration & Customisation

### Creating a Configuration

```python
from easyscrape import scrape, Config

config = Config(
    timeout=60.0,       # Wait longer for slow sites
    max_retries=5,      # Retry more times
    rate_limit=1.0,     # Be polite: 1 request/second
)

result = scrape("https://example.com", config=config)
```

### Common Configuration Patterns

#### Development Mode

```python
dev_config = Config(
    cache_enabled=True,   # Don't re-download pages
    cache_ttl=86400,      # Cache for 24 hours
    timeout=60.0,         # Patient timeouts
)
```

#### Production Mode

```python
prod_config = Config(
    max_retries=5,
    retry_delay=2.0,
    backoff_factor=2.0,   # 2s, 4s, 8s, 16s, 32s
    rate_limit=2.0,       # 2 requests/second
    rotate_ua=True,       # Vary User-Agent
    respect_robots=True,  # Honour robots.txt
)
```

#### Stealth Mode

```python
stealth_config = Config(
    use_stealth=True,     # TLS fingerprint bypass
    rotate_ua=True,       # Random User-Agent
    headers={
        "Accept-Language": "en-GB,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    },
)
```

### Custom Headers

```python
config = Config(
    headers={
        "Authorization": "Bearer your-token",
        "X-Custom-Header": "value",
        "Referer": "https://google.com",
    }
)
```

### Using Proxies

```python
config = Config(
    proxies=[
        "http://user:pass@proxy1.com:8080",
        "http://user:pass@proxy2.com:8080",
    ],
    proxy_mode="round-robin",  # or "random"
)
```

---

## 7. Handling Pagination

Most websites split content across multiple pages. Here's how to handle them.

### Method 1: Follow "Next" Links

```python
from easyscrape import paginate

all_items = []

for page in paginate(
    "https://books.toscrape.com",
    next_selector=".next a",
    max_pages=10,
):
    items = page.css_list("h3 a::attr(title)")
    all_items.extend(items)
    print(f"Page {page.url}: {len(items)} items")

print(f"Total: {len(all_items)} items")
```

### Method 2: Parameter-Based Pagination

```python
from easyscrape import paginate_param

for page in paginate_param(
    "https://example.com/search",
    param="page",
    start=1,
    end=10,
):
    results = page.css_list(".result")
    print(f"Page {page}: {len(results)} results")
```

### Method 3: Offset-Based Pagination

```python
from easyscrape import paginate_offset

for page in paginate_offset(
    "https://example.com/api/items",
    offset_param="offset",
    limit_param="limit",
    limit=20,
    max_offset=200,
):
    items = page.json()["items"]
    print(f"Offset {page}: {len(items)} items")
```

### Method 4: Manual Control

```python
from easyscrape import scrape

page_num = 1
all_books = []

while True:
    url = f"https://books.toscrape.com/catalogue/page-{page_num}.html"
    result = scrape(url)
    
    if not result.ok:
        break  # No more pages
    
    books = result.css_list("h3 a::attr(title)")
    if not books:
        break  # Empty page
    
    all_books.extend(books)
    print(f"Page {page_num}: {len(books)} books")
    
    page_num += 1
    if page_num > 50:  # Safety limit
        break

print(f"Total: {len(all_books)} books")
```

---

## 8. JavaScript-Rendered Pages

Many modern websites use JavaScript to load content. EasyScrape handles this with Playwright.

### Installation

```bash
pip install easyscrape[browser]
playwright install chromium
```

### Basic Usage

```python
from easyscrape import scrape, Config

config = Config(javascript=True)
result = scrape("https://quotes.toscrape.com/js/", config=config)

quotes = result.css_list(".quote .text")
print(f"Found {len(quotes)} quotes")
```

### Wait for Content

```python
config = Config(
    javascript=True,
    js_wait=3.0,              # Wait 3 seconds after load
    js_wait_for=".quote",     # Or wait for this selector
)
```

### Advanced Browser Control

```python
from easyscrape import Browser

async def scrape_dynamic_page():
    async with Browser(headless=True) as browser:
        page = await browser.goto("https://example.com")
        
        # Wait for specific element
        await page.wait_for(".content-loaded")
        
        # Click a button
        await page.click("#load-more")
        
        # Wait for new content
        await page.wait(1.0)
        
        # Extract data
        items = await page.css_list(".item")
        return items
```

---

## 9. Asynchronous Scraping

For scraping many pages quickly, use async.

### Why Async?

```
Synchronous (100 pages, 1s each):  100 seconds
Asynchronous (100 pages, 10 concurrent): ~10 seconds
```

### Basic Async

```python
import asyncio
from easyscrape import async_scrape

async def main():
    result = await async_scrape("https://example.com")
    print(result.title())

asyncio.run(main())
```

### Scraping Many Pages

```python
import asyncio
from easyscrape import async_scrape_many, Config

async def scrape_all_books():
    urls = [
        f"https://books.toscrape.com/catalogue/page-{i}.html"
        for i in range(1, 51)
    ]
    
    config = Config(
        concurrent_limit=10,  # Max 10 at a time
        rate_limit=5.0,       # 5 requests/second
    )
    
    results = await async_scrape_many(urls, config=config)
    
    all_books = []
    for result in results:
        if result.ok:
            books = result.css_list("h3 a::attr(title)")
            all_books.extend(books)
    
    return all_books

books = asyncio.run(scrape_all_books())
print(f"Scraped {len(books)} books")
```

### With Progress Tracking

```python
import asyncio
from easyscrape import async_scrape, Config

async def scrape_with_progress(urls):
    config = Config(rate_limit=5.0)
    results = []
    
    for i, url in enumerate(urls, 1):
        result = await async_scrape(url, config=config)
        results.append(result)
        print(f"Progress: {i}/{len(urls)} ({100*i/len(urls):.1f}%)")
    
    return results
```

---

## 10. Sessions & Authentication

### Maintaining Cookies

```python
from easyscrape import Session

with Session() as session:
    # First request sets cookies
    session.get("https://example.com")
    
    # Subsequent requests include those cookies
    result = session.get("https://example.com/dashboard")
```

### Login Flow

```python
from easyscrape import Session

with Session() as session:
    # Step 1: Get the login page (may set CSRF token)
    login_page = session.get("https://example.com/login")
    
    # Step 2: Extract CSRF token if needed
    csrf = login_page.css("input[name='csrf']::attr(value)")
    
    # Step 3: Submit login form
    response = session.post(
        "https://example.com/login",
        data={
            "username": "myuser",
            "password": "mypass",
            "csrf": csrf,
        }
    )
    
    # Step 4: Check if login worked
    if "Welcome" in response.text:
        print("Login successful!")
        
        # Step 5: Access protected content
        profile = session.get("https://example.com/profile")
        print(profile.css(".user-name"))
```

---

## 11. Error Handling

### The Exception Hierarchy

```
EasyScrapeError (catch all)
+-- NetworkError (connection issues)
|   +-- RequestTimeout
+-- HTTPError (4xx, 5xx responses)
+-- InvalidURLError
+-- RateLimitHit (429 Too Many Requests)
+-- RetryExhausted
+-- ExtractionError
```

### Basic Error Handling

```python
from easyscrape import scrape
from easyscrape.exceptions import EasyScrapeError

try:
    result = scrape("https://example.com")
except EasyScrapeError as e:
    print(f"Scraping failed: {e}")
```

### Specific Error Handling

```python
from easyscrape import scrape
from easyscrape.exceptions import (
    NetworkError,
    HTTPError,
    RateLimitHit,
    RequestTimeout,
)

def safe_scrape(url):
    try:
        return scrape(url)
    except RateLimitHit:
        print("Rate limited! Waiting...")
        time.sleep(60)
        return scrape(url)  # Retry
    except RequestTimeout:
        print("Timeout - skipping")
        return None
    except HTTPError as e:
        print(f"HTTP {e.status_code}")
        return None
    except NetworkError as e:
        print(f"Network error: {e}")
        return None
```

---

## 12. Data Export

### Export to CSV

```python
from easyscrape import scrape, to_csv

result = scrape("https://books.toscrape.com")
books = result.extract_all(".product_pod", {
    "title": "h3 a::attr(title)",
    "price": ".price_color::text",
})

to_csv(books, "books.csv")
```

### Export to JSON

```python
from easyscrape import to_json

to_json(books, "books.json", indent=2)
```

### Export to Excel

```python
from easyscrape import to_excel

to_excel(books, "books.xlsx")
```

### Export to DataFrame

```python
from easyscrape import to_dataframe

df = to_dataframe(books)
print(df.head())
print(df.describe())
```

---

## 13. Best Practices

### 1. Rate Limiting

```python
# Always limit your request rate
config = Config(rate_limit=1.0)  # 1 request/second
```

### 2. Respect robots.txt

```python
config = Config(respect_robots=True)
```

### 3. Identify Yourself

```python
config = Config(
    headers={"User-Agent": "MyBot/1.0 (contact@example.com)"}
)
```

### 4. Handle Errors

```python
# Never let one error crash your whole scrape
for url in urls:
    try:
        result = scrape(url)
        process(result)
    except EasyScrapeError:
        continue
```

### 5. Cache During Development

```python
config = Config(cache_enabled=True, cache_ttl=86400)
```

### 6. Use Async for Large Jobs

```python
# 100 pages: 10x faster with async
await async_scrape_many(urls, config=Config(concurrent_limit=10))
```

---

## 14. Real-World Project

Let's build a complete book scraper that:

1. Scrapes all 50 pages of books.toscrape.com
2. Extracts title, price, rating, and availability
3. Handles errors gracefully
4. Exports to CSV and JSON

```python
"""
complete_book_scraper.py

A production-ready scraper for books.toscrape.com
"""

import asyncio
from easyscrape import (
    async_scrape_many,
    Config,
    to_csv,
    to_json,
)
from easyscrape.exceptions import EasyScrapeError


def create_urls(num_pages: int) -> list[str]:
    """Generate URLs for all pages."""
    return [
        f"https://books.toscrape.com/catalogue/page-{i}.html"
        for i in range(1, num_pages + 1)
    ]


def parse_rating(class_name: str) -> int:
    """Convert 'star-rating Three' to 3."""
    ratings = {
        "One": 1, "Two": 2, "Three": 3,
        "Four": 4, "Five": 5
    }
    for word, num in ratings.items():
        if word in class_name:
            return num
    return 0


def parse_price(price_str: str) -> float:
    """Convert 'GBP 51.77' to 51.77."""
    return float(price_str.replace("GBP ", "").replace("$", ""))


async def scrape_books(num_pages: int = 50) -> list[dict]:
    """Scrape all books from the website."""
    
    # Configuration
    config = Config(
        timeout=30.0,
        max_retries=3,
        rate_limit=5.0,
        concurrent_limit=10,
        cache_enabled=True,
        rotate_ua=True,
    )
    
    # Generate URLs
    urls = create_urls(num_pages)
    print(f"Scraping {len(urls)} pages...")
    
    # Fetch all pages
    try:
        results = await async_scrape_many(urls, config=config)
    except EasyScrapeError as e:
        print(f"Fatal error: {e}")
        return []
    
    # Parse results
    all_books = []
    errors = []
    
    for i, result in enumerate(results, 1):
        if not result.ok:
            errors.append({"page": i, "status": result.status_code})
            continue
        
        books = result.extract_all(".product_pod", {
            "title": "h3 a::attr(title)",
            "price": ".price_color::text",
            "rating": ".star-rating::attr(class)",
            "availability": ".availability::text",
            "url": "h3 a::attr(href)",
        })
        
        # Clean and transform data
        for book in books:
            book["price_numeric"] = parse_price(book["price"])
            book["rating_numeric"] = parse_rating(book["rating"])
            book["availability"] = book["availability"].strip()
            book["url"] = f"https://books.toscrape.com/catalogue/{book['url']}"
        
        all_books.extend(books)
        
        # Progress
        if i % 10 == 0:
            print(f"Processed {i}/{len(urls)} pages...")
    
    # Report
    print(f"\nComplete!")
    print(f"  Books scraped: {len(all_books)}")
    print(f"  Errors: {len(errors)}")
    
    if errors:
        print(f"  Failed pages: {[e['page'] for e in errors]}")
    
    return all_books


def export_data(books: list[dict]) -> None:
    """Export books to multiple formats."""
    
    # CSV
    to_csv(books, "output/books.csv")
    print("Exported to output/books.csv")
    
    # JSON
    to_json(books, "output/books.json", indent=2)
    print("Exported to output/books.json")
    
    # Summary
    if books:
        prices = [b["price_numeric"] for b in books]
        print(f"\nSummary:")
        print(f"  Total books: {len(books)}")
        print(f"  Price range: GBP {min(prices):.2f} - {max(prices):.2f}")
        print(f"  Average price: GBP {sum(prices)/len(prices):.2f}")


async def main():
    """Main entry point."""
    print("=" * 60)
    print("Book Scraper - books.toscrape.com")
    print("=" * 60)
    
    books = await scrape_books(num_pages=50)
    
    if books:
        export_data(books)
    else:
        print("No books scraped!")


if __name__ == "__main__":
    asyncio.run(main())
```

---

## Conclusion

You've learned:

- Basic scraping with `scrape()`
- CSS selectors for data extraction
- Structured data extraction with schemas
- Configuration and customisation
- Pagination handling
- JavaScript rendering
- Async scraping for speed
- Sessions and authentication
- Error handling
- Data export

### Next Steps

1. **Practice**: Scrape a website you're interested in
2. **Read**: Check the API Reference for all available methods
3. **Explore**: Try the Cookbook recipes for specific tasks
4. **Contribute**: Found a bug? Want a feature? Open an issue!

Happy scraping!