Scrapy Spider Tutorial: Extracting Product Prices

Scrapy Spider Tutorial: Extracting Product Prices

1. Setting Up the Environment:
  • Install Scrapy:
    pip install scrapy
    
2. Creating a New Scrapy Project:
  • Navigate to where you want to create your project:

    cd /desired/path/
    
  • Create a new Scrapy project:

    scrapy startproject price_scraper
    
3. Creating a Spider:

Inside the price_scraper/spiders directory, create a new spider:

  • Navigate to spiders directory:

    cd price_scraper/spiders
    
  • Create a new spider:

    scrapy genspider price_spider www.example.com
    

    Replace www.example.com with the target website.

4. Defining Spider Logic:

Inside the price_spider.py:

import scrapy

class PriceSpider(scrapy.Spider):
    name = "price_spider"
    start_urls = [
        'https://www.example.com/product-page/'
    ]

    def parse(self, response):
        yield {
            'original_price': response.css('CSS_SELECTOR_FOR_ORIGINAL_PRICE::text').get(),
            'sale_price': response.css('CSS_SELECTOR_FOR_SALE_PRICE::text').get(),
        }

Replace CSS_SELECTOR_FOR_ORIGINAL_PRICE and CSS_SELECTOR_FOR_SALE_PRICE with the actual CSS selectors of the data you want to extract.

5. Pipeline to Send Email:

In price_scraper/pipelines.py, add:

from .mail import send_email

class EmailPipeline:
    def process_item(self, item, spider):
        subject = "Product Price Update"
        body = f"Original Price: {item['original_price']}, Sale Price: {item['sale_price']}"
        from_email = spider.settings.get('FROM_EMAIL')
        from_password = spider.settings.get('FROM_PASSWORD')
        to_email = spider.settings.get('TO_EMAIL')
        send_email(subject, body, from_email, from_password, to_email)
        return item

Create a mail.py inside the price_scraper directory:

import smtplib

def send_email(subject, body, from_email, from_password, to_email):
    server = smtplib.SMTP('smtp.gmail.com', 587)
    server.starttls()
    server.login(from_email, from_password)
    message = f"Subject: {subject}\n\n{body}"
    server.sendmail(from_email, to_email, message)
    server.quit()
6. Activate Pipeline:

In price_scraper/settings.py, activate the pipeline:

ITEM_PIPELINES = {
    'price_scraper.pipelines.EmailPipeline': 1,
}
7. Run the Spider with Cron Job:
  • Open your crontab:

    crontab -e
    
  • Add your cron job:

    0 0 * * * cd /path/to/your/scrapy/project && /usr/local/bin/scrapy crawl price_spider -s FROM_EMAIL=[email protected] -s FROM_PASSWORD="password" -s TO_EMAIL=[email protected] >> /path/to/logfile.log 2>&1
    

    This will run the spider daily at midnight. Adjust the time as needed. Ensure the paths and email credentials are correct.

8. Test & Monitor:
  • Initially, run the spider manually to ensure there’s no error:

    cd /path/to/your/scrapy/project
    scrapy crawl price_spider -s FROM_EMAIL=[email protected] -s FROM_PASSWORD="password" -s TO_EMAIL=[email protected]
    
  • Monitor the logs to troubleshoot and ensure smooth operation.

9. Important Notes:
  • Make sure to handle exceptions and errors for a robust spider.
  • Respect robots.txt of websites. Use the ROBOTSTXT_OBEY setting wisely.
  • If using Gmail, allow “less secure apps” to send emails or consider using a dedicated email service for sending notifications.

你可能感兴趣的:(scrapy,python,chrome)