Neural-Crawler-Web-Crawler

πŸ•·οΈ Neural Crawler - Web Crawler

Neural Crawler Banner

A powerful, Flask-powered web crawler built with Scrapy for efficient data extraction

Python Version Scrapy Flask License


πŸ“– Overview

Neural Crawler is a modern web scraping application that combines the power of Scrapy with a user-friendly Flask web interface. It enables users to crawl websites and extract structured data with just a single click.

The default configuration targets Books to Scrape, extracting book titles, prices, and availability informationβ€”but can be easily customized to crawl any website.

✨ Features

</p>

πŸ› οΈ Tech Stack

Technology Purpose
Python 3.8+ Core programming language
Scrapy Web crawling framework
Flask Web application framework
HTML/CSS Frontend interface

πŸ“ Project Structure

Neural-Crawler-Web-Crawler/
β”œβ”€β”€ app.py                  # Flask application entry point
β”œβ”€β”€ crawler_runner.py       # Spider execution handler
β”œβ”€β”€ output.json             # Crawled data output
β”œβ”€β”€ scrapy.cfg              # Scrapy configuration
β”œβ”€β”€ neuralcrawling/         # Scrapy project module
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ items.py            # Data models
β”‚   β”œβ”€β”€ middlewares.py      # Spider/Downloader middlewares
β”‚   β”œβ”€β”€ pipelines.py        # Data processing pipelines
β”‚   β”œβ”€β”€ settings.py         # Scrapy settings
β”‚   └── spiders/
β”‚       β”œβ”€β”€ __init__.py
β”‚       └── crawling_spider.py  # Main crawler spider
β”œβ”€β”€ static/
β”‚   └── style.css           # Application styles
β”œβ”€β”€ templates/
β”‚   └── index.html          # Web interface template
└── images/
    └── crawler-2.jpg       # Banner image

πŸš€ Installation

Prerequisites

Setup Steps

  1. Clone the repository
    git clone https://github.com/yourusername/Neural-Crawler-Web-Crawler.git
    cd Neural-Crawler-Web-Crawler
    
  2. Create a virtual environment (recommended)
    python -m venv venv
       
    # Windows
    venv\Scripts\activate
       
    # macOS/Linux
    source venv/bin/activate
    
  3. Install dependencies
    pip install flask scrapy
    

πŸ’» Usage

Running the Web Application

  1. Start the Flask server
    python app.py
    
  2. Access the interface

    Open your browser and navigate to: http://127.0.0.1:5000

  3. Start crawling

    Click the β€œStart Crawling” button to begin data extraction.

Running the Spider Directly

You can also run the Scrapy spider directly from the command line:

scrapy crawl mycrawler -o output.json

βš™οΈ Configuration

Spider Settings

Modify neuralcrawling/spiders/crawling_spider.py to customize:

Setting Default Description
allowed_domains books.toscrape.com Domains the spider can crawl
start_urls http://books.toscrape.com/ Starting URLs for crawling
CLOSESPIDER_PAGECOUNT 10 Maximum pages to crawl

Global Settings

Edit neuralcrawling/settings.py to adjust:

πŸ“€ Output Format

Crawled data is saved to output.json in the following structure:

[
  {
    "title": "A Light in the Attic",
    "price": "Β£51.77",
    "availability": "In"
  },
  {
    "title": "Tipping the Velvet",
    "price": "Β£53.74",
    "availability": "In"
  }
]

🎨 Customization

Crawling a Different Website

  1. Update allowed_domains and start_urls in crawling_spider.py
  2. Modify the rules to match the target site’s URL patterns
  3. Update parse_item() to extract the desired data fields
  4. Adjust the HTML template in templates/index.html to display new fields

Example: Custom Data Extraction

def parse_item(self, response):
    yield {
        "title": response.css("h1::text").get(),
        "price": response.css(".price::text").get(),
        "description": response.css(".description p::text").get(),
        "url": response.url,
    }

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Acknowledgments