A powerful, Flask-powered web crawler built with Scrapy for efficient data extraction
Neural Crawler is a modern web scraping application that combines the power of Scrapy with a user-friendly Flask web interface. It enables users to crawl websites and extract structured data with just a single click.
The default configuration targets Books to Scrape, extracting book titles, prices, and availability informationβbut can be easily customized to crawl any website.
π― Page Limit Control - Configurable page count limits
</p>
| Technology | Purpose |
|---|---|
| Python 3.8+ | Core programming language |
| Scrapy | Web crawling framework |
| Flask | Web application framework |
| HTML/CSS | Frontend interface |
Neural-Crawler-Web-Crawler/
βββ app.py # Flask application entry point
βββ crawler_runner.py # Spider execution handler
βββ output.json # Crawled data output
βββ scrapy.cfg # Scrapy configuration
βββ neuralcrawling/ # Scrapy project module
β βββ __init__.py
β βββ items.py # Data models
β βββ middlewares.py # Spider/Downloader middlewares
β βββ pipelines.py # Data processing pipelines
β βββ settings.py # Scrapy settings
β βββ spiders/
β βββ __init__.py
β βββ crawling_spider.py # Main crawler spider
βββ static/
β βββ style.css # Application styles
βββ templates/
β βββ index.html # Web interface template
βββ images/
βββ crawler-2.jpg # Banner image
git clone https://github.com/yourusername/Neural-Crawler-Web-Crawler.git
cd Neural-Crawler-Web-Crawler
python -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate
pip install flask scrapy
python app.py
Access the interface
Open your browser and navigate to: http://127.0.0.1:5000
Start crawling
Click the βStart Crawlingβ button to begin data extraction.
You can also run the Scrapy spider directly from the command line:
scrapy crawl mycrawler -o output.json
Modify neuralcrawling/spiders/crawling_spider.py to customize:
| Setting | Default | Description |
|---|---|---|
allowed_domains |
books.toscrape.com |
Domains the spider can crawl |
start_urls |
http://books.toscrape.com/ |
Starting URLs for crawling |
CLOSESPIDER_PAGECOUNT |
10 |
Maximum pages to crawl |
Edit neuralcrawling/settings.py to adjust:
ROBOTSTXT_OBEY - Respect robots.txt rules (default: True)DOWNLOAD_DELAY - Delay between requests (default: 1 second)CONCURRENT_REQUESTS_PER_DOMAIN - Parallel requests per domain (default: 1)Crawled data is saved to output.json in the following structure:
[
{
"title": "A Light in the Attic",
"price": "Β£51.77",
"availability": "In"
},
{
"title": "Tipping the Velvet",
"price": "Β£53.74",
"availability": "In"
}
]
allowed_domains and start_urls in crawling_spider.pyrules to match the target siteβs URL patternsparse_item() to extract the desired data fieldstemplates/index.html to display new fieldsdef parse_item(self, response):
yield {
"title": response.css("h1::text").get(),
"price": response.css(".price::text").get(),
"description": response.css(".description p::text").get(),
"url": response.url,
}
Contributions are welcome! Please feel free to submit a Pull Request.
git checkout -b feature/AmazingFeature)git commit -m 'Add some AmazingFeature')git push origin feature/AmazingFeature)