Improve Stability and Implement Data Compression
Issue Description
We are currently experiencing frequent bans and "Access Denied" responses from srf.ch, which is limiting our ability to crawl the website effectively. Additionally, the initial load of data is resulting in over 40GB of data. We need to implement a strategy to handle data for long-term storage efficiently, such as using LZMA compression.
Expected Behavior
- Improve crawler behavior to avoid bans and "Access Denied" responses from the website.
- After downloading the HTML, compress the data using LZMA before saving for long-term storage.
Current Behavior
Currently, our bot is getting banned frequently, and we're not implementing any data compression, leading to large data storage requirements.
Definition of Done (DoD)
-
Improved crawler behavior to avoid bans and "Access Denied" responses. -
Implementation of LZMA compression on the downloaded data. -
All compressed data is correctly saved for long-term storage. -
The bot handles errors and exceptions gracefully, logging them for review. -
Code has been reviewed and approved. -
All tests pass.
Additional Information
We should ensure that the crawler mimics human behavior as much as possible to avoid triggering any anti-bot measures. The data compression should not result in loss of necessary information and should be easily decompressible when needed for analysis.
Screenshots
[If applicable, add screenshots to help explain your problem.]