real projects · cohort · intermediate
Ethical Web Automation
Fetch public data respectfully, throttle politely, and document consent boundaries.
4 weeks · 26 guided hours · rolling · 11,200 THB (informational)
Tool stack
Pythonhttpxparsel
Description
You will compare robots.txt interpretations, cache politely, and build scrapers that degrade gracefully when DOMs shift. Thai language tokenization quirks appear in parsing exercises so learners stop blaming encoding ghosts.
What is included
- robots.txt and terms-of-use reading checklist
- Polite throttling with adaptive backoff
- DOM change detection with snapshot tests
- Structured extraction with parsel and readability helpers
- Archiving outputs with provenance metadata
- Mentor review of your consent memo draft
Outcomes
- Publish a consent memo for a sample public dataset
- Ship a scraper with monitored failure alerts
- Demonstrate a de-identification pass on stored HTML
FAQ
Will you teach bypassing paywalls?
No—that is outside Bridgemesh policy and will stop mentor review.
Legal advice?
We provide frameworks; your counsel signs off on production use.
What if a site blocks us?
Labs include pivoting to official APIs or pausing collection—no circumvention tricks.
Experience notes
“Consent memo template saved awkward conversations with marketing—wish week two readings were shorter.”