"GDPR-compliant data aggregation at the Observatory"

2026-05-12 9 min read
gdpr data-engineering compliance eu-law privacy

"A technical walkthrough of the Observatory's compliance architecture: PII audit and purge pipeline, robots.txt watcher, GDPR-by-design Telegram bot, and what EU data law actually looks like in a scraping project."

Data scraping projects accumulate personal data almost by accident. You focus on prices and model names, and later discover you have been storing seller usernames, feedback scores, and full raw JSON blobs containing whatever the API returned — including fields you never asked for and have no use for.

This is what happened during an internal audit of the Observatory's database. This case study documents the findings, the technical response, and the ongoing compliance mechanisms now built into the pipeline.

The regulatory context

Three legal frameworks apply simultaneously:

GDPR (Regulation (EU) 2016/679): Applies to any personal data of EU residents. Seller usernames are personal data under GDPR. "Roberto P." in a listing description is personal data. A combination of eBay username + feedbackScore + sellerAccountType can allow indirect identification of an individual seller — and the combination, not just any single field, is what triggers GDPR obligations.

Database Directive (Directive 96/9/EC): Protects the investment in a database, not its individual contents. Commercial scraping of a database that represents substantial investment can infringe sui generis database rights even if individual data points are public. This applies to Hispasonic and Audiofanzine, which have years of accumulated listing data.

eBay API License Agreement (§3.1.b, §9.5, §9.10): Explicit contractual prohibitions on persisting eBay listings in a database, training ML models on eBay data, and aggregating eBay data in ways that compete with eBay. These are not just legal obligations — violating them voids API access. This is where the audit found the most concrete, actionable violations.

Design decisions driven by compliance

Before detailing the purge, it is worth noting the structural decisions made earlier:

  • No personal data in the public dataset: the CC BY 4.0 dataset on GitHub contains only aggregated statistics (P25/P50/P75 per model, model catalogue, monthly market stats). No listing URLs, no seller data, no individual observations.
  • 72-hour embargo on listings: new observations are not published via the API or RSS for 72 hours after scraping. This provides a buffer against real-time market disruption and reduces the commercial competitiveness argument.
  • Rate limiting: 1.5 seconds minimum between requests per source. Documented at /about/bot.
  • User-Agent disclosure: crawlers identify as SynthObservatoryBot/2.0 with a contact URL. Not anonymous, not deceptive.

The PII audit findings

Two columns in staging_observations were the problem.

user: stored seller usernames from Hispasonic, Soundsmarket, and Audiofanzine, and from eBay a combination of username + feedbackScore + sellerAccountType. The eBay combination in particular allowed indirect identification of individual sellers, in direct violation of §9.10 of the API License Agreement.

raw_json: stored the full API or scraping response — every field returned by the source, including Soundsmarket listing descriptions like "Roberto P. — selling due to relocation", Audiofanzine contact information embedded in descriptions, and the complete eBay item record with all fields that the License Agreement prohibits storing.

Listing titles themselves were determined to be not personal data: they describe the product, not the person. They were retained.

Two fields inside raw_json had genuine analytical value and needed to be extracted before deletion: - condition from Noiz.gr (instrument condition: "Usado - Menta", "Nuevo") - description from Audiofanzine and Soundsmarket (listing descriptions used in ML training)

The PII purge pipeline (Migration 053)

The approach: extract useful fields from raw_json before purging, then null both user and raw_json across the entire table.

-- Add new columns for extracted fields
ALTER TABLE staging_observations ADD COLUMN condition   TEXT;
ALTER TABLE staging_observations ADD COLUMN description TEXT;

-- Extract condition from Noiz raw_json (1,975 rows)
UPDATE staging_observations
SET condition = json_extract(raw_json, '$.condition')
WHERE source_id = 7;

-- Extract description from Audiofanzine (16 rows)
UPDATE staging_observations
SET description = json_extract(raw_json, '$.description')
WHERE source_id = 4;

-- Extract description from Soundsmarket (108 rows)
UPDATE staging_observations
SET description = json_extract(raw_json, '$.desc')
WHERE source_id = 6;

-- Purge PII columns
UPDATE staging_observations SET user     = NULL;  -- 20,482 rows
UPDATE staging_observations SET raw_json = NULL;  -- 23,006 rows

After the purge, all six scrapers were patched to not collect user or raw_json in future runs.

eBay: 6,571 rows that had been persisted in staging_observations in violation of the API License Agreement were deleted entirely. The eBay scraper was refactored to operate without persistence — data fetched via the Browse API is used for in-memory price reference only and never written to the database. The 19,409 eBay observations present at the time of the audit are excluded from all public datasets.

The robots.txt watcher

Checking robots.txt once at project setup is not sufficient — sources can change their policies. The Observatory checks continuously.

cron/robots_watcher.py runs daily at 06:00 UTC, two hours before scrapers start at 08:00. For each active source, it:

  1. Fetches /robots.txt using SynthObservatoryBot as the user-agent
  2. Checks the actual paths used by each scraper (e.g., /anuncios/teclados-sintetizadores for Hispasonic), not just the root
  3. Computes a SHA-256 hash of the full robots.txt content
  4. Compares against the hash stored in robots_txt_log

Three outcomes:

Outcome Action
Source now blocks our user-agent UPDATE sources SET active = 0 + Telegram notification to admin
Content changed, still permitted Telegram notification for manual review
Network error Log and assume permitted (transient errors don't justify stopping the scraper)

The scraper checks active = 1 before executing and exits cleanly if the source has been deactivated. Maximum detection gap: less than 24 hours, acceptable given that sources are scraped twice a week.

The takedown mechanism

Two processes, documented publicly at /about/takedown:

For operators (marketplaces requesting scraping to stop): robots.txt is the primary mechanism. The watcher detects disallow rules within 24 hours and deactivates the scraper automatically. Email contact is the fallback, with a 5 business day response SLA.

For individuals (GDPR Art. 17 right to erasure): since listing titles are not personal data and seller usernames and descriptions have already been purged, this primarily applies to the Telegram bot. The /forgetme command cascades deletion across all tables immediately, with no confirmation step required — in compliance with the regulation's requirement for an effective and frictionless right to erasure.

The Telegram bot: GDPR by design

The Observatory bot was rebuilt from scratch with privacy as an explicit constraint, not a retrofit.

Minimal data collection: only chat_id (a numeric Telegram identifier) is stored per user. No name, no username, no phone number.

Pseudonymisation in logs: bot_operations_log stores SHA-256(chat_id) rather than the chat_id itself. Logs are auditable without linking back to an individual.

Automated triple purge:

Data Retention
Price alerts 90 days
Inactive users 6 months
Anonymised log entries 12 months

Art. 13 GDPR compliance: the full legal notice (data collected, legal basis, retention periods, rights, contact) is delivered in the /start message, before any personal data is collected.

Self-service erasure: /forgetme deletes all data immediately, cascades across all tables, and returns a confirmation. No waiting period, no confirmation step.

Auditability

What exists to demonstrate compliance:

  • robots_txt_log table: complete history of checks with timestamps, content hashes, and outcomes
  • subscriber_events table: audit log of all user lifecycle events (register, tier_change, unsubscribe, delete)
  • bot_operations_log: SHA-256 pseudonymised log of all bot commands
  • Five public legal pages at /about/legal/ documenting data processing in plain language
  • CITATION.cff and CC BY 4.0 license in the public dataset repository at GitHub

What is missing: a formal Data Protection Impact Assessment (DPIA) under Art. 35 GDPR, and review by an EU-qualified data law practitioner. These remain on the roadmap before any commercial use of the platform.

What I learned about EU data law as a developer

The hardest part was not the technical implementation — it was identifying what counts as personal data. Seller usernames felt harmless until reviewing the eBay API License and recognising that username + feedbackScore + sellerAccountType is a combination that narrows the field to one identifiable account. The general principle: any combination of attributes that singles out an individual is personal data under GDPR, regardless of whether any single attribute is identifying on its own.

The robots.txt watcher was the most satisfying piece to build. It turns a one-time manual check into a continuous, auditable process — the kind of infrastructure that signals to a potential partner source that compliance is operational, not aspirational.