"How a three-layer hybrid architecture — semantic rules, TF-IDF logistic regression, and post-process correction — classifies second-hand synthesiser listings across five European languages, reaching 96.5% accuracy."
The European second-hand synthesiser market spans five countries and as many languages. A listing for a Roland Juno-106 might appear as "Roland Juno-106 en perfecto estado" on Hispasonic, "Roland Juno 106 σε άριστη κατάσταση" on Noiz.gr, or "Roland Juno-106 très bon état" on Audiofanzine. And mixed into those listings will be a Juno-106 patch cable, a Juno-106 service manual, and a bundle that mentions a Juno-106 in the title.
The classifier's job is to separate these: instrument, accessory, manual, digital content. Getting it right matters directly for data quality — a service manual priced at 8€ included in the price pool would collapse the fair price (P50) for the Juno-106 toward zero.
At the start of the project, around 550 of 9,253 listings tagged as accessories were actually complete instruments, miscategorised. The cause: the word "accessories" appeared in titles like "Elektron Digitone — FM Synthesizer with box and accessories", and a naive keyword match flagged the whole listing.
Beyond mis-tagging, the distribution of the training data was unbalanced: roughly 60% of non-eBay listings came from Hispasonic in Spanish, with Greek (Noiz.gr), French (Audiofanzine) and multilingual English/German (eBay) as smaller sources. A word-level classifier trained primarily on Spanish listings generalises poorly to Greek.
Five sources, four primary languages:
| Source | Language(s) | Role |
|---|---|---|
| Hispasonic | Spanish | ~60% of non-eBay listings |
| Noiz.gr | Greek | ~15% |
| Audiofanzine | French + English | ~10% |
| Soundsmarket | Spanish | ~10% |
| eBay Browse API | EN/DE/FR multilingual | Reference prices only |
Listing titles average 6–12 words. They include brand names, model codes, abbreviations, condition notes, and incidental context. Spelling is irregular: "Sintezatör", "Synthesizer", "synthèse", "sintetizador", "συνθεσάιζερ". At project start, 16,406 observations had is_accessory IS NULL and required classification.
Three options were evaluated:
Option 3 was chosen because it aligns confidence with explainability. Rules handle the obvious cases transparently. ML handles ambiguity. The correction layer addresses known failure modes without touching the rest.
db/04_classify_accessories.py)Fifteen semantic categories with multilingue keyword lists (ES/EN/FR/DE/IT), using word boundary matching. A special INSTRUMENT_CONTEXT set detects phrases like "with case", "incluye funda", "bundle" that negate an accessory signal — a listing titled "Roland D-50 with original case" is an instrument, not a case.
One critical bug found during data review: the word "accessories" (plural) was in GENERIC_ACCESSORY_TERMS. A title like "Elektron Digitone — FM Synthesizer with box and accessories" matched the rule and was tagged as an accessory.
# Before: caused false positives on instrument titles
GENERIC_ACCESSORY_TERMS = ["accessory", "accessories", "accessoires", "accesori"]
# After: only unambiguous singulars
GENERIC_ACCESSORY_TERMS = ["accessory", "accessoire", "accesorio"]
ml/classifier.py)Feature space:
analyzer='char_wb', ngram_range=(2, 4), max_features=5000price_eur (normalised)price_eur / price_p50 ratio (1.0 when no fair price exists — neutral, not zero)char_wb was chosen over word-level tokenisation for two reasons: it handles spelling variations and abbreviations across all languages without a language-specific tokeniser, and character n-grams (2–4) capture meaningful substrings ("rak" for eurorack, "dbd" for bedienungsanleitung) even in misspelled words.
Model: Logistic Regression with class_weight="balanced" to handle the class imbalance between instruments and accessories.
Confidence thresholds differ by source type:
| Source type | Instrument threshold | Accessory threshold |
|---|---|---|
| eBay / Thomann (reference only) | ≥ 0.58 | ≤ 0.42 |
| Hispasonic / Noiz / Audiofanzine | ≥ 0.85 | ≤ 0.15 |
The rationale: listings shown to users require higher confidence. eBay listings only affect the P50 calculation, where a mislabelled instrument degrades data quality but does not create a visible error in the interface. Listings between the two thresholds for their source type go to is_accessory IS NULL for manual review via /admin/review.
run_synth_correction())Acts on all is_accessory=1 records after ML classification. Flips to is_accessory=0 only when:
synthesizer, fm synthesizer, sintetizador, etc.)eurorack, floppy, pcb, memory card, für yamaha, etc.)Conservative by design: it only corrects clear contradictions, never makes autonomous decisions on ambiguous cases.
To enrich the training set without collecting new data, a translation pipeline (ml/translate_titles.py) converted Spanish and Greek titles to English via the LLM API:
This doubled the English-language signal in the training data and improved generalisation.
| Stage | Accuracy | Training samples |
|---|---|---|
| Baseline (word-level, English only) | 88.1% | ~6,800 |
| After word boundary + threshold increase | 94.1% | ~8,100 |
| After multilingual augmentation | 96.5% | 22,531 |
After applying run_synth_correction() to the full dataset: 2,321 corrections — listings reclassified from accessory to instrument. Of 9,253 listings previously tagged as accessories, 6,932 remained as genuine accessories after correction.
Python's re module with re.IGNORECASE does not perform Unicode case folding for Greek. "ακορντεόν" and "ΑΚΟΡΝΤΕΟΝ" are treated as different strings. The workaround is to add explicit uppercase variants to regex patterns for Greek terms.
# Works for ASCII, fails silently for Greek uppercase variants
re.compile(r'\bακορντεόν\b', re.IGNORECASE)
# Required for Greek
re.compile(r'\bακορντεόν\b|\bΑΚΟΡΝΤΕΟΝ\b')
The price_eur / price_p50 ratio feature is circular: it requires a fair price to exist, which itself depends on correct classification. Listings without a fair price use ratio=1.0 (neutral), reducing the feature's discriminative power for models that are new to the catalogue. A better approach would be to compute this ratio only during scheduled retrain cycles, after fair prices have stabilised over several pipeline runs.
The translation pipeline adds some noise: machine-translated titles do not always preserve the syntactic patterns the classifier uses. Confidence filtering on the translations (discarding low-confidence outputs) would improve precision at a small cost to recall.
The classified product catalogue (3,000+ canonical models) is published at: github.com/albertjimrod/eusynth-market-data — CC BY 4.0.