"Building a multilingual classifier for noisy synth listings"

2026-05-12 9 min read

ml nlp multilingual classification python

"How a three-layer hybrid architecture — semantic rules, TF-IDF logistic regression, and post-process correction — classifies second-hand synthesiser listings across five European languages, reaching 96.5% accuracy."

The European second-hand synthesiser market spans five countries and as many languages. A listing for a Roland Juno-106 might appear as "Roland Juno-106 en perfecto estado" on Hispasonic, "Roland Juno 106 σε άριστη κατάσταση" on Noiz.gr, or "Roland Juno-106 très bon état" on Audiofanzine. And mixed into those listings will be a Juno-106 patch cable, a Juno-106 service manual, and a bundle that mentions a Juno-106 in the title.

The classifier's job is to separate these: instrument, accessory, manual, digital content. Getting it right matters directly for data quality — a service manual priced at 8€ included in the price pool would collapse the fair price (P50) for the Juno-106 toward zero.

The problem

At the start of the project, around 550 of 9,253 listings tagged as accessories were actually complete instruments, miscategorised. The cause: the word "accessories" appeared in titles like "Elektron Digitone — FM Synthesizer with box and accessories", and a naive keyword match flagged the whole listing.

Beyond mis-tagging, the distribution of the training data was unbalanced: roughly 60% of non-eBay listings came from Hispasonic in Spanish, with Greek (Noiz.gr), French (Audiofanzine) and multilingual English/German (eBay) as smaller sources. A word-level classifier trained primarily on Spanish listings generalises poorly to Greek.

The data

Five sources, four primary languages:

Source	Language(s)	Role
Hispasonic	Spanish	~60% of non-eBay listings
Noiz.gr	Greek	~15%
Audiofanzine	French + English	~10%
Soundsmarket	Spanish	~10%
eBay Browse API	EN/DE/FR multilingual	Reference prices only

Listing titles average 6–12 words. They include brand names, model codes, abbreviations, condition notes, and incidental context. Spelling is irregular: "Sintezatör", "Synthesizer", "synthèse", "sintetizador", "συνθεσάιζερ". At project start, 16,406 observations had is_accessory IS NULL and required classification.

Approach

Three options were evaluated:

Rules only: keyword lists per category in multiple languages. Fast, interpretable, fragile at the edges.
ML only: train a text classifier end-to-end. Good generalisation, black-box errors.
Three-layer hybrid: rule-based first pass → ML for the grey zone → post-process correction for systematic errors.

Option 3 was chosen because it aligns confidence with explainability. Rules handle the obvious cases transparently. ML handles ambiguity. The correction layer addresses known failure modes without touching the rest.

Implementation

Layer 1 — Semantic rules (`db/04_classify_accessories.py`)

Fifteen semantic categories with multilingue keyword lists (ES/EN/FR/DE/IT), using word boundary matching. A special INSTRUMENT_CONTEXT set detects phrases like "with case", "incluye funda", "bundle" that negate an accessory signal — a listing titled "Roland D-50 with original case" is an instrument, not a case.

One critical bug found during data review: the word "accessories" (plural) was in GENERIC_ACCESSORY_TERMS. A title like "Elektron Digitone — FM Synthesizer with box and accessories" matched the rule and was tagged as an accessory.

# Before: caused false positives on instrument titles
GENERIC_ACCESSORY_TERMS = ["accessory", "accessories", "accessoires", "accesori"]

# After: only unambiguous singulars
GENERIC_ACCESSORY_TERMS = ["accessory", "accessoire", "accesorio"]

Layer 2 — ML classifier (`ml/classifier.py`)

Feature space:

TF-IDF on listing title with analyzer='char_wb', ngram_range=(2, 4), max_features=5000
price_eur (normalised)
price_eur / price_p50 ratio (1.0 when no fair price exists — neutral, not zero)

char_wb was chosen over word-level tokenisation for two reasons: it handles spelling variations and abbreviations across all languages without a language-specific tokeniser, and character n-grams (2–4) capture meaningful substrings ("rak" for eurorack, "dbd" for bedienungsanleitung) even in misspelled words.

Model: Logistic Regression with class_weight="balanced" to handle the class imbalance between instruments and accessories.

Confidence thresholds differ by source type:

Source type	Instrument threshold	Accessory threshold
eBay / Thomann (reference only)	≥ 0.58	≤ 0.42
Hispasonic / Noiz / Audiofanzine	≥ 0.85	≤ 0.15

The rationale: listings shown to users require higher confidence. eBay listings only affect the P50 calculation, where a mislabelled instrument degrades data quality but does not create a visible error in the interface. Listings between the two thresholds for their source type go to is_accessory IS NULL for manual review via /admin/review.

Layer 3 — Post-process correction (`run_synth_correction()`)

Acts on all is_accessory=1 records after ML classification. Flips to is_accessory=0 only when:

Title contains a strong instrument indicator (synthesizer, fm synthesizer, sintetizador, etc.)
AND title does not contain a definitive accessory indicator (eurorack, floppy, pcb, memory card, für yamaha, etc.)

Conservative by design: it only corrects clear contradictions, never makes autonomous decisions on ambiguous cases.

Multilingual training augmentation

To enrich the training set without collecting new data, a translation pipeline (ml/translate_titles.py) converted Spanish and Greek titles to English via the LLM API:

6,825 titles processed
1,378 successfully translated
0 translation errors
Each translated title was added as an independent training sample with the same label as the original

This doubled the English-language signal in the training data and improved generalisation.

Results

Stage	Accuracy	Training samples
Baseline (word-level, English only)	88.1%	~6,800
After word boundary + threshold increase	94.1%	~8,100
After multilingual augmentation	96.5%	22,531

After applying run_synth_correction() to the full dataset: 2,321 corrections — listings reclassified from accessory to instrument. Of 9,253 listings previously tagged as accessories, 6,932 remained as genuine accessories after correction.

Known failure: Greek case folding

Python's re module with re.IGNORECASE does not perform Unicode case folding for Greek. "ακορντεόν" and "ΑΚΟΡΝΤΕΟΝ" are treated as different strings. The workaround is to add explicit uppercase variants to regex patterns for Greek terms.

# Works for ASCII, fails silently for Greek uppercase variants
re.compile(r'\bακορντεόν\b', re.IGNORECASE)

# Required for Greek
re.compile(r'\bακορντεόν\b|\bΑΚΟΡΝΤΕΟΝ\b')

What I'd do differently

The price_eur / price_p50 ratio feature is circular: it requires a fair price to exist, which itself depends on correct classification. Listings without a fair price use ratio=1.0 (neutral), reducing the feature's discriminative power for models that are new to the catalogue. A better approach would be to compute this ratio only during scheduled retrain cycles, after fair prices have stabilised over several pipeline runs.

The translation pipeline adds some noise: machine-translated titles do not always preserve the syntactic patterns the classifier uses. Confidence filtering on the translations (discarding low-confidence outputs) would improve precision at a small cost to recall.

Open dataset

The classified product catalogue (3,000+ canonical models) is published at: github.com/albertjimrod/eusynth-market-data — CC BY 4.0.

Methodology Open dataset Contact

← All case studies