Hero image for Building an Exercise Database: Multi-Source Data Normalization

Building an Exercise Database: Multi-Source Data Normalization


Anyone who has tried to build a fitness application quickly discovers a frustrating reality: exercise data is scattered across dozens of sources, each with its own naming conventions, field structures, and quality standards. One API calls it “Dumbbell Bench Press,” another lists “Flat Bench Dumbbell Press,” and a third simply says “DB Bench.” They all mean the same exercise, but your database sees three distinct entries.

This article walks through building a normalized exercise database by aggregating data from the Wger API, ExerciseDB, and Kaggle datasets. The challenge extends beyond simple data merging - it requires entity resolution for exercises and equipment, fuzzy string matching for inconsistent naming, and a carefully designed PostgreSQL schema that maintains referential integrity while accommodating the inherent messiness of real-world fitness data.

By the end, you will have a working data pipeline architecture and practical code examples for tackling multi-source data normalization in any domain.

The data quality challenge

Before diving into solutions, understanding the specific problems helps frame the engineering decisions. Here is what raw data from three sources typically looks like:

SourceExercise NameEquipmentMuscle Group
Wger API”Bench Press""Barbell""Chest”
ExerciseDB”barbell bench press""barbell""pectorals”
Kaggle”BENCH PRESS (BARBELL)""Barbell, Bench""Pectoralis Major”

The same exercise appears with different casing, naming conventions, and muscle terminology. Equipment fields might be a single string, a comma-separated list, or completely missing. Some sources include detailed instructions while others provide only names.

Common data quality issues include:

  • Inconsistent naming: “Lat Pulldown” vs “Cable Lat Pull-Down” vs “Pulldown - Lat”
  • Missing fields: ExerciseDB lacks difficulty ratings; Wger often omits equipment
  • Duplicate exercises: Same movement appears under multiple names or variations
  • Conflicting metadata: Sources disagree on primary muscle groups
  • Equipment ambiguity: “Machine” could mean dozens of different devices

A naive approach of simply concatenating datasets creates a bloated, inconsistent database. The solution requires normalization at multiple levels.

Fetching data from multiple APIs

The first step is building reliable data extraction from each source. Each API has quirks that the extraction layer must handle.

ETL pipeline showing extraction from multiple APIs through transformation to PostgreSQL

extractors/wger.py
import requests
from typing import Iterator
from dataclasses import dataclass
from time import sleep
@dataclass
class RawExercise:
"""Raw exercise data before normalization."""
source: str
external_id: str
name: str
description: str | None
equipment: list[str]
muscles: list[str]
category: str | None
raw_data: dict # Original payload for debugging
class WgerExtractor:
"""Extract exercise data from Wger API."""
BASE_URL = "https://wger.de/api/v2"
def __init__(self, language: str = "en"):
self.language = language
self.session = requests.Session()
self._equipment_cache: dict[int, str] = {}
self._muscle_cache: dict[int, str] = {}
def _fetch_lookup_tables(self) -> None:
"""Pre-fetch equipment and muscle name mappings."""
# Equipment mapping (id -> name)
resp = self.session.get(f"{self.BASE_URL}/equipment/")
for item in resp.json()["results"]:
self._equipment_cache[item["id"]] = item["name"]
# Muscle mapping (id -> name)
resp = self.session.get(f"{self.BASE_URL}/muscle/")
for item in resp.json()["results"]:
self._muscle_cache[item["id"]] = item["name"]
def extract(self) -> Iterator[RawExercise]:
"""Yield all exercises from Wger API."""
self._fetch_lookup_tables()
url = f"{self.BASE_URL}/exercise/"
params = {"language": 2, "limit": 100} # language=2 is English
while url:
resp = self.session.get(url, params=params)
resp.raise_for_status()
data = resp.json()
for exercise in data["results"]:
yield RawExercise(
source="wger",
external_id=str(exercise["id"]),
name=exercise["name"],
description=exercise.get("description"),
equipment=[
self._equipment_cache.get(eid, f"unknown_{eid}")
for eid in exercise.get("equipment", [])
],
muscles=[
self._muscle_cache.get(mid, f"unknown_{mid}")
for mid in exercise.get("muscles", [])
],
category=exercise.get("category", {}).get("name"),
raw_data=exercise,
)
url = data.get("next")
params = {} # Pagination URL includes params
sleep(0.5) # Rate limiting

ExerciseDB requires a RapidAPI key and has a different structure:

extractors/exercisedb.py
class ExerciseDBExtractor:
"""Extract exercise data from ExerciseDB via RapidAPI."""
BASE_URL = "https://exercisedb.p.rapidapi.com"
def __init__(self, api_key: str):
self.headers = {
"X-RapidAPI-Key": api_key,
"X-RapidAPI-Host": "exercisedb.p.rapidapi.com"
}
def extract(self) -> Iterator[RawExercise]:
"""Yield all exercises from ExerciseDB."""
resp = requests.get(
f"{self.BASE_URL}/exercises",
headers=self.headers,
params={"limit": 2000}
)
resp.raise_for_status()
for exercise in resp.json():
yield RawExercise(
source="exercisedb",
external_id=exercise["id"],
name=exercise["name"],
description=exercise.get("instructions", [None])[0],
equipment=[exercise["equipment"]] if exercise.get("equipment") else [],
muscles=[exercise["target"]] if exercise.get("target") else [],
category=exercise.get("bodyPart"),
raw_data=exercise,
)

Note: Always store the raw_data payload. When normalization logic changes, having the original data avoids re-fetching from rate-limited APIs.

Fuzzy matching for equipment normalization

Equipment names vary wildly across sources. “Dumbbell,” “dumbbells,” “Dumbells” (misspelled), and “Free Weights - Dumbbell” should all resolve to the same canonical equipment entry.

Entity resolution showing similar equipment names merged into canonical form

The solution uses fuzzy string matching with the rapidfuzz library, combined with a manually curated list of canonical equipment names:

normalization/equipment.py
from rapidfuzz import fuzz, process
from dataclasses import dataclass
@dataclass
class CanonicalEquipment:
"""Canonical equipment with known aliases."""
id: int
name: str
category: str
aliases: list[str]
# Curated list of canonical equipment
CANONICAL_EQUIPMENT = [
CanonicalEquipment(1, "Barbell", "free_weights",
["barbell", "olympic bar", "standard bar", "bar"]),
CanonicalEquipment(2, "Dumbbell", "free_weights",
["dumbbell", "dumbbells", "dumbells", "db", "free weights - dumbbell"]),
CanonicalEquipment(3, "Cable Machine", "machines",
["cable", "cable machine", "pulley", "cable pulley", "cable stack"]),
CanonicalEquipment(4, "Bodyweight", "bodyweight",
["bodyweight", "body weight", "none", "no equipment", "body only"]),
CanonicalEquipment(5, "Kettlebell", "free_weights",
["kettlebell", "kettle bell", "kb"]),
CanonicalEquipment(6, "Resistance Band", "accessories",
["band", "bands", "resistance band", "elastic band", "loop band"]),
# ... 30+ more entries
]
class EquipmentNormalizer:
"""Normalize equipment names to canonical forms."""
def __init__(self, threshold: int = 80):
self.threshold = threshold
self._build_lookup()
def _build_lookup(self) -> None:
"""Build alias -> canonical mapping."""
self.alias_map: dict[str, CanonicalEquipment] = {}
self.all_aliases: list[str] = []
for equip in CANONICAL_EQUIPMENT:
for alias in equip.aliases:
normalized = alias.lower().strip()
self.alias_map[normalized] = equip
self.all_aliases.append(normalized)
def normalize(self, raw_name: str) -> CanonicalEquipment | None:
"""Resolve raw equipment name to canonical form."""
if not raw_name:
return None
cleaned = raw_name.lower().strip()
# Exact match
if cleaned in self.alias_map:
return self.alias_map[cleaned]
# Fuzzy match
match = process.extractOne(
cleaned,
self.all_aliases,
scorer=fuzz.token_sort_ratio
)
if match and match[1] >= self.threshold:
return self.alias_map[match[0]]
return None # Unknown equipment

The token_sort_ratio scorer handles word reordering (“Bench Press” matches “Press Bench”), which is common in exercise naming. A threshold of 80% catches typos while avoiding false positives.

Pro Tip: Log unmatched equipment names to a file. Reviewing these periodically reveals new aliases to add to the canonical list, improving match rates over time.

Entity resolution for exercises

Equipment normalization is relatively straightforward because the vocabulary is limited. Exercise resolution is harder - there are thousands of exercises with subtle variations.

The approach combines multiple signals:

  1. Fuzzy name matching for catching spelling variations
  2. Equipment overlap - exercises with the same name but different equipment are distinct
  3. Muscle group similarity - additional confidence signal
normalization/exercises.py
from rapidfuzz import fuzz
from typing import NamedTuple
import hashlib
class ExerciseSignature(NamedTuple):
"""Unique signature for exercise deduplication."""
name_normalized: str
equipment_ids: frozenset[int]
primary_muscle: str | None
class ExerciseResolver:
"""Resolve exercises across sources to canonical entries."""
def __init__(self, equipment_normalizer: EquipmentNormalizer):
self.equipment_normalizer = equipment_normalizer
self.canonical_exercises: dict[str, dict] = {} # signature_hash -> exercise
self.merge_log: list[dict] = [] # Track merge decisions for review
def _compute_signature(self, exercise: RawExercise) -> ExerciseSignature:
"""Compute deduplication signature for exercise."""
# Normalize name: lowercase, remove parentheticals, strip whitespace
name = exercise.name.lower()
name = re.sub(r'\([^)]*\)', '', name) # Remove (barbell), (dumbbell), etc.
name = re.sub(r'[^a-z0-9\s]', '', name) # Remove punctuation
name = ' '.join(name.split()) # Normalize whitespace
# Resolve equipment to canonical IDs
equip_ids = set()
for raw_equip in exercise.equipment:
canonical = self.equipment_normalizer.normalize(raw_equip)
if canonical:
equip_ids.add(canonical.id)
# Primary muscle (first listed)
primary = exercise.muscles[0].lower() if exercise.muscles else None
return ExerciseSignature(name, frozenset(equip_ids), primary)
def resolve(self, exercise: RawExercise) -> str:
"""Resolve exercise to canonical ID, creating if new."""
sig = self._compute_signature(exercise)
sig_hash = hashlib.md5(str(sig).encode()).hexdigest()[:12]
# Check for existing match
if sig_hash in self.canonical_exercises:
existing = self.canonical_exercises[sig_hash]
self.merge_log.append({
"action": "merged",
"new_source": exercise.source,
"new_name": exercise.name,
"canonical_name": existing["name"],
"signature": sig,
})
return sig_hash
# Check fuzzy match against existing (catches near-duplicates)
for existing_hash, existing in self.canonical_exercises.items():
name_score = fuzz.token_sort_ratio(sig.name_normalized, existing["signature"].name_normalized)
equip_overlap = len(sig.equipment_ids & existing["signature"].equipment_ids)
if name_score > 90 and equip_overlap > 0:
self.merge_log.append({
"action": "fuzzy_merged",
"new_source": exercise.source,
"new_name": exercise.name,
"canonical_name": existing["name"],
"name_score": name_score,
})
return existing_hash
# New canonical exercise
self.canonical_exercises[sig_hash] = {
"name": exercise.name, # Keep first-seen name as canonical
"signature": sig,
"sources": [exercise.source],
}
return sig_hash

The merge log is critical for debugging. When users report missing exercises or incorrect merges, the log shows exactly why decisions were made.

PostgreSQL schema design

The normalized schema separates concerns into distinct tables with proper foreign key relationships:

schema.sql
-- Canonical equipment entries
CREATE TABLE equipment (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE,
category VARCHAR(50) NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
-- Muscle groups with hierarchy
CREATE TABLE muscle_groups (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE,
parent_id INTEGER REFERENCES muscle_groups(id),
body_region VARCHAR(50) -- upper, lower, core
);
-- Canonical exercises (deduplicated)
CREATE TABLE exercises (
id VARCHAR(12) PRIMARY KEY, -- Hash-based ID for determinism
name VARCHAR(200) NOT NULL,
description TEXT,
difficulty VARCHAR(20),
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- Many-to-many: exercises <-> equipment
CREATE TABLE exercise_equipment (
exercise_id VARCHAR(12) REFERENCES exercises(id) ON DELETE CASCADE,
equipment_id INTEGER REFERENCES equipment(id) ON DELETE CASCADE,
is_primary BOOLEAN DEFAULT FALSE,
PRIMARY KEY (exercise_id, equipment_id)
);
-- Many-to-many: exercises <-> muscle groups
CREATE TABLE exercise_muscles (
exercise_id VARCHAR(12) REFERENCES exercises(id) ON DELETE CASCADE,
muscle_group_id INTEGER REFERENCES muscle_groups(id) ON DELETE CASCADE,
involvement VARCHAR(20) DEFAULT 'primary', -- primary, secondary, stabilizer
PRIMARY KEY (exercise_id, muscle_group_id, involvement)
);
-- Source tracking for provenance
CREATE TABLE exercise_sources (
id SERIAL PRIMARY KEY,
exercise_id VARCHAR(12) REFERENCES exercises(id) ON DELETE CASCADE,
source_name VARCHAR(50) NOT NULL,
external_id VARCHAR(100),
original_name VARCHAR(200),
raw_data JSONB,
fetched_at TIMESTAMP DEFAULT NOW(),
UNIQUE (source_name, external_id)
);
-- Indexes for common queries
CREATE INDEX idx_exercises_name_trgm ON exercises USING gin (name gin_trgm_ops);
CREATE INDEX idx_exercise_sources_exercise ON exercise_sources(exercise_id);
CREATE INDEX idx_exercise_equipment_equip ON exercise_equipment(equipment_id);
CREATE INDEX idx_exercise_muscles_muscle ON exercise_muscles(muscle_group_id);

Key design decisions:

  • Hash-based exercise IDs ensure the same exercise gets the same ID across pipeline runs
  • JSONB raw_data column preserves original payloads without schema rigidity
  • Trigram index (gin_trgm_ops) enables fast fuzzy search in PostgreSQL
  • Involvement levels for muscles capture that some exercises work a muscle as stabilizer vs. primary mover

Pro Tip: Enable the pg_trgm extension with CREATE EXTENSION pg_trgm; before creating the trigram index.

Data pipeline architecture

The complete pipeline follows an Extract-Transform-Load pattern with clear stage separation:

pipeline/orchestrator.py
from dataclasses import dataclass
from datetime import datetime
import logging
@dataclass
class PipelineConfig:
"""Configuration for the ETL pipeline."""
wger_enabled: bool = True
exercisedb_api_key: str | None = None
kaggle_path: str | None = None
fuzzy_threshold: int = 80
batch_size: int = 100
class ExercisePipeline:
"""Orchestrate the exercise data ETL pipeline."""
def __init__(self, config: PipelineConfig, db_connection):
self.config = config
self.db = db_connection
self.equipment_normalizer = EquipmentNormalizer(config.fuzzy_threshold)
self.exercise_resolver = ExerciseResolver(self.equipment_normalizer)
self.logger = logging.getLogger(__name__)
def run(self) -> dict:
"""Execute full pipeline, return statistics."""
stats = {"extracted": 0, "merged": 0, "loaded": 0, "errors": 0}
start_time = datetime.now()
# Extract from all configured sources
raw_exercises = []
if self.config.wger_enabled:
self.logger.info("Extracting from Wger API...")
for ex in WgerExtractor().extract():
raw_exercises.append(ex)
stats["extracted"] += 1
if self.config.exercisedb_api_key:
self.logger.info("Extracting from ExerciseDB...")
for ex in ExerciseDBExtractor(self.config.exercisedb_api_key).extract():
raw_exercises.append(ex)
stats["extracted"] += 1
if self.config.kaggle_path:
self.logger.info("Extracting from Kaggle dataset...")
for ex in KaggleExtractor(self.config.kaggle_path).extract():
raw_exercises.append(ex)
stats["extracted"] += 1
self.logger.info(f"Extracted {stats['extracted']} raw exercises")
# Transform: resolve to canonical exercises
canonical_map: dict[str, list[RawExercise]] = {}
for raw in raw_exercises:
try:
canonical_id = self.exercise_resolver.resolve(raw)
if canonical_id not in canonical_map:
canonical_map[canonical_id] = []
canonical_map[canonical_id].append(raw)
except Exception as e:
self.logger.error(f"Resolution failed for {raw.name}: {e}")
stats["errors"] += 1
stats["merged"] = stats["extracted"] - len(canonical_map)
self.logger.info(f"Resolved to {len(canonical_map)} canonical exercises")
# Load to PostgreSQL
self._load_equipment()
self._load_exercises(canonical_map)
stats["loaded"] = len(canonical_map)
# Save merge log for review
self._save_merge_log()
stats["duration_seconds"] = (datetime.now() - start_time).total_seconds()
return stats
def _load_exercises(self, canonical_map: dict[str, list[RawExercise]]) -> None:
"""Load canonical exercises with all relationships."""
with self.db.cursor() as cur:
for canonical_id, sources in canonical_map.items():
# Use first source's data as primary, enrich from others
primary = sources[0]
# Merge descriptions (take longest)
description = max(
(s.description for s in sources if s.description),
key=len, default=None
)
# Insert canonical exercise
cur.execute("""
INSERT INTO exercises (id, name, description)
VALUES (%s, %s, %s)
ON CONFLICT (id) DO UPDATE SET
description = COALESCE(EXCLUDED.description, exercises.description),
updated_at = NOW()
""", (canonical_id, primary.name, description))
# Insert source tracking
for source in sources:
cur.execute("""
INSERT INTO exercise_sources
(exercise_id, source_name, external_id, original_name, raw_data)
VALUES (%s, %s, %s, %s, %s)
ON CONFLICT (source_name, external_id) DO NOTHING
""", (canonical_id, source.source, source.external_id,
source.name, Json(source.raw_data)))
# Insert equipment relationships
for raw_equip in primary.equipment:
canonical_equip = self.equipment_normalizer.normalize(raw_equip)
if canonical_equip:
cur.execute("""
INSERT INTO exercise_equipment (exercise_id, equipment_id)
VALUES (%s, %s)
ON CONFLICT DO NOTHING
""", (canonical_id, canonical_equip.id))
self.db.commit()

The pipeline is idempotent - running it multiple times produces the same result thanks to ON CONFLICT clauses. This allows safe re-runs when adding new sources or updating normalization logic.

Handling edge cases with machine learning

For exercises that fuzzy matching cannot resolve, a lightweight ML classifier provides additional signal:

normalization/ml_classifier.py
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
class ExerciseSimilarityModel:
"""ML-based exercise similarity for edge cases."""
def __init__(self):
self.vectorizer = TfidfVectorizer(
ngram_range=(1, 3),
max_features=5000,
stop_words='english'
)
self.exercise_vectors = None
self.exercise_ids = []
def fit(self, exercises: list[tuple[str, str]]) -> None:
"""Fit on (id, name+description) tuples."""
self.exercise_ids = [ex[0] for ex in exercises]
texts = [ex[1] for ex in exercises]
self.exercise_vectors = self.vectorizer.fit_transform(texts)
def find_similar(self, name: str, description: str = "", top_k: int = 5) -> list[tuple[str, float]]:
"""Find similar exercises by text similarity."""
query_text = f"{name} {description}"
query_vec = self.vectorizer.transform([query_text])
similarities = cosine_similarity(query_vec, self.exercise_vectors).flatten()
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [
(self.exercise_ids[i], float(similarities[i]))
for i in top_indices
if similarities[i] > 0.3 # Minimum threshold
]

This model runs as a secondary check when fuzzy matching confidence is low, providing human reviewers with likely matches to verify.

Key takeaways

Building a normalized exercise database from multiple sources requires layered approaches:

  • Separate extraction from transformation - APIs change; keeping raw data allows re-processing without re-fetching
  • Build canonical entity lists for equipment - the vocabulary is finite and worth curating manually
  • Use fuzzy matching with logging - the rapidfuzz library with token_sort_ratio handles most naming variations; logs enable continuous improvement
  • Design schemas for provenance - tracking which source contributed what data aids debugging and data quality monitoring
  • Make pipelines idempotent - ON CONFLICT clauses in PostgreSQL enable safe re-runs
  • Reserve ML for edge cases - simple heuristics solve 90% of matches; ML handles the ambiguous remainder

The patterns here generalize beyond fitness data. Any domain with multiple data sources - product catalogs, medical records, geographic entities - benefits from this normalization architecture.

The complete pipeline reduces thousands of duplicate exercise entries to a clean, searchable database with full provenance tracking. More importantly, it establishes a foundation that improves over time as new aliases and edge cases are discovered and incorporated.