Hero image for Modeling CAD Assembly Hierarchies as Knowledge Graphs in Neo4j

Modeling CAD Assembly Hierarchies as Knowledge Graphs in Neo4j


Your CAD documentation lives in PDFs nobody reads, spreadsheets that go stale, and tribal knowledge that walks out the door when engineers leave. Meanwhile, questions like “which assemblies use this deprecated fastener?” require hours of manual searching across disconnected systems. A graph database changes this entirely—queries that once took an afternoon now complete in milliseconds.

This matters because engineering decisions cascade. Change one part, and you need to understand every assembly it touches, every material specification it depends on, and every revision that led to its current form. Relational databases force you to think in tables and joins. Graph databases let you think in relationships—exactly how engineers already reason about assemblies.

The core insight behind knowledge graphs is that relationships are first-class citizens, not afterthoughts bolted onto tables through foreign keys. When you ask “what happens if we change this bearing specification?”, you’re really asking about a web of connections: which parts use this bearing, which assemblies contain those parts, which products ship with those assemblies, and which customers have those products in the field. A knowledge graph makes these connections explicit and queryable.


Why Relational Databases Fail CAD Documentation

CAD assemblies are trees of arbitrary depth. A top-level product contains subassemblies, which contain components, which reference fasteners, which specify materials. The depth varies wildly: a simple bracket has two levels, a complete machine tool has fifteen. Relational databases handle this with recursive CTEs or adjacency lists, both of which become performance nightmares at scale.

The fundamental mismatch runs deeper than just hierarchy depth. Relational databases were designed for transactional data with well-defined schemas and predictable query patterns. CAD assemblies exhibit none of these characteristics. An engineer might query for all parts made from a specific aluminum alloy, then pivot to asking which of those parts appear in assemblies shipping to automotive customers. These ad-hoc traversals are natural in engineering work but punishing in relational systems.

Consider the relationships in a typical assembly. Parts don’t just “belong to” assemblies—they mate with other parts at specific faces, they substitute for alternative components, they depend on tolerances defined elsewhere, and they constrain the positions of neighboring parts. Each relationship type carries different semantics and different query patterns. A mate constraint between two parts has properties like mate type (coincident, concentric, parallel), the specific faces involved, and offset values. A substitution relationship needs to capture compatibility conditions and any required adapter components.

In a relational model, you’d create junction tables for each relationship type. Want to find all parts affected by a material change? That’s a join across parts, materials, assemblies, and the junction tables connecting them. Add revision history, and you’re joining against temporal tables too. The query becomes a maintenance burden before it becomes a performance problem.

The join explosion gets worse as your data model matures. Real engineering documentation needs to capture:

  • Which parts can substitute for each other (with compatibility conditions)
  • Which specifications constrain which features on which parts
  • Which revisions supersede which previous versions (and why)
  • Which suppliers can provide which parts at which lead times
  • Which test results validate which design assumptions

Each of these relationship types adds another junction table, another set of indexes, and another join in your queries.

Relational approach (simplified):
parts ←→ part_assemblies ←→ assemblies ←→ assembly_materials ←→ materials
↓ ↓ ↓ ↓
part_revisions constraints assembly_revisions material_specs
↓ ↓ ↓
suppliers mate_definitions test_results

Graph traversal matches how engineers actually think. “Show me everything connected to this part” is a natural graph query. “Find all paths from this fastener to top-level assemblies” is a single Cypher statement. You’re not fighting the data model to express engineering questions.

The performance difference becomes stark with deep hierarchies. A relational database executing a 10-level recursive query performs 10 separate operations, each scanning indexes. A graph database traverses 10 hops in a single operation, following pointers directly in memory. At 10,000 parts with 50,000 relationships, graph queries return in milliseconds where relational queries take seconds. At 100,000 parts with 500,000 relationships—a realistic scale for complex products—the gap widens to minutes versus milliseconds.

This isn’t about Neo4j being universally better than PostgreSQL. Relational databases excel at transactional integrity, complex aggregations, and tabular reporting. If you need to generate a bill of materials sorted by cost, a relational database handles that elegantly. But for relationship-heavy queries across hierarchical structures—impact analysis, dependency tracing, constraint propagation—graphs win decisively. The smart approach uses both: a relational database as the system of record for transactional operations, and a graph database as a queryable index over the relationship structure.


Designing Your CAD Knowledge Graph Schema

A well-designed schema captures engineering semantics without over-complicating queries. Start with five core node types: Part, Assembly, Material, Specification, and Revision. Each serves a distinct purpose in the knowledge graph.

Parts are the atomic units—individual components with properties like part number, name, mass, and bounding box dimensions. Every manufactured or purchased item that appears on a drawing gets a Part node. Parts have identities that persist across revisions: when you update a bracket’s dimensions, it’s still the same bracket from an identity perspective, just at a new revision.

Assemblies are special parts that contain other parts; they share most properties but add assembly-specific metadata like total component count and assembly sequence information. The distinction between Part and Assembly could be modeled with a single node type and a boolean flag, but separate labels enable more precise queries. When you want all leaf-level parts (not assemblies), the label-based filter is cleaner than a property check.

Materials define what parts are made from, with properties for density, yield strength, and material standards compliance. Material nodes serve as aggregation points: query all parts using a specific alloy, compute total mass by material type, or trace supply chain exposure to a single material source. Material properties should include both the specification (what’s required) and the actual values (what was measured or certified).

Specifications capture engineering requirements: tolerances, surface finishes, heat treatments, and any other requirement that constrains how a part must be manufactured or must perform. Separating specifications from parts enables powerful queries: which parts share the same tolerance requirements, which specifications are most frequently violated in production, which suppliers can meet a given specification set.

Revisions provide temporal context, linking parts to their historical versions. Unlike parts, which have persistent identity, revisions are immutable snapshots. Once released, a revision never changes—any modification creates a new revision. This immutability is essential for traceability: you need to know exactly what was shipped, not what the current version looks like.

schema-creation.cypher
// Core node constraints for data integrity
CREATE CONSTRAINT part_id IF NOT EXISTS
FOR (p:Part) REQUIRE p.partNumber IS UNIQUE;
CREATE CONSTRAINT assembly_id IF NOT EXISTS
FOR (a:Assembly) REQUIRE a.assemblyNumber IS UNIQUE;
CREATE CONSTRAINT material_id IF NOT EXISTS
FOR (m:Material) REQUIRE m.materialCode IS UNIQUE;
// Index for common query patterns
CREATE INDEX part_name IF NOT EXISTS FOR (p:Part) ON (p.name);
CREATE INDEX revision_date IF NOT EXISTS FOR (r:Revision) ON (r.createdAt);
// Composite index for status-based queries
CREATE INDEX part_status IF NOT EXISTS FOR (p:Part) ON (p.status, p.partNumber);
// Example node creation with properties
CREATE (p:Part {
partNumber: 'BRK-001-A',
name: 'Mounting Bracket',
mass: 0.245, // kg
boundingBox: [50.0, 30.0, 5.0], // mm [x, y, z]
status: 'RELEASED',
createdAt: datetime(),
createdBy: '[email protected]'
})
CREATE (m:Material {
materialCode: 'AL-6061-T6',
name: 'Aluminum 6061-T6',
density: 2.70, // g/cm³
yieldStrength: 276, // MPa
standard: 'ASTM B209',
costPerKg: 3.50
})
CREATE (p)-[:MADE_FROM {percentage: 100}]->(m)

Relationship types encode engineering semantics. CONTAINS links assemblies to their children with properties for quantity and position. MATES_WITH captures geometric constraints between parts, storing mate type (coincident, concentric, distance) and constraint values. SUPERSEDES tracks part evolution, pointing from newer revisions to older ones. REQUIRES links parts to specifications they must satisfy.

The choice of relationship direction matters for query performance. Neo4j traverses outgoing relationships faster than incoming ones in most cases. Design your relationships so the most common traversal direction is outgoing: assemblies point to the parts they contain (not parts pointing to assemblies that contain them), and current revisions point to the revisions they supersede.

relationships.cypher
// Assembly containment with quantity and position
MATCH (a:Assembly {assemblyNumber: 'ASM-100'})
MATCH (p:Part {partNumber: 'BRK-001-A'})
CREATE (a)-[:CONTAINS {
quantity: 4,
positions: [[0,0,0], [100,0,0], [0,100,0], [100,100,0]],
insertedAt: datetime()
}]->(p)
// Geometric mate relationship
MATCH (p1:Part {partNumber: 'BRK-001-A'})
MATCH (p2:Part {partNumber: 'BASE-PLATE-001'})
CREATE (p1)-[:MATES_WITH {
mateType: 'COINCIDENT',
face1: 'bottom',
face2: 'top',
offset: 0.0
}]->(p2)
// Revision chain
MATCH (current:Part {partNumber: 'BRK-001-A'})
MATCH (previous:Revision {revisionId: 'BRK-001-A-REV-B'})
CREATE (current)-[:SUPERSEDES {
changeReason: 'Increased material thickness for load requirements',
changedBy: '[email protected]',
changedAt: datetime()
}]->(previous)
// Specification requirement with criticality
MATCH (p:Part {partNumber: 'BRK-001-A'})
MATCH (s:Specification {specId: 'TOL-FLATNESS-001'})
CREATE (p)-[:REQUIRES {
criticality: 'HIGH',
verificationMethod: 'CMM_INSPECTION'
}]->(s)

💡 Pro Tip: Store revision history as a chain of SUPERSEDES relationships rather than creating new Part nodes for each revision. This keeps your part count manageable while preserving complete history. The current Part node always represents the latest released version; traverse SUPERSEDES to walk back through history.

Handling revision history without graph explosion requires discipline. Create Revision nodes only when a part’s released version changes. Draft changes stay as properties on the Part node until release. This prevents your graph from growing geometrically with every save operation. A part with 50 revisions shouldn’t create 50 nodes—it should create one Part node with a chain of Revision nodes capturing the release milestones.

Property modeling deserves careful thought. Store properties that frequently appear in WHERE clauses directly on nodes for index access. Store properties that are only displayed in results (not filtered) as they are. Consider computed properties: total mass of an assembly could be stored (for fast access) or computed (for accuracy), depending on your update patterns.


Extracting Graph Data from CadQuery Models

CadQuery models contain implicit graph structure in their assembly hierarchies. Extracting this structure programmatically turns your CAD models into knowledge graph source data. The extraction process needs to be deterministic and idempotent—running it twice on the same CAD file should produce identical graph updates.

The extraction process walks the assembly tree, capturing parent-child relationships and geometric constraints. Each compound in CadQuery represents either a part or a subassembly, and the children() method reveals the hierarchy. The key challenge is generating stable identifiers: you need to recognize when a part you’re importing already exists in the graph versus when it’s a new part.

Geometric fingerprinting provides one approach to stable identification. Parts with identical bounding boxes, volumes, and centers of mass are likely the same part. This works for initial prototypes but has obvious limitations—two different parts could have identical envelopes. Production systems need content-addressable hashing of the actual B-rep geometry, which is more complex to implement but provides reliable identification.

extract_assembly_graph.py
import cadquery as cq
from typing import Dict, List, Any
import hashlib
import json
def generate_part_id(shape) -> str:
"""Generate a stable ID from shape geometry.
Uses bounding box as a simple geometric fingerprint.
Production systems should hash the actual B-rep topology
for reliable identification of identical parts.
"""
bb = shape.BoundingBox()
geometry_string = f"{bb.xlen:.4f}_{bb.ylen:.4f}_{bb.zlen:.4f}"
return hashlib.md5(geometry_string.encode()).hexdigest()[:12]
def extract_graph_from_assembly(
assembly: cq.Assembly,
parent_id: str = None
) -> Dict[str, List[Dict[str, Any]]]:
"""
Extract nodes and relationships from a CadQuery assembly.
Walks the assembly tree recursively, creating node dictionaries
for each part and relationship dictionaries for each containment.
Returns a dictionary with 'nodes' and 'relationships' lists
suitable for Neo4j batch import via APOC procedures.
"""
nodes = []
relationships = []
for name, child in assembly.children.items():
# Determine if this is a part or subassembly
is_assembly = hasattr(child, 'children') and len(child.children) > 0
node_label = 'Assembly' if is_assembly else 'Part'
# Extract geometric properties from the compound shape
shape = child.toCompound()
bb = shape.BoundingBox()
node = {
'id': generate_part_id(shape),
'label': node_label,
'properties': {
'name': name,
'boundingBox': [round(bb.xlen, 3), round(bb.ylen, 3), round(bb.zlen, 3)],
'volume': round(shape.Volume(), 6),
'centerOfMass': [
round(shape.Center().x, 3),
round(shape.Center().y, 3),
round(shape.Center().z, 3)
]
}
}
nodes.append(node)
# Create containment relationship to parent
if parent_id:
# Extract position from assembly location
loc = child.loc
position = [
round(loc.IsIdentity() and 0 or loc.IsTranslation() and loc.Translation().x, 3),
round(loc.IsIdentity() and 0 or loc.IsTranslation() and loc.Translation().y, 3),
round(loc.IsIdentity() and 0 or loc.IsTranslation() and loc.Translation().z, 3)
]
relationships.append({
'type': 'CONTAINS',
'from_id': parent_id,
'to_id': node['id'],
'properties': {
'quantity': 1,
'position': position
}
})
# Recursively process subassemblies
if is_assembly:
sub_result = extract_graph_from_assembly(child, node['id'])
nodes.extend(sub_result['nodes'])
relationships.extend(sub_result['relationships'])
return {'nodes': nodes, 'relationships': relationships}
def export_for_neo4j(graph_data: Dict, output_path: str):
"""Export graph data as JSON for Neo4j import.
The output format is compatible with apoc.load.json
for efficient batch importing without individual transactions.
"""
with open(output_path, 'w') as f:
json.dump(graph_data, f, indent=2)
print(f"Exported {len(graph_data['nodes'])} nodes and "
f"{len(graph_data['relationships'])} relationships to {output_path}")

Parametric dependencies require special handling. When one part’s dimensions depend on another’s parameters, capture this as a DEPENDS_ON relationship with the parameter names and expressions stored as properties. These dependencies are critical for impact analysis—changing a parameter on one part can cascade through dependent parts.

Extracting parametric dependencies from CadQuery models requires parsing the source code or inspecting the model’s parameter registry. A simplified approach uses regex pattern matching; production systems would use Python’s AST module for reliable parsing.

extract_parametric_deps.py
def extract_parametric_dependencies(
model_source: str,
part_registry: Dict[str, str]
) -> List[Dict[str, Any]]:
"""
Parse CadQuery source to find parametric dependencies.
Looks for patterns where one part's dimensions reference
another part's parameters. This enables change propagation
queries: if I change this parameter, what else changes?
This is a simplified approach—production code would use AST parsing
for reliable identification of cross-references.
"""
dependencies = []
# Pattern: references to other parts' dimensions
# e.g., bracket_width = base_plate.width - 10
import re
pattern = r'(\w+)\s*=\s*(\w+)\.(\w+)'
for match in re.finditer(pattern, model_source):
local_param, referenced_part, referenced_param = match.groups()
if referenced_part in part_registry:
dependencies.append({
'type': 'DEPENDS_ON',
'from_id': 'current_part', # Replace with actual part ID
'to_id': part_registry[referenced_part],
'properties': {
'localParameter': local_param,
'referencedParameter': referenced_param,
'expression': match.group(0)
}
})
return dependencies

⚠️ Warning: Geometric fingerprinting with bounding boxes works for initial prototypes but fails when parts have identical envelopes. Production systems need content-addressable hashing of the actual B-rep geometry. Consider using OpenCASCADE’s shape hash or serializing the BREP and hashing the result.

The batch import structure aligns with Neo4j’s apoc.load.json procedure, enabling efficient bulk loading without repeated individual transactions. Batch imports are dramatically faster than individual CREATE statements—the difference can be 100x or more for large datasets.

For incremental updates, use MERGE instead of CREATE to avoid duplicating nodes. The pattern below handles both new parts and updates to existing parts:

incremental_import.cypher
// Merge nodes to avoid duplicates
CALL apoc.load.json('file:///assembly_graph.json') YIELD value
UNWIND value.nodes AS node
CALL apoc.merge.node([node.label], {id: node.id}, node.properties) YIELD node AS n
RETURN count(n) as nodesProcessed
// Merge relationships
CALL apoc.load.json('file:///assembly_graph.json') YIELD value
UNWIND value.relationships AS rel
MATCH (from {id: rel.from_id})
MATCH (to {id: rel.to_id})
CALL apoc.merge.relationship(from, rel.type, {}, rel.properties, to) YIELD rel AS r
RETURN count(r) as relationshipsProcessed

Cypher Queries That Answer Real Engineering Questions

The value of a knowledge graph emerges in the queries it enables. Four query patterns cover most engineering documentation needs: impact analysis, dependency tracing, conflict detection, and revision history. Each pattern exploits graph traversal to answer questions that would require complex recursive CTEs or multiple queries in a relational database.

Impact analysis answers “what breaks if I change this part?” The query traverses upward from a part through all containing assemblies, returning the complete blast radius of a proposed change. This is the query engineers run most frequently, and it’s the one that justifies the entire knowledge graph investment.

impact_analysis.cypher
// Find all assemblies affected by changing a specific part
// Returns assemblies ordered by proximity (nearest first)
MATCH path = (p:Part {partNumber: 'FASTENER-M8-001'})<-[:CONTAINS*]-(a:Assembly)
WITH a, length(path) as depth
ORDER BY depth
RETURN a.assemblyNumber as assembly,
a.name as name,
depth as levelsUp,
a.status as status
// More detailed: include all sibling parts that mate with the changed part
// This identifies parts that might need geometric updates
MATCH (changed:Part {partNumber: 'FASTENER-M8-001'})
MATCH (changed)-[:MATES_WITH]-(sibling:Part)
MATCH path = (sibling)<-[:CONTAINS*]-(topLevel:Assembly)
WHERE NOT (topLevel)<-[:CONTAINS]-()
RETURN DISTINCT topLevel.assemblyNumber as topLevelAssembly,
collect(DISTINCT sibling.partNumber) as affectedSiblings,
count(DISTINCT sibling) as siblingCount
// Full impact analysis with categorization
MATCH (changed:Part {partNumber: 'FASTENER-M8-001'})
OPTIONAL MATCH directPath = (changed)<-[:CONTAINS*1..1]-(directParent:Assembly)
OPTIONAL MATCH indirectPath = (changed)<-[:CONTAINS*2..]-(indirectParent:Assembly)
OPTIONAL MATCH (changed)-[:MATES_WITH]-(matedPart:Part)
RETURN changed.partNumber as changedPart,
collect(DISTINCT directParent.assemblyNumber) as directlyAffected,
collect(DISTINCT indirectParent.assemblyNumber) as indirectlyAffected,
collect(DISTINCT matedPart.partNumber) as geometricallyConstrained

Material dependency tracing follows the supply chain impact of material changes. When a vendor discontinues a material or new regulations restrict its use, you need every part and assembly affected. This query pattern also supports cost analysis: if aluminum prices rise, which products are most exposed?

material_trace.cypher
// Find all parts and their assemblies using a specific material
MATCH (m:Material {materialCode: 'AL-6061-T6'})
MATCH (p:Part)-[:MADE_FROM]->(m)
OPTIONAL MATCH path = (p)<-[:CONTAINS*]-(a:Assembly)
WITH p, collect(DISTINCT a.assemblyNumber) as containingAssemblies
RETURN p.partNumber as part,
p.name as partName,
p.status as status,
containingAssemblies
ORDER BY size(containingAssemblies) DESC
// Aggregate material usage across the entire product
// Useful for supply chain planning and cost estimation
MATCH (m:Material)<-[r:MADE_FROM]-(p:Part)
MATCH (p)<-[:CONTAINS]-(a:Assembly)
WITH m, count(DISTINCT p) as partCount,
sum(p.mass * r.percentage / 100) as totalMass
RETURN m.materialCode as material,
m.name as materialName,
partCount,
round(totalMass, 2) as totalMassKg
ORDER BY totalMass DESC
// Material substitution analysis
MATCH (restricted:Material {materialCode: 'AL-6061-T6'})
MATCH (p:Part)-[:MADE_FROM]->(restricted)
OPTIONAL MATCH (p)-[:CAN_USE_ALTERNATIVE]->(alt:Material)
RETURN p.partNumber as part,
p.name as partName,
collect(alt.materialCode) as alternatives,
CASE WHEN size(collect(alt)) > 0 THEN 'HAS_ALTERNATIVES' ELSE 'NO_ALTERNATIVES' END as status

Circular dependency detection catches constraint conflicts before they cause assembly failures. Parts that mutually depend on each other create unsolvable constraint systems—the CAD software will fail to regenerate the model. Detecting these cycles in the graph prevents wasted engineering time.

circular_dependency_check.cypher
// Find circular dependencies in mate constraints
// Any cycle indicates a potential constraint solver failure
MATCH path = (p:Part)-[:MATES_WITH*2..10]->(p)
WHERE all(r in relationships(path) WHERE type(r) = 'MATES_WITH')
RETURN [node in nodes(path) | node.partNumber] as circularChain,
length(path) as chainLength
// Find parts with conflicting parametric dependencies
// Mutual dependencies can cause infinite update loops
MATCH (p1:Part)-[d1:DEPENDS_ON]->(p2:Part)-[d2:DEPENDS_ON]->(p1)
RETURN p1.partNumber as part1,
p2.partNumber as part2,
d1.localParameter as param1DependsOn,
d2.localParameter as param2DependsOn
// Detect transitive dependency cycles (A -> B -> C -> A)
MATCH path = (start:Part)-[:DEPENDS_ON*3..8]->(start)
RETURN [node in nodes(path) | node.partNumber] as dependencyCycle,
[rel in relationships(path) | rel.localParameter] as parameters

Revision history queries reconstruct the evolution of a design, showing what changed and why. These queries support both engineering investigations (“why does this part look like this?”) and compliance requirements (“prove that this change was reviewed and approved”).

revision_history.cypher
// Get complete revision history for a part
MATCH path = (current:Part {partNumber: 'BRK-001-A'})-[:SUPERSEDES*0..]->(rev:Revision)
WITH nodes(path) as revisions
UNWIND revisions as r
RETURN r.revisionId as revision,
r.changedAt as date,
r.changedBy as engineer,
r.changeReason as reason
ORDER BY r.changedAt DESC
// Compare two revisions to see what changed
MATCH (newer:Part {partNumber: 'BRK-001-A'})
MATCH (older:Revision {revisionId: 'BRK-001-A-REV-B'})
RETURN newer.mass - older.mass as massDelta,
newer.boundingBox[0] - older.boundingBox[0] as widthDelta,
newer.boundingBox[1] - older.boundingBox[1] as heightDelta,
newer.boundingBox[2] - older.boundingBox[2] as depthDelta
// Find all changes by a specific engineer in a time range
MATCH (p:Part)-[s:SUPERSEDES]->(r:Revision)
WHERE s.changedBy = '[email protected]'
AND s.changedAt > datetime('2025-01-01')
RETURN p.partNumber as part,
s.changeReason as reason,
s.changedAt as date
ORDER BY s.changedAt DESC

📝 Note: These queries assume proper indexing on frequently-queried properties. Without indexes on partNumber, materialCode, and createdAt, performance degrades significantly beyond 10,000 nodes. Run CALL db.indexes() to verify your indexes are in place.


Integrating with Your Engineering Workflow

A knowledge graph that engineers don’t use is worthless. Integration with existing tools—version control, CAD systems, engineering portals—makes the graph a natural part of the workflow rather than another system to maintain. The goal is invisible automation: the graph updates itself when engineers do their normal work.

Git hooks trigger graph updates when CAD files are committed. This ensures the knowledge graph stays synchronized without requiring engineers to remember manual update steps. The hook pattern works for any version control system; the example uses Git because it’s ubiquitous.

The key design principle is asynchronous processing. The git hook should complete quickly (under a second) to avoid disrupting the engineer’s workflow. The actual graph update happens in a background job, potentially minutes later. If the update fails, it should alert someone rather than blocking the commit.

git_hook_updater.py
#!/usr/bin/env python3
"""
Post-commit hook that updates Neo4j when CAD files change.
Place in .git/hooks/post-commit and make executable.
This hook queues changed files for processing rather than
processing them synchronously, keeping commit times fast.
"""
import subprocess
import sys
from pathlib import Path
from neo4j import GraphDatabase
# Configuration - in production, read from environment
NEO4J_URI = "bolt://localhost:7687"
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = "your-password"
CAD_EXTENSIONS = {'.step', '.stp', '.cq.py', '.cadquery'}
def get_changed_files() -> list[str]:
"""Get files changed in the last commit."""
result = subprocess.run(
['git', 'diff-tree', '--no-commit-id', '--name-only', '-r', 'HEAD'],
capture_output=True, text=True
)
return result.stdout.strip().split('\n')
def is_cad_file(filename: str) -> bool:
"""Check if file is a CAD-related file."""
path = Path(filename)
return path.suffix.lower() in CAD_EXTENSIONS or \
any(filename.endswith(ext) for ext in CAD_EXTENSIONS)
def update_graph_for_file(driver, filepath: str):
"""Queue a CAD file for graph extraction.
Rather than processing immediately, we mark existing nodes
as needing verification and queue the file for background
processing. This keeps the commit fast.
"""
with driver.session() as session:
# Mark existing nodes as potentially stale
session.run("""
MATCH (p:Part {sourceFile: $filepath})
SET p.lastChecked = datetime(), p.needsUpdate = true
""", filepath=filepath)
# In production, this would publish to a message queue
# For simplicity, we just log the intent
print(f"Queued {filepath} for graph extraction")
def main():
changed_files = get_changed_files()
cad_files = [f for f in changed_files if is_cad_file(f)]
if not cad_files:
sys.exit(0)
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
try:
for filepath in cad_files:
update_graph_for_file(driver, filepath)
print(f"Updated graph for {len(cad_files)} CAD files")
finally:
driver.close()
if __name__ == '__main__':
main()

A lightweight query API exposes graph queries to engineering tools without requiring Cypher knowledge. Flask provides a minimal wrapper that translates REST calls into graph queries. Engineers can integrate these endpoints into scripts, dashboards, or other tools.

The API should expose high-level engineering concepts, not raw Cypher. An engineer asking “what’s affected by this change?” shouldn’t need to understand graph traversal. They should call /api/impact/PART-001 and get an answer.

cad_graph_api.py
from flask import Flask, jsonify, request
from neo4j import GraphDatabase
app = Flask(__name__)
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
@app.route('/api/impact/<part_number>')
def get_impact_analysis(part_number: str):
"""Return all assemblies affected by changing a part.
This is the most frequently used endpoint—engineers call it
before making any change to understand the blast radius.
"""
with driver.session() as session:
result = session.run("""
MATCH path = (p:Part {partNumber: $pn})<-[:CONTAINS*]-(a:Assembly)
WITH a, length(path) as depth
RETURN a.assemblyNumber as assembly,
a.name as name,
depth
ORDER BY depth
""", pn=part_number)
return jsonify({
'partNumber': part_number,
'affectedAssemblies': [dict(record) for record in result]
})
@app.route('/api/materials/<material_code>/usage')
def get_material_usage(material_code: str):
"""Return all parts using a specific material.
Called during supply chain analysis and material substitution
planning. Returns parts ordered by mass for prioritization.
"""
with driver.session() as session:
result = session.run("""
MATCH (m:Material {materialCode: $mc})<-[:MADE_FROM]-(p:Part)
RETURN p.partNumber as part, p.name as name, p.mass as mass
ORDER BY p.mass DESC
""", mc=material_code)
return jsonify({
'materialCode': material_code,
'parts': [dict(record) for record in result]
})
@app.route('/api/search')
def search_parts():
"""Full-text search across part names and descriptions.
Supports the engineering portal's search box, enabling
engineers to find parts by name fragments.
"""
query = request.args.get('q', '')
with driver.session() as session:
result = session.run("""
MATCH (p:Part)
WHERE p.name CONTAINS $query OR p.description CONTAINS $query
RETURN p.partNumber as part, p.name as name
LIMIT 50
""", query=query)
return jsonify({'results': [dict(record) for record in result]})
if __name__ == '__main__':
app.run(port=5000)

Caching strategies prevent query latency from degrading the user experience. For large assemblies with deep hierarchies, pre-compute common traversals and store them as materialized paths. This trades write-time computation for read-time speed—appropriate when reads vastly outnumber writes, as they do in most documentation systems.

materialized_paths.cypher
// Pre-compute top-level assembly membership for fast filtering
// Run this nightly or after bulk imports
MATCH (p:Part)
MATCH path = (p)<-[:CONTAINS*]-(top:Assembly)
WHERE NOT (top)<-[:CONTAINS]-()
WITH p, collect(DISTINCT top.assemblyNumber) as topLevelAssemblies
SET p.topLevelAssemblies = topLevelAssemblies
// Now queries can filter without traversal
// This turns a multi-hop traversal into a property check
MATCH (p:Part)
WHERE 'PRODUCT-001' IN p.topLevelAssemblies
RETURN p.partNumber, p.name

💡 Pro Tip: Run materialized path updates in a nightly batch job rather than on every change. Real-time updates create transaction contention; batch updates keep writes fast. The slight staleness (up to 24 hours) is acceptable for most documentation queries.


Avoiding Common Pitfalls

Building CAD knowledge graphs exposes several traps that seem reasonable initially but cause pain at scale. Learn from these mistakes before making them.

Over-normalizing relationships fragments information that’s always queried together. If you always need a material’s density when querying parts, don’t force a separate lookup. Denormalize density onto the MADE_FROM relationship or even onto Part nodes directly. The graph purist in you will object; your query performance will thank you. The rule of thumb: if you always traverse a relationship when accessing a node, consider merging the target node’s key properties onto the relationship.

Storing geometry in the graph sounds appealing but creates bloat. B-rep geometry for a moderately complex part can be megabytes. Multiply by thousands of parts and revisions, and your graph becomes unwieldy—slow to backup, slow to query, expensive to host. Instead, store geometry in files (STEP, BREP, or your CAD system’s native format) and keep only the file reference in the graph. The graph holds metadata and relationships; external storage holds geometry.

Good pattern:
(Part)-[:GEOMETRY_AT]->(GeometryRef {
path: '/cad-files/parts/BRK-001-A/geometry.step',
checksum: 'sha256:abc123...',
format: 'STEP AP214'
})
Bad pattern:
(Part {
geometry: <5MB of OCCT serialization>,
...
})

Unbounded revision history causes graph explosion. If every save creates a new Revision node, a part edited 50 times has 50 Revision nodes. Across 10,000 parts, that’s potentially 500,000 Revision nodes—most capturing trivial changes like moving a dimension witness line. Instead, create Revision nodes only at release milestones. Draft changes update Part properties in place; released versions get permanent Revision snapshots. This keeps revision count proportional to meaningful engineering changes, not to save frequency.

Ignoring query patterns during schema design leads to expensive rewrites. Before finalizing your schema, write the ten most important queries your engineers will run. If a query requires traversing through five intermediate nodes, consider adding a direct relationship. Schema design is query design. The best schema is the one that makes your common queries simple and fast.

Treating the graph as a source of truth rather than a derived index causes synchronization nightmares. The CAD files are the source of truth. The graph is a queryable index over them. If they diverge, the graph is wrong. Design your update pipeline with this principle: always re-derive from source rather than incrementally patching the graph. When in doubt, delete and rebuild.

Neglecting to plan for schema migration locks you into early design decisions. Build your import pipeline to support schema versions. When you need to add a new relationship type or split a node type in two, a versioned pipeline lets you re-import everything cleanly rather than writing brittle migration scripts. Version your schema explicitly with metadata nodes or properties that track what schema version each node was created under.


Key Takeaways

  • Model assemblies as subgraphs with CONTAINS relationships and parts as nodes with material/specification edges to enable impact analysis queries that complete in milliseconds instead of hours
  • Use APOC procedures for batch importing CAD data and MERGE operations to handle incremental updates without duplicating nodes—this prevents graph explosion from repeated imports
  • Start with three core queries: affected-assemblies, material-trace, and revision-diff, then expand your query library based on actual engineering questions rather than anticipated needs
  • Trigger graph updates from your version control system using post-commit hooks rather than polling CAD files—this keeps documentation synchronized automatically and eliminates manual update steps
  • Store geometry references in the graph, not geometry itself—keep file paths and checksums as node properties while storing actual B-rep data in external file storage

Resources