Hero image for Building Knowledge Graphs with Neo4j for CAD Documentation: A Practical Guide

Building Knowledge Graphs with Neo4j for CAD Documentation: A Practical Guide


Your CAD documentation lives in scattered PDFs, tribal knowledge, and that one engineer’s head who’s about to retire. When someone asks “which parts use this bearing specification?” or “what assemblies will break if we change this tolerance?”, the answer takes hours of archaeology through disconnected systems. A spec change that should take minutes to assess becomes a multi-day investigation across spreadsheets, file servers, and hallway conversations. Knowledge graphs solve this by making relationships between CAD entities queryable, traversable, and explicit.

The knowledge graph approach transforms documentation from static files into a living network of connected information. Instead of searching through folders hoping to find the right document, engineers query the graph directly: “Show me everything connected to this bearing.” The graph returns not just the bearing’s specification sheet, but every assembly using it, every tolerance that depends on it, and every engineer who has modified it. This shift from document-centric to relationship-centric documentation fundamentally changes how engineering teams access and trust their technical knowledge.

Why Relational Databases Fail CAD Documentation

CAD data is inherently graph-shaped. A housing assembly contains a motor sub-assembly, which references a bearing specification, which depends on a material grade, which was approved by a specific engineer during a particular revision cycle. These relationships form chains that extend five, ten, or fifteen levels deep. Relational databases can represent this structure, but the representation becomes increasingly painful as depth increases.

Consider a straightforward question: “Which assemblies use parts that reference specification SPEC-4521?” In a relational model, you need parts, assemblies, specifications, and the join tables connecting them. The query requires multiple JOINs, and the performance degrades as your part catalog grows. Add revision history, and you’re now joining against temporal tables. Add material dependencies, and the query becomes a maintenance nightmare. What starts as a simple lookup transforms into a query that spans eight tables with complex join conditions that only the original author understands.

The fundamental issue is that relational databases treat relationships as second-class citizens. Connections between entities exist only through foreign keys and join operations. Every traversal requires the database to compute set intersections at runtime. When your engineering team asks “what breaks if we change this?”, the database has to repeatedly rediscover the same relationship paths. This computational overhead compounds with each level of depth, turning millisecond queries into multi-second operations that frustrate users and discourage exploration.

Graph databases invert this model entirely. Relationships are stored directly alongside the nodes they connect, pre-computed and indexed for instant traversal. Moving from a part to its specification to dependent assemblies requires following pointers, not computing joins. The query “find all parts within 5 hops of this specification” executes in constant time relative to the result set size, not the total database size. Whether your database contains ten thousand parts or ten million, the traversal speed remains consistent because you’re following direct connections rather than searching through tables.

This architectural difference becomes critical for CAD documentation because the most valuable queries are relationship-intensive. Impact analysis, dependency tracking, design reuse discovery—these operations traverse the graph repeatedly. A query that takes 30 seconds in PostgreSQL executes in milliseconds in Neo4j because the relationship structure is pre-computed and indexed. Engineers who previously avoided running impact analyses because of wait times suddenly run them routinely, catching potential issues before they propagate through the design.

Graph databases also model the domain more naturally. When you describe CAD relationships to a new team member, you draw boxes and arrows on a whiteboard. Parts contain sub-parts. Specifications constrain tolerances. Engineers approve revisions. Materials satisfy requirements. This mental model maps directly to nodes and relationships in a graph database. Your schema documentation becomes the same diagram you’d draw anyway, reducing the cognitive gap between how engineers think about their data and how the database stores it.

The decision to use a graph database isn’t about abandoning relational systems entirely. Your ERP system, your ordering database, your financial records—these remain relational because their query patterns favor tabular operations. The knowledge graph supplements these systems by providing a query layer optimized for the relationship-heavy questions that CAD documentation demands. Think of it as adding a specialized index optimized for connectivity queries, one that coexists with your existing infrastructure rather than replacing it.

Modeling CAD Entities as a Knowledge Graph

A well-designed graph schema captures engineering reality without over-complicating the model. Start with core node types and expand only when your queries demand additional structure. The temptation to model every conceivable entity upfront leads to schemas that are difficult to maintain and queries that are harder to write than necessary.

Part nodes represent individual components with properties like part number, description, status, and creation date. These are the atomic units of your design—the bearings, fasteners, housings, and brackets that compose larger structures. Assembly nodes represent collections of parts and sub-assemblies, capturing the hierarchical structure of your products. Specification nodes capture the engineering requirements that constrain parts—tolerances, materials, surface finishes, and performance criteria. Material nodes define the substances from which parts are manufactured, linking to supplier data and certification requirements. Engineer nodes represent the people who design, approve, and modify components, enabling attribution and expertise discovery. Revision nodes track the temporal evolution of parts through their design lifecycle.

Relationships carry as much meaning as nodes, and in many cases, more. CONTAINS connects assemblies to their constituent parts and sub-assemblies, with properties capturing quantity and position within the assembly. REFERENCES links parts to the specifications they must satisfy, indicating whether the reference is mandatory or advisory. SUPERSEDES creates revision chains, connecting newer versions to the parts they replace while preserving the historical record. DESIGNED_BY attributes ownership to engineers, supporting expertise queries like “who knows about hydraulic actuators?” REQUIRES_TOLERANCE captures the specific tolerance bands applied to part-specification pairs, because the same specification might apply differently to different parts. MADE_FROM connects parts to materials, enabling supply chain impact analysis when material availability changes.

schema.cypher
// Create constraints for unique identifiers
CREATE CONSTRAINT part_number IF NOT EXISTS
FOR (p:Part) REQUIRE p.partNumber IS UNIQUE;
CREATE CONSTRAINT spec_id IF NOT EXISTS
FOR (s:Specification) REQUIRE s.specId IS UNIQUE;
CREATE CONSTRAINT assembly_id IF NOT EXISTS
FOR (a:Assembly) REQUIRE a.assemblyId IS UNIQUE;
CREATE CONSTRAINT engineer_id IF NOT EXISTS
FOR (e:Engineer) REQUIRE e.employeeId IS UNIQUE;
CREATE CONSTRAINT material_id IF NOT EXISTS
FOR (m:Material) REQUIRE m.materialId IS UNIQUE;
// Create indexes for frequently queried properties
CREATE INDEX part_status IF NOT EXISTS
FOR (p:Part) ON (p.status);
CREATE INDEX spec_category IF NOT EXISTS
FOR (s:Specification) ON (s.category);
CREATE INDEX part_current IF NOT EXISTS
FOR (p:Part) ON (p.current);
// Example node creation with relationships
CREATE (housing:Assembly {
assemblyId: 'ASM-001',
name: 'Motor Housing Assembly',
revision: 'C',
status: 'RELEASED',
createdDate: date('2024-01-10')
})
CREATE (bearing:Part {
partNumber: 'PRT-4521',
description: '6205-2RS Deep Groove Ball Bearing',
status: 'ACTIVE',
createdDate: date('2024-03-15'),
current: true
})
CREATE (bearingSpec:Specification {
specId: 'SPEC-4521',
title: 'Bearing Load Rating Requirements',
category: 'MECHANICAL',
revision: 'B',
minLoadRating: 14000,
maxOperatingTemp: 120
})
CREATE (steel:Material {
materialId: 'MAT-1020',
name: 'Chrome Steel 52100',
supplier: 'MetalWorks Inc',
certificationRequired: true
})
CREATE (engineer:Engineer {
employeeId: 'ENG-042',
name: 'Sarah Chen',
department: 'Mechanical Design',
expertise: ['bearings', 'rotating machinery']
})
CREATE (housing)-[:CONTAINS {quantity: 2, position: 'FRONT'}]->(bearing)
CREATE (bearing)-[:REFERENCES {criticality: 'HIGH', mandatory: true}]->(bearingSpec)
CREATE (bearing)-[:MADE_FROM {percentage: 100}]->(steel)
CREATE (bearing)-[:DESIGNED_BY {role: 'LEAD', date: date('2024-03-15')}]->(engineer)

Property placement matters more than it might initially seem. Put intrinsic attributes on nodes: part numbers, descriptions, revision letters, material compositions. These properties belong to the entity itself regardless of its relationships. Put contextual attributes on relationships: quantities within assemblies, criticality ratings for specification references, effective dates for supersession, roles for engineer attribution. This distinction keeps queries clean and prevents property duplication across multiple relationship instances.

Revision history requires careful modeling to avoid graph explosion. Rather than creating new nodes for every property change, use SUPERSEDES relationships to chain major revisions while storing minor changes as property updates. A part that goes through revisions A, B, and C becomes three nodes connected by SUPERSEDES relationships. Each node carries its own specification references and assembly memberships valid for that revision. This approach preserves complete traceability while keeping the graph navigable.

💡 Pro Tip: Add a current boolean property to Part nodes and maintain it during imports. This lets you quickly filter to active parts without traversing supersession chains for every query. When a new revision is created, set current = false on the old revision and current = true on the new one in a single transaction.

Avoid the temptation to model everything upfront. Your initial schema should support your first three use cases. As engineers request new query types, extend the model. A schema that grows with usage stays cleaner than one designed for hypothetical future needs. You can always add node types and relationship types later; removing them is much harder once queries depend on them.

Setting Up Neo4j and Ingesting CAD Metadata

Neo4j offers two deployment paths suited to different stages of your project. Neo4j Desktop provides a local development environment with full feature access—use it for prototyping, schema experimentation, and development. The desktop application bundles a complete Neo4j instance with management tools, making it easy to create, start, stop, and reset databases as you iterate on your design. AuraDB delivers managed cloud hosting with automatic backups, scaling, and high availability—use it for production deployments where team-wide access and reliability matter. For CAD documentation systems, start with Desktop to validate your schema and refine your queries, then migrate to AuraDB when you’re ready for production.

Install the Python driver and supporting libraries to programmatically interact with your database:

Terminal window
pip install neo4j python-dotenv pydantic

The following script establishes a connection pattern you’ll reuse throughout your ingestion pipeline. The context manager pattern ensures connections are properly closed even when exceptions occur, preventing connection pool exhaustion during long-running imports:

neo4j_connection.py
from neo4j import GraphDatabase
from contextlib import contextmanager
import os
from dotenv import load_dotenv
import logging
load_dotenv()
logger = logging.getLogger(__name__)
class CADGraphClient:
"""Client for interacting with the CAD knowledge graph in Neo4j."""
def __init__(self):
uri = os.getenv("NEO4J_URI", "bolt://localhost:7687")
user = os.getenv("NEO4J_USER", "neo4j")
password = os.getenv("NEO4J_PASSWORD")
if not password:
raise ValueError("NEO4J_PASSWORD environment variable is required")
self.driver = GraphDatabase.driver(uri, auth=(user, password))
# Verify connectivity on initialization
self.driver.verify_connectivity()
logger.info(f"Connected to Neo4j at {uri}")
def close(self):
"""Close the driver connection."""
self.driver.close()
logger.info("Neo4j connection closed")
@contextmanager
def session(self, database: str = "neo4j"):
"""Context manager for Neo4j sessions."""
session = self.driver.session(database=database)
try:
yield session
finally:
session.close()
def batch_import_parts(self, parts: list[dict], batch_size: int = 1000):
"""
Import parts in batches to manage memory usage.
Uses MERGE to make imports idempotent - safe to re-run
without creating duplicates.
"""
query = """
UNWIND $parts AS part
MERGE (p:Part {partNumber: part.partNumber})
SET p.description = part.description,
p.status = part.status,
p.createdDate = date(part.createdDate),
p.material = part.material,
p.current = part.current,
p.lastModified = datetime()
"""
total_imported = 0
with self.session() as session:
for i in range(0, len(parts), batch_size):
batch = parts[i:i + batch_size]
session.run(query, parts=batch)
total_imported += len(batch)
logger.info(f"Imported parts {i} to {i + len(batch)} ({total_imported} total)")
return total_imported
def batch_import_assemblies(self, assemblies: list[dict], batch_size: int = 500):
"""Import assembly nodes with their properties."""
query = """
UNWIND $assemblies AS asm
MERGE (a:Assembly {assemblyId: asm.assemblyId})
SET a.name = asm.name,
a.revision = asm.revision,
a.status = asm.status,
a.lastModified = datetime()
"""
with self.session() as session:
for i in range(0, len(assemblies), batch_size):
batch = assemblies[i:i + batch_size]
session.run(query, assemblies=batch)
def create_containment_relationships(self, relationships: list[dict]):
"""Create CONTAINS relationships between assemblies and parts."""
query = """
UNWIND $rels AS rel
MATCH (a:Assembly {assemblyId: rel.assemblyId})
MATCH (p:Part {partNumber: rel.partNumber})
MERGE (a)-[r:CONTAINS]->(p)
SET r.quantity = rel.quantity,
r.position = rel.position,
r.lastModified = datetime()
"""
with self.session() as session:
session.run(query, rels=relationships)
logger.info(f"Created {len(relationships)} CONTAINS relationships")
def create_specification_references(self, references: list[dict]):
"""Link parts to their specifications."""
query = """
UNWIND $refs AS ref
MATCH (p:Part {partNumber: ref.partNumber})
MATCH (s:Specification {specId: ref.specId})
MERGE (p)-[r:REFERENCES]->(s)
SET r.criticality = ref.criticality,
r.mandatory = ref.mandatory,
r.lastModified = datetime()
"""
with self.session() as session:
session.run(query, refs=references)

CAD systems export metadata in STEP and IGES formats. These files contain geometric data alongside product structure information that reveals part relationships. The steputils library parses STEP files to extract part hierarchies, though you’ll need to map the STEP entity structure to your graph schema:

step_parser.py
from steputils import p21
from pathlib import Path
from dataclasses import dataclass
from typing import Optional
import logging
logger = logging.getLogger(__name__)
@dataclass
class PartMetadata:
part_number: str
description: str
step_entity_id: int
@dataclass
class AssemblyRelation:
parent_id: str
child_id: str
quantity: int = 1
position: Optional[str] = None
def extract_step_metadata(step_file: Path) -> dict:
"""
Extract part and assembly metadata from a STEP file.
STEP files encode product structure through PRODUCT entities
and NEXT_ASSEMBLY_USAGE_OCCURRENCE relationships. This function
extracts both and maps them to our graph schema.
"""
logger.info(f"Parsing STEP file: {step_file}")
step_data = p21.readfile(str(step_file))
parts = []
assemblies = []
relationships = []
entity_to_part = {} # Map STEP entity IDs to part numbers
# First pass: extract all PRODUCT entities
for entity in step_data.get_entities():
if entity.entity_name == "PRODUCT":
part_id = entity.id
# PRODUCT params: (id, name, description, frame_of_reference)
part_name = entity.params[0] if entity.params else f"PART-{part_id}"
description = entity.params[1] if len(entity.params) > 1 else ""
# Clean up part number - remove quotes and whitespace
part_name = str(part_name).strip("'\"").strip()
parts.append({
"partNumber": part_name,
"description": str(description).strip("'\""),
"status": "ACTIVE",
"current": True,
"createdDate": "2024-01-01", # Default, override from PDM
"material": None
})
entity_to_part[part_id] = part_name
# Second pass: extract assembly relationships
for entity in step_data.get_entities():
if entity.entity_name == "NEXT_ASSEMBLY_USAGE_OCCURRENCE":
# NAUO links parent and child products in assembly structure
parent_ref = entity.params[0]
child_ref = entity.params[1]
# Resolve references to actual part numbers
parent_part = entity_to_part.get(parent_ref)
child_part = entity_to_part.get(child_ref)
if parent_part and child_part:
relationships.append({
"assemblyId": parent_part,
"partNumber": child_part,
"quantity": 1,
"position": None
})
# Mark parent as an assembly if not already
if parent_part not in [a.get("assemblyId") for a in assemblies]:
assemblies.append({
"assemblyId": parent_part,
"name": parent_part,
"revision": "A",
"status": "ACTIVE"
})
logger.info(f"Extracted {len(parts)} parts, {len(assemblies)} assemblies, "
f"{len(relationships)} relationships")
return {
"parts": parts,
"assemblies": assemblies,
"relationships": relationships
}
def process_step_directory(directory: Path, client) -> dict:
"""Process all STEP files in a directory and import to Neo4j."""
import gc
total_parts = 0
total_relationships = 0
for step_file in directory.glob("**/*.step"):
try:
metadata = extract_step_metadata(step_file)
client.batch_import_parts(metadata["parts"])
client.batch_import_assemblies(metadata["assemblies"])
client.create_containment_relationships(metadata["relationships"])
total_parts += len(metadata["parts"])
total_relationships += len(metadata["relationships"])
# Free memory between files for large imports
gc.collect()
except Exception as e:
logger.error(f"Failed to process {step_file}: {e}")
continue
return {"parts": total_parts, "relationships": total_relationships}

⚠️ Warning: Large STEP files can consume significant memory during parsing. Process files individually rather than loading an entire directory into memory simultaneously, and use Python’s garbage collector between files if memory becomes constrained. For files exceeding 500MB, consider streaming parsers or processing on machines with adequate RAM.

For initial bulk imports exceeding 100,000 nodes, use Neo4j’s LOAD CSV command with pre-processed CSV exports from your CAD system. This bypasses the Python driver for faster throughput, using Neo4j’s optimized CSV import pipeline. Export your PDM system to CSV, stage the files on a network path accessible to Neo4j (or use a file:// URI for local files), and execute the import directly in Cypher. This approach can import millions of nodes in minutes rather than hours.

Cypher Queries That Answer Real Engineering Questions

The value of a knowledge graph materializes through queries that would be impractical in relational systems. Impact analysis—determining what breaks when something changes—is the highest-value query category for CAD documentation. These queries justify the infrastructure investment by preventing costly design errors that would otherwise propagate through production.

Finding all parts affected by a specification change requires traversing inward from the specification to find all parts that reference it, then outward to find all assemblies containing those parts. This bidirectional traversal executes in milliseconds regardless of database size:

impact_analysis.cypher
// Find all assemblies impacted by a specification change
// The *1..5 syntax traverses 1 to 5 levels of CONTAINS relationships
MATCH (s:Specification {specId: $specId})<-[:REFERENCES]-(p:Part)<-[:CONTAINS*1..5]-(a:Assembly)
WHERE p.current = true
RETURN DISTINCT a.assemblyId AS assembly,
a.name AS assemblyName,
collect(DISTINCT p.partNumber) AS affectedParts,
count(DISTINCT p) AS partCount
ORDER BY partCount DESC

The *1..5 syntax controls traversal depth—here we’re looking up to five levels of assembly containment. Adjust this based on your product structure depth. For deeply nested assemblies like aircraft or automotive systems, you might need *1..10 or higher. The query remains performant because Neo4j follows indexed relationships rather than computing joins.

For comprehensive impact analysis that includes engineer notification, extend the query to capture who designed the affected parts:

impact_with_engineers.cypher
// Impact analysis with engineer attribution for notification
MATCH (s:Specification {specId: $specId})<-[:REFERENCES]-(p:Part)
WHERE p.current = true
OPTIONAL MATCH (p)-[:DESIGNED_BY]->(e:Engineer)
OPTIONAL MATCH (p)<-[:CONTAINS*1..5]-(a:Assembly)
RETURN p.partNumber AS part,
p.description AS description,
collect(DISTINCT e.name) AS engineers,
collect(DISTINCT e.employeeId) AS engineerIds,
collect(DISTINCT a.assemblyId) AS affectedAssemblies
ORDER BY size(collect(DISTINCT a.assemblyId)) DESC

Bill of materials queries flatten the graph into hierarchical output suitable for traditional BOM reports while preserving the depth information that indicates assembly structure:

bom_query.cypher
// Generate a complete bill of materials with depth tracking
MATCH path = (root:Assembly {assemblyId: $assemblyId})-[:CONTAINS*]->(component)
WHERE component:Part OR component:Assembly
WITH component,
length(path) AS depth,
[node IN nodes(path) |
CASE WHEN node:Assembly THEN node.assemblyId ELSE node.partNumber END
] AS hierarchy,
[rel IN relationships(path) | rel.quantity] AS quantities
RETURN component.partNumber AS partNumber,
component.description AS description,
depth,
hierarchy,
reduce(total = 1, q IN quantities | total * coalesce(q, 1)) AS totalQuantity,
CASE WHEN component:Assembly THEN 'ASSEMBLY' ELSE 'PART' END AS nodeType
ORDER BY hierarchy

Orphaned parts—components that exist but belong to no assembly—indicate data quality issues or deprecated inventory that should be reviewed:

orphan_detection.cypher
// Find parts not contained in any assembly
MATCH (p:Part)
WHERE p.current = true
AND NOT EXISTS { (a:Assembly)-[:CONTAINS]->(p) }
RETURN p.partNumber, p.description, p.createdDate, p.status
ORDER BY p.createdDate DESC

Circular dependencies in CAD data usually indicate modeling errors—an assembly that contains itself through some chain of sub-assemblies. These should never exist and warrant immediate investigation:

circular_dependency.cypher
// Detect circular containment relationships
MATCH path = (a:Assembly)-[:CONTAINS*]->(a)
RETURN a.assemblyId AS circularAssembly,
length(path) AS cycleLength,
[node IN nodes(path) |
CASE WHEN node:Assembly THEN node.assemblyId ELSE node.partNumber END
] AS cyclePath

Design reuse queries identify parts with similar specification profiles, suggesting candidates for standardization that reduce inventory costs and simplify maintenance:

design_reuse.cypher
// Find parts sharing multiple specifications (reuse candidates)
MATCH (p1:Part)-[:REFERENCES]->(s:Specification)<-[:REFERENCES]-(p2:Part)
WHERE p1.partNumber < p2.partNumber // Avoid duplicate pairs
AND p1.current = true AND p2.current = true
WITH p1, p2, collect(s.specId) AS sharedSpecs, count(s) AS specCount
WHERE specCount >= 3 // Threshold for meaningful similarity
RETURN p1.partNumber AS part1,
p1.description AS description1,
p2.partNumber AS part2,
p2.description AS description2,
sharedSpecs,
specCount
ORDER BY specCount DESC
LIMIT 20

Material supply chain impact queries help assess the blast radius when a material becomes unavailable or requires substitution:

material_impact.cypher
// Find all parts and assemblies affected by material shortage
MATCH (m:Material {materialId: $materialId})<-[:MADE_FROM]-(p:Part)
WHERE p.current = true
OPTIONAL MATCH (p)<-[:CONTAINS*1..5]-(a:Assembly)
RETURN m.name AS material,
collect(DISTINCT p.partNumber) AS affectedParts,
collect(DISTINCT a.assemblyId) AS affectedAssemblies,
count(DISTINCT p) AS partCount,
count(DISTINCT a) AS assemblyCount

Building a Search API with Python and Neo4j

Wrapping your knowledge graph in an API layer provides controlled access for engineering tools, PLM integrations, and custom dashboards. FastAPI delivers the performance and developer experience appropriate for internal tooling, with automatic OpenAPI documentation that helps other teams integrate with your service.

api.py
from fastapi import FastAPI, HTTPException, Query, Depends
from pydantic import BaseModel, Field
from neo4j_connection import CADGraphClient
from typing import Optional
from contextlib import asynccontextmanager
# Initialize client at startup, close at shutdown
@asynccontextmanager
async def lifespan(app: FastAPI):
app.state.client = CADGraphClient()
yield
app.state.client.close()
app = FastAPI(
title="CAD Knowledge Graph API",
description="Query interface for CAD documentation knowledge graph",
version="1.0.0",
lifespan=lifespan
)
def get_client() -> CADGraphClient:
return app.state.client
class ImpactAnalysisResponse(BaseModel):
specification_id: str
affected_parts: list[str]
affected_assemblies: list[dict]
total_parts_affected: int
engineers_to_notify: list[str] = Field(default_factory=list)
class PartSearchResponse(BaseModel):
parts: list[dict]
total_count: int
query: str
class BOMComponent(BaseModel):
part_number: Optional[str]
assembly_id: Optional[str]
description: str
depth: int
quantity: int
node_type: str
class BOMResponse(BaseModel):
root_assembly: str
components: list[BOMComponent]
total_components: int
@app.get("/impact-analysis/{spec_id}", response_model=ImpactAnalysisResponse)
def get_impact_analysis(
spec_id: str,
max_depth: int = Query(default=5, le=10, ge=1),
include_engineers: bool = Query(default=False),
client: CADGraphClient = Depends(get_client)
):
"""
Analyze the impact of changing a specification.
Returns all parts that reference the specification and all assemblies
that contain those parts, up to the specified traversal depth.
"""
query = """
MATCH (s:Specification {specId: $specId})<-[:REFERENCES]-(p:Part)
WHERE p.current = true
OPTIONAL MATCH (p)<-[:CONTAINS*1..$maxDepth]-(a:Assembly)
OPTIONAL MATCH (p)-[:DESIGNED_BY]->(e:Engineer)
WITH s,
collect(DISTINCT p.partNumber) AS parts,
collect(DISTINCT {
assemblyId: a.assemblyId,
name: a.name,
revision: a.revision
}) AS assemblies,
collect(DISTINCT e.name) AS engineers
RETURN s.specId AS specId,
s.title AS specTitle,
parts,
[asm IN assemblies WHERE asm.assemblyId IS NOT NULL] AS assemblies,
engineers
"""
with client.session() as session:
result = session.run(query, specId=spec_id, maxDepth=max_depth)
record = result.single()
if not record:
raise HTTPException(
status_code=404,
detail=f"Specification {spec_id} not found"
)
return ImpactAnalysisResponse(
specification_id=record["specId"],
affected_parts=record["parts"],
affected_assemblies=record["assemblies"],
total_parts_affected=len(record["parts"]),
engineers_to_notify=record["engineers"] if include_engineers else []
)
@app.get("/parts/search", response_model=PartSearchResponse)
def search_parts(
q: str = Query(..., min_length=2, description="Search query"),
status: Optional[str] = Query(None, description="Filter by status"),
current_only: bool = Query(True, description="Only show current revisions"),
limit: int = Query(default=50, le=200, ge=1),
client: CADGraphClient = Depends(get_client)
):
"""
Full-text search across part numbers and descriptions.
Requires full-text index to be created first:
CREATE FULLTEXT INDEX partSearch FOR (p:Part) ON EACH [p.description, p.partNumber]
"""
query = """
CALL db.index.fulltext.queryNodes('partSearch', $searchTerm)
YIELD node, score
WHERE ($status IS NULL OR node.status = $status)
AND ($currentOnly = false OR node.current = true)
RETURN node.partNumber AS partNumber,
node.description AS description,
node.status AS status,
node.current AS current,
node.createdDate AS createdDate,
score
ORDER BY score DESC
LIMIT $limit
"""
with client.session() as session:
result = session.run(
query,
searchTerm=q,
status=status,
currentOnly=current_only,
limit=limit
)
parts = [dict(record) for record in result]
return PartSearchResponse(
parts=parts,
total_count=len(parts),
query=q
)
@app.get("/assemblies/{assembly_id}/bom", response_model=BOMResponse)
def get_bill_of_materials(
assembly_id: str,
max_depth: int = Query(default=10, le=20, ge=1),
include_specifications: bool = Query(default=False),
client: CADGraphClient = Depends(get_client)
):
"""
Retrieve complete bill of materials for an assembly.
Returns all components (parts and sub-assemblies) contained within
the specified assembly, with depth and quantity information.
"""
# First verify the assembly exists
check_query = "MATCH (a:Assembly {assemblyId: $assemblyId}) RETURN a"
with client.session() as session:
if not session.run(check_query, assemblyId=assembly_id).single():
raise HTTPException(
status_code=404,
detail=f"Assembly {assembly_id} not found"
)
query = """
MATCH path = (root:Assembly {assemblyId: $assemblyId})-[:CONTAINS*1..]->(component)
WHERE component:Part OR component:Assembly
WITH component,
length(path) AS depth,
reduce(qty = 1, rel IN relationships(path) | qty * coalesce(rel.quantity, 1)) AS quantity
OPTIONAL MATCH (component)-[:REFERENCES]->(spec:Specification)
RETURN component.partNumber AS partNumber,
component.assemblyId AS assemblyId,
component.description AS description,
depth,
quantity,
CASE WHEN component:Assembly THEN 'ASSEMBLY' ELSE 'PART' END AS nodeType,
collect(spec.specId) AS specifications
ORDER BY depth, partNumber, assemblyId
"""
with client.session() as session:
result = session.run(query, assemblyId=assembly_id, maxDepth=max_depth)
components = []
for record in result:
comp = BOMComponent(
part_number=record["partNumber"],
assembly_id=record["assemblyId"],
description=record["description"] or "",
depth=record["depth"],
quantity=record["quantity"],
node_type=record["nodeType"]
)
components.append(comp)
return BOMResponse(
root_assembly=assembly_id,
components=components,
total_components=len(components)
)
@app.get("/health")
def health_check(client: CADGraphClient = Depends(get_client)):
"""Verify database connectivity."""
try:
with client.session() as session:
session.run("RETURN 1")
return {"status": "healthy", "database": "connected"}
except Exception as e:
raise HTTPException(status_code=503, detail=f"Database unavailable: {e}")

📝 Note: Parameterized queries (using $variable syntax) prevent Cypher injection and enable query plan caching. Never concatenate user input directly into query strings. Neo4j compiles and caches query plans for parameterized queries, significantly improving performance for repeated query patterns.

The API returns graph fragments rather than flattened tables when relationships matter to the consumer. For BOM queries, include the hierarchical structure with depth information. For simple searches, flatten to a list with relevance scores. Match the response shape to how downstream systems will consume the data—a web dashboard might want nested JSON while a spreadsheet export needs flat rows.

Create the full-text index before deploying the search endpoint:

CREATE FULLTEXT INDEX partSearch IF NOT EXISTS
FOR (p:Part) ON EACH [p.description, p.partNumber]

Keeping the Graph Current with Your CAD System

A knowledge graph that diverges from engineering reality becomes worse than useless—it becomes actively misleading. Engineers who learn to distrust the graph will revert to hallway conversations and tribal knowledge, wasting the infrastructure investment. Synchronization strategy determines whether your graph remains a trusted source or becomes another stale documentation artifact.

Event-driven updates provide the tightest synchronization and should be your target architecture. Modern PLM systems like Teamcenter, Windchill, and Fusion 360 emit events when parts change. Subscribe to these events through webhooks or message queues (Kafka, RabbitMQ, or cloud-native equivalents), and apply updates to the graph within seconds of the source change. This approach requires integration work upfront but eliminates synchronization lag and ensures engineers always see current data. The graph becomes an extension of the PLM system rather than a separate artifact to maintain.

When event-driven integration isn’t feasible—perhaps your PLM system lacks event support, or organizational constraints prevent real-time integration—periodic full syncs provide a pragmatic alternative. Export your PDM data nightly, diff against the previous export, and apply only the changes. Use MERGE statements throughout your import scripts—they create nodes that don’t exist and update nodes that do, making your imports idempotent and safe to re-run. A failed import at 2 AM can be re-executed at 6 AM without creating duplicate data.

Handling deletions requires explicit policy decisions that align with your engineering organization’s traceability requirements. When a part is removed from the CAD system, should the graph node disappear entirely, or transition to an ARCHIVED status? For traceability purposes, most engineering organizations prefer soft deletion: set a status property to ARCHIVED and exclude archived nodes from standard queries using a WHERE clause. This preserves the historical record—critical for regulatory compliance and failure analysis—while keeping active queries clean and performant.

Superseded parts follow a similar pattern that preserves revision history while maintaining query simplicity. When revision B replaces revision A, create the new node, establish the SUPERSEDES relationship, and set current = false on the old revision within a single transaction. Queries that need current parts filter on current = true. Queries that need revision history traverse the supersession chain backward. This dual-access pattern supports both everyday engineering work and historical investigation.

supersession_handling.cypher
// Create new revision and update supersession chain
MATCH (oldPart:Part {partNumber: $oldPartNumber, current: true})
CREATE (newPart:Part {
partNumber: $newPartNumber,
description: $description,
revision: $newRevision,
status: 'ACTIVE',
current: true,
createdDate: date()
})
CREATE (newPart)-[:SUPERSEDES {effectiveDate: date(), reason: $reason}]->(oldPart)
SET oldPart.current = false
// Copy specification references to new revision
WITH oldPart, newPart
MATCH (oldPart)-[r:REFERENCES]->(s:Specification)
CREATE (newPart)-[:REFERENCES {criticality: r.criticality, mandatory: r.mandatory}]->(s)
RETURN newPart.partNumber AS newPart, oldPart.partNumber AS superseded

Version control for the graph itself presents unique challenges. Neo4j doesn’t have native branching or rollback like Git. For critical changes, export the affected subgraph to Cypher scripts before modifying, providing a restoration path if the update goes wrong. The apoc.export.cypher.query procedure exports query results as executable Cypher statements. For development work, use separate database instances rather than trying to branch within a single instance—AuraDB makes spinning up development databases straightforward.

Monitor graph health through scheduled queries that surface data quality issues before they affect users. Track node counts by type, relationship counts by type, and orphan percentages over time. Sudden drops in part counts indicate sync failures. Growing orphan percentages suggest that assembly relationships aren’t updating correctly when parts move between assemblies. Build dashboards using Grafana or your organization’s monitoring tools that surface these metrics to the team maintaining the integration.

Query performance monitoring reveals optimization opportunities and catches regressions early. Neo4j’s query log captures execution times and query plans. Review slow queries weekly, add indexes for properties that appear frequently in WHERE clauses, and refactor queries that cause full graph scans. A knowledge graph that takes 30 seconds to answer impact questions loses adoption to informal hallway conversations. Target sub-second response times for interactive queries and document acceptable latencies for batch operations.

Key Takeaways

  • Start your graph model with three node types (Part, Assembly, Specification) and expand only when queries demand it—premature modeling creates maintenance burden without delivering value
  • Use MERGE instead of CREATE in Cypher to make your import scripts idempotent and re-runnable, eliminating duplicate data from failed or repeated imports
  • Index properties you filter on frequently—part numbers, specification IDs, and the current boolean should be indexed from day one
  • Build impact analysis queries first; they deliver immediate value by answering “what breaks if we change this?” and justify the infrastructure investment to stakeholders
  • Store contextual attributes on relationships (quantities, positions, criticality) rather than duplicating them across nodes, keeping your model clean and queries intuitive
  • Preserve revision history through SUPERSEDES relationships and a current flag rather than deleting old revisions, maintaining traceability while keeping queries simple
  • Monitor graph health metrics continuously—orphan counts, node counts, and query latencies indicate data quality issues before they affect engineering decisions