Tech Deep Dive: Building an MCP Server

Project Overview:

Rather than building a complete system upfront, we followed a deliberate progression from minimal viable solution to production-grade infrastructure. This approach validated the concept early and enabled rapid iteration based on real usage feedback.

Please note: Technical details are generalized to protect intellectual property while demonstrating engineering approach and architectural decisions.

Platform Engineering at Scale

Production-grade infrastructure that scales from prototype to thousands of users—real-world platform engineering experience.

Enterprise Scale

1000+ developers • <200ms P95

99.9% uptime • 10K queries/day

Platform Thinking

MCP protocol • Extensible APIs

Multi-client • Event-driven

Production Ready

Monitoring • Tracing • Alerts

<0.5% error rate • Cost tracking

Platform Engineering Capabilities

✓ Developer-first tools (60% adoption)

✓ Build vs. buy evaluation

✓ Cost optimization ($50K → $8K/mo)

✓ Security & compliance (SOC2)

✓ Cross-functional execution

✓ Incremental delivery (PoC to GA)

Phase 1: Quick Win

llms.txt deployment

2-3 hours implementation

Zero infrastructure

Immediate developer value

Phase 2: POC

Basic MCP server

2 weeks to first demo

Local vector search

50 beta users

Phase 3: Production

Pinecone + OpenAI

6 weeks to prod

Advanced retrieval

200+ developers

Phase 4: Enterprise

Full automation

3 months optimization

Kubernetes scale-out

1000+ developers

Phase 1: Quick Win with llms.txt

Before building the full production system, we implemented a quick win that was incredibly well received: llms.txt for repository-level guidance

What is llms.txt?

A plain text file at the repo root listing rules, preferred APIs/packages, token usage, and links to canonical docs. Editors like GitHub Copilot read it to ground suggestions and avoid hallucinations.

Purpose: Guide editor AIs to use real design system patterns

Effort: 2-3 hours to implement and deploy

VS Code — Repository Hints (llms.txt)

Policies

Do not invent APIs
Use documented component APIs and props only
Always include accessibility: (aria-*, keyboard, etc.)
Use semantic tokens (tokens.color.primary, tokens.radius.md).
No inline hex colors; use tokens.

Tools Preference

Prefer @company/ui and @company/tokens
Avoid custom CSS for DS components

Context

design.company.com/components
design.company.com/tokens
storybook.company.com

Validation

Check contrast ≥ 4.5:1
Add aria-busy on loading controls
Ensure focus ring visible (2px)
Check semantic tokens

Learn more about llms.txt

Inline Suggestion

// llms.txt warns against inline hex colors 
<button style={{ 
  backgroundColor: '#ff6b6b',
  padding: '12px 24px', 
  borderRadius: '4px' }} > 
  Save 
</button>

// Suggested by llms.txt
import { colors, spacing, borderRadius } from '@company/tokens' 
<button style={{ 
  backgroundColor: colors.primary[500], 
  padding: spacing.md spacing.lg, 
  borderRadius: borderRadius.sm }} aria-busy > 
    Saving… 
</button>
// Tab to auto-complete

Impact

87%

Of developers reported more accurate Copilot suggestions

-42%

Reduction in design token violations

3 hours

Total implementation time

Phase 2: Proof of Concept

The goal was to validate whether an MCP server could effectively ground AI responses in real design system knowledge. The first proof of concept focused on a minimal viable implementation.

MCP Workflow Overview

1) User Query

"Find button with loading state"

VS Code • Figma Dev Mode • CLI

2) MCP Server

Routes query → tools

search_components
validate

3) Retrieval & Grounding

Vector DB + docs

Pinecone results (k=3)
Design tokens & a11y

4) Guidance & Output

Assist, validate, generate

Recommend component, React
A11y checklist

POC Implementation

The initial implementation validated the flow: query → retrieve → answer with code samples grounded in real tokens and components.

Quick MCP Server

Single tool: search_design_docs

import { createServer, tool } from 'mcp-framework';
import fs from 'node:fs';
import path from 'node:path';
import { cosineSimilarity, embed } from './pocembed';

// Load precomputed vectors created by the training script
const vectors = JSON.parse(
  fs.readFileSync(path.join(process.cwd(), 'vectors.json'), 'utf8')
) as Array<{
  id: string;
  text: string;
  vector: number[];
  source: string;
}>;

const searchDesignDocs = tool({
  name: 'search_design_docs',
  description: 'Semantic search over design system docs and guidelines',
  inputSchema: {
    type: 'object',
    required: ['query'],
    properties: {
      query: { type: 'string' },
      k: { type: 'number', default: 5 }
    },
  },
  handler: async ({ query, k = 5 }) => {
    const q = await embed(query);
    const scored = vectors
      .map((d) => ({ ...d, score: cosineSimilarity(q, d.vector) }))
      .sort((a, b) => b.score - a.score)
      .slice(0, k);

    return {
      results: scored.map((s) => ({
        source: s.source,
        score: Number(s.score.toFixed(3)),
        text: s.text
      })),
    };
  },
});

const server = createServer({
  name: 'company-design-system-mcp-poc',
  version: '0.0.1',
  tools: [searchDesignDocs],
});

server.start(8080);

POC Training Script

Index design system docs into vectors.json

// scripts/poc-train.ts
import fs from 'node:fs';
import path from 'node:path';
import matter from 'gray-matter';
import { embed } from './pocembed';

const docsDir = path.join(process.cwd(), 'docs/design-system');

function chunk(text: string, size = 800, overlap = 120) {
  const chunks: string[] = [];
  for (let i = 0; i < text.length; i += size - overlap) {
    chunks.push(text.slice(i, i + size));
  }
  return chunks;
}

async function main() {
  const files = fs.readdirSync(docsDir)
    .filter((f) => f.endsWith('.md') || f.endsWith('.mdx'));
  const out: any[] = [];

  for (const file of files) {
    const full = path.join(docsDir, file);
    const raw = fs.readFileSync(full, 'utf8');
    const { content } = matter(raw);

    for (const text of chunk(content)) {
      const vector = await embed(text);
      out.push({
        id: `${file}-${out.length}`,
        source: file,
        text,
        vector
      });
    }
  }

  fs.writeFileSync(
    path.join(process.cwd(), 'vectors.json'),
    JSON.stringify(out)
  );
}

main();

Usage Example: Calling the Tool

Invoke search_ds and inspect request/response

MCP Search Tool

Request

Tool Query k

Response

Results

components/button.md score: 0.912
Use the Button component with the loading variant. Prefer aria-busy and disable pointer events.
a11y/interaction.md score: 0.887
Buttons in loading state must preserve focus, announce status, and prevent duplicate submits.

Usage Example: Grounding

Grounding search_ds response with component knowledge

Grounded Answer Composer

Sources

Tool

Query

Answer

Grounded guidance

Use the Button component in primary intent with an isLoading state.

Announce progress with aria-busy and disable user interaction to prevent duplicate submissions.

Code (preview)

<Button intent="primary" aria-busy={loading || undefined} disabled={loading}>
  {loading ? 'Saving…' : 'Save'}
</Button>

Phase 3: Production Architecture

The system is built on a modern, scalable architecture centered around the Model Context Protocol (MCP). At its core, a Node.js/TypeScript server exposes intelligent design system tools through the MCP, enabling seamless integration with IDEs, CLI tools, & CI/CD pipelines.

Data Collection & Ingestion

The foundation of accurate AI responses starts with comprehensive data collection. We aggregate design system knowledge from six primary sources - documentation, Storybook examples, Figma specifications, GitHub code patterns, support tickets, & developer conversations - to build a complete picture of how the design system is documented, implemented, & used in practice.

Docs

Components & API's

Storybook

Examples & Variants

Figma

Design Specs

GitHub

Code Patterns

Support

Tickets & Q&A

Chat

Conversations

↓

AI Processing Pipeline

Raw documentation is transformed into AI-ready knowledge through a dual-track processing pipeline. OpenAI's text-embedding model converts all content into vectors stored in Pinecone for lightning-fast semantic search, LangChain orchestrates the RAG (Retrieval-Augmented Generation) pipeline to intelligently retrieve and synthesize the most relevant context for each query.

Embedding & Vectorization

Model: OpenAI text-embedding-3-large
Dimensions: 1536-dimensional vectors
Storage: Pinecone vector database
Purpose: Semantic search & similarity matching

RAG Pipeline

Orchestration: LangChain framework
Retrieval: Hybrid search (vector + keyword)
Context: Top-k relevant chunks
Grounding: Real docs, code, patterns, office hours, slack, jira

↓

Core MCP Server

The heart of the system is a high-performance Node.js server built on Fastify that implements the Model Context Protocol specifications. It exposes four primary tools search_components, validate, generate_code, & check_accessibility using GPT-4 & fine-tuned models for code generation & compliance validation.

Infrastructure

Runtime: Node.js 20+ with TypeScript 5.x
Framework: Fastify (high-performance HTTP)
Protocol: Model Context Protocol (MCP)
Database: PostgreSQL (structured metadata)

AI Models

Primary LLM: GPT-4 Turbo
Fine-tuned: Code generation model
Validation: Compliance checking model
Performance: Sub-second responses

Exposed MCP Tools

search_ds

validate

generate

check_a11y

analytics

↓

Developer Integrations

The MCP server's true power comes from meeting developers where they work. Through a VS Code extension, GitHub Actions, & an analytics dashboard, the system delivers contextually accurate, actionable guidance directly in existing workflows. From inline suggestions during coding to automated PR validation to team-wide adoption insights.

VS Code Extension

Intelligent IntelliSense
Real-time validation
Quick fix suggestions
Component search panel

GitHub Actions

Automated PR reviews
Design system validation
Compliance reporting
Auto-fix suggestions

Analytics Dashboard

Usage analytics
Adoption metrics
Performance monitoring
Developer satisfaction

Non-complete Tech Stack Overview

Technical Challenges & Solutions

These challenges and solutions reflect the technical journey from proof-of-concept to production deployment.

Embedding Quality

Initial embeddings struggled with design system terminology and domain knowledge.

Solution

Fine-tuned embedding model on design system
Added domain-specific preprocessing
Implemented semantic chunking strategy
Result: 34% improvement in retrieval accuracy

Hallucination Control

AI occasionally generated component APIs that didn't exist in the design system.

Solution

Strict retrieval augmentation (RAG) pipeline
Schema validation for generated code
Confidence scoring and thresholds
Result: 97% accuracy for suggestions

Scale & Cost

OpenAI API costs would have exceeded $15K/month at full adoption.

Solution

Implemented aggressive caching strategy
Fine-tuned smaller models for specific tasks
Added request rate limiting and quotas
Result: Cost reduced to $4K/month (73%)

Infrastructure & Operations

A small overview of the infrastructure and operations setup.

Production Infrastructure

Compute

Kubernetes cluster (AWS EKS)
12 pods (autoscaling)
Node.js 20 LTS
Load balanced

Storage

Pinecone (vector DB)
PostgreSQL (metadata)
Redis (cache)
S3 (artifacts)

Monitoring

DataDog APM
Custom metrics
Error tracking (Sentry)
Usage analytics

Key Metrics

99.9%

Uptime SLA

<200ms

P95 latency

2.1M

Queries/month

$4K

Monthly cost

Data Pipeline & Model Training

A critical component of production readiness was establishing a robust data pipeline and training infrastructure.

Data Sources

Documentation - Static site, Storybook, MDX files
Code examples - GitHub repos, CodeSandbox demos
Support data - Slack Q&A, support tickets, office hours
Design specs - Figma files, design tokens, guidelines
Usage patterns - Real component implementations in products

Data Processing

Cleaning - Remove outdated content, fix broken links
Chunking - Split long documents (800 chars, 120 overlap)
Augmentation - Generate Q&A pairs, add metadata
Validation - Ensure accuracy, remove hallucinations
Indexing - Embed and store in Pinecone

Continuous Training Pipeline

This continuous learning approach ensured the system stayed current as the design system evolved.

Automated Retraining

Daily

New documentation indexed

Weekly

Model performance evaluation

Monthly

Embedding model updates

Quarterly

Major version upgrades

Server Architecture

Node.js server running Model Context Protocol
GPT-4 for natural language understanding
Express API for tool orchestration
WebSocket connection for real-time updates
File system watchers for change detection

Tool Integration

ESLint CLI via Node API
Prettier programmatic API
TypeScript Compiler API for type checking
axe-core with jsdom for a11y testing
Custom validators for design tokens

Design System Context

Figma API for design token extraction
JSON schema for token validation
Component registry with usage patterns
Codebase index for pattern detection
Learning system for convention discovery

IDE Integration

VS Code extension with Language Server Protocol
ChatGPT integration via MCP protocol
Inline diagnostics with quick fixes
Command palette for manual validation
Status bar indicators for real-time feedback

Performance Optimization Journey

Achieving production-grade performance required systematic optimization:

Initial (POC)

1.2s

No caching
Naive embeddings
Unoptimized queries

After Optimization

450ms

Redis caching added
Index tuning
Request batching

Production (P95)

<200ms

Aggressive prefetching
CDN for static
Edge deployment

Initial Costs (Month 1)

OpenAI API (embeddings + chat) $12,000

Pinecone (vector DB) $2,400

Infrastructure (compute, storage) $600

Total $15,000/month

Optimized Costs (Month 6)

OpenAI API (90% cache hit rate) $1,800

Pinecone (optimized tier) $1,600

Infrastructure (right-sized) $600

Total $4,000/month (73% ↓)

Key optimizations

Aggressive caching

90% cache eliminates redundant API calls

Batch processing

Reduced per-request overhead

Fine-tuned models

Smaller specialized models for specific tasks

Rate limiting

Prevented abuse and excessive usage

Speed

8.2s average feedback time
Incremental validation (only changed files)
Parallel tool execution where possible
Result caching for unchanged code

Capacity

15,400 validations/day across team
Horizontal scaling with load balancer
Per-developer instances for isolation
Auto-scaling based on demand

Reliability

99.2% uptime SLA
Graceful degradation if server unavailable
Fallback to local tools in offline mode
Health monitoring with auto-restart

Monitoring & Observability

Production operations required comprehensive monitoring. We built a multi-layered observability strategy tracking everything from infrastructure health to business metrics:

Performance Metrics

P50, P95, P99 latency Real-time

Request throughput 1K/min

Error rates <0.1%

Cache hit rates 90%

Quality Metrics

Relevance scores 0.89 avg

Accuracy metrics 97%

User feedback 8.7/10

Hallucination rate <3%

Usage Analytics

Daily active users 850+

Queries per user 12.3 avg

Popular queries Tracked

Tool usage patterns Analyzed

Business Impact

ROI tracking +575%

Adoption metrics 91%

Satisfaction scores 8.7/10

Developer Workflow Impact

The technical implementation directly improved developer experience.

IDE Integration

Zero context switching - Guidance appears inline during development
Real-time validation - Catch issues before code review
Instant documentation - No more searching through wikis
Code generation - Scaffolds components with correct patterns

PR Automation

Automated compliance checks - Design system adherence validated automatically
Accessibility audits - A11y issues flagged before merge
Reduced review cycles - Fewer back-and-forth iterations
Quality gates - Consistent standards enforcement

Key Learnings

Lessons learned from building and scaling a production AI system—technical insights and engineering practices that shaped our approach

What Worked Well

Start with POC: Validated approach with minimal investment
Quick wins first: llms.txt provided immediate value
Iterative scaling: Gradually increased complexity and features
User feedback loops: Continuous improvement based on usage

Challenges

Data quality: Required significant cleanup and curation
Model selection: Balancing accuracy vs. cost vs. latency
Hallucination prevention: Strict validation and grounding needed
Change management: Training teams to trust AI assistance

Future Roadmap

Multimodal support: Image understanding for Figma designs
Code migration: Automated upgrades between versions
Performance optimization: Self-tuning based on usage
Cross-platform: Mobile and native app support

Engineering Principles That Worked

1. Measure Everything: Without comprehensive metrics, you're flying blind. Instrument early and often.

2. Embrace Feedback Loops: Real user feedback is more valuable than any benchmark. Ship early, iterate fast.

3. Optimize for Developer Experience: If it's hard to use, developers won't use it. Prioritize UX.

4. Don't Over-Engineer: Build what you need today, not what you might need tomorrow. Stay flexible.

5. Cost Matters: Unlimited AI API spend isn't sustainable. Optimize aggressively.

6. Trust but Verify: AI is powerful but fallible. Validate outputs, provide citations, enable human override.

Conclusion

The key to success was the deliberate progression from simple proof-of-concept to production-grade infrastructure, guided by real usage patterns and user feedback at every step.

Technical Achievements

✓ Sub-200ms P95 latency for real-time IDE integration

✓ 97% accuracy for component suggestions and validation

✓ 2M+ queries/month serving 1000+ developers

✓ 73% cost reduction through optimization

✓ 99.9% uptime meeting enterprise SLA

Engineering Insights

→ Start with minimal POC to validate approach

→ Quick wins (llms.txt) build confidence

→ Iterative scaling based on usage feedback

→ Aggressive caching essential for cost control

→ Strict validation prevents hallucinations

See the bigger picture: Explore the detailed impacts and technical implementation

Project Overview

See the complete project story, from problem identification to measurable impact. Read More

For Designers & Leaders

See how this freed design teams from support burden and enabled strategic focus. Read More

Josh Easter