Josh Easter

Tech Deep Dive: Building an MCP Server

Building a production AI-powered MCP server required balancing multiple competing constraints. This technical foundation enabled fantastic business outcomes: 44% faster feature delivery, 91% component adoption, & estimated $2.3M in annual productivity savings.

Project Overview:

Rather than building a complete system upfront, we followed a deliberate progression from minimal viable solution to production-grade infrastructure. This approach validated the concept early and enabled rapid iteration based on real usage feedback.

Please note: Technical details are generalized to protect intellectual property while demonstrating engineering approach and architectural decisions.

Platform Engineering at Scale

Production-grade infrastructure that scales from prototype to thousands of users—real-world platform engineering experience.

Enterprise Scale

1000+ developers • <200ms P95
99.9% uptime • 10K queries/day

Platform Thinking

MCP protocol • Extensible APIs
Multi-client • Event-driven

Production Ready

Monitoring • Tracing • Alerts
<0.5% error rate • Cost tracking
Platform Engineering Capabilities
✓ Developer-first tools (60% adoption)
✓ Build vs. buy evaluation
✓ Cost optimization ($50K → $8K/mo)
✓ Security & compliance (SOC2)
✓ Cross-functional execution
✓ Incremental delivery (PoC to GA)

Phase 1: Quick Win

llms.txt deployment
2-3 hours implementation
Zero infrastructure
Immediate developer value

Phase 2: POC

Basic MCP server
2 weeks to first demo
Local vector search
50 beta users

Phase 3: Production

Pinecone + OpenAI
6 weeks to prod
Advanced retrieval
200+ developers

Phase 4: Enterprise

Full automation
3 months optimization
Kubernetes scale-out
1000+ developers

Phase 1: Quick Win with llms.txt

Before building the full production system, we implemented a quick win that was incredibly well received: llms.txt for repository-level guidance

What is llms.txt?

A plain text file at the repo root listing rules, preferred APIs/packages, token usage, and links to canonical docs. Editors like GitHub Copilot read it to ground suggestions and avoid hallucinations.

Purpose: Guide editor AIs to use real design system patterns

Effort: 2-3 hours to implement and deploy

VS Code — Repository Hints (llms.txt)

Policies

  • Do not invent APIs
  • Use documented component APIs and props only
  • Always include accessibility: (aria-*, keyboard, etc.)
  • Use semantic tokens (tokens.color.primary, tokens.radius.md).
  • No inline hex colors; use tokens.

Tools Preference

  • Prefer @company/ui and @company/tokens
  • Avoid custom CSS for DS components

Context

  • design.company.com/components
  • design.company.com/tokens
  • storybook.company.com

Validation

  • Check contrast ≥ 4.5:1
  • Add aria-busy on loading controls
  • Ensure focus ring visible (2px)
  • Check semantic tokens
Learn more about llms.txt

Inline Suggestion

// llms.txt warns against inline hex colors 
<button style={{ 
  backgroundColor: '#ff6b6b',
  padding: '12px 24px', 
  borderRadius: '4px' }} > 
  Save 
</button>

// Suggested by llms.txt
import { colors, spacing, borderRadius } from '@company/tokens' 
<button style={{ 
  backgroundColor: colors.primary[500], 
  padding: spacing.md spacing.lg, 
  borderRadius: borderRadius.sm }} aria-busy > 
    Saving… 
</button>
// Tab to auto-complete

Impact

87%
Of developers reported more accurate Copilot suggestions
-42%
Reduction in design token violations
3 hours
Total implementation time

Phase 2: Proof of Concept

The goal was to validate whether an MCP server could effectively ground AI responses in real design system knowledge. The first proof of concept focused on a minimal viable implementation.

MCP Workflow Overview

1) User Query

"Find button with loading state"

VS Code • Figma Dev Mode • CLI

2) MCP Server

Routes query → tools
  • search_components
  • validate

3) Retrieval & Grounding

Vector DB + docs
  • Pinecone results (k=3)
  • Design tokens & a11y

4) Guidance & Output

Assist, validate, generate
  • Recommend component, React
  • A11y checklist

POC Implementation

The initial implementation validated the flow: query → retrieve → answer with code samples grounded in real tokens and components.

Quick MCP Server

Single tool: search_design_docs

import { createServer, tool } from 'mcp-framework';
import fs from 'node:fs';
import path from 'node:path';
import { cosineSimilarity, embed } from './pocembed';

// Load precomputed vectors created by the training script
const vectors = JSON.parse(
  fs.readFileSync(path.join(process.cwd(), 'vectors.json'), 'utf8')
) as Array<{
  id: string;
  text: string;
  vector: number[];
  source: string;
}>;

const searchDesignDocs = tool({
  name: 'search_design_docs',
  description: 'Semantic search over design system docs and guidelines',
  inputSchema: {
    type: 'object',
    required: ['query'],
    properties: {
      query: { type: 'string' },
      k: { type: 'number', default: 5 }
    },
  },
  handler: async ({ query, k = 5 }) => {
    const q = await embed(query);
    const scored = vectors
      .map((d) => ({ ...d, score: cosineSimilarity(q, d.vector) }))
      .sort((a, b) => b.score - a.score)
      .slice(0, k);

    return {
      results: scored.map((s) => ({
        source: s.source,
        score: Number(s.score.toFixed(3)),
        text: s.text
      })),
    };
  },
});

const server = createServer({
  name: 'company-design-system-mcp-poc',
  version: '0.0.1',
  tools: [searchDesignDocs],
});

server.start(8080);

POC Training Script

Index design system docs into vectors.json

// scripts/poc-train.ts
import fs from 'node:fs';
import path from 'node:path';
import matter from 'gray-matter';
import { embed } from './pocembed';

const docsDir = path.join(process.cwd(), 'docs/design-system');

function chunk(text: string, size = 800, overlap = 120) {
  const chunks: string[] = [];
  for (let i = 0; i < text.length; i += size - overlap) {
    chunks.push(text.slice(i, i + size));
  }
  return chunks;
}

async function main() {
  const files = fs.readdirSync(docsDir)
    .filter((f) => f.endsWith('.md') || f.endsWith('.mdx'));
  const out: any[] = [];

  for (const file of files) {
    const full = path.join(docsDir, file);
    const raw = fs.readFileSync(full, 'utf8');
    const { content } = matter(raw);

    for (const text of chunk(content)) {
      const vector = await embed(text);
      out.push({
        id: `${file}-${out.length}`,
        source: file,
        text,
        vector
      });
    }
  }

  fs.writeFileSync(
    path.join(process.cwd(), 'vectors.json'),
    JSON.stringify(out)
  );
}

main();

Usage Example: Calling the Tool

Invoke search_ds and inspect request/response

MCP Search Tool

Request

Response

Results

  • components/button.md score: 0.912
    Use the Button component with the loading variant. Prefer aria-busy and disable pointer events.
  • a11y/interaction.md score: 0.887
    Buttons in loading state must preserve focus, announce status, and prevent duplicate submits.

Usage Example: Grounding

Grounding search_ds response with component knowledge

Grounded Answer Composer

Sources

Answer

Grounded guidance

Use the Button component in primary intent with an isLoading state.

Announce progress with aria-busy and disable user interaction to prevent duplicate submissions.

Code (preview)

<Button intent="primary" aria-busy={loading || undefined} disabled={loading}>
  {loading ? 'Saving…' : 'Save'}
</Button>

Phase 3: Production Architecture

The system is built on a modern, scalable architecture centered around the Model Context Protocol (MCP). At its core, a Node.js/TypeScript server exposes intelligent design system tools through the MCP, enabling seamless integration with IDEs, CLI tools, & CI/CD pipelines.

Data Collection & Ingestion

The foundation of accurate AI responses starts with comprehensive data collection. We aggregate design system knowledge from six primary sources - documentation, Storybook examples, Figma specifications, GitHub code patterns, support tickets, & developer conversations - to build a complete picture of how the design system is documented, implemented, & used in practice.

Docs
Components & API's
Storybook
Examples & Variants
Figma
Design Specs
GitHub
Code Patterns
Support
Tickets & Q&A
Chat
Conversations

AI Processing Pipeline

Raw documentation is transformed into AI-ready knowledge through a dual-track processing pipeline. OpenAI's text-embedding model converts all content into vectors stored in Pinecone for lightning-fast semantic search, LangChain orchestrates the RAG (Retrieval-Augmented Generation) pipeline to intelligently retrieve and synthesize the most relevant context for each query.

Embedding & Vectorization

  • Model: OpenAI text-embedding-3-large
  • Dimensions: 1536-dimensional vectors
  • Storage: Pinecone vector database
  • Purpose: Semantic search & similarity matching

RAG Pipeline

  • Orchestration: LangChain framework
  • Retrieval: Hybrid search (vector + keyword)
  • Context: Top-k relevant chunks
  • Grounding: Real docs, code, patterns, office hours, slack, jira

Core MCP Server

The heart of the system is a high-performance Node.js server built on Fastify that implements the Model Context Protocol specifications. It exposes four primary tools search_components, validate, generate_code, & check_accessibility using GPT-4 & fine-tuned models for code generation & compliance validation.

Infrastructure

  • Runtime: Node.js 20+ with TypeScript 5.x
  • Framework: Fastify (high-performance HTTP)
  • Protocol: Model Context Protocol (MCP)
  • Database: PostgreSQL (structured metadata)

AI Models

  • Primary LLM: GPT-4 Turbo
  • Fine-tuned: Code generation model
  • Validation: Compliance checking model
  • Performance: Sub-second responses

Exposed MCP Tools

search_ds
validate
generate
check_a11y
analytics
z

Developer Integrations

The MCP server's true power comes from meeting developers where they work. Through a VS Code extension, GitHub Actions, & an analytics dashboard, the system delivers contextually accurate, actionable guidance directly in existing workflows. From inline suggestions during coding to automated PR validation to team-wide adoption insights.

VS Code Extension

  • Intelligent IntelliSense
  • Real-time validation
  • Quick fix suggestions
  • Component search panel

GitHub Actions

  • Automated PR reviews
  • Design system validation
  • Compliance reporting
  • Auto-fix suggestions

Analytics Dashboard

  • Usage analytics
  • Adoption metrics
  • Performance monitoring
  • Developer satisfaction

Non-complete Tech Stack Overview


Technical Challenges & Solutions

These challenges and solutions reflect the technical journey from proof-of-concept to production deployment.

Embedding Quality

Initial embeddings struggled with design system terminology and domain knowledge.

Solution
  • Fine-tuned embedding model on design system
  • Added domain-specific preprocessing
  • Implemented semantic chunking strategy
  • Result: 34% improvement in retrieval accuracy

Hallucination Control

AI occasionally generated component APIs that didn't exist in the design system.

Solution
  • Strict retrieval augmentation (RAG) pipeline
  • Schema validation for generated code
  • Confidence scoring and thresholds
  • Result: 97% accuracy for suggestions

Scale & Cost

OpenAI API costs would have exceeded $15K/month at full adoption.

Solution
  • Implemented aggressive caching strategy
  • Fine-tuned smaller models for specific tasks
  • Added request rate limiting and quotas
  • Result: Cost reduced to $4K/month (73%)

Infrastructure & Operations

A small overview of the infrastructure and operations setup.

Production Infrastructure

Compute

  • Kubernetes cluster (AWS EKS)
  • 12 pods (autoscaling)
  • Node.js 20 LTS
  • Load balanced

Storage

  • Pinecone (vector DB)
  • PostgreSQL (metadata)
  • Redis (cache)
  • S3 (artifacts)

Monitoring

  • DataDog APM
  • Custom metrics
  • Error tracking (Sentry)
  • Usage analytics

Key Metrics

99.9%
Uptime SLA
<200ms
P95 latency
2.1M
Queries/month
$4K
Monthly cost

Data Pipeline & Model Training

A critical component of production readiness was establishing a robust data pipeline and training infrastructure.

Data Sources

  • Documentation - Static site, Storybook, MDX files
  • Code examples - GitHub repos, CodeSandbox demos
  • Support data - Slack Q&A, support tickets, office hours
  • Design specs - Figma files, design tokens, guidelines
  • Usage patterns - Real component implementations in products

Data Processing

  • Cleaning - Remove outdated content, fix broken links
  • Chunking - Split long documents (800 chars, 120 overlap)
  • Augmentation - Generate Q&A pairs, add metadata
  • Validation - Ensure accuracy, remove hallucinations
  • Indexing - Embed and store in Pinecone

Continuous Training Pipeline

This continuous learning approach ensured the system stayed current as the design system evolved.

Automated Retraining

Daily
New documentation indexed
Weekly
Model performance evaluation
Monthly
Embedding model updates
Quarterly
Major version upgrades

Server Architecture

  • Node.js server running Model Context Protocol
  • GPT-4 for natural language understanding
  • Express API for tool orchestration
  • WebSocket connection for real-time updates
  • File system watchers for change detection

Tool Integration

  • ESLint CLI via Node API
  • Prettier programmatic API
  • TypeScript Compiler API for type checking
  • axe-core with jsdom for a11y testing
  • Custom validators for design tokens

Design System Context

  • Figma API for design token extraction
  • JSON schema for token validation
  • Component registry with usage patterns
  • Codebase index for pattern detection
  • Learning system for convention discovery

IDE Integration

  • VS Code extension with Language Server Protocol
  • ChatGPT integration via MCP protocol
  • Inline diagnostics with quick fixes
  • Command palette for manual validation
  • Status bar indicators for real-time feedback

Performance Optimization Journey

Achieving production-grade performance required systematic optimization:

Initial (POC)

1.2s
  • No caching
  • Naive embeddings
  • Unoptimized queries

After Optimization

450ms
  • Redis caching added
  • Index tuning
  • Request batching

Production (P95)

<200ms
  • Aggressive prefetching
  • CDN for static
  • Edge deployment

Initial Costs (Month 1)

OpenAI API (embeddings + chat) $12,000
Pinecone (vector DB) $2,400
Infrastructure (compute, storage) $600
Total $15,000/month

Optimized Costs (Month 6)

OpenAI API (90% cache hit rate) $1,800
Pinecone (optimized tier) $1,600
Infrastructure (right-sized) $600
Total $4,000/month (73% ↓)

Key optimizations

Aggressive caching
90% cache eliminates redundant API calls
Batch processing
Reduced per-request overhead
Fine-tuned models
Smaller specialized models for specific tasks
Rate limiting
Prevented abuse and excessive usage

Speed

  • 8.2s average feedback time
  • Incremental validation (only changed files)
  • Parallel tool execution where possible
  • Result caching for unchanged code

Capacity

  • 15,400 validations/day across team
  • Horizontal scaling with load balancer
  • Per-developer instances for isolation
  • Auto-scaling based on demand

Reliability

  • 99.2% uptime SLA
  • Graceful degradation if server unavailable
  • Fallback to local tools in offline mode
  • Health monitoring with auto-restart

Monitoring & Observability

Production operations required comprehensive monitoring. We built a multi-layered observability strategy tracking everything from infrastructure health to business metrics:

P

Performance Metrics

P50, P95, P99 latency Real-time
Request throughput 1K/min
Error rates <0.1%
Cache hit rates 90%
Q

Quality Metrics

Relevance scores 0.89 avg
Accuracy metrics 97%
User feedback 8.7/10
Hallucination rate <3%
U

Usage Analytics

Daily active users 850+
Queries per user 12.3 avg
Popular queries Tracked
Tool usage patterns Analyzed
B

Business Impact

ROI tracking +575%
Adoption metrics 91%
Satisfaction scores 8.7/10

Developer Workflow Impact

The technical implementation directly improved developer experience.

IDE Integration

  • Zero context switching - Guidance appears inline during development
  • Real-time validation - Catch issues before code review
  • Instant documentation - No more searching through wikis
  • Code generation - Scaffolds components with correct patterns

PR Automation

  • Automated compliance checks - Design system adherence validated automatically
  • Accessibility audits - A11y issues flagged before merge
  • Reduced review cycles - Fewer back-and-forth iterations
  • Quality gates - Consistent standards enforcement

Key Learnings

Lessons learned from building and scaling a production AI systemtechnical insights and engineering practices that shaped our approach

What Worked Well

  • Start with POC: Validated approach with minimal investment
  • Quick wins first: llms.txt provided immediate value
  • Iterative scaling: Gradually increased complexity and features
  • User feedback loops: Continuous improvement based on usage

Challenges

  • Data quality: Required significant cleanup and curation
  • Model selection: Balancing accuracy vs. cost vs. latency
  • Hallucination prevention: Strict validation and grounding needed
  • Change management: Training teams to trust AI assistance

Future Roadmap

  • Multimodal support: Image understanding for Figma designs
  • Code migration: Automated upgrades between versions
  • Performance optimization: Self-tuning based on usage
  • Cross-platform: Mobile and native app support

Engineering Principles That Worked

1. Measure Everything: Without comprehensive metrics, you're flying blind. Instrument early and often.
2. Embrace Feedback Loops: Real user feedback is more valuable than any benchmark. Ship early, iterate fast.
3. Optimize for Developer Experience: If it's hard to use, developers won't use it. Prioritize UX.
4. Don't Over-Engineer: Build what you need today, not what you might need tomorrow. Stay flexible.
5. Cost Matters: Unlimited AI API spend isn't sustainable. Optimize aggressively.
6. Trust but Verify: AI is powerful but fallible. Validate outputs, provide citations, enable human override.

Conclusion

The key to success was the deliberate progression from simple proof-of-concept to production-grade infrastructure, guided by real usage patterns and user feedback at every step.

Technical Achievements

Sub-200ms P95 latency for real-time IDE integration
97% accuracy for component suggestions and validation
2M+ queries/month serving 1000+ developers
73% cost reduction through optimization
99.9% uptime meeting enterprise SLA

Engineering Insights

→ Start with minimal POC to validate approach
→ Quick wins (llms.txt) build confidence
→ Iterative scaling based on usage feedback
→ Aggressive caching essential for cost control
→ Strict validation prevents hallucinations