From AI-Integrated Systems to AI Platform Architect

After 20 years in tech, you'd think at some point you get to sit back and say, "I know this stuff."

And then the AI stack explodes overnight and makes half your architecture knowledge feel like ancient history. 😅

The thing about something new showing up is that "you have a choice". Ignore it, side-step it, wait for it to mature, or take a peek. Just a peek.

The problem with peeking is that it pulls you in. One question leads to another. One repo leads to six. Two weeks later, you have a production-grade agentic AI platform, a full evaluation suite, and a very strong opinion about chunking strategies.

Curiosity is dangerous like that. I highly recommend it.

First, Some Context

I'm not new to AI. Over the past decade, I've shipped production systems that use AI — facial recognition and RFID tracking for a student safety system deployed in 50+ schools, augmented reality with ML Kit that lets users try on jewellery through a mobile app, and predictive analytics pipelines for healthcare operations.

But there's a real difference between integrating AI as a component and architecting the AI layer itself.

Using AI as a component — calling a vision API, embedding a model into a mobile app, wiring a prediction service into a pipeline. AI is one ingredient in a larger system. You're a chef who uses a powerful appliance.
Architecting the AI layer itself — designing the orchestration runtime, the retrieval pipeline, the tool interfaces, the evaluation framework, the observability stack. You're the one who builds the appliance.

Most engineers have done #1. Production AI platforms need #2.

I had deep experience with the first. The modern LLM and agentic stack required the second — and I knew it. So I treated it the way I treat any new infrastructure layer I need to own: build from primitives first, understand the tradeoffs, then use the frameworks with intention.

The Rules I Set

Two weeks. One rule: no claiming, only demonstrating.

Build from primitives before using frameworks — understand what the abstractions hide
Every component production-grade — proper error handling, logging, CI/CD, Docker, deployment config
Evaluate with real metrics — not "it seemed to work in the demo."
Everything public on GitHub — no hiding behind "I can't share the code."

Two weeks. Six repos. One flagship platform.

What I Built and Why in That Order

I started at the foundation and worked up:

llm-chat-api — baseline chat service with multi-provider abstraction across OpenAI, Anthropic, and Gemini. Before building anything agentic, I needed to understand the differences among providers at the API level and design a clean abstraction layer. Boring? Yes. Essential? Also yes.

rag-api — a complete RAG pipeline built entirely from primitives. Document loader, chunker, embedder, retriever — all built manually before touching LlamaIndex. This was the most valuable exercise in the entire two weeks. Building from scratch forces you to understand why chunking strategy matters, what retrieval actually does, and exactly where things break. You don't get that from a framework tutorial.

semantic-search-api — embeddings, ChromaDB, hybrid search, health dashboard. A standalone search service you can drop into any system.

ai-service-kit — a shared library with 102 passing tests, 2-level provider fallback, deterministic mock providers (SHA-256 seeding), and cloud logging abstraction for AWS, Azure, GCP, and Datadog. Paired with ai-service-template so every new service starts production-ready on day one. Turns out LLMs don't change the rules of good software engineering. Factory patterns, provider abstraction, mock layers — still very much needed. More on this later.

agents-api — a custom "ReAct" multi-agent system built deliberately without LangGraph. Planner → Worker → Reviewer pattern, model routing, semantic caching, guardrails, PII masking, Prometheus metrics. I built this before the flagship to understand the agent loop mechanics before abstracting them.

agentic-ai-platform — the flagship. LangGraph for stateful orchestration, LlamaIndex for RAG, and Model Context Protocol (MCP) for tool standardization, LangSmith for observability, and RAGAS for evaluation. An IT Support AI Agent use case with a live Human-in-the-Loop approval demo.

All repos are at github.com/manisundaram. Full portfolio at hellomani.com/page/work.

Meet the Stack

Before I go deep on each component, here's the cast of characters. Each gets their own episode — but it helps to know who's who before the show starts.

If you come from the REST API and cloud architecture world, which is where I come from, these mappings will feel familiar:

Component	What It Does	Familiar Equivalent
LLM	The knowledge engine that generates responses	A very smart API endpoint
RAG	Gives the LLM access to current, specific data	The DB call your API makes before responding
Vector Database	Stores and searches data by meaning, not value	RDS — but you query by similarity
Prompt Engineering	How you structure the question matters	Your API request payload
AI Agent	Orchestrates tools and decisions autonomously	Backend service making multiple API calls
LangGraph	Manages complex agent workflows and state	AWS Step Functions for AI
MCP	Standardized interface for agent tools	REST API contracts — for AI tools
RAGAS	Evaluates whether your AI actually works	NUnit and Selenium — for AI
LangSmith	Full observability into what your AI is doing	CloudWatch for your AI layer
ai-service-kit	Shared library, abstractions, factory patterns	Your internal SDK

Each of these is a distinct discipline. Some will feel immediately familiar. Some will feel new. All of them matter if you're building AI systems that actually hold up in production.

The Numbers

I evaluated the flagship with RAGAS. Real numbers, not marketing:

Metric	Score
Agent Task Completion	1.00
Tool Call Accuracy	1.00
RAG Faithfulness	1.00
Hallucination Rate	0.10
Overall RAGAS Score	0.613

The agent does what it's supposed to do, uses tools correctly, and doesn't hallucinate against retrieved context. Good.

The overall score of 0.613 is pulled down by context precision (0.067) and context recall (0.20). My retrieval strategy needs work, specifically the chunking approach and query rewriting. I know exactly where the system is weak. That's the point of measuring.

A system that "seems to work" tells you nothing. A score of 0.067 tells you exactly where to focus next.

Why This Matters Beyond the Repos

The honest answer is I built this to close a gap between what I could claim and what I could demonstrate. In this market, that gap matters.

But the process gave me something more valuable than repos. Real opinions about real tradeoffs. LangGraph vs custom loops. LlamaIndex vs primitives. When HITL adds value vs when it adds friction. What RAGAS measures and what it misses.

Those opinions only come from building.

I know which one keeps life interesting.

Next up: What is RAG — your LLM's first stop for current information?

From AI-Integrated Systems to AI Platform Architect

First, Some Context

The Rules I Set

What I Built and Why in That Order

Meet the Stack

The Numbers

Why This Matters Beyond the Repos

Comments (2)

I Built an Agentic AI Platform. Here's What I Learned.

More from this blog

Engineering at Scale: Bridging Architecture and Leadership

Command Palette

First, Some Context

The Rules I Set

What I Built and Why in That Order

Meet the Stack

The Numbers

Why This Matters Beyond the Repos

Comments (2)

I Built an Agentic AI Platform. Here's What I Learned.

More from this blog