The Context (The “Nightmare” State)
The client, a Fortune 500 FinTech company, had a fragmented approach to Generative AI.
- Fragmented Access: 12 different product teams were purchasing their own OpenAI API keys, leading to zero visibility into aggregate spend or usage patterns.
- Security Risks: Developers were hardcoding API keys in
.envfiles and committing them to internal repositories. There were no content filters for PII (Personally Identifiable Information). - Operational Inefficiency: Onboarding a new microservice to use LLMs took 3 weeks of Jira tickets and security reviews.
They needed a way to democratize access to the best models (OpenAI, Anthropic) while strictly enforcing security, compliance, and cost controls.
The Architecture
We moved them from a “mesh” of direct connections to a “Hub and Spoke” model.
Before:
- Each service connected directly to external providers.
- No central logging.
- No standard retry/fallback logic.
After (The Golden Path):
- Traffic Proxy: All LLM traffic flows through a localized AI Gateway cluster (deployed on Kubernetes).
- Identity Federation: Authentication is handled via the existing Corporate SSO, mapped to IAM roles for creating “Team” budgets.
- Policy Enforcement: Traffic is inspected before it leaves the boundary. Regex patterns block credit card numbers and PII.
The Implementation
We deployed a custom configuration of an AI Gateway integrated with their Internal Developer Platform (Backstage).
1. The Stack
- Gateway: Custom high-performance proxy written in Go, deployed as a sidecar or standalone service.
- Portal: Backstage plugin for provisioning API keys and viewing usage.
- Observability: Prometheus for metrics (token count, latency) and Datadog for tracing.
2. The Hack: Semantic Caching
The biggest win came from implementing Semantic Caching. We realized that 30% of internal developer traffic was repetitive (testing the same prompts against the same models).
We implemented a Redis-backed semantic cache that intercepts these requests. If a similar prompt (cosine similarity > 0.95) was seen recently, the gateway returns the cached response instantly.
- Result: 30% reduction in traffic to OpenAI.
- Latency: Reduced from ~2.5s to <50ms for cached hits.
The Results
| Metric | Before | After |
|---|---|---|
| Token Cost | $50k/month (uncontrolled) | $30k/month (capped & optimized) |
| Onboarding Time | 3 weeks | 5 minutes (Self-Service) |
| Security Incidents | 2 Potential Leaks | 0 (Blocked by PII Filter) |
| Observability | None | Real-time Dashboard per Team |
Conclusion
By treating AI access as infrastructure rather than just an API key, we enabled the organization to scale their GenAI initiatives securely. The focus shifted from “How do I get an API key?” to “How do I build the best prompt?”