A multi-lingual text-to-speech AI pipeline supporting real-time audio previews and massive batch conversions on AWS.
Neurative acts as a production hub for content creators requiring instantaneous, high-fidelity voice synthesis across multiple languages and tonal characteristics. The platform supports real-time audio previews for quick iteration, as well as massive batch conversions for audiobook and podcast production. Users select voice profiles, upload scripts, and receive high-bitrate audio files delivered via a global CDN. The system needed to handle hundreds of concurrent users while keeping latency low and costs predictable.
Generating and streaming high-bitrate audio files simultaneously for hundreds of concurrent users was overwhelming the node servers, leading to buffering, crashed conversions, and expensive unoptimized billings. Pumping base64 audio through JSON WebSocket payloads crashed the Node environment under load. Browsers strictly limit Web Audio API contexts without direct user interaction, complicating the preview experience. The client needed a robust architecture that could scale without degrading quality or blowing the OpenAI API budget.
Decoupled the audio processing pipeline from the main thread. Implemented an AWS S3 + CloudFront architecture—OpenAI TTS generates binary, streams directly to S3, and the frontend receives edge-cached CloudFront URLs instead of raw payloads. Engineered an intelligent batch conversion job queue on the Node.js backend to throttle API loads gracefully. Built a Lambda buffer to handle streaming without blocking. Latency dropped by roughly 15x, and the platform now supports non-blocking batch conversions with Stripe-tiered access control.
System architecture and data flow diagrams illustrating the underlying infrastructure and request lifecycle.
| Metric | Requirement | Target |
|---|---|---|
| P99 Latency | < 250ms | < 100ms |
| System Uptime | 99.9% | 99.99% |
| Query Payload | 10k ops/sec | 50k ops/sec |
Building a frontend wrapped around the Web Audio API was extremely demanding. Browsers strictly limit audio contexts without direct user interactions to prevent autoplay spam.
The heaviest engineering problem wasn't even the UI, it was managing the raw binary data. Pumping base64 audio directly through JSON websocket payloads crashed the entire Node environment under load.
The fix was an elegant 'Lambda Buffer'. We let OpenAI TTS generate the binary, streamed it directly to an S3 object, and handed the React frontend an edge-cached Cloudfront URL instead. Latency dropped by roughly 15x instantly.
Never transmit raw binary payloads directly through standard HTTP REST objects. Use presigned URLs for large asset streaming.
Strict tracking of OpenAI API usages mapped back to the user's stripe subscription is essential to prevent billing leaks.