Section 5

Data Pipeline

Meeting content flows from live capture through transcription, semantic chunking, and embedding before landing in the vector store. Each processing step is handled by an abstracted provider adapter — swapping implementations is a configuration change, not a code change.

Ingestion Flow

The meeting bot and Job Runner operate independently. The bot captures audio and uploads to blob storage; the Job Runner handles all downstream processing asynchronously.

Orchestration
Capture
Processing
Storage

Transcription Provider Strategy

Transcription is a discrete pipeline step. The meeting bot delivers raw audio (WebM) to blob storage; the Job Runner picks it up and dispatches to the configured transcription provider. The interface is abstracted behind a provider adapter.

At these price points, transcription output approximates what COTS providers deliver (Fireflies, MeetGeek, etc.). Accuracy can be tuned upward, but returns diminish quickly and costs increase exponentially.

Deepgram Nova-2
Recommended
Cost~$0.0043 / min
DiarizationQuery parameter (diarize=true) — no extra cost

Cheapest option with strong accuracy

AssemblyAI
Cost~$0.006 / min
DiarizationAutomatic speaker count detection

Nano tier available for lower cost when top accuracy isn't needed

Build the provider adapter from day one. Swapping transcription providers should be a configuration change, not a code change.

Embedding Model Strategy

The embedding model is separate from the reasoning model used at query time. It converts semantic chunks into dense vectors for storage in Qdrant. Two strong candidates exist, each with different trade-offs.

OpenAI text-embedding-3-large
High quality
Dimensions3072

Generally considered highest quality general-purpose embedding model

Voyage AI
Claude-optimized
Dimensions1024

Anthropic has optimized Claude to work particularly well with Voyage embeddings

Which combination performs best is a moving target. OpenAI's embeddings tend to be higher quality in isolation, but Claude's optimization for Voyage means retrieval quality can be better in the full Claude + Voyage pipeline.

Abstract the embedding interface

Build the embedding provider behind an adapter from day one. Swapping models should be a configuration change.

Build re-indexing tooling early

We will almost certainly swap embedding models at least once. The ability to re-embed the entire corpus against a new model needs to exist before that day comes.