*Achieving 2K TPS with SOTA Models: Options and Considerations*
The author is seeking to deploy a Super-Optimal (SOTA) AI model capable of processing 2,000 transactions per second (TPS) with minimal latency. This would enable real-time conversations with a family member, with a focus on medical issues, and require the model to maintain maximum intelligence throughout a 30-60K token prompt conversation.
Evaluating Open-Source Models
The author has identified several open-source models that could potentially meet their requirements:
* Qwen3.5 (27B parameters)
* Qwen3.5 397BA17B
* Kimi K2.5
* GLM-5
* Cerebras' GLM-4.7 (1K+ TPS, but considered an older model)
Analyzing Model Performance
While Cerebras' GLM-4.7 has demonstrated impressive performance with 1,000+ TPS, its age and potential API deprecation are concerns. The author is also considering OpenAI's "Spark" model, available on the Pro tier in Codex, but lacks confidence in its performance due to the lack of non-coding benchmarks.
Alternative Options and Considerations
The author has also explored alternative options, such as:
* Using a non-reasoning model like Opus 4.6 for quick time to first answer token, but recognizes the limitations of not having reasoning capabilities.
* The fast Claude API, which, while fast, falls short of the required 3-second time to first answer token with Content Over Token (COT) due to latency issues.
Next Steps
To achieve the desired performance, the author should consider the following:
* Evaluate the performance of each model using relevant benchmarks and metrics.
* Assess the feasibility of deploying the chosen model in a cloud-based environment to minimize infrastructure costs.
* Consider the trade-offs between model performance, latency, and cost to determine the most suitable option for their specific use case.
Ultimately, the author's decision will depend on their specific requirements, budget, and willingness to invest in infrastructure.