Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
Tencent Hunyuan Production Fundamentals
- Overview of Tencent Hunyuan model serving scenarios
- Production characteristics of large and MoE models
- Common latency, throughput, and cost bottlenecks
- Defining service-level objectives for inference workloads
Deployment Architecture and Serving Flow
- Core components of a production inference stack
- Choosing between containerized, on-premise, and cloud deployment models
- Model loading, request routing, and GPU allocation basics
- Designing for reliability and operational simplicity
Latency Optimization in Practice
- Using optimized inference engines such as TensorRT where applicable
- KV-cache concepts and practical cache tuning
- Reducing startup, warmup, and response overhead
- Measuring time to first token and token generation speed
Throughput, Batching, and GPU Efficiency
- Continuous batching and request batching strategies
- Managing concurrency and queue behavior
- Improving GPU utilization without harming user experience
- Handling long-context and mixed-workload requests
Quantization and Cost Control
- Why quantization matters for production serving
- Practical trade-offs of FP16, INT8, and other common precision options
- Balancing model quality, latency, and infrastructure cost
- Building a simple cost optimization checklist
Operations, Monitoring, and Readiness Review
- Autoscaling triggers for inference services
- Monitoring latency, throughput, cache usage, and GPU health
- Logging, alerting, and incident response basics
- Reviewing a reference deployment and creating an improvement plan
Requirements
- Basic understanding of large language model deployment and inference workflows
- Experience with containers, cloud or on-premise infrastructure, and API-based services
- Working knowledge of Python or system engineering tasks
Audience
- ML engineers deploying LLMs into production
- Platform engineers responsible for GPU-based inference services
- Solution architects designing scalable AI serving platforms
14 Hours
Custom Corporate Training
Training solutions designed exclusively for businesses.
- Customized Content: We adapt the syllabus and practical exercises to the real goals and needs of your project.
- Flexible Schedule: Dates and times adapted to your team's agenda.
- Format: Online (live), In-company (at your offices), or Hybrid.
Price per private group, online live training, starting from 3200 € + VAT*
Contact us for an exact quote and to hear our latest promotions