logo
사건

AMD Instinct MI355X Achieves MLPerf Inference v6.0 Gains with Over 1 Million Tokens per Second and Supports Scalable ROC

인증
중국 Beijing Qianxing Jietong Technology Co., Ltd. 인증
중국 Beijing Qianxing Jietong Technology Co., Ltd. 인증
고객 검토
베이징 첸싱 지에통 테크 주식회사의 영업 사원은 매우 전문적이고 참을성 있습니다. 그들은 빨리 인용을 제공할 수 있습니다. 제품의 품질과 패키징은 또한 매우 좋습니다. 우리의 협력은 매우 매끄럽습니다.

—— 《Festfing DV》LLC

내가 긴급히 인텔 CPU와 토시바 SSD를 찾고 있었늘 때, 베이징 첸싱 지에통 기술 주식회사로부터의 샌디는 나에게 많은 도움을 주었고, 나에게 빨리 필요로 한 제품을 가져다 주었습니다. 나는 정말로 그녀를 압니다.

—— 고양이 엔

베이징 첸싱 지에통 기술 주식회사의 샌디는 내가 서버를 구입할 때 제시간에 나에게 구성 오류를 상기시킬 수 있는 매우 주의깊은 판매원을 있습니다. 엔지니어들은 또한 매우 전문적이고, 빠르게 테스팅 프로세스를 완료할 수 있습니다.

—— 스트렐킨 미하일 블라드미로비치

베이징 첸싱지에통과의 협업에 매우 만족합니다. 제품 품질이 훌륭하고, 배송도 항상 제 시간에 이루어집니다. 영업팀은 전문적이고, 인내심이 많으며, 모든 질문에 매우 친절하게 답변해 줍니다. 그들의 지원에 진심으로 감사드리며, 장기적인 파트너십을 기대합니다. 강력 추천합니다!

—— Ahmad Navid

품질: 제 공급업체와의 좋은 경험. 미크로틱 RB3011은 이미 사용되었지만 매우 좋은 상태로 모든 것이 완벽하게 작동합니다. 통신은 빠르고 원활했습니다.그리고 제 모든 걱정은 빠르게 해결되었습니다.매우 신뢰할 수 있는 공급자

—— 제란 콜레시오

제가 지금 온라인 채팅 해요

AMD Instinct MI355X Achieves MLPerf Inference v6.0 Gains with Over 1 Million Tokens per Second and Supports Scalable ROC

April 15, 2026
AMD has announced its MLPerf Inference v6.0 benchmark results, positioning the Instinct MI355X GPU as a highly scalable inference platform capable of supporting single-node, multinode, and heterogeneous deployments. Beyond incremental performance gains, the submission introduces new workloads, demonstrates cluster-scale throughput exceeding 1 million tokens per second, and validates consistent performance reproducibility across an expanding partner ecosystem.

CDNA 4 Architecture Targets High-Capacity Inference


The Instinct MI355X is built on AMD’s CDNA 4 architecture, leveraging a TSMC dual-process chiplet design: compute dies (XCDs) use a 3nm node, while I/O dies utilize 6nm FinFET technology. The multi-chiplet package integrates 185 billion transistors and supports FP4 and FP6 data formats—critical for efficient large-model inference. Each GPU is equipped with up to 288GB of HBM3E memory (delivering 8 TB/sec of memory bandwidth) , enabling support for models up to 520 billion parameters on a single device. AMD emphasizes that this combination of compute density and memory capacity eliminates the need for excessive model partitioning, a key advantage for large-scale inference workloads.

Available in UBB8 configurations, the platform offers both air-cooled and direct liquid-cooled options, aligning with diverse data center deployment requirements. Notably, the MI355X features a 1400W TBP (Thermal Design Power) with liquid cooling, delivering higher performance than its air-cooled counterpart, the MI350X.

Multinode Throughput Surpasses 1 Million Tokens per Second


A standout achievement from the MLPerf v6.0 round is AMD’s cluster-scale throughput exceeding 1 million tokens per second. Using Instinct MI355X GPUs, AMD hit this milestone with Llama 2 70B in both Server and Offline scenarios, as well as with GPT-OSS-120B in Offline mode.

최신 회사 사례 AMD Instinct MI355X Achieves MLPerf Inference v6.0 Gains with Over 1 Million Tokens per Second and Supports Scalable ROC  0

AMD MLPerf 1M tokens per second graphic

These results reflect a growing industry shift toward evaluating inference performance at the cluster level, rather than per individual accelerator. Aggregate throughput and time-to-serve have become primary metrics for determining production readiness in large-scale AI deployments.

AMD also demonstrated exceptional scaling efficiency. For Llama 2 70B, an 11-node, 87-GPU configuration achieved over 1 million tokens per second across Offline, Server, and Interactive scenarios, with scale-out efficiency ranging from 93% to 98%. For GPT-OSS-120B, a 12-node, 94-GPU cluster delivered similar throughput with over 90% scaling efficiency—proving performance translates effectively as deployments expand beyond a single system.

Generational Gains and Competitive Single-Node Performance


AMD reported significant generational improvements, with the Instinct MI355X delivering 3.1x better performance on Llama 2 70B Server compared to the prior-generation Instinct MI325X, reaching 100,282 tokens per second. This improvement stems from both CDNA 4 architectural enhancements and ROCm software optimizations. Offline scores improved by 4.4x and Server scores by 4.8x compared to prior MLPerf rounds, primarily driven by FP4 quantization—a key feature of the MI355X that unlocks higher throughput for AI workloads.

AMD Inference results vs previous gen graphic

In single-node comparisons against NVIDIA platforms, the MI355X demonstrated strong competitiveness. On Llama 2 70B, it matched NVIDIA B200 in Offline throughput, achieved near parity in Server performance, and outperformed it in Interactive mode. Against NVIDIA B300, the MI355X delivered 92% of Offline performance, 93% of Server performance, and exceeded it by 4% in Interactive mode. Notably, the MI355X also offers superior cost-efficiency, delivering 40% more tokens per dollar compared to the NVIDIA B200.

First-Time Model Enablement Expands Coverage


MLPerf Inference v6.0 introduced several new workloads, and AMD used this round to showcase rapid model enablement. GPT-OSS-120B, a mixture-of-experts model, made its MLPerf debut with the MI355X, achieving competitive results against NVIDIA systems in both Offline and Server scenarios.

AMD also submitted results for Wan-2.2 text-to-video generation, marking its entry into multimodal and generative video inference. While the official submission focused on Single Stream latency, the results were on par with existing platforms. Post-submission tuning further improved performance, highlighting room for optimization as the software stack matures.

These additions underscore AMD’s commitment to expanding beyond traditional LLM benchmarks to support emerging AI workloads across diverse use cases.

ROCm Software Enables Scaling and Heterogeneous Inference


AMD credits much of the MI355X’s performance and scalability to its ROCm software stack. Key enhancements include optimized FP4 execution, improved GPU-to-GPU communication for distributed inference, and support for dynamic workload distribution across heterogeneous environments—critical for mixed-GPU deployments.

AMD MLPerf inference results instinct mI355x graphic
A milestone heterogeneous submission—developed by Dell and MangoBoost—used three AMD Instinct GPU models: MI300X, MI325X, and MI355X. This configuration achieved 141,521 tokens per second on Llama 2 70B Server and 151,843 tokens per second on Llama 2 70B Offline. Notably, the MI355X platform was located in Dell’s U.S. lab, while the MI300X and MI325X systems were in Korea—demonstrating the ability to coordinate distributed systems across geographic locations.

Ecosystem Growth and Reproducibility


AMD’s partner ecosystem expanded significantly in this MLPerf round, with nine companies submitting results across multiple Instinct GPU generations. Participating vendors include Cisco, Dell, Giga Computing, HPE, MangoBoost, MiTAC, Oracle, Supermicro, and Red Hat—reflecting broad industry adoption of AMD’s inference solutions.

Partner submissions closely aligned with AMD’s internal results, typically within 4% and in some cases within 1%. This consistency confirms that MI355X performance is reproducible across OEM and cloud platforms, reducing deployment risk and boosting confidence in real-world performance outcomes.

Beijing Qianxing Jietong Technology Co., Ltd.
Sandy Yang/Global Strategy Director
WhatsApp / WeChat: +86 13426366826
Email: yangyd@qianxingdata.com
Website: www.qianxingdata.com/www.storagesserver.com
Business Focus:
ICT Product Distribution/System Integration & Services/Infrastructure Solutions
With 20+ years of IT distribution experience, we partner with leading global brands to deliver reliable products and professional services.
“Using Technology to Build an Intelligent World”Your Trusted ICT Product Service Provider!
연락처 세부 사항
Beijing Qianxing Jietong Technology Co., Ltd.

담당자: Ms. Sandy Yang

전화 번호: 13426366826

회사에 직접 문의 보내기 (0 / 3000)