[2412.19437] DeepSeek-V3 Technical Report

[Submitted on 27 Dec 2024 (v1), last revised 18 Feb 2025 (this version, v2)]

Authors:DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J.L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du, R.J. Chen, R.L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S.S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W.L. Xiao, Wangding Zeng

et al. (100 additional authors not shown)

View a PDF of the paper titled DeepSeek-V3 Technical Report, by DeepSeek-AI and 199 other authors

View PDF
HTML (experimental)

Abstract:We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at this https URL.

Submission history

From: Wenfeng Liang [view email]
[v1]
Fri, 27 Dec 2024 04:03:16 UTC (1,114 KB)
[v2]
Tue, 18 Feb 2025 17:26:38 UTC (1,114 KB)

Source link

Basic Bliss

Recommended

Premium Perks

Ultimate Insider

Basic Bliss

Recommended

Premium Perks

Ultimate Insider

Become a member

17-year-old surprises customers with revenge on boss: ‘It makes me laugh now’ – FAIL Blog

Response Time: Vol. 44 – The Intercom Blog

636: Nose-Biting Territory

Bernadette Returns, Emma Stays Gone

17-year-old surprises customers with revenge on boss: ‘It makes me laugh now’ – FAIL Blog

Response Time: Vol. 44 – The Intercom Blog

636: Nose-Biting Territory

Bernadette Returns, Emma Stays Gone

[2412.19437] DeepSeek-V3 Technical Report

Submission history

Response Time: Vol. 44 – The Intercom Blog

Wikipedia:Database download – Wikipedia

Book: Scaleup Arabia: Journeys & Lessons from Top Founders & Leaders Driving Growth in MENA & Beyond

Subscribe for exclusive content

Subscribe to News Inside 2 You

Basic Bliss

Recommended

Premium Perks

Ultimate Insider

Subscribe to Liberty Case

Basic Bliss

Recommended

Premium Perks

Ultimate Insider

Become a member

[2412.19437] DeepSeek-V3 Technical Report

Submission history

Subscribe for exclusive content