Multi-Head Latent Attention: Boosting Inference Efficiency
ContentsIntroductionMethodLow-RankKey-ValueJointCompressionDecoupledRotaryPositionEmbeddingReferencesIntroduction作者提出Multi-headLatentAttention(MLA),通过将KV压缩为CompressedLatentKV,在减小KVcache的同时保持模型精度