Dual Camera Conference System

DualSpar: A Dual-Granularity Memory Framework with Adaptive Sparsity for Efficient LLM Inference

Abstract: The block-based inference engine, powered by noncontiguous key-value (KV) cache management, has emerged as a new paradigm for large language model (LLM) inference due to its efficient memory ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

DualSpar: A Dual-Granularity Memory Framework with Adaptive Sparsity for Efficient LLM Inference

Trending now