Tian-Fantasea/test123
Updated • 1
An end-to-end speculative inference acceleration scheme is presented for OpenPangu-7B to address memory constraints and lack of native support for speculative decoding on NPU hardware.
To mitigate the Memory Wall bottleneck encountered by Large Language Models (LLMs) during inference on NPU hardware, and addressing the scarcity of native support for mainstream speculative decoding algorithms on domestic infrastructure, this study presents an end-to-end speculative inference acceleration scheme for OpenPangu-7B.
Get this paper in your agent:
hf papers read 2603.03383 curl -LsSf https://hf.co/cli/install.sh | bash No dataset linking this paper
No Space linking this paper
No Collection including this paper