Similar to quantization, sparsity reduces a model’s size, which leads to cost savings for hosting and inference. This reduction technique is useful because a model’s size and computational needs usually increase as its performance improves. Using SparseGPT, any GPT model can be pruned by at least 50% with a minimal accuracy decrease. Sparsity prunes a model by removing redundant parameters. Sparsity can allow SMEs to run models on-device for their applications. Meta’s Llama 3.1 and Mistral Large 2 are over 229 GB in file size. This makes it difficult for SMEs to afford cloud computing costs to use large foundational models in their applications. An SME’s IT team can turn to sparsity to reduce costs and energy consumption and improve inference speeds while maintaining a suitable accuracy level.
How Sparsity Works
Sparsity is an alternative model compression technique to quantization. …