To receive industry-leading AI updates and exclusive content, sign up for our daily and weekly newsletters. Learn more
Many companies are hopeful that AI will revolutionize their business, but that hope may soon be dashed by the enormous costs involved in training advanced AI systems. pointed out Engineering issues often stall progress, especially when it comes to optimizing hardware such as GPUs to efficiently handle the massive computational requirements of training and fine-tuning large language models.
While large IT companies can afford to spend millions, sometimes billions, of dollars on training and optimization, SMEs and startups often have to spend hundreds of millions of dollars on training and optimization. Being alienatedThis article describes some strategies that may enable developers with limited resources to train AI models without incurring large costs.
If you’re going to invest for 10 cents, invest for a dollar.
As we all know, the development and release of AI products, whether it is the foundational model or large-scale language model (LLM) or the fine-tuned downstream application, heavily depends on dedicated AI chips, especially GPUs. These GPUs are very expensive and difficult to obtain, so SemiAnalysis Coined word Within the machine learning (ML) community, the terms “GPU rich” and “GPU poor” are often used. LLM training can be expensive, primarily due to the costs associated with both acquiring and maintaining the hardware, rather than ML algorithms or expertise.
Training these models requires extensive computation on powerful clusters, and larger models take even longer. For example, Llama 2 70B 70 billion parameters would have to be exposed to 2 trillion tokens, requiring at least 10^24 floating point operations. Should I give up if I have a weak GPU?
Alternative strategies
There are several strategies that technology companies are currently leveraging to find alternative solutions, reduce reliance on expensive hardware, and ultimately save costs.
One approach is to tweak and streamline training hardware. Although this approach is still experimental and requires investment, it holds promise for the future optimization of LLM training. An example of such a hardware-related solution would be custom AI chips. Microsoft and MetaNew Semiconductor Initiatives NVIDIA and Open AIfrom a single computing cluster BaiduRental GPU Vastand Sohu chipped in. etchingetc.
While this is an important step forward, this methodology is best suited for larger companies that can afford to invest heavily now to reduce future expenses, not for new entrants with limited funds who want to develop an AI product now.
What to do: Innovative software
With a low budget in mind, there is another way to optimize your LLM training and reduce costs through innovative software. This approach is more affordable and accessible to most ML engineers, whether they are seasoned professionals, AI enthusiasts, or software developers looking to enter the field. Let’s take a closer look at some of these code-based optimization tools.
Mixed Precision Training
What is it?: Imagine if your company has 20 employees, but you rent office space for 200. Clearly, this is a waste of resources. Similar inefficiencies occur in practice during model training, where ML frameworks often allocate more memory than is actually needed. Mixed precision training fixes this through optimizations, improving both speed and memory usage.
structureTo achieve this, lower precision b/float16 arithmetic is combined with standard float32 arithmetic to perform fewer computational operations at one time. This may sound like a ball of technical arcana to non-engineers, but it essentially means that AI models can process data faster and require less memory without compromising accuracy.
Improvement indicatorsThis technology allows GPUs to achieve up to six times the speed. TPU (Google’s Tensor Processing Unit). Nvidia’s vertex Meta AI Pie Torch It supports mixed precision training and provides easy access to pipeline integration. Implementing this methodology can help companies significantly reduce GPU costs while maintaining an acceptable level of model performance.
Activation Checkpoint
What is it?: If you have limited memory but at the same time want more time, checkpointing might be the right technique. In short, by minimizing computations, it significantly reduces memory consumption, making LLM training possible without hardware upgrades.
structure: The main idea of activation checkpointing is to store a subset of important values while training the model, and recalculate the rest only when necessary. That is, instead of keeping all intermediate data in memory, the system only keeps what is important, freeing up memory space in the process. This is similar to the “cross that bridge when the time comes” principle, meaning that we don’t bother with less urgent issues until they require our attention.
Improvement indicators: In most cases, activation checkpointing reduces memory usage by up to 70%, but also extends the training phase by about 15-25%. This fair tradeoff allows companies to train large AI models on existing hardware without investing additional capital in infrastructure. The aforementioned PyTorch library Checkpoint supportEasier to implement.
Multi-GPU Training
What is it?: Imagine a small bakery that needs to produce a large number of baguettes quickly. With one baker working alone, it will probably take a long time. With a second baker, the process speeds up. Add a third baker, and it speeds up even more. Multi-GPU training works in much the same way.
structure: Instead of using one GPU, we use multiple GPUs at the same time. Thus, the training of the AI model is distributed across these GPUs, allowing them to work in parallel with each other. Logically, this is the opposite of the previous method, checkpointing, which reduced the hardware acquisition cost at the expense of extending the execution time. Here, we use more hardware, but utilize it to the fullest and maximize efficiency, thereby reducing the execution time and reducing the operational cost in return.
Improvement indicators: Below are three robust tools for training LLM on a multi-GPU setup, listed in ascending order of efficiency based on experimental results.
- Deep Speed: A library specifically designed for training AI models using multiple GPUs, which can achieve speeds up to 10x faster than traditional training techniques.
- FSDP: One of the most popular frameworks, PyTorch, addresses some of DeepSpeed’s inherent limitations and improves computational efficiency by an additional 15-20%.
- YaFSDP: A recently released enhanced version of FSDP for model training that achieves 10-25% speedup over the original FSDP methodology.
Conclusion
Using techniques such as mixed-precision training, activation checkpoints, and the use of multiple GPUs, even small and medium-sized businesses can make significant advances in AI training, both fine-tuning and creating models. These tools increase computational efficiency, speeding up execution times and reducing overall costs. Additionally, they allow larger models to be trained on existing hardware, reducing the need for expensive upgrades. By democratizing access to advanced AI capabilities, these approaches enable a wider range of technology companies to innovate and compete in this rapidly evolving field.
There is a saying that “AI will never replace you, but someone using AI will replace you.” The time to embrace AI is now, and with the strategies above, you can do so even on a budget.
Ksenia Se Turing Post.
Data Decision Maker
Welcome to the VentureBeat community!
DataDecisionMakers is a place where experts, including technologists working with data, can share data-related insights and innovations.
If you want to hear about cutting edge ideas, updates, best practices, and the future of data and data technology, join DataDecisionMakers.
You might also consider contributing your own article.
Learn more about DataDecisionMakers