Efficient LLM Finetuning
STADLE Value Proposition for LLMs
Outcomes:
-
8% - 14% reduction in time required to fine-tune LLMs, irrespective of model training frameworks used (NeMo, DeepSpeed)
-
90% retention of learnings from older datasets, when models are fine-tuned on newer datasets
​
How?
-
Before STADLE: Standard fine-tuning approaches work on the entire dataset as a singular learning “task”
-
After STADLE: STADLE instead works in parallel on multiple meaningful subsets of the entire dataset pertaining to “subtasks” of the learning task (e.g. data from a specific location)​
STADLE + NeMo = Improved Training Efficiency
​NeMo simplifies the deployment and management of distributed training tasks at scale, with support for many of the techniques used for efficient LLM pretraining and fine-tuning (3D-parallelism, flash attention; PEFT, MoE)
STADLE, on the other hand, modifies the model update algorithm and model synchronization methodology, with a focus on reducing interference and redundant learning across nodes
This allows for:
-
Data-efficient incremental learning
-
Modified sharding based on reducing single-node training subtask complexity
-
Reduction in necessary inter-node communication
Combining the higher-level optimizations from STADLE with lower-level optimizations and orchestration from NeMo allows for improved training efficiency without significant infrastructure modifications​