Dataloco
Train Your Large Model on Multiple GPUs with Tensor Parallelism
This article is divided into five parts; they are: • An Example of Tensor Parallelism • Setting Up Tensor Parallelism • Preparing Model for Tensor Parallelism • Train a Model with Tensor Parallelism • Combining Tensor Parallelism with FSDP Tensor parallelism originated from the Megatron-LM paper.
Train Your Large Model on Multiple GPUs with Fully Sharded Data Parallelism
This article is divided into five parts; they are: • Introduction to Fully Sharded Data Parallel • Preparing Model for FSDP Training • Training Loop with FSDP • Fine-Tuning FSDP Behavior • Checkpointing FSDP Models Sharding is a term originally used in database management systems, where it refers to dividing a database into smaller units, called shards, to improve performance.
Beyond Short-term Memory: The 3 Types of Long-term Memory AI Agents Need
If you've built chatbots or worked with language models, you're already familiar with how AI systems handle memory within a single conversation.
Train Your Large Model on Multiple GPUs with Pipeline Parallelism
This article is divided into six parts; they are: • Pipeline Parallelism Overview • Model Preparation for Pipeline Parallelism • Stage and Pipeline Schedule • Training Loop • Distributed Checkpointing • Limitations of Pipeline Parallelism Pipeline parallelism means creating the model as a pipeline of stages.
5 Python Libraries for Advanced Time Series Forecasting
Predicting the future has always been the holy grail of analytics.
Training a Model on Multiple GPUs with Data Parallelism
This article is divided into two parts; they are: • Data Parallelism • Distributed Data Parallelism If you have multiple GPUs, you can combine them to operate as a single GPU with greater memory capacity.
Train a Model Faster with torch.compile and Gradient Accumulation
This article is divided into two parts; they are: • Using `torch.
Training a Model with Limited Memory using Mixed Precision and Gradient Checkpointing
This article is divided into three parts; they are: • Floating-point Numbers • Automatic Mixed Precision Training • Gradient Checkpointing Let's get started! The default data type in PyTorch is the IEEE 754 32-bit floating-point format, also known as single precision.
Practical Agentic Coding with Google Jules
If you have an interest in agentic coding, there's a pretty good chance you've heard of <a href="https://jules.
Evaluating Perplexity on Language Models
This article is divided into two parts; they are: • What Is Perplexity and How to Compute It • Evaluate the Perplexity of a Language Model with HellaSwag Dataset Perplexity is a measure of how well a language model predicts a sample of text.
3 Smart Ways to Encode Categorical Features for Machine Learning
If you spend any time working with real-world data, you quickly realize that not everything comes in neat, clean numbers.
Pretraining a Llama Model on Your Local GPU
This article is divided into three parts; they are: • Training a Tokenizer with Special Tokens • Preparing the Training Data • Running the Pretraining The model architecture you will use is the same as the one created in the <a href="https://machinelearningmastery.
Rotary Position Embeddings for Long Context Length
This article is divided into two parts; they are: • Simple RoPE • RoPE for Long Context Length Compared to the sinusoidal position embeddings in the original Transformer paper, RoPE mutates the input tensor using a rotation matrix: $$ \begin{aligned} X_{n,i} &= X_{n,i} \cos(n\theta_i) - X_{n,\frac{d}{2}+i} \sin(n\theta_i) \\ X_{n,\frac{d}{2}+i} &= X_{n,i} \sin(n\theta_i) + X_{n,\frac{d}{2}+i} \cos(n\theta_i) \end{aligned} $$ where $X_{n,i}$ is the $i$-th element of the vector at the $n$-
How to Fine-Tune a Local Mistral or Llama 3 Model on Your Own Dataset
Large language models (LLMs) like Mistral 7B and Llama 3 8B have shaken the AI field, but their broad nature limits their application to specialized areas.
5 Agentic Coding Tips & Tricks
Agentic coding only feels "smart" when it ships correct diffs, passes tests, and leaves a paper trail you can trust.
K-Means Cluster Evaluation with Silhouette Analysis
Clustering models in machine learning must be assessed by how well they separate data into meaningful groups with distinctive characteristics.
The Complete Guide to Docker for Machine Learning Engineers
Machine learning models often behave differently across environments.
Preparing Data for BERT Training
This article is divided into four parts; they are: • Preparing Documents • Creating Sentence Pairs from Document • Masking Tokens • Saving the Training Data for Reuse Unlike decoder-only models, BERT's pretraining is more complex.
BERT Models and Its Variants
This article is divided into two parts; they are: • Architecture and Training of BERT • Variations of BERT BERT is an encoder-only model.
From Shannon to Modern AI: A Complete Information Theory Guide for Machine Learning
In 1948, Claude Shannon published a paper that changed how we think about information forever.