Blog

Reinforcement Learning Part 3: The Bellman Optimality Equation and Optimal Policies

Reinforcement Learning Part 3: The Bellman Optimality Equation and Optimal Policies

Those joining us directly at Part 3 should be familiar with the Bellman Equation and how it can be used to compare two policies. We have defined optimal policies and optimal state values. We also know how to calculate state values iteratively rather than calculating a computationally-heavy matrix inverse using a step known as Policy Evaluation. Those interested in previous parts can find the links at the bottom.
The Bellman Optimality Equation
The Bellman Optimality Equation (BOE) builds on the Bellman Equation and tries to express the state values that an agent can achieve...
Reinforcement Learning Part 2: The Bellman Equation and its Importance

Reinforcement Learning Part 2: The Bellman Equation and its Importance

For those joining this series directly at Part 2, Part 1 contained a list of terminology that is commonly used in RL, all of which the reader should be familiar with. Part 1 is linked at the bottom of this post.
How do we know which policy is better?
Let us say for any given RL problem, we have to find the “best” policy that the agent should follow. To find the best policy, we first need to define what makes a policy better than another one. Intuitively speaking, a policy is better if the agent receives higher rewards. Mathematically, state values provide us with a way to...
Understanding MAISI: Synthetic CT Scan Generation with Medical AI Diffusion Models

Understanding MAISI: Synthetic CT Scan Generation with Medical AI Diffusion Models

Paper: MAISI: Medical AI for Synthetic Imaging (https://arxiv.org/pdf/2409.11169)
Oct 2024, NVIDIA
With generative AI being all the craze right now, diffusion models emerged as a highly effective means to create high fidelity and high diversity images. While much of the research in this field is aimed at controlling the outputs of these models, a small section of research is devoted to generating radiological scans, with an even smaller section of research aimed at generating 3D radiological scans, specifically Computed Tomography (CT) scans.
One may question why I am even...
Reinforcement Learning Part 1: Basic Terminology and the Markov Decision Process

Reinforcement Learning Part 1: Basic Terminology and the Markov Decision Process

Reinforcement learning is a branch of AI that has recently gained renewed attention through cutting-edge research and experimentation. While traditionally considered a slow and unstable training paradigm, its role in advancing large language models and the broader pursuit of AGI has revealed new benefits, making RL a powerful strategy that every AI scientist should have in their toolkit.
Terminology
Let us begin this with the most basic terminology that we will encounter throughout this series:
Agent: An entity that perceives its environment and takes actions autonomously...
Understanding Classifier-Free Guidance: Improving Control in Diffusion Models Without Additional…

Understanding Classifier-Free Guidance: Improving Control in Diffusion Models Without Additional…

Understanding Classifier-Free Guidance: Improving Control in Diffusion Models Without Additional Networks
Paper: CLASSIFIER-FREE DIFFUSION GUIDANCE (https://arxiv.org/pdf/2207.12598)
Jul 2022, Google Research
Goal
The authors of this paper set out to achieve the results of Classifier Guidance without the classifier, which is the main innovation presented in the latter. Avoiding the disadvantages of this classifier is the primary motivation behind this paper. Additionally, they are also able to do this with a single stage of training compared to the two networks required to...
Understanding Classifier Guidance: Steering Diffusion Models with Gradient Signals

Understanding Classifier Guidance: Steering Diffusion Models with Gradient Signals

Paper: Diffusion Models Beat GANs on Image Synthesis (https://openreview.net/pdf?id=AAWuCvzaVt)
June 2021, OpenAI
Goal
It is well known that while GANs have superior image fidelity, they lack in terms of output diversity. While Diffusion Models (DMs) are a noteworthy competitor in the domain of image synthesis, providing high diversity in their outputs, they lack controllability in their outputs. The authors of this paper express two reasons for this disparity: extensive research on GANs leading to better results, and the inherent capability of GANs to be able to trade...
Understanding DETR: When Transformers Decide Bounding Boxes Are Just Another Sequence

Understanding DETR: When Transformers Decide Bounding Boxes Are Just Another Sequence

Paper: End-to-End Object Detection with Transformers (https://arxiv.org/pdf/2005.12872)
May 2020, Facebook AI
DETR: Detection Transformer
Traditionally, detection models use complex strategies that employ the use of CNNs along with additional non-DL AI algorithms in order to generate bounding boxes with classification for objects of interest. The authors of this paper broke this trend by targeting the same goal but by using a simple transformer-based architecture and a smart method to calculate the loss. The authors strip detection architectures of region proposal networks,...
Understanding LLaVa 1.5: Improvements in Training Recipes and Benchmarks

Understanding LLaVa 1.5: Improvements in Training Recipes and Benchmarks

Paper: Improved Baselines with Visual Instruction Tuning (https://arxiv.org/pdf/2310.03744)
May 2024, UW-Madison, Microsoft Research
In this paper, the authors build upon their previous paper “Visual Instruction Tuning”. If you want to know more about LLaVa 1.0 i.e. the first paper, have a look here.
This core material presented in this paper can be divided into the following sections:
Comparison with other modelsArchitectural changesDataset changesResultsComparison with other models
The paper mainly focuses on the following competitors when it compares then with LLaVA 1....
Glossary: A Reference to All My Articles

Glossary: A Reference to All My Articles

I wanted to provide a method for readers to be able to find any and all of my articles in one place with some sort of segregation, which I find hard to do with Medium; this is why I have created a sort of index below which lists all the articles based on some defining topic.
Research Papers
Diffusion Models
Understanding DiT: Transformer-Driven Diffusion for Image SynthesisUnderstanding ControlNet: A Way to Boss Around Your Diffusion ModelUnderstanding LDMs: Diffusion in Latent Space for Efficient Resource UtilizationUnderstanding Classifier Guidance: Steering Diffusion...
Understanding DiT: Transformer-Driven Diffusion for Image Synthesis

Understanding DiT: Transformer-Driven Diffusion for Image Synthesis

Paper: Scalable Diffusion Models with Transformers (https://arxiv.org/pdf/2212.09748)
DiT: Diffusion Transformer
Mar 2023, UCB, NYU
The authors of this paper had a single goal in mind: to show superior performance by transformers compared to conv UNets in the LDM diffusion process. Before this paper, very limited research existed which employed the use of transformers in generative vision modeling; this paper changed that by showing that the inductive bias inherent in U-Nets are not a requirement for diffusion models to work.
(to understand diffusion models, please read the...
Understanding LLaVa: Merging Visual Perception with Large Language Models

Understanding LLaVa: Merging Visual Perception with Large Language Models

Paper: Visual Instruction Tuning (https://arxiv.org/pdf/2304.08485)
LLaVa: Large Language and Vision Assistant
Dec 2023, UW Madison, Microsoft, Columbia
While there are a lot of models that have been made to follow instructions in text format, there is not a lot of research in miltimodal models where the instruction is followed based on the information present in the image. This paper aims to address this gap by generating a relevant dataset using powerful LLMs (such as GPT) and then creating a large multimodal model using pre-trained LLMs and ViTs that are then fine-tuned...
Understanding ControlNet: A Way to Boss Around Your Diffusion Model

Understanding ControlNet: A Way to Boss Around Your Diffusion Model

Paper: Adding Conditional Control to Text-to-Image Diffusion Models (https://arxiv.org/pdf/2302.05543)
Goal
While umpteen models exist that allow image generation using texts, the outputs are usually different from what the user visualized in their mind, and it does take a bit of tweaking and prompt engineering to get a satisfactory image. ControlNet is a method that is introduced to add spatial conditioning to a pretrained text-to-image diffusion model that produces outputs that are much closer to the user’s imagination with minimal effort. This spatial conditioning can be...
Overview of all optimizers: Making Bad Choices Faster Since 1986

Overview of all optimizers: Making Bad Choices Faster Since 1986

While PyTorch, Keras, and other similar libraries handle gradient graph calculation based on the forward pass, it is actually the optimizer’s job to update the weights of the model at every update step. There are several optimizers to choose from, each with their own advantages and disadvantages. We shall discuss the following optimizers here:
Stochastic Gradient Descent (SGD)
SGD is, by far, the simplest and easiest optimizer to understand as it’s the first method of updating model weights that anybody can think of intuitively.
In SGD, the weights are updated with a value...
Understanding PyTorch’s Distributed Communication Package: Powering all distributed model training

Understanding PyTorch’s Distributed Communication Package: Powering all distributed model training

PyTorch provides a very powerful API to communicate between multiple accelerator (GPU, CPU, etc.) instances when performing distributed model training, leading to high efficiency and optimization.
This API is powered by one of multiple backends. Since most of today’s training occurs on GPUs, the NCCL backend is the most commonly used backend, especially when training LLMs which require InfiniBand for efficient training. The next most observed backend is Gloo, followed by MPI, which allow training on multiple CPUs. PyTorch takes care of interfacing with the backend for us,...
Understanding InfoVAE: Rethinking the ELBO to Avoid Posterior Collapse

Understanding InfoVAE: Rethinking the ELBO to Avoid Posterior Collapse

Paper: InfoVAE: Balancing Learning and Inference in Variational Autoencoders (https://arxiv.org/pdf/1706.02262)
Goal: To modify the ELBO such that the opposing forces of the reconstruction loss and the KL divergence is decoupled while optimizing for generation.
The authors first start by giving a recap of VAEs. The math behind the vanilla VAE can be found here.
The authors then state that the ELBO can theoretically be maximized even with inaccurate estimated posteriors. They say that “good ELBO values do not imply accurate inference.” They go on to show prove this by math,...
Understanding VampPrior: Identifying More Expressive VAE Priors

Understanding VampPrior: Identifying More Expressive VAE Priors

Paper: VAE with a VampPrior (https://arxiv.org/pdf/1705.07120)
Choosing a simple standard normal distribution as a prior for a VAE model has been shown to lead to over-regularization of the model and have insufficient utilization of the available latent space. The authors of this paper identify a new method to estimate a prior much closer to the true posterior while training the VAE, addressing the issues of the vanilla VAE.
The math behind it
A VAE aims to optimize for two opposing terms simultaneously: the negative log likelihood of our desired data distribution, and the...
Understanding SR3: Super-Resolution using Diffusion Models

Understanding SR3: Super-Resolution using Diffusion Models

Paper: Image Super-Resolution via Iterative Refinement (https://arxiv.org/pdf/2104.07636)
In this paper, the authors introduce a new method to perform super-resolution (SR) on low resolution (LR) images using diffusion models. The existing methods at that time involved a different set of algorithms (regressive, adversarial, flow-based, etc.) which had their own set of issues. This paper tries to solve for them with diffusion models and shows better results based on human perception of the model outputs.
The training process
(to understand diffusion models, please read the...
Understanding MultiResUNet: Multi-Resolution Pathways for (Medical) Image Segmentation

Understanding MultiResUNet: Multi-Resolution Pathways for (Medical) Image Segmentation

Paper: MultiResUNet : Rethinking the U-Net Architecture for Multimodal Biomedical Image Segmentation (https://arxiv.org/pdf/1902.04049)
In this paper, the authors attempt to improve upon the classical U-Net architecture by identifying the aspects where it lacks and incorporating changes accordingly.
Modifying the UNet block
Instead of the two 3x3 conv layers in each scale of the U-Net encoder and decoder (which is approximately equivalent to a single 5x5 conv layer), they use three different kernel sizes (3, 5, and 7), and concatenate the outputs to get a single feature...
Understanding LDMs: Diffusion in Latent Space for Efficient Resource Utilization

Understanding LDMs: Diffusion in Latent Space for Efficient Resource Utilization

Paper: High-Resolution Image Synthesis with Latent Diffusion Models (https://arxiv.org/pdf/2112.10752)
Diffusion models have provided us with a method to generate conditioned high quality samples from a model that trains in a stable manner, often achieving state-of-the-art results. But training diffusion models in the pixel-space can be costly in terms of time and resources. This paper aims to introduce latent diffusion models (LDM) that operate in the latent-space instead of the pixel-space to reduce the time and resouce dependencies drastically.
The two stages of training...