Blog | Arjun Agarwal

Understanding MedSyn: Another Method of Diffusion-Based Medical Image Generation

November 18, 2025 by Arjun Agarwal

Paper: MedSyn: Text-guided Anatomy-aware Synthesis of High-Fidelity 3D CT Images (https://arxiv.org/pdf/2310.03559)
(note that there two independent papers introducing “MedSyn”; the other one is about generating clinical notes using language models)
Oct 2024, Boston University, Stanford University, Carnegie Mellon University, University of Pittsburgh
We discuss another paper that deals with generative AI in 3D computer vision, specifically CT scans. This particular paper was released approximately the same time as the MAISI paper, after the GenerateCT paper.
For those...

Reinforcement Learning Part 5: Monte Carlo Learning

November 12, 2025 by Arjun Agarwal

Up until now, we have assumed we know the dynamics model of our environment. However, this is often not the case as perfect knowledge of the world is not usually known. For all future articles, we shall assume that we do not know the transition probabilities beforehand, but their effects can be observed when the agent takes specified action. Note that we are still operating in a finite state space and action space.
Since we no longer definitively know the dynamics model, we shall learn to approximate it using experience. This is known as Monte Carlo (MC) estimation. MC...

Understanding GenerateCT: Advancing Synthetic CT Reconstruction through Generative Modeling

November 04, 2025 by Arjun Agarwal

Paper: Text-Conditional Generation of 3D Chest CT Volumes (https://arxiv.org/pdf/2305.16037)
Jul 2024, University of Zurich, Istanbul Medipol University, ETH Zurich, Imperial College London, University of Pennsylvania, Stanford University
We discussed the MAISI paper previously which showcased excellent generative capabilities for CT scans, even with conditioning. This particular paper was actually released before MAISI and has a slightly different scope, but the authors of this paper take an interesting approach to achieve their goal, which is why I believe it is worth...

Understanding LoRA: Fine-Tuning Large Models Without Breaking the GPU

October 18, 2025 by Arjun Agarwal

Paper: LoRA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS (https://arxiv.org/pdf/2106.09685)
Oct 2021, Microsoft
Goal
In today’s world, there exist a lot of highly capable general purpose LLMs. However, most use-cases require specialized versions of these LLMs which is not an easy task. Keeping aside data curation and preparation, which in itself is quite demanding, current LLMs are multi-billion (and sometimes even multi-trillion) parameter models, and fine-tuning the model for training will require a costly high resource multi-node setup. The authors of this paper aim to...

Understanding InternVL: Vision-Language Alignment through Cross-Attention Mechanisms

October 14, 2025 by Arjun Agarwal

Paper: InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks (https://arxiv.org/pdf/2312.14238)
Jan 2024, OpenGVLab, China
Goal
With the rapid growth in VLMs, the authors of this paper aim to create a VLM with a large vision model (6B) that is aligned with an equally sized LLM (8B).
Instead of connecting the vision and language models using projection layers, the authors use cross attention and progressively train the network in order to reach better results.
Architecture
Vision model
The authors use a simple ViT model for the image...

Reinforcement Learning Part 4: Value Iteration and Policy Iteration

October 09, 2025 by Arjun Agarwal

By now we are comfortable with the terminology used in RL, the Bellman equation, the Bellman Optimality equation (BOE), and how we can solve one iteration of the BOE mathematically. Those who want to read the previous parts can find the links to the articles at the bottom. Now we will formally define three algorithms that allow us to achieve optimal policies. These algorithms have the following requirements:
The state space and action space must be finite.The dynamics model must be known i.e. the transition probabilities for each set of inputs must be known.All three...

Reinforcement Learning Part 3: The Bellman Optimality Equation and Optimal Policies

September 19, 2025 by Arjun Agarwal

Those joining us directly at Part 3 should be familiar with the Bellman Equation and how it can be used to compare two policies. We have defined optimal policies and optimal state values. We also know how to calculate state values iteratively rather than calculating a computationally-heavy matrix inverse using a step known as Policy Evaluation. Those interested in previous parts can find the links at the bottom.
The Bellman Optimality Equation
The Bellman Optimality Equation (BOE) builds on the Bellman Equation and tries to express the state values that an agent can achieve...

Reinforcement Learning Part 2: The Bellman Equation and its Importance

September 11, 2025 by Arjun Agarwal

For those joining this series directly at Part 2, Part 1 contained a list of terminology that is commonly used in RL, all of which the reader should be familiar with. Part 1 is linked at the bottom of this post.
How do we know which policy is better?
Let us say for any given RL problem, we have to find the “best” policy that the agent should follow. To find the best policy, we first need to define what makes a policy better than another one. Intuitively speaking, a policy is better if the agent receives higher rewards. Mathematically, state values provide us with a way to...

Understanding MAISI: Synthetic CT Scan Generation with Medical AI Diffusion Models

September 10, 2025 by Arjun Agarwal

Paper: MAISI: Medical AI for Synthetic Imaging (https://arxiv.org/pdf/2409.11169)
Oct 2024, NVIDIA
With generative AI being all the craze right now, diffusion models emerged as a highly effective means to create high fidelity and high diversity images. While much of the research in this field is aimed at controlling the outputs of these models, a small section of research is devoted to generating radiological scans, with an even smaller section of research aimed at generating 3D radiological scans, specifically Computed Tomography (CT) scans.
One may question why I am even...

Reinforcement Learning Part 1: Basic Terminology and the Markov Decision Process

September 09, 2025 by Arjun Agarwal

Reinforcement learning is a branch of AI that has recently gained renewed attention through cutting-edge research and experimentation. While traditionally considered a slow and unstable training paradigm, its role in advancing large language models and the broader pursuit of AGI has revealed new benefits, making RL a powerful strategy that every AI scientist should have in their toolkit.
Terminology
Let us begin this with the most basic terminology that we will encounter throughout this series:
Agent: An entity that perceives its environment and takes actions autonomously...

Understanding Classifier-Free Guidance: Improving Control in Diffusion Models Without Additional…

September 08, 2025 by Arjun Agarwal

Understanding Classifier-Free Guidance: Improving Control in Diffusion Models Without Additional Networks
Paper: CLASSIFIER-FREE DIFFUSION GUIDANCE (https://arxiv.org/pdf/2207.12598)
Jul 2022, Google Research
Goal
The authors of this paper set out to achieve the results of Classifier Guidance without the classifier, which is the main innovation presented in the latter. Avoiding the disadvantages of this classifier is the primary motivation behind this paper. Additionally, they are also able to do this with a single stage of training compared to the two networks required to...

Understanding Classifier Guidance: Steering Diffusion Models with Gradient Signals

September 05, 2025 by Arjun Agarwal

Paper: Diffusion Models Beat GANs on Image Synthesis (https://openreview.net/pdf?id=AAWuCvzaVt)
June 2021, OpenAI
Goal
It is well known that while GANs have superior image fidelity, they lack in terms of output diversity. While Diffusion Models (DMs) are a noteworthy competitor in the domain of image synthesis, providing high diversity in their outputs, they lack controllability in their outputs. The authors of this paper express two reasons for this disparity: extensive research on GANs leading to better results, and the inherent capability of GANs to be able to trade...

Understanding DETR: When Transformers Decide Bounding Boxes Are Just Another Sequence

August 31, 2025 by Arjun Agarwal

Paper: End-to-End Object Detection with Transformers (https://arxiv.org/pdf/2005.12872)
May 2020, Facebook AI
DETR: Detection Transformer
Traditionally, detection models use complex strategies that employ the use of CNNs along with additional non-DL AI algorithms in order to generate bounding boxes with classification for objects of interest. The authors of this paper broke this trend by targeting the same goal but by using a simple transformer-based architecture and a smart method to calculate the loss. The authors strip detection architectures of region proposal networks,...

Understanding LLaVa 1.5: Improvements in Training Recipes and Benchmarks

August 18, 2025 by Arjun Agarwal

Paper: Improved Baselines with Visual Instruction Tuning (https://arxiv.org/pdf/2310.03744)
May 2024, UW-Madison, Microsoft Research
In this paper, the authors build upon their previous paper “Visual Instruction Tuning”. If you want to know more about LLaVa 1.0 i.e. the first paper, have a look here.
This core material presented in this paper can be divided into the following sections:
Comparison with other modelsArchitectural changesDataset changesResultsComparison with other models
The paper mainly focuses on the following competitors when it compares then with LLaVA 1....

Glossary: A Reference to All My Articles

August 16, 2025 by Arjun Agarwal

I wanted to provide a method for readers to be able to find any and all of my articles in one place with some sort of segregation, which I find hard to do with Medium; this is why I have created a sort of index below which lists all the articles based on some defining topic.
Research Papers
Diffusion Models
Understanding DiT: Transformer-Driven Diffusion for Image SynthesisUnderstanding ControlNet: A Way to Boss Around Your Diffusion ModelUnderstanding LDMs: Diffusion in Latent Space for Efficient Resource UtilizationUnderstanding Classifier Guidance: Steering Diffusion...

Understanding DiT: Transformer-Driven Diffusion for Image Synthesis

July 26, 2025 by Arjun Agarwal

Paper: Scalable Diffusion Models with Transformers (https://arxiv.org/pdf/2212.09748)
DiT: Diffusion Transformer
Mar 2023, UCB, NYU
The authors of this paper had a single goal in mind: to show superior performance by transformers compared to conv UNets in the LDM diffusion process. Before this paper, very limited research existed which employed the use of transformers in generative vision modeling; this paper changed that by showing that the inductive bias inherent in U-Nets are not a requirement for diffusion models to work.
(to understand diffusion models, please read the...

Understanding LLaVa: Merging Visual Perception with Large Language Models

July 24, 2025 by Arjun Agarwal

Paper: Visual Instruction Tuning (https://arxiv.org/pdf/2304.08485)
LLaVa: Large Language and Vision Assistant
Dec 2023, UW Madison, Microsoft, Columbia
While there are a lot of models that have been made to follow instructions in text format, there is not a lot of research in miltimodal models where the instruction is followed based on the information present in the image. This paper aims to address this gap by generating a relevant dataset using powerful LLMs (such as GPT) and then creating a large multimodal model using pre-trained LLMs and ViTs that are then fine-tuned...

Understanding ControlNet: A Way to Boss Around Your Diffusion Model

July 22, 2025 by Arjun Agarwal

Paper: Adding Conditional Control to Text-to-Image Diffusion Models (https://arxiv.org/pdf/2302.05543)
Goal
While umpteen models exist that allow image generation using texts, the outputs are usually different from what the user visualized in their mind, and it does take a bit of tweaking and prompt engineering to get a satisfactory image. ControlNet is a method that is introduced to add spatial conditioning to a pretrained text-to-image diffusion model that produces outputs that are much closer to the user’s imagination with minimal effort. This spatial conditioning can be...

Overview of all optimizers: Making Bad Choices Faster Since 1986

July 01, 2025 by Arjun Agarwal

While PyTorch, Keras, and other similar libraries handle gradient graph calculation based on the forward pass, it is actually the optimizer’s job to update the weights of the model at every update step. There are several optimizers to choose from, each with their own advantages and disadvantages. We shall discuss the following optimizers here:
Stochastic Gradient Descent (SGD)
SGD is, by far, the simplest and easiest optimizer to understand as it’s the first method of updating model weights that anybody can think of intuitively.
In SGD, the weights are updated with a value...

Understanding PyTorch’s Distributed Communication Package: Powering all distributed model training

May 04, 2025 by Arjun Agarwal

PyTorch provides a very powerful API to communicate between multiple accelerator (GPU, CPU, etc.) instances when performing distributed model training, leading to high efficiency and optimization.
This API is powered by one of multiple backends. Since most of today’s training occurs on GPUs, the NCCL backend is the most commonly used backend, especially when training LLMs which require InfiniBand for efficient training. The next most observed backend is Gloo, followed by MPI, which allow training on multiple CPUs. PyTorch takes care of interfacing with the backend for us,...

Understanding InfoVAE: Rethinking the ELBO to Avoid Posterior Collapse

April 09, 2025 by Arjun Agarwal

Paper: InfoVAE: Balancing Learning and Inference in Variational Autoencoders (https://arxiv.org/pdf/1706.02262)
Goal: To modify the ELBO such that the opposing forces of the reconstruction loss and the KL divergence is decoupled while optimizing for generation.
The authors first start by giving a recap of VAEs. The math behind the vanilla VAE can be found here.
The authors then state that the ELBO can theoretically be maximized even with inaccurate estimated posteriors. They say that “good ELBO values do not imply accurate inference.” They go on to show prove this by math,...

Understanding VampPrior: Identifying More Expressive VAE Priors

April 06, 2025 by Arjun Agarwal

Paper: VAE with a VampPrior (https://arxiv.org/pdf/1705.07120)
Choosing a simple standard normal distribution as a prior for a VAE model has been shown to lead to over-regularization of the model and have insufficient utilization of the available latent space. The authors of this paper identify a new method to estimate a prior much closer to the true posterior while training the VAE, addressing the issues of the vanilla VAE.
The math behind it
A VAE aims to optimize for two opposing terms simultaneously: the negative log likelihood of our desired data distribution, and the...

Understanding SR3: Super-Resolution using Diffusion Models

April 01, 2025 by Arjun Agarwal

Paper: Image Super-Resolution via Iterative Refinement (https://arxiv.org/pdf/2104.07636)
In this paper, the authors introduce a new method to perform super-resolution (SR) on low resolution (LR) images using diffusion models. The existing methods at that time involved a different set of algorithms (regressive, adversarial, flow-based, etc.) which had their own set of issues. This paper tries to solve for them with diffusion models and shows better results based on human perception of the model outputs.
The training process
(to understand diffusion models, please read the...

Understanding MultiResUNet: Multi-Resolution Pathways for (Medical) Image Segmentation

March 29, 2025 by Arjun Agarwal

Paper: MultiResUNet : Rethinking the U-Net Architecture for Multimodal Biomedical Image Segmentation (https://arxiv.org/pdf/1902.04049)
In this paper, the authors attempt to improve upon the classical U-Net architecture by identifying the aspects where it lacks and incorporating changes accordingly.
Modifying the UNet block
Instead of the two 3x3 conv layers in each scale of the U-Net encoder and decoder (which is approximately equivalent to a single 5x5 conv layer), they use three different kernel sizes (3, 5, and 7), and concatenate the outputs to get a single feature...

Understanding LDMs: Diffusion in Latent Space for Efficient Resource Utilization

March 28, 2025 by Arjun Agarwal

Paper: High-Resolution Image Synthesis with Latent Diffusion Models (https://arxiv.org/pdf/2112.10752)
Diffusion models have provided us with a method to generate conditioned high quality samples from a model that trains in a stable manner, often achieving state-of-the-art results. But training diffusion models in the pixel-space can be costly in terms of time and resources. This paper aims to introduce latent diffusion models (LDM) that operate in the latent-space instead of the pixel-space to reduce the time and resouce dependencies drastically.
The two stages of training...