Vision Architectures Documentation#

Vision Architectures is a comprehensive library that implements various deep learning architectures for computer vision tasks, with a particular focus on 3D vision models. The library includes implementations of state-of-the-art architectures such as Vision Transformers (ViT), Swin Transformers, and their 3D variants.

Getting Started#

To use this library, install it via pip:

pip install vision-architectures

or, to develop it, clone it from github and run:

pip install -e .

Basic Usage Example:

import torch
from vision_architectures.nets.vit_3d import ViT3DModel

# Create a 3D Vision Transformer model
model = ViT3DModel(
     dim=768,
     num_heads=12,
     mlp_ratio=4,
     patch_size=(16, 16, 16),
     in_channels=1,
     encoder_depth=12,
     num_class_tokens=1,
     layer_norm_eps=1e-6,
)

# Forward pass with 3D input
x = torch.randn(1, 1, 128, 128, 128)  # Batch, Channels, Depth, Height, Width
spacings = torch.tensor([[1.0, 1.0, 1.0]])  # Voxel spacing information
class_tokens, encodings = model(x, spacings)

Contents: