ViT3D#
- pydantic model vision_architectures.nets.vit_3d.ViT3DEncoderConfig[source]#
Bases:
Attention3DWithMLPConfigShow JSON schema
{ "title": "ViT3DEncoderConfig", "type": "object", "properties": { "dim": { "description": "Dimension of the input and output features.", "title": "Dim", "type": "integer" }, "num_heads": { "description": "Number of query heads", "title": "Num Heads", "type": "integer" }, "ratio_q_to_kv_heads": { "default": 1, "description": "Ratio of query heads to key/value heads. Useful for MQA/GQA.", "title": "Ratio Q To Kv Heads", "type": "integer" }, "logit_scale_learnable": { "default": false, "description": "Whether the logit scale is learnable.", "title": "Logit Scale Learnable", "type": "boolean" }, "attn_drop_prob": { "default": 0.0, "description": "Dropout probability for attention weights.", "title": "Attn Drop Prob", "type": "number" }, "proj_drop_prob": { "default": 0.0, "description": "Dropout probability for the projection layer.", "title": "Proj Drop Prob", "type": "number" }, "max_attention_batch_size": { "default": -1, "description": "Runs attention by splitting the inputs into chunks of this size. 0 means no chunking. Useful for large inputs during inference. (This happens along batch dimension).", "title": "Max Attention Batch Size", "type": "integer" }, "rotary_position_embeddings_config": { "anyOf": [ { "$ref": "#/$defs/RotaryPositionEmbeddings3DConfig" }, { "type": "null" } ], "default": null, "description": "Config for rotary position embeddings" }, "mlp_ratio": { "default": 4, "description": "Ratio of the hidden dimension in the MLP to the input dimension.", "title": "Mlp Ratio", "type": "integer" }, "activation": { "default": "gelu", "description": "Activation function for the MLP.", "title": "Activation", "type": "string" }, "mlp_drop_prob": { "default": 0.0, "description": "Dropout probability for the MLP.", "title": "Mlp Drop Prob", "type": "number" }, "norm_location": { "default": "post", "description": "Location of the normalization layer in the attention block. Pre-normalization implies normalization before the attention operation, while post-normalization applies it after.", "enum": [ "pre", "post" ], "title": "Norm Location", "type": "string" }, "layer_norm_eps": { "default": 1e-06, "description": "Epsilon value for the layer normalization.", "title": "Layer Norm Eps", "type": "number" }, "encoder_depth": { "description": "Number of encoder blocks.", "title": "Encoder Depth", "type": "integer" } }, "$defs": { "RotaryPositionEmbeddings3DConfig": { "properties": { "dim": { "anyOf": [ { "type": "integer" }, { "type": "null" } ], "default": null, "description": "Dimension of the position embeddings", "title": "Dim" }, "base": { "default": 10000.0, "description": "Base value for the exponent.", "title": "Base", "type": "number" }, "split": { "anyOf": [ { "maxItems": 3, "minItems": 3, "prefixItems": [ { "type": "number" }, { "type": "number" }, { "type": "number" } ], "type": "array" }, { "maxItems": 3, "minItems": 3, "prefixItems": [ { "type": "integer" }, { "type": "integer" }, { "type": "integer" } ], "type": "array" } ], "default": [ 0.3333333333333333, 0.3333333333333333, 0.3333333333333333 ], "description": "Split of the position embeddings. If float, converted to int based on self.dim", "title": "Split" } }, "title": "RotaryPositionEmbeddings3DConfig", "type": "object" } }, "required": [ "dim", "num_heads", "encoder_depth" ] }
- Config:
arbitrary_types_allowed: bool = True
extra: str = ignore
validate_default: bool = True
validate_assignment: bool = True
validate_return: bool = True
- Fields:
- Validators:
-
field encoder_depth:
int[Required]# Number of encoder blocks.
- Validated by:
- pydantic model vision_architectures.nets.vit_3d.ViT3DEncoderWithPatchEmbeddingsConfig[source]#
Bases:
ViT3DEncoderConfig,PatchEmbeddings3DConfigShow JSON schema
{ "title": "ViT3DEncoderWithPatchEmbeddingsConfig", "type": "object", "properties": { "in_channels": { "description": "Number of input channels.", "title": "In Channels", "type": "integer" }, "out_channels": { "default": null, "title": "Out Channels", "type": "null" }, "kernel_size": { "default": null, "title": "Kernel Size", "type": "null" }, "padding": { "anyOf": [ { "type": "integer" }, { "items": { "type": "integer" }, "type": "array" }, { "type": "string" } ], "default": "same", "description": "Padding for the convolution. Can be 'same' or an integer/tuple of integers.", "title": "Padding" }, "stride": { "anyOf": [ { "type": "integer" }, { "items": { "type": "integer" }, "type": "array" } ], "default": 1, "description": "Stride for the convolution", "title": "Stride" }, "conv_kwargs": { "additionalProperties": true, "default": {}, "description": "Additional keyword arguments for the convolution layer", "title": "Conv Kwargs", "type": "object" }, "transposed": { "default": false, "description": "Whether to perform ConvTranspose instead of Conv", "title": "Transposed", "type": "boolean" }, "normalization": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": "batchnorm3d", "description": "Normalization layer type.", "title": "Normalization" }, "normalization_pre_args": { "default": [], "description": "Arguments for the normalization layer before providing the dimension. Useful when using GroupNorm layers are being used to specify the number of groups.", "items": {}, "title": "Normalization Pre Args", "type": "array" }, "normalization_post_args": { "default": [], "description": "Arguments for the normalization layer after providing the dimension.", "items": {}, "title": "Normalization Post Args", "type": "array" }, "normalization_kwargs": { "additionalProperties": true, "default": {}, "description": "Additional keyword arguments for the normalization layer", "title": "Normalization Kwargs", "type": "object" }, "activation": { "default": "gelu", "description": "Activation function for the MLP.", "title": "Activation", "type": "string" }, "activation_kwargs": { "additionalProperties": true, "default": {}, "description": "Additional keyword arguments for the activation function.", "title": "Activation Kwargs", "type": "object" }, "sequence": { "default": "CNA", "description": "Sequence of operations in the block.", "enum": [ "C", "AC", "CA", "CD", "CN", "DC", "NC", "ACD", "ACN", "ADC", "ANC", "CAD", "CAN", "CDA", "CDN", "CNA", "CND", "DAC", "DCA", "DCN", "DNC", "NAC", "NCA", "NCD", "NDC", "ACDN", "ACND", "ADCN", "ADNC", "ANCD", "ANDC", "CADN", "CAND", "CDAN", "CDNA", "CNAD", "CNDA", "DACN", "DANC", "DCAN", "DCNA", "DNAC", "DNCA", "NACD", "NADC", "NCAD", "NCDA", "NDAC", "NDCA" ], "title": "Sequence", "type": "string" }, "drop_prob": { "default": 0.0, "description": "Dropout probability.", "title": "Drop Prob", "type": "number" }, "patch_size": { "description": "Size of the patches to extract from the input.", "maxItems": 3, "minItems": 3, "prefixItems": [ { "type": "integer" }, { "type": "integer" }, { "type": "integer" } ], "title": "Patch Size", "type": "array" }, "dim": { "description": "Dimension of the input and output features.", "title": "Dim", "type": "integer" }, "norm_layer": { "default": "layernorm", "description": "Normalization layer to use.", "title": "Norm Layer", "type": "string" }, "num_heads": { "description": "Number of query heads", "title": "Num Heads", "type": "integer" }, "ratio_q_to_kv_heads": { "default": 1, "description": "Ratio of query heads to key/value heads. Useful for MQA/GQA.", "title": "Ratio Q To Kv Heads", "type": "integer" }, "logit_scale_learnable": { "default": false, "description": "Whether the logit scale is learnable.", "title": "Logit Scale Learnable", "type": "boolean" }, "attn_drop_prob": { "default": 0.0, "description": "Dropout probability for attention weights.", "title": "Attn Drop Prob", "type": "number" }, "proj_drop_prob": { "default": 0.0, "description": "Dropout probability for the projection layer.", "title": "Proj Drop Prob", "type": "number" }, "max_attention_batch_size": { "default": -1, "description": "Runs attention by splitting the inputs into chunks of this size. 0 means no chunking. Useful for large inputs during inference. (This happens along batch dimension).", "title": "Max Attention Batch Size", "type": "integer" }, "rotary_position_embeddings_config": { "anyOf": [ { "$ref": "#/$defs/RotaryPositionEmbeddings3DConfig" }, { "type": "null" } ], "default": null, "description": "Config for rotary position embeddings" }, "mlp_ratio": { "default": 4, "description": "Ratio of the hidden dimension in the MLP to the input dimension.", "title": "Mlp Ratio", "type": "integer" }, "mlp_drop_prob": { "default": 0.0, "description": "Dropout probability for the MLP.", "title": "Mlp Drop Prob", "type": "number" }, "norm_location": { "default": "post", "description": "Location of the normalization layer in the attention block. Pre-normalization implies normalization before the attention operation, while post-normalization applies it after.", "enum": [ "pre", "post" ], "title": "Norm Location", "type": "string" }, "layer_norm_eps": { "default": 1e-06, "description": "Epsilon value for the layer normalization.", "title": "Layer Norm Eps", "type": "number" }, "encoder_depth": { "description": "Number of encoder blocks.", "title": "Encoder Depth", "type": "integer" }, "absolute_position_embeddings_config": { "anyOf": [ { "$ref": "#/$defs/AbsolutePositionEmbeddings3DConfig" }, { "type": "null" } ], "default": {} }, "num_class_tokens": { "description": "Number of class tokens to be added.", "title": "Num Class Tokens", "type": "integer" } }, "$defs": { "AbsolutePositionEmbeddings3DConfig": { "properties": { "dim": { "anyOf": [ { "type": "integer" }, { "type": "null" } ], "default": null, "description": "Dimension of the position embeddings", "title": "Dim" }, "grid_size": { "anyOf": [ { "maxItems": 3, "minItems": 3, "prefixItems": [ { "type": "integer" }, { "type": "integer" }, { "type": "integer" } ], "type": "array" }, { "type": "null" } ], "default": null, "description": "Size of entire patch matrix.", "title": "Grid Size" }, "learnable": { "default": false, "description": "Whether the position embeddings are learnable.", "title": "Learnable", "type": "boolean" } }, "title": "AbsolutePositionEmbeddings3DConfig", "type": "object" }, "RotaryPositionEmbeddings3DConfig": { "properties": { "dim": { "anyOf": [ { "type": "integer" }, { "type": "null" } ], "default": null, "description": "Dimension of the position embeddings", "title": "Dim" }, "base": { "default": 10000.0, "description": "Base value for the exponent.", "title": "Base", "type": "number" }, "split": { "anyOf": [ { "maxItems": 3, "minItems": 3, "prefixItems": [ { "type": "number" }, { "type": "number" }, { "type": "number" } ], "type": "array" }, { "maxItems": 3, "minItems": 3, "prefixItems": [ { "type": "integer" }, { "type": "integer" }, { "type": "integer" } ], "type": "array" } ], "default": [ 0.3333333333333333, 0.3333333333333333, 0.3333333333333333 ], "description": "Split of the position embeddings. If float, converted to int based on self.dim", "title": "Split" } }, "title": "RotaryPositionEmbeddings3DConfig", "type": "object" } }, "required": [ "in_channels", "patch_size", "dim", "num_heads", "encoder_depth", "num_class_tokens" ] }
- Config:
arbitrary_types_allowed: bool = True
extra: str = ignore
validate_default: bool = True
validate_assignment: bool = True
validate_return: bool = True
- Fields:
- Validators:
-
field absolute_position_embeddings_config:
AbsolutePositionEmbeddings3DConfig|None= {}# - Validated by:
-
field num_class_tokens:
int[Required]# Number of class tokens to be added.
- Validated by:
- pydantic model vision_architectures.nets.vit_3d.ViT3DDecoderConfig[source]#
Bases:
Attention3DWithMLPConfigShow JSON schema
{ "title": "ViT3DDecoderConfig", "type": "object", "properties": { "dim": { "description": "Dimension of the input and output features.", "title": "Dim", "type": "integer" }, "num_heads": { "description": "Number of query heads", "title": "Num Heads", "type": "integer" }, "ratio_q_to_kv_heads": { "default": 1, "description": "Ratio of query heads to key/value heads. Useful for MQA/GQA.", "title": "Ratio Q To Kv Heads", "type": "integer" }, "logit_scale_learnable": { "default": false, "description": "Whether the logit scale is learnable.", "title": "Logit Scale Learnable", "type": "boolean" }, "attn_drop_prob": { "default": 0.0, "description": "Dropout probability for attention weights.", "title": "Attn Drop Prob", "type": "number" }, "proj_drop_prob": { "default": 0.0, "description": "Dropout probability for the projection layer.", "title": "Proj Drop Prob", "type": "number" }, "max_attention_batch_size": { "default": -1, "description": "Runs attention by splitting the inputs into chunks of this size. 0 means no chunking. Useful for large inputs during inference. (This happens along batch dimension).", "title": "Max Attention Batch Size", "type": "integer" }, "rotary_position_embeddings_config": { "anyOf": [ { "$ref": "#/$defs/RotaryPositionEmbeddings3DConfig" }, { "type": "null" } ], "default": null, "description": "Config for rotary position embeddings" }, "mlp_ratio": { "default": 4, "description": "Ratio of the hidden dimension in the MLP to the input dimension.", "title": "Mlp Ratio", "type": "integer" }, "activation": { "default": "gelu", "description": "Activation function for the MLP.", "title": "Activation", "type": "string" }, "mlp_drop_prob": { "default": 0.0, "description": "Dropout probability for the MLP.", "title": "Mlp Drop Prob", "type": "number" }, "norm_location": { "default": "post", "description": "Location of the normalization layer in the attention block. Pre-normalization implies normalization before the attention operation, while post-normalization applies it after.", "enum": [ "pre", "post" ], "title": "Norm Location", "type": "string" }, "layer_norm_eps": { "default": 1e-06, "description": "Epsilon value for the layer normalization.", "title": "Layer Norm Eps", "type": "number" }, "decoder_depth": { "description": "Number of decoder blocks.", "title": "Decoder Depth", "type": "integer" } }, "$defs": { "RotaryPositionEmbeddings3DConfig": { "properties": { "dim": { "anyOf": [ { "type": "integer" }, { "type": "null" } ], "default": null, "description": "Dimension of the position embeddings", "title": "Dim" }, "base": { "default": 10000.0, "description": "Base value for the exponent.", "title": "Base", "type": "number" }, "split": { "anyOf": [ { "maxItems": 3, "minItems": 3, "prefixItems": [ { "type": "number" }, { "type": "number" }, { "type": "number" } ], "type": "array" }, { "maxItems": 3, "minItems": 3, "prefixItems": [ { "type": "integer" }, { "type": "integer" }, { "type": "integer" } ], "type": "array" } ], "default": [ 0.3333333333333333, 0.3333333333333333, 0.3333333333333333 ], "description": "Split of the position embeddings. If float, converted to int based on self.dim", "title": "Split" } }, "title": "RotaryPositionEmbeddings3DConfig", "type": "object" } }, "required": [ "dim", "num_heads", "decoder_depth" ] }
- Config:
arbitrary_types_allowed: bool = True
extra: str = ignore
validate_default: bool = True
validate_assignment: bool = True
validate_return: bool = True
- Fields:
- Validators:
-
field decoder_depth:
int[Required]# Number of decoder blocks.
- Validated by:
- pydantic model vision_architectures.nets.vit_3d.ViT3DDecoderWithPatchEmbeddingsConfig[source]#
Bases:
ViT3DDecoderConfig,PatchEmbeddings3DConfigShow JSON schema
{ "title": "ViT3DDecoderWithPatchEmbeddingsConfig", "type": "object", "properties": { "in_channels": { "description": "Number of input channels.", "title": "In Channels", "type": "integer" }, "out_channels": { "default": null, "title": "Out Channels", "type": "null" }, "kernel_size": { "default": null, "title": "Kernel Size", "type": "null" }, "padding": { "anyOf": [ { "type": "integer" }, { "items": { "type": "integer" }, "type": "array" }, { "type": "string" } ], "default": "same", "description": "Padding for the convolution. Can be 'same' or an integer/tuple of integers.", "title": "Padding" }, "stride": { "anyOf": [ { "type": "integer" }, { "items": { "type": "integer" }, "type": "array" } ], "default": 1, "description": "Stride for the convolution", "title": "Stride" }, "conv_kwargs": { "additionalProperties": true, "default": {}, "description": "Additional keyword arguments for the convolution layer", "title": "Conv Kwargs", "type": "object" }, "transposed": { "default": false, "description": "Whether to perform ConvTranspose instead of Conv", "title": "Transposed", "type": "boolean" }, "normalization": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": "batchnorm3d", "description": "Normalization layer type.", "title": "Normalization" }, "normalization_pre_args": { "default": [], "description": "Arguments for the normalization layer before providing the dimension. Useful when using GroupNorm layers are being used to specify the number of groups.", "items": {}, "title": "Normalization Pre Args", "type": "array" }, "normalization_post_args": { "default": [], "description": "Arguments for the normalization layer after providing the dimension.", "items": {}, "title": "Normalization Post Args", "type": "array" }, "normalization_kwargs": { "additionalProperties": true, "default": {}, "description": "Additional keyword arguments for the normalization layer", "title": "Normalization Kwargs", "type": "object" }, "activation": { "default": "gelu", "description": "Activation function for the MLP.", "title": "Activation", "type": "string" }, "activation_kwargs": { "additionalProperties": true, "default": {}, "description": "Additional keyword arguments for the activation function.", "title": "Activation Kwargs", "type": "object" }, "sequence": { "default": "CNA", "description": "Sequence of operations in the block.", "enum": [ "C", "AC", "CA", "CD", "CN", "DC", "NC", "ACD", "ACN", "ADC", "ANC", "CAD", "CAN", "CDA", "CDN", "CNA", "CND", "DAC", "DCA", "DCN", "DNC", "NAC", "NCA", "NCD", "NDC", "ACDN", "ACND", "ADCN", "ADNC", "ANCD", "ANDC", "CADN", "CAND", "CDAN", "CDNA", "CNAD", "CNDA", "DACN", "DANC", "DCAN", "DCNA", "DNAC", "DNCA", "NACD", "NADC", "NCAD", "NCDA", "NDAC", "NDCA" ], "title": "Sequence", "type": "string" }, "drop_prob": { "default": 0.0, "description": "Dropout probability.", "title": "Drop Prob", "type": "number" }, "patch_size": { "description": "Size of the patches to extract from the input.", "maxItems": 3, "minItems": 3, "prefixItems": [ { "type": "integer" }, { "type": "integer" }, { "type": "integer" } ], "title": "Patch Size", "type": "array" }, "dim": { "description": "Dimension of the input and output features.", "title": "Dim", "type": "integer" }, "norm_layer": { "default": "layernorm", "description": "Normalization layer to use.", "title": "Norm Layer", "type": "string" }, "num_heads": { "description": "Number of query heads", "title": "Num Heads", "type": "integer" }, "ratio_q_to_kv_heads": { "default": 1, "description": "Ratio of query heads to key/value heads. Useful for MQA/GQA.", "title": "Ratio Q To Kv Heads", "type": "integer" }, "logit_scale_learnable": { "default": false, "description": "Whether the logit scale is learnable.", "title": "Logit Scale Learnable", "type": "boolean" }, "attn_drop_prob": { "default": 0.0, "description": "Dropout probability for attention weights.", "title": "Attn Drop Prob", "type": "number" }, "proj_drop_prob": { "default": 0.0, "description": "Dropout probability for the projection layer.", "title": "Proj Drop Prob", "type": "number" }, "max_attention_batch_size": { "default": -1, "description": "Runs attention by splitting the inputs into chunks of this size. 0 means no chunking. Useful for large inputs during inference. (This happens along batch dimension).", "title": "Max Attention Batch Size", "type": "integer" }, "rotary_position_embeddings_config": { "anyOf": [ { "$ref": "#/$defs/RotaryPositionEmbeddings3DConfig" }, { "type": "null" } ], "default": null, "description": "Config for rotary position embeddings" }, "mlp_ratio": { "default": 4, "description": "Ratio of the hidden dimension in the MLP to the input dimension.", "title": "Mlp Ratio", "type": "integer" }, "mlp_drop_prob": { "default": 0.0, "description": "Dropout probability for the MLP.", "title": "Mlp Drop Prob", "type": "number" }, "norm_location": { "default": "post", "description": "Location of the normalization layer in the attention block. Pre-normalization implies normalization before the attention operation, while post-normalization applies it after.", "enum": [ "pre", "post" ], "title": "Norm Location", "type": "string" }, "layer_norm_eps": { "default": 1e-06, "description": "Epsilon value for the layer normalization.", "title": "Layer Norm Eps", "type": "number" }, "decoder_depth": { "description": "Number of decoder blocks.", "title": "Decoder Depth", "type": "integer" }, "absolute_position_embeddings_config": { "anyOf": [ { "$ref": "#/$defs/AbsolutePositionEmbeddings3DConfig" }, { "type": "null" } ], "default": {} }, "num_class_tokens": { "description": "Number of class tokens to be added.", "title": "Num Class Tokens", "type": "integer" } }, "$defs": { "AbsolutePositionEmbeddings3DConfig": { "properties": { "dim": { "anyOf": [ { "type": "integer" }, { "type": "null" } ], "default": null, "description": "Dimension of the position embeddings", "title": "Dim" }, "grid_size": { "anyOf": [ { "maxItems": 3, "minItems": 3, "prefixItems": [ { "type": "integer" }, { "type": "integer" }, { "type": "integer" } ], "type": "array" }, { "type": "null" } ], "default": null, "description": "Size of entire patch matrix.", "title": "Grid Size" }, "learnable": { "default": false, "description": "Whether the position embeddings are learnable.", "title": "Learnable", "type": "boolean" } }, "title": "AbsolutePositionEmbeddings3DConfig", "type": "object" }, "RotaryPositionEmbeddings3DConfig": { "properties": { "dim": { "anyOf": [ { "type": "integer" }, { "type": "null" } ], "default": null, "description": "Dimension of the position embeddings", "title": "Dim" }, "base": { "default": 10000.0, "description": "Base value for the exponent.", "title": "Base", "type": "number" }, "split": { "anyOf": [ { "maxItems": 3, "minItems": 3, "prefixItems": [ { "type": "number" }, { "type": "number" }, { "type": "number" } ], "type": "array" }, { "maxItems": 3, "minItems": 3, "prefixItems": [ { "type": "integer" }, { "type": "integer" }, { "type": "integer" } ], "type": "array" } ], "default": [ 0.3333333333333333, 0.3333333333333333, 0.3333333333333333 ], "description": "Split of the position embeddings. If float, converted to int based on self.dim", "title": "Split" } }, "title": "RotaryPositionEmbeddings3DConfig", "type": "object" } }, "required": [ "in_channels", "patch_size", "dim", "num_heads", "decoder_depth", "num_class_tokens" ] }
- Config:
arbitrary_types_allowed: bool = True
extra: str = ignore
validate_default: bool = True
validate_assignment: bool = True
validate_return: bool = True
- Fields:
- Validators:
-
field absolute_position_embeddings_config:
AbsolutePositionEmbeddings3DConfig|None= {}# - Validated by:
-
field num_class_tokens:
int[Required]# Number of class tokens to be added.
- Validated by:
- class vision_architectures.nets.vit_3d.ViT3DEncoder(config={}, checkpointing_level=0, **kwargs)[source]#
Bases:
Module,PyTorchModelHubMixinVision Transformer encoder. This class is designed for 3D input eg. medical images, videos etc.
- __init__(config={}, checkpointing_level=0, **kwargs)[source]#
Initialize the ViT3DEncoder.
- Parameters:
config (
ViT3DEncoderConfig) – An instance of the Config class that contains all the configuration parameters. It can also be passed as a dictionary and the instance will be created automatically.checkpointing_level (
int) – The level of checkpointing to use for activation checkpointing. Refer toActivationCheckpointingfor more details.**kwargs – Additional keyword arguments for configuration.
- forward(embeddings, return_intermediates=False, channels_first=True, query_grid_shape=None, key_grid_shape=None)[source]#
Pass the input embeddings through the ViT encoder (self attention).
- Parameters:
embeddings (
Tensor) – Tensor of shape (B, C, Z, Y, X), (B, Z, Y, X, C), or (B, N, C) representing the input features.return_intermediates (
bool) – Return intermediate outputs such as layer/block/stage outputs.channels_first (
bool) – Whether the inputs are in channels first format (B, C, …) or not (B, …, C).query_grid_shape (
Optional[tuple[int,int,int]]) – Shape of the tokens in 3D. Used to identify the actual 3D matrix and separate it from extra tokens (eg. class tokens) to apply rotary position embeddings. Leading tokens are treated as extra tokens and only trailing tokens are used.key_grid_shape (
Optional[tuple[int,int,int]]) – Shape of the tokens in 3D. Used to identify the actual 3D matrix and separate it from extra tokens (eg. class tokens) to apply rotary position embeddings. Leading tokens are treated as extra tokens and only trailing tokens are used.
- Return type:
Tensor|tuple[Tensor,list[Tensor]]- Returns:
Tensor of shape (B, C, Z, Y, X), (B, Z, Y, X, C), or (B, N, C) representing the output features. If return_intermediates is True, returns a tuple of the output embeddings and a list of intermediate embeddings in channels_last format.
- class vision_architectures.nets.vit_3d.ViT3DDecoder(config={}, checkpointing_level=0, **kwargs)[source]#
Bases:
Module,PyTorchModelHubMixinVision Transformer decoder. This class is designed for 3D input eg. medical images, videos etc.
- __init__(config={}, checkpointing_level=0, **kwargs)[source]#
Initialize the ViT3DDecoder.
- Parameters:
config (
ViT3DDecoderConfig) – An instance of the Config class that contains all the configuration parameters. It can also be passed as a dictionary and the instance will be created automatically.checkpointing_level (
int) – The level of checkpointing to use for activation checkpointing. Refer toActivationCheckpointingfor more details.**kwargs – Additional keyword arguments for configuration.
- forward(q, kv, return_intermediates=False, channels_first=True, q_grid_shape=None, kv_grid_shape=None)[source]#
Pass the input embeddings through the ViT decoder (self attention + cross attention).
- Parameters:
q (
Tensor) – Input to the query matrix. Tensor of shape (B, C, Z, Y, X), (B, Z, Y, X, C), or (B, N, C) representing the input features.kv (
Tensor) – Input to the key and value matrices. Tensor of shape (B, C, Z, Y, X), (B, Z, Y, X, C), or (B, N, C) representing the input features.return_intermediates (
bool) – Return intermediate outputs such as layer/block/stage outputs.channels_first (
bool) – Whether the inputs are in channels first format (B, C, …) or not (B, …, C).q_grid_shape (
Optional[tuple[int,int,int]]) – Shape of the tokens in 3D. Used to identify the actual 3D matrix and separate it from extra tokens (eg. class tokens) to apply rotary position embeddings. Leading tokens are treated as extra tokens and only trailing tokens are used.kv_grid_shape (
Optional[tuple[int,int,int]]) – Shape of the tokens in 3D. Used to identify the actual 3D matrix and separate it from extra tokens (eg. class tokens) to apply rotary position embeddings. Leading tokens are treated as extra tokens and only trailing tokens are used.
- Return type:
Tensor|tuple[Tensor,list[Tensor]]- Returns:
Tensor of shape (B, C, Z, Y, X), (B, Z, Y, X, C), or (B, N, C) representing the output features. If return_intermediates is True, returns a tuple of the output embeddings and a list of intermediate embeddings in channels_last format.
- class vision_architectures.nets.vit_3d.ViT3DEncoderWithPatchEmbeddings(config={}, checkpointing_level=0, **kwargs)[source]#
Bases:
Module,PyTorchModelHubMixinPatchification of input array followed by a ViT encoder. This class is designed for 3D input eg. medical images, videos etc.
- __init__(config={}, checkpointing_level=0, **kwargs)[source]#
Initialize the ViT3DEncoderWithPatchEmbeddings.
- Parameters:
config (
ViT3DEncoderWithPatchEmbeddingsConfig) – An instance of the Config class that contains all the configuration parameters. It can also be passed as a dictionary and the instance will be created automatically.checkpointing_level (
int) – The level of checkpointing to use for activation checkpointing. Refer toActivationCheckpointingfor more details.**kwargs – Additional keyword arguments for configuration.
- forward(pixel_values, spacings=None, channels_first=True, return_intermediates=False)[source]#
Patchify the input datapoint and then pass through the ViT encoder (self attention).
- Parameters:
pixel_values (
Tensor) – Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the input features.spacings (
Optional[Tensor]) – Spacing information of shape (B, 3) of the input features.channels_first (
bool) – Whether the inputs are in channels first format (B, C, …) or not (B, …, C).return_intermediates (
bool) – Return intermediate outputs such as layer/block/stage outputs.
- Return type:
tuple[Tensor,list[Tensor]] |tuple[Tensor,list[Tensor],list[Tensor]]- Returns:
Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the output features. If return_intermediates is True, returns a tuple of the output embeddings and a list of intermediate embeddings.
- class vision_architectures.nets.vit_3d.ViT3DDecoderWithPatchEmbeddings(config={}, checkpointing_level=0, **kwargs)[source]#
Bases:
Module,PyTorchModelHubMixinPatchification of input array followed by a ViT Decoder. This class is designed for 3D input eg. medical images, videos etc.
- __init__(config={}, checkpointing_level=0, **kwargs)[source]#
Initialize the ViT3DDecoderWithPatchEmbeddings.
- Parameters:
config (
ViT3DDecoderWithPatchEmbeddingsConfig) – An instance of the Config class that contains all the configuration parameters. It can also be passed as a dictionary and the instance will be created automatically.checkpointing_level (
int) – The level of checkpointing to use for activation checkpointing. Refer toActivationCheckpointingfor more details.**kwargs – Additional keyword arguments for configuration.
- forward(pixel_values, kv=None, spacings=None, channels_first=True, return_intermediates=False)[source]#
Patchify the input datapoint and then pass through the ViT encoder (self attention).
- Parameters:
pixel_values (
Tensor) – Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the input features.kv (
Optional[Tensor]) – Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the input features. This represents the cache from the encoder.spacings (
Optional[Tensor]) – Spacing information of shape (B, 3) of the input features.channels_first (
bool) – Whether the inputs are in channels first format (B, C, …) or not (B, …, C).return_intermediates (
bool) – Return intermediate outputs such as layer/block/stage outputs.
- Return type:
tuple[Tensor,list[Tensor]] |tuple[Tensor,list[Tensor],list[Tensor]]- Returns:
Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the output features. If return_intermediates is True, returns a tuple of the output embeddings and a list of intermediate embeddings.