Transformer#

pydantic model vision_architectures.blocks.transformer.Attention1DMLPConfig[source]#

Bases: CustomBaseModel

Show JSON schema
{
   "title": "Attention1DMLPConfig",
   "type": "object",
   "properties": {
      "dim": {
         "description": "Dimension of the input and output features.",
         "title": "Dim",
         "type": "integer"
      },
      "mlp_ratio": {
         "default": 4,
         "description": "Ratio of the hidden dimension in the MLP to the input dimension.",
         "title": "Mlp Ratio",
         "type": "integer"
      },
      "activation": {
         "default": "gelu",
         "description": "Activation function for the MLP.",
         "title": "Activation",
         "type": "string"
      },
      "mlp_drop_prob": {
         "default": 0.0,
         "description": "Dropout probability for the MLP.",
         "title": "Mlp Drop Prob",
         "type": "number"
      }
   },
   "required": [
      "dim"
   ]
}

Config:
  • arbitrary_types_allowed: bool = True

  • extra: str = ignore

  • validate_default: bool = True

  • validate_assignment: bool = True

  • validate_return: bool = True

Fields:
Validators:

field dim: int [Required]#

Dimension of the input and output features.

Validated by:
field mlp_ratio: int = 4#

Ratio of the hidden dimension in the MLP to the input dimension.

Validated by:
field activation: str = 'gelu'#

Activation function for the MLP.

Validated by:
field mlp_drop_prob: float = 0.0#

Dropout probability for the MLP.

Validated by:
pydantic model vision_architectures.blocks.transformer.Attention3DMLPConfig[source]#

Bases: Attention1DMLPConfig

Show JSON schema
{
   "title": "Attention3DMLPConfig",
   "type": "object",
   "properties": {
      "dim": {
         "description": "Dimension of the input and output features.",
         "title": "Dim",
         "type": "integer"
      },
      "mlp_ratio": {
         "default": 4,
         "description": "Ratio of the hidden dimension in the MLP to the input dimension.",
         "title": "Mlp Ratio",
         "type": "integer"
      },
      "activation": {
         "default": "gelu",
         "description": "Activation function for the MLP.",
         "title": "Activation",
         "type": "string"
      },
      "mlp_drop_prob": {
         "default": 0.0,
         "description": "Dropout probability for the MLP.",
         "title": "Mlp Drop Prob",
         "type": "number"
      }
   },
   "required": [
      "dim"
   ]
}

Config:
  • arbitrary_types_allowed: bool = True

  • extra: str = ignore

  • validate_default: bool = True

  • validate_assignment: bool = True

  • validate_return: bool = True

Fields:

Validators:

pydantic model vision_architectures.blocks.transformer.Attention1DWithMLPConfig[source]#

Bases: Attention1DMLPConfig, Attention1DConfig

Show JSON schema
{
   "title": "Attention1DWithMLPConfig",
   "type": "object",
   "properties": {
      "dim": {
         "description": "Dimension of the input and output features.",
         "title": "Dim",
         "type": "integer"
      },
      "num_heads": {
         "description": "Number of query heads",
         "title": "Num Heads",
         "type": "integer"
      },
      "ratio_q_to_kv_heads": {
         "default": 1,
         "title": "Ratio Q To Kv Heads",
         "type": "integer"
      },
      "logit_scale_learnable": {
         "default": false,
         "title": "Logit Scale Learnable",
         "type": "boolean"
      },
      "attn_drop_prob": {
         "default": 0.0,
         "title": "Attn Drop Prob",
         "type": "number"
      },
      "proj_drop_prob": {
         "default": 0.0,
         "title": "Proj Drop Prob",
         "type": "number"
      },
      "max_attention_batch_size": {
         "default": -1,
         "description": "Runs attention by splitting the inputs into chunks of this size. 0 means no chunking. Useful for large inputs during inference.",
         "title": "Max Attention Batch Size",
         "type": "integer"
      },
      "mlp_ratio": {
         "default": 4,
         "description": "Ratio of the hidden dimension in the MLP to the input dimension.",
         "title": "Mlp Ratio",
         "type": "integer"
      },
      "activation": {
         "default": "gelu",
         "description": "Activation function for the MLP.",
         "title": "Activation",
         "type": "string"
      },
      "mlp_drop_prob": {
         "default": 0.0,
         "description": "Dropout probability for the MLP.",
         "title": "Mlp Drop Prob",
         "type": "number"
      },
      "norm_location": {
         "default": "post",
         "description": "Location of the normalization layer in the attention block. Pre-normalization implies normalization before the attention operation, while post-normalization applies it after.",
         "enum": [
            "pre",
            "post"
         ],
         "title": "Norm Location",
         "type": "string"
      },
      "layer_norm_eps": {
         "default": 1e-06,
         "description": "Epsilon value for the layer normalization.",
         "title": "Layer Norm Eps",
         "type": "number"
      }
   },
   "required": [
      "dim",
      "num_heads"
   ]
}

Config:
  • arbitrary_types_allowed: bool = True

  • extra: str = ignore

  • validate_default: bool = True

  • validate_assignment: bool = True

  • validate_return: bool = True

Fields:
Validators:

field norm_location: Literal['pre', 'post'] = 'post'#

Location of the normalization layer in the attention block. Pre-normalization implies normalization before the attention operation, while post-normalization applies it after.

Validated by:
field layer_norm_eps: float = 1e-06#

Epsilon value for the layer normalization.

Validated by:
pydantic model vision_architectures.blocks.transformer.Attention3DWithMLPConfig[source]#

Bases: Attention3DMLPConfig, Attention3DConfig

Show JSON schema
{
   "title": "Attention3DWithMLPConfig",
   "type": "object",
   "properties": {
      "dim": {
         "description": "Dimension of the input and output features.",
         "title": "Dim",
         "type": "integer"
      },
      "num_heads": {
         "description": "Number of query heads",
         "title": "Num Heads",
         "type": "integer"
      },
      "ratio_q_to_kv_heads": {
         "default": 1,
         "title": "Ratio Q To Kv Heads",
         "type": "integer"
      },
      "logit_scale_learnable": {
         "default": false,
         "title": "Logit Scale Learnable",
         "type": "boolean"
      },
      "attn_drop_prob": {
         "default": 0.0,
         "title": "Attn Drop Prob",
         "type": "number"
      },
      "proj_drop_prob": {
         "default": 0.0,
         "title": "Proj Drop Prob",
         "type": "number"
      },
      "max_attention_batch_size": {
         "default": -1,
         "description": "Runs attention by splitting the inputs into chunks of this size. 0 means no chunking. Useful for large inputs during inference.",
         "title": "Max Attention Batch Size",
         "type": "integer"
      },
      "mlp_ratio": {
         "default": 4,
         "description": "Ratio of the hidden dimension in the MLP to the input dimension.",
         "title": "Mlp Ratio",
         "type": "integer"
      },
      "activation": {
         "default": "gelu",
         "description": "Activation function for the MLP.",
         "title": "Activation",
         "type": "string"
      },
      "mlp_drop_prob": {
         "default": 0.0,
         "description": "Dropout probability for the MLP.",
         "title": "Mlp Drop Prob",
         "type": "number"
      },
      "norm_location": {
         "default": "post",
         "description": "Location of the normalization layer in the attention block. Pre-normalization implies normalization before the attention operation, while post-normalization applies it after.",
         "enum": [
            "pre",
            "post"
         ],
         "title": "Norm Location",
         "type": "string"
      },
      "layer_norm_eps": {
         "default": 1e-06,
         "description": "Epsilon value for the layer normalization.",
         "title": "Layer Norm Eps",
         "type": "number"
      }
   },
   "required": [
      "dim",
      "num_heads"
   ]
}

Config:
  • arbitrary_types_allowed: bool = True

  • extra: str = ignore

  • validate_default: bool = True

  • validate_assignment: bool = True

  • validate_return: bool = True

Fields:
Validators:

field norm_location: Literal['pre', 'post'] = 'post'#

Location of the normalization layer in the attention block. Pre-normalization implies normalization before the attention operation, while post-normalization applies it after.

Validated by:
field layer_norm_eps: float = 1e-06#

Epsilon value for the layer normalization.

Validated by:
class vision_architectures.blocks.transformer.Attention1DMLP(config={}, checkpointing_level=0, **kwargs)[source]#

Bases: Module

The MLP that is usually used after performing attention. This class is designed for 1D input eg. language, etc.

__init__(config={}, checkpointing_level=0, **kwargs)[source]#

Initialize an Attention1DMLP block. Activation checkpointing level 2.

Parameters:
  • config (Attention1DMLPConfig) – An instance of the Config class that contains all the configuration parameters. It can also be passed as a dictionary and the instance will be created automatically.

  • checkpointing_level (int) – The level of checkpointing to use for activation checkpointing. Refer to ActivationCheckpointing for more details.

  • **kwargs – Additional keyword arguments for configuration.

forward(hidden_states)[source]#

Forward pass of the Attention1DMLP block.

Parameters:

hidden_states (Tensor) – {INPUT_1D_DOC}

Return type:

Tensor

Returns:

{OUTPUT_1D_DOC}

class vision_architectures.blocks.transformer.Attention3DMLP(config={}, checkpointing_level=0, **kwargs)[source]#

Bases: Attention1DMLP

The MLP that is usually used after performing attention. This class is designed for 3D input eg. medical images, videos etc.

__init__(config={}, checkpointing_level=0, **kwargs)[source]#

Initialize an Attention3DMLP block. Activation checkpointing level 2.

Parameters:
  • config (Attention3DMLPConfig) – An instance of the Config class that contains all the configuration parameters. It can also be passed as a dictionary and the instance will be created automatically.

  • checkpointing_level (int) – The level of checkpointing to use for activation checkpointing. Refer to ActivationCheckpointing for more details.

  • **kwargs – Additional keyword arguments for configuration.

forward(hidden_states, channels_first=True)[source]#

Forward pass of the Attention3DMLP block.

Parameters:
  • hidden_states (Tensor) – Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the input features.

  • channels_first (bool) – Whether the inputs are in channels first format (B, C, …) or not (B, …, C).

Return type:

Tensor

Returns:

Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the output features.

class vision_architectures.blocks.transformer.Attention1DWithMLP(config={}, relative_position_bias=None, logit_scale=None, checkpointing_level=0, **kwargs)[source]#

Bases: Module

An attention block with an MLP. This class is designed for 1D input eg. language, etc.

__init__(config={}, relative_position_bias=None, logit_scale=None, checkpointing_level=0, **kwargs)[source]#

Initialize an Attention1DWithMLP block. Activation checkpointing level 3.

Parameters:
  • config (Attention1DWithMLPConfig) – An instance of the Config class that contains all the configuration parameters. It can also be passed as a dictionary and the instance will be created automatically.

  • relative_position_bias (Union[RelativePositionEmbeddings3D, RelativePositionEmbeddings3DMetaNetwork, None]) – Relative position embeddings for the attention mechanism.

  • logit_scale (Optional[float]) – Optional scaling factor for the attention logits.

  • checkpointing_level (int) – The level of checkpointing to use for activation checkpointing. Refer to ActivationCheckpointing for more details.

  • **kwargs – Additional keyword arguments for configuration.

forward(query, key, value)[source]#

Forward pass of the Attention1DWithMLP block.

Parameters:
  • query (Tensor) – Tensor of shape (B, T, C) representing the input features.

  • key (Tensor) – Tensor of shape (B, T, C) representing the input features.

  • value (Tensor) – Tensor of shape (B, T, C) representing the input features.

Return type:

Tensor

Returns:

Tensor of shape (B, T, C) representing the output features.

class vision_architectures.blocks.transformer.Attention3DWithMLP(config={}, relative_position_bias=None, logit_scale=None, checkpointing_level=0, **kwargs)[source]#

Bases: Module

An attention block with an MLP. This class is designed for 3D input eg. medical images, videos etc.

__init__(config={}, relative_position_bias=None, logit_scale=None, checkpointing_level=0, **kwargs)[source]#

Initialize an Attention3DWithMLP block. Activation checkpointing level 3.

Parameters:
  • config (Attention3DWithMLPConfig) – An instance of the Config class that contains all the configuration parameters. It can also be passed as a dictionary and the instance will be created automatically.

  • relative_position_bias (Union[RelativePositionEmbeddings3D, RelativePositionEmbeddings3DMetaNetwork, None]) – Relative position embeddings for the attention mechanism.

  • logit_scale (Optional[float]) – Optional scaling factor for the attention logits.

  • checkpointing_level (int) – The level of checkpointing to use for activation checkpointing. Refer to ActivationCheckpointing for more details.

  • **kwargs – Additional keyword arguments for configuration.

forward(query, key, value, channels_first=True)[source]#

Forward pass of the Attention3DWithMLP block.

Parameters:
  • query (Tensor) – Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the input features.

  • key (Tensor) – Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the input features.

  • value (Tensor) – Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the input features.

  • channels_first (bool) – Whether the inputs are in channels first format (B, C, …) or not (B, …, C).

Return type:

Tensor

Returns:

Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the output features.

class vision_architectures.blocks.transformer.TransformerEncoderBlock1D(config={}, relative_position_bias=None, logit_scale=None, checkpointing_level=0, **kwargs)[source]#

Bases: Attention1DWithMLP

A self attention transformer block. This class is designed for 1D input eg. language, etc.

forward(qkv, *args, **kwargs)[source]#

Forward pass of the TransformerEncoderBlock1D block. Activation checkpointing level 3.

Parameters:

qkv (Tensor) – Tensor of shape (B, T, C) representing the input features. The same tensor is used for query, key, and value.

Return type:

Tensor

Returns:

Tensor of shape (B, T, C) representing the output features.

class vision_architectures.blocks.transformer.TransformerEncoderBlock3D(config={}, relative_position_bias=None, logit_scale=None, checkpointing_level=0, **kwargs)[source]#

Bases: Attention3DWithMLP

A self attention transformer block. This class is designed for 3D input eg. medical images, videos etc.

forward(qkv, *args, **kwargs)[source]#

Forward pass of the TransformerEncoderBlock3D block. Activation checkpointing level 3.

Parameters:

qkv (Tensor) – Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the input features. The same tensor is used for query, key, and value.

Return type:

Tensor

Returns:

Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the output features.

class vision_architectures.blocks.transformer.TransformerDecoderBlock1D(config={}, checkpointing_level=0, **kwargs)[source]#

Bases: Module

A cross attention transformer block. This class is designed for 1D input eg. language, etc.

__init__(config={}, checkpointing_level=0, **kwargs)[source]#

Initialize a TransformerDecoderBlock1D block. Activation checkpointing level 3.

Parameters:
  • config (Attention1DWithMLPConfig) – An instance of the Config class that contains all the configuration parameters. It can also be passed as a dictionary and the instance will be created automatically.

  • checkpointing_level (int) – The level of checkpointing to use for activation checkpointing. Refer to ActivationCheckpointing for more details.

  • **kwargs – Additional keyword arguments for configuration.

forward(q, kv)[source]#

Forward pass of the TransformerDecoderBlock1D block.

Parameters:
  • q (Tensor) – The query tensor. Tensor of shape (B, T, C) representing the input features.

  • kv (Tensor) – The key and value tensors. Tensor of shape (B, T, C) representing the input features.

Return type:

Tensor

Returns:

Tensor of shape (B, T, C) representing the output features.

class vision_architectures.blocks.transformer.TransformerDecoderBlock3D(config={}, checkpointing_level=0, **kwargs)[source]#

Bases: Module

A cross attention transformer block. This class is designed for 3D input eg. medical images, videos etc.

__init__(config={}, checkpointing_level=0, **kwargs)[source]#

Initialize a TransformerDecoderBlock3D block. Activation checkpointing level 3.

Parameters:
  • config (Attention3DWithMLPConfig) – An instance of the Config class that contains all the configuration parameters. It can also be passed as a dictionary and the instance will be created automatically.

  • checkpointing_level (int) – The level of checkpointing to use for activation checkpointing. Refer to ActivationCheckpointing for more details.

  • **kwargs – Additional keyword arguments for configuration.

forward(q, kv, channels_first=True)[source]#

Forward pass of the TransformerDecoderBlock3D block.

Parameters:
  • q (Tensor) – The query tensor. Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the input features.

  • kv (Tensor) – The key and value tensors. Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the input features.

  • channels_first (bool) – Whether the inputs are in channels first format (B, C, …) or not (B, …, C).

Return type:

Tensor

Returns:

Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the output features.