DETR3D#

pydantic model vision_architectures.nets.detr_3d.DETRDecoderConfig[source]#

Bases: Attention1DWithMLPConfig

Show JSON schema
{
   "title": "DETRDecoderConfig",
   "type": "object",
   "properties": {
      "dim": {
         "description": "Dimension of the input and output features.",
         "title": "Dim",
         "type": "integer"
      },
      "num_heads": {
         "description": "Number of query heads",
         "title": "Num Heads",
         "type": "integer"
      },
      "ratio_q_to_kv_heads": {
         "default": 1,
         "title": "Ratio Q To Kv Heads",
         "type": "integer"
      },
      "logit_scale_learnable": {
         "default": false,
         "title": "Logit Scale Learnable",
         "type": "boolean"
      },
      "attn_drop_prob": {
         "default": 0.0,
         "title": "Attn Drop Prob",
         "type": "number"
      },
      "proj_drop_prob": {
         "default": 0.0,
         "title": "Proj Drop Prob",
         "type": "number"
      },
      "max_attention_batch_size": {
         "default": -1,
         "description": "Runs attention by splitting the inputs into chunks of this size. 0 means no chunking. Useful for large inputs during inference.",
         "title": "Max Attention Batch Size",
         "type": "integer"
      },
      "mlp_ratio": {
         "default": 4,
         "description": "Ratio of the hidden dimension in the MLP to the input dimension.",
         "title": "Mlp Ratio",
         "type": "integer"
      },
      "activation": {
         "default": "gelu",
         "description": "Activation function for the MLP.",
         "title": "Activation",
         "type": "string"
      },
      "mlp_drop_prob": {
         "default": 0.0,
         "description": "Dropout probability for the MLP.",
         "title": "Mlp Drop Prob",
         "type": "number"
      },
      "norm_location": {
         "default": "post",
         "description": "Location of the normalization layer in the attention block. Pre-normalization implies normalization before the attention operation, while post-normalization applies it after.",
         "enum": [
            "pre",
            "post"
         ],
         "title": "Norm Location",
         "type": "string"
      },
      "layer_norm_eps": {
         "default": 1e-06,
         "description": "Epsilon value for the layer normalization.",
         "title": "Layer Norm Eps",
         "type": "number"
      },
      "num_layers": {
         "description": "Number of transformer decoder layers.",
         "title": "Num Layers",
         "type": "integer"
      }
   },
   "required": [
      "dim",
      "num_heads",
      "num_layers"
   ]
}

Config:
  • arbitrary_types_allowed: bool = True

  • extra: str = ignore

  • validate_default: bool = True

  • validate_assignment: bool = True

  • validate_return: bool = True

Fields:
Validators:

field num_layers: int [Required]#

Number of transformer decoder layers.

Validated by:
pydantic model vision_architectures.nets.detr_3d.DETRBBoxMLPConfig[source]#

Bases: CustomBaseModel

Show JSON schema
{
   "title": "DETRBBoxMLPConfig",
   "type": "object",
   "properties": {
      "dim": {
         "description": "Dimension of the input features.",
         "title": "Dim",
         "type": "integer"
      },
      "num_classes": {
         "description": "Number of classes for the bounding box predictions.",
         "title": "Num Classes",
         "type": "integer"
      }
   },
   "required": [
      "dim",
      "num_classes"
   ]
}

Config:
  • arbitrary_types_allowed: bool = True

  • extra: str = ignore

  • validate_default: bool = True

  • validate_assignment: bool = True

  • validate_return: bool = True

Fields:
Validators:

field dim: int [Required]#

Dimension of the input features.

Validated by:
field num_classes: int [Required]#

Number of classes for the bounding box predictions.

Validated by:
pydantic model vision_architectures.nets.detr_3d.DETR3DConfig[source]#

Bases: DETRDecoderConfig, DETRBBoxMLPConfig, AbsolutePositionEmbeddings3DConfig

Show JSON schema
{
   "title": "DETR3DConfig",
   "type": "object",
   "properties": {
      "dim": {
         "description": "Dimension of the input and output features.",
         "title": "Dim",
         "type": "integer"
      },
      "grid_size": {
         "anyOf": [
            {
               "maxItems": 3,
               "minItems": 3,
               "prefixItems": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "integer"
                  },
                  {
                     "type": "integer"
                  }
               ],
               "type": "array"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Grid Size"
      },
      "learnable": {
         "default": false,
         "title": "Learnable",
         "type": "boolean"
      },
      "num_classes": {
         "description": "Number of classes for the bounding box predictions.",
         "title": "Num Classes",
         "type": "integer"
      },
      "num_heads": {
         "description": "Number of query heads",
         "title": "Num Heads",
         "type": "integer"
      },
      "ratio_q_to_kv_heads": {
         "default": 1,
         "title": "Ratio Q To Kv Heads",
         "type": "integer"
      },
      "logit_scale_learnable": {
         "default": false,
         "title": "Logit Scale Learnable",
         "type": "boolean"
      },
      "attn_drop_prob": {
         "default": 0.0,
         "title": "Attn Drop Prob",
         "type": "number"
      },
      "proj_drop_prob": {
         "default": 0.0,
         "title": "Proj Drop Prob",
         "type": "number"
      },
      "max_attention_batch_size": {
         "default": -1,
         "description": "Runs attention by splitting the inputs into chunks of this size. 0 means no chunking. Useful for large inputs during inference.",
         "title": "Max Attention Batch Size",
         "type": "integer"
      },
      "mlp_ratio": {
         "default": 4,
         "description": "Ratio of the hidden dimension in the MLP to the input dimension.",
         "title": "Mlp Ratio",
         "type": "integer"
      },
      "activation": {
         "default": "gelu",
         "description": "Activation function for the MLP.",
         "title": "Activation",
         "type": "string"
      },
      "mlp_drop_prob": {
         "default": 0.0,
         "description": "Dropout probability for the MLP.",
         "title": "Mlp Drop Prob",
         "type": "number"
      },
      "norm_location": {
         "default": "post",
         "description": "Location of the normalization layer in the attention block. Pre-normalization implies normalization before the attention operation, while post-normalization applies it after.",
         "enum": [
            "pre",
            "post"
         ],
         "title": "Norm Location",
         "type": "string"
      },
      "layer_norm_eps": {
         "default": 1e-06,
         "description": "Epsilon value for the layer normalization.",
         "title": "Layer Norm Eps",
         "type": "number"
      },
      "num_layers": {
         "description": "Number of transformer decoder layers.",
         "title": "Num Layers",
         "type": "integer"
      },
      "num_objects": {
         "description": "Maximum number of objects to detect.",
         "title": "Num Objects",
         "type": "integer"
      },
      "drop_prob": {
         "default": 0.0,
         "description": "Dropout probability for input embeddings.",
         "title": "Drop Prob",
         "type": "number"
      }
   },
   "required": [
      "dim",
      "num_classes",
      "num_heads",
      "num_layers",
      "num_objects"
   ]
}

Config:
  • arbitrary_types_allowed: bool = True

  • extra: str = ignore

  • validate_default: bool = True

  • validate_assignment: bool = True

  • validate_return: bool = True

Fields:
Validators:

field num_objects: int [Required]#

Maximum number of objects to detect.

Validated by:
field drop_prob: float = 0.0#

Dropout probability for input embeddings.

Validated by:
class vision_architectures.nets.detr_3d.DETRDecoder(config={}, checkpointing_level=0, **kwargs)[source]#

Bases: Module, PyTorchModelHubMixin

DETR Transformer decoder.

__init__(config={}, checkpointing_level=0, **kwargs)[source]#

Initialize the DETRDecoder. Activation checkpointing level 4.

Parameters:
  • config (DETRDecoderConfig) – An instance of the Config class that contains all the configuration parameters. It can also be passed as a dictionary and the instance will be created automatically.

  • checkpointing_level (int) – The level of checkpointing to use for activation checkpointing. Refer to ActivationCheckpointing for more details.

  • **kwargs – Additional keyword arguments for configuration.

forward(object_queries, embeddings, return_intermediates=False)[source]#

Forward pass of the DETR3D decoder.

Parameters:
  • object_queries (Tensor) – Tokens that represent object queries. Tensor of shape (B, T, C) representing the input features.

  • embeddings (Tensor) – Actual embeddings of the input. Tensor of shape (B, T, C) representing the input features.

  • return_intermediates (bool) – If True, also returns the outputs of all layers. Defaults to False.

Return type:

Tensor | tuple[Tensor, list[Tensor]]

Returns:

If return_intermediates is True, returns the final object embeddings and a list of outputs from all layers. Otherwise, returns only the final object embeddings.

class vision_architectures.nets.detr_3d.DETRBBoxMLP(config={}, **kwargs)[source]#

Bases: Module

DETR Bounding Box MLP. This module predicts bounding boxes and class scores from object query embeddings.

__init__(config={}, **kwargs)[source]#

Initialize the DETRBBoxMLP.

Parameters:
  • config (DETRBBoxMLPConfig) – An instance of the Config class that contains all the configuration parameters. It can also be passed as a dictionary and the instance will be created automatically.

  • **kwargs – Additional keyword arguments for configuration.

forward(object_embeddings)[source]#

Forward pass of the DETRBBoxMLP.

Parameters:

object_embeddings (Tensor) – Object embeddings from the DETR decoder. Tensor of shape (B, T, C) representing the input features.

Return type:

Tensor

Returns:

A tensor of shape (b, num_possible_objects, 1 objectness class + 6 bounding box parameters + num_classes) containing the predicted bounding boxes and class scores.

class vision_architectures.nets.detr_3d.DETR3D(config={}, checkpointing_level=0, **kwargs)[source]#

Bases: Module, PyTorchModelHubMixin

DETR 3D model. Also implements bipartite matching loss which is essential for DETR training.

__init__(config={}, checkpointing_level=0, **kwargs)[source]#

Initialize the DETR3D. Activation checkpointing level 4.

Parameters:
  • config (DETR3DConfig) – An instance of the Config class that contains all the configuration parameters. It can also be passed as a dictionary and the instance will be created automatically.

  • checkpointing_level (int) – The level of checkpointing to use for activation checkpointing. Refer to ActivationCheckpointing for more details.

  • **kwargs – Additional keyword arguments for configuration.

forward(embeddings, spacings, channels_first=True, return_intermediates=False)[source]#

Forward pass of the DETR3D.

Parameters:
  • embeddings (Tensor) – Encoded input features. Tensor of shape (B, C, Z, Y, X) or (B, Z, Y, X, C) representing the input features.

  • spacings (Tensor) – Spacing information of shape (B, 3) of the input features.

  • channels_first (bool) – Whether the inputs are in channels first format (B, C, …) or not (B, …, C).

  • return_intermediates (bool) – If True, also returns the outputs of all layers. Defaults to False.

Return type:

Tensor | tuple[Tensor, Tensor, list[Tensor]]

Returns:

A tuple containing bounding boxes, object embeddings, and layer outputs if return_intermediates is True. Else, returns only the bounding boxes.

static bipartite_matching_loss(pred, target, classification_cost_weight=1.0, bbox_l1_cost_weight=1.0, bbox_iou_cost_weight=1.0, reduction='mean')[source]#

Bipartite matching loss for DETR. The classes are expected to optimize for a multi-class classification problem. Expects raw logits in class predictions, not probabilities.

Parameters:
  • pred (Tensor) – Predicted bounding boxes and class scores. It should be of shape (B, num_objects, 6 + 1 + num_classes). Number of objects and number of classes will be inferred from here.

  • target (Tensor | list[Tensor]) – Target bounding boxes and class scores. If provided as a list, each element should be a tensor for the corresponding batch element in pred and therefore should have a length of B. Each tensor should have less than or equal to the number of objects in pred. The number of classes can either be the exact same as in pred, or it should be 1 argmax (one-cold) decoding.

  • classification_cost_weight (float) – Weight for the classification cost in hungarian matching.

  • bbox_l1_cost_weight (float) – Weight for the bounding box L1 loss cost in hungarian matching.

  • bbox_iou_cost_weight (float) – Weight for the bounding box IoU cost in hungarian matching.

  • reduction (str) – Specifies the reduction to apply to the output.

Return type:

Tensor

Returns:

A tensor containing the bipartite matching loss with the shape depending on the reduction argument.