Version: v2602

Improvement through Refactoring

Fixing conversion errors

Based on the error cause identified in the previous step, we will refactor the source code. In this tutorial, refactored code is provided. so we will move the directory.

cd ../refactor

Fixing Input Shape (Resolving Dynamic Shape)

In DETR, an error occurs in the IntermediateLayerGetter class (ResNet101), and this is due to dynamic shape. The original DETR resizes while preserving aspect ratio, so the tensor size varies for each image.

Solution: Add padding processing and modify it so that the input size is always constant (e.g., 1333×1333).

Added a process to resize to the maximum size (1333×1333) and pad any missing areas with zeros.
Modified to adjust coordinate transformation during post-processing (inverse normalization) to account for padding margins.

Deprecation of the custom `NestedTensor` class and organization of input/output

The top-level DETR class and similar classes take the custom NestedTensor class as an input argument. This will be converted to a format that ONNX can interpret.

Solution: In the data loader and model processing, separate the image tensor and the mask.

Deprecated NestedTensor and revised to explicitly pass the two torch.Tensors, the “image tensor” and the “mask”, as arguments.
Accordingly, we will also modify the interfaces of the following sub-modules.
- Joiner
- BackboneBase
- PositionEmbeddingSine

Correcting Output Format

After resolving the dynamic shape and input type issues, the next error occurs because the DETR class returns a dictionary.

Solution: Change the output to a simple list of tensors.

DETR class: Modified to return a list (tuple) of tensors instead of a dictionary with keys pred_logits and pred_boxes.
Evaluator class: Added a wrapper process to restore inference results from a list to a dictionary format when receiving them.

TIPS: Narrow Down Conversion Targets

AcuiRT tries to perform recursive conversion starting from the top-level module, but when debugging you may want to convert and verify only a specific module precisely. In that case, by specifying the config as shown below, you can target only specific modules (e.g., backbone.conv.stem1). This enables fast verification without relying on errors in higher‑level modules.

config_debug.json
{
    "backbone.conv.stem1" : {
        "rt_mode": "onnx",
        "auto": true
    }
}

Re-converting the Modified Model

When you run the conversion after applying refactoring, you get output like the following.

python main.py --batch_size 1 --no_aux_loss --eval --backbone resnet101 --resume ../baseline/detr-r101-2c7b67e5.pth --coco_path /path/to/dataset/coco --trt-engine-dir exps/refactor --trt-config aibooster_misc/config.json

Workflow Report
Conversion Rate: 447/447
Num Modules: 1
Accuracy: 0.5021068707623291
Latency: 66.03 ms
model [success] (<class 'models.detr.DETR'>)
                                            Data
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┓
┃ status  ┃ module_class_name ┃ device_time_total ┃ device_events_total ┃ error ┃ num_errors ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━┩
│ success │ DETR              │ 144668.06         │ 3                   │ None  │ 0          │
└─────────┴───────────────────┴───────────────────┴─────────────────────┴───────┴────────────┘

Checking num_modules, it is the ideal 1, confirming that all layers have been consolidated into a single TensorRT engine. Also, both inference accuracy and speed are nearly equivalent to the results before applying AcuiRT.

As for accuracy, a decrease of about 3% compared to the baseline is observed. This is thought to be caused by the padding performed to fix the input shape Fixing the input shape, which changed the computation results in the sleeve region during convolution. It is believed that performing fine-tuning with padding enabled can restore accuracy.

Accelerating with Quantization

By adding quantization settings to the config, quantization can be automatically applied during conversion to TensorRT. The setting below performs conversion to TensorRT with all layers quantized to fp16.

config_quant_fp16.json
{
    "rt_mode": "onnx",
    "auto": true,
    "fp16": true
}

We attempt conversion with quantization applied. It is successful if we obtain output as shown below. We were able to significantly improve latency while maintaining accuracy (AP).

python main.py --batch_size 1 --no_aux_loss --eval --backbone resnet101 --resume ../baseline/detr-r101-2c7b67e5.pth --coco_path /path/to/dataset/coco --trt-engine-dir exps/refactor_fp16 --trt-config aibooster_misc/config_fp16.json

Workflow Report
Conversion Rate: 447/447
Num Modules: 1
Accuracy: 0.5025068956852554
Latency: 48.13 ms
model [success] (<class 'models.detr.DETR'>)
                                             Data
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┓
┃ status  ┃ module_class_name ┃ device_time_total ┃ device_events_total ┃ error ┃ num_errors ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━┩
│ success │ DETR              │ 42507.566         │ 3                   │ None  │ 0          │
└─────────┴───────────────────┴───────────────────┴─────────────────────┴───────┴────────────┘

Final Result

The final optimized results, which combine the refactoring up to this point (input fixing and I/O organization) with the quantization settings, are as follows. By modifying the model architecture and applying FP16 quantization, we achieved a significant speedup.

Model	AP (Accuracy)	Latency	Notes
PyTorch model	0.5310	60.92ms	Baseline
AcuiRT + Apply Refactoring + Quantization	0.5025	48.13ms	about 1.25× speedup, recover Accuracy

Fixing conversion errors​

Fixing Input Shape (Resolving Dynamic Shape)​

Deprecation of the custom NestedTensor class and organization of input/output​

Correcting Output Format​

Re-converting the Modified Model​

Accelerating with Quantization​

Final Result​