3D Harmonic Loss: Towards Task-Consistent and Time-Friendly 3D Object Detection on Edge for V2X Orchestration

The use of edge computing for 3D perception has garnered interest in intelligent transportation systems (ITS) due to its potential to enhance Vehicle-to-Everything (V2X) orchestration through real-time traffic monitoring. The ability to accurately measure depth information in the environment using LiDAR has led to a growing emphasis on 3D detection based on this technology, which has significantly advanced the field of 3D perception. However, the computationally-intensive nature of these operations has made it challenging to meet the real-time deployment requirements using existing methods. The object detection task in the pointcloud domain is hindered by a substantial inconsistency problem caused by its high sparsity, which remains unaddressed. This article conducts an in-depth analysis of the issue, which has been brought to light by recent research on detecting inconsistency problems in image specialization. To address this problem, we propose a solution in the form of a 3D harmonic loss function, which aims to alleviate the inconsistent predictions based on pointcloud data. In addition, we showcase the viability of optimizing 3D harmonic loss mathematically. Our simulations employ the KITTI dataset and DAIR-V2X-I dataset, and our proposed approach significantly surpasses the performance of benchmark models. Additionally, we validate the efficiency of our proposed model through its deployment on an edge device (Jetson Xavier TX) in a simulated environment.

Abstract-The use of edge computing for 3D perception has garnered interest in intelligent transportation systems (ITS) due to its potential to enhance Vehicle-to-Everything (V2X) orchestration through real-time traffic monitoring.The ability to accurately measure depth information in the environment using LiDAR has led to a growing emphasis on 3D detection based on this technology, which has significantly advanced the field of 3D perception.However, the computationally-intensive nature of these operations has made it challenging to meet the real-time deployment requirements using existing methods.The object detection task in the pointcloud domain is hindered by a substantial inconsistency problem caused by its high sparsity, which remains unaddressed.This article conducts an in-depth analysis of the issue, which has been brought to light by recent research on detecting inconsistency problems in image specialization.To address this problem, we propose a solution in the form of a 3D harmonic loss function, which aims to alleviate the inconsistent predictions based on pointcloud data.In addition, we showcase the viability of optimizing 3D harmonic loss mathematically.Our simulations employ the KITTI dataset and DAIR-V2X-I dataset, and our proposed approach significantly surpasses the performance of benchmark models.Additionally, we validate the efficiency of our proposed model through its deployment on an edge device (Jetson Xavier TX) in a simulated environment.

I. INTRODUCTION
B ACKGROUND: Edge computing-based computer vision technology has received global attention for strengthening V2X orchestration and autonomous driving systems (ADS).
In the interdisciplinary research areas of V2X and ADS, data is collected and analyzed by vehicles and infrastructures to enable intelligent decision-making for vehicle movement [1].The decision-making system relies on data captured from surrounding areas, including road structure and traffic information.Through the use of effective object detection methods cloned via Road Side Units (RSU) and On Board Units (OBU), the data is analyzed to identify and localize traffic candidates, enabling necessary decisions for vehicle movement.
Motivation: Let's consider the movement of a vehicle using event-trigger analysis based on traffic information.Traffic data is collected through surveillance devices or on-vehicle sensors such as LiDAR and cameras.Edge devices analyze the important LiDAR data (pointcloud) to achieve the target with low latency.However, computation-intensive services are offloaded to servers to meet application deadlines.The affordability, increased perception of distant objects, and robust characteristics of LiDAR 3D object detection technology have made it prominent.To facilitate pinpoint communication from vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I), the recognized and localized vehicle is vital in measuring the surroundings and infrastructure through RSU-LiDAR deployment.Therefore, developing and deploying an efficient and robust 3D detector based on LiDAR data is a crucial research direction to enhance V2X efficiency.
Problem of task inconsistency and time delay: Object detection tasks in modern times have branched out into various sub-tasks like object localization, classification, and direction estimation.In the 2D image domain, most 2D detections consider sub-tasks independently, leading to inconsistent and unexpected predictions with high classification confidence but inadequate localization after post-processing (e.g.Non-Maximum Suppression), as shown in Fig. 1(a).Recently, researchers have addressed and partially solved this inconsistency problem in 2D object detection in [2], [3], [4], [5].However, despite advancements  While recent lidar-based 3D object detection methods [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18] focus on achieving the best mAP and consider it as a benchmark for model accuracy, they fail to address other critical factors like time consumption, quality of experience (QoE), and service reliability.For real-time applications like V2X, a cost-effective, task-friendly, and task-consistent detection solution with fast run-time and low error rate is required.Some researchers [19], [20] have noticed the problem of task inconsistency, but their Fig. 4. Qualitative analysis of overall 3D detection performance.Predicted bboxes from Pointpillar (baseline) (green bboxes) [21] and predicted bboxes from Harmonic Pointpillar (Ours) (blue bboxes) are visualized in same frames.Ground Truths (red bboxes) are also drawn for qualitative check.Harmonic PointPillar (Ours) shows better recall rate and localization accuracy with less false positive than PointPillar (baseline).
Fig. 5. Qualitative analysis of inconsistent/consistent 3D detection.For better view, the results are in BEV visualization (zoom in for detail check).PointPillar (baseline) [21] suffers from inconsistency problem in 3D detection, while Harmonic PointPillar (Ours) shows a great robustness on keeping predictions consistent.solutions rely on additional modules that increase inference time, which contradicts our goal of reducing computational burden.
Other recent works like [21], [22], [23], [24], [25], [26] have attempted to improve deployment metrics like computational burden and execution latency, but they have not achieved a sufficient trade-off between detection accuracy and time consumption for edge device-based simulations in real-time applications.
Our solutions: We derived solutions from the learning optimization perspective to solve the above drawbacks for better edge-computing object detection performance.Firstly, by drawing the lessons from the inconsistent prediction problem in camera-based 2D detection, we indicate a similar inconsistency problem in lidar-based 3D detection.This problem gradually leads to the inaccuracy of the prediction in actual applications and is worth being discovered and resolving.To alleviate inconsistent predictions of 3D detectors, we analyze the cause of the inconsistency problem through the respective characteristics of the image and point cloud.Inspired by the solution in image domain [4], we extend the 2D solution to 3D detection and propose 3D harmonic loss, a task-consistent learning strategy for optimizing pointcloud-based 3D detectors.It is worth mentioning that our solution, 3D harmonic loss, not like previous solutions [19], [20], only works for model training and does not bring any extra time-cost to model inference.Secondly, a thorough mathematical analysis is conducted to explain and demonstrate the effectiveness of 3D harmonic loss.Experiments on KITTI 3D/BEV detection dataset [27] further validate that the proposed strategy can achieve a noticeable performance improvement.Third, our proposed model is deployed on the edge device (Jetson Xavier TX) for simulation, and it achieves an ideal trade-off between time efficiency and detection accuracy.
We deploy the proposed detector on edge devices (Jetson Xavier TX) for realistic simulations to meet the lightweight design and edge-computing benchmark metrics.
Our contributions are as follows.
1) Develop a 3D harmonic loss method for alleviating inconsistent predictions inspired by related ideas from 2D detection.Thus, we level up the 2D solution to lidar-based 3D detection to map both two-stage and one-stage 3D detection models' learning accuracy without extra time-cost on inference.2) Experiments on KITTI Dataset [27] and DAIR-V2X-I Dataset [28] demonstrate our proposed work's effectiveness for both on-vehicle and on-infrastructure object detection.Especially for industrial-popular lidar-based detectors such as SECOND [6], and PointPillar [21] are considered to showcase the significant margin of mean average precision (mAP) improvement concerning the proposed 3D harmonic loss.3) Realistic simulations by deploying our proposed lightweight detector on the Jetson Xavier device further verify and realise that our solutions are time-friendly and task-consistent towards 3D detection for real applications.The article continues as Section II that briefs the extant approaches research gaps.Section III represents the proposed work in detail.Section IV represents the proposed method's effectiveness using qualitative and quantitative analysis.Section V concludes the manuscript.

A. LiDAR-Based 3D Object Detection
The popularity of 3D object detection has increased with the use of pointcloud-based deep learning models via various frameworks.Typically, two types of frameworks exist, namely one-stage and two-stage.One-stage methods enable instantaneous prediction of object 3D bounding boxes (bboxes).Some of these methods are points-based, like 3DSSD [9], which utilizes the PointNet [29] architecture, and PointGNN [7] network that employs a graph neural network.These methods use raw lidar pointclouds to make 3D shape predictions.Alternatively, voxel-based methods, such as VoxelNet, first convert the lidar pointcloud into 3D voxels to decrease input memory usage.Then, voxel features are fed into a region proposal network using 3D convolutions for 3D detection.SECOND [6] is a more time-efficient approach based on VoxelNet [10] that proposes sparse 3D convolutions.However, the time performance of one-stage 3D detection is still unsatisfactory.VoTr [30] and VoxSeT [14] utilize a voxel-based one-stage method and introduce transformer architecture for improved accuracy.However, their heavy parameters and complicated operations significantly reduce the time performance of 3D detection.PointPillar [21], on the other hand, transforms 3D pointclouds into 2D voxels, followed by highly efficient 2D convolutions to achieve real-time performance and easy deployment of 3D detection [26].
In contrast, the first stage of two-stage detectors [8], [11], [12], [13], [17], [18] involves predicting the Region-of-Interests (ROIs), while the second stage utilizes a refinement network to detect objects with greater precision.Despite the advantages of some two-stage methods, such as CenterPoint [17], which incorporates a fast feature encoding and a lightweight refinement head in its network design, they often fail to meet the speed requirements of real-time applications.As a result, the time-cost comparison gap between such two-stage detectors and certain one-stage detectors, such as PointPillar [21], remains relatively similar.
Combining image and point cloud data [31], [32], [33], [34], [35] is a suitable approach for enhancing 3D detection accuracy and surrounding perception.Nonetheless, the fusion techniques are more intricate and time-consuming for real-time applications compared to pure lidar-based detection.Thus, while we did consider some fusion methods in our experiment to demonstrate their accuracy benefits, they are not the primary focus of our discussion.

B. Inconsistency Problem in Object Detection
The issue of inconsistency was first observed in 2D object detection methods within the image domain, as demonstrated in Fig. 1(a).In the initial approaches, object classification and localization were treated independently during model training, leading to incongruous predictions during inference.To address this problem, recent studies [2], [3], [4], [5] have attempted to bridge the gap between these sub-tasks of 2D detection in the image domain.For example, [2], [3] proposed a Generalized Focal Loss approach that did not achieve the desired accuracy, while an improved version of the Focal Loss method was introduced in [36] to ensure consistent 2D detection.[4] introduces a balanced loss function to reconcile prediction consistency.Moreover, [5] proposed a PAA method that included an additional module for predicting IoU, which was useful for selecting positive training samples.Similarly, inconsistency in 3D point cloud object detection systems can lead to lower object detection reliability and quality of experience.Some 3D detection methods [19], [20] may slightly alleviate this issue, even though they are not fundamentally aware of the inconsistency problem.However, these methods modify the structure of the 3D models, requiring additional time and operations for predicting IoU and post-processing, which may not be feasible for real-time environments.Our work is relatively independent to above works, mainly reflected in two aspects: our work first indicates the need to address the inconsistency problem of 3D detection in the point cloud domain.Most importantly, our proposed solution, as a common optimization method for 3D detectors' training, effectively addresses the inconsistency problem without introducing any extra burden during model inference and deployment.

III. PROPOSED WORK
This section presents the formulation of the proposed method 3D harmonic loss, from both a theoretical and mathematical optimization standpoint.
In Fig. 1(d), we aim to attain uniform predictions in 3D detection.The reason behind the inconsistency issue is explored by analyzing the learning loss function (1) for a positive training sample i in several existing methods [6], [8], [12], [21].It is revealed that the three sub-tasks of 3D object detection (classification, localization (regression), and direction estimation) are handled and monitored separately, resulting in the inconsistency problem.
Where p i is softmax classification score, p i is softmax direction score.Also p gt i and p gt i are the ground truths for classification and direction estimation respectively.Consequently, the classification loss (L cls (p i , p gt i )) for positive training samples (p gt i = 1) uses focal loss [36], which is derived as follows In continuation, the regression loss L reg uses SmoothL 1 [37] as follows Where Δ d i is the difference between the set of attributes (x i , y i , z i , l i , w i , h i , θ i ) of predicted offsets d i and ground truth offsets d gt i , which is determined by the parameters (X gt i , Y gt i , Z gt i , L gt i , W gt i , H gt i , α gt i ) of ground truth boxes and the parameters (X i , Y i , Z i , L i , W i , H i , α i ) of anchor boxes as follows The inconsistent handling of various sub-tasks can result in inconsistent inference outcomes, which was addressed in prior research [4].However, that research was limited to 2D detection using image sources and only focused on generalizing critical loss functions such as cross-entropy, L1, and IoU.Our proposed 3D harmonic loss, detailed in Theorem, is specifically designed for lidar-based 3D detection, which involves different data modalities (pointcloud) and multiple prediction dimensions (including direction estimation, height, and depth).To maintain consistency during model learning, three dynamic factors are employed: 1 + β r , 1 + β c , and 1 − β r +β c β dir .The factors 1 + β r and 1 + β c work in conjunction to ensure mutual-consistency, while 1 − β r +β c β dir guarantees intrinsic-consistency.Our approach ensures both mutual and intrinsic consistency among the subtasks.In cases where classification optimization falls short, the factor obtained from the classification part supervises the regression part, and vice versa.Additionally, our method guarantees that the classification and regression parts consistently supervise the direction estimation part.This is exemplified by the fact that an accurate direction estimation should align with distinct class recognition and unambiguous boundary regression.Our learning mechanism is well-suited to commonly used loss functions in 3D detection training, including focal loss for object classification, Smooth L 1 for object localization, and binary cross-entropy for object direction estimation.In essence, our approach harmoniously addresses all sub-tasks.
Theorem: 3D harmonic loss Where The maximum value of β r (or β c ) cannot exceed 1 unless the regression part (or classification part) of the model has fully converged (Lreg = 0 (or Lcls = 0)).Consequently, we set β dir to 2 in our 3D harmonic loss function.Once the model has completely converged, the expression 1 − β r (=1)+β c (=1) evaluates to 0, indicating that further weighting is unnecessary.When the model is in training, setting β dir = 2 in the expression 1 − β r +β c β dir ensures that the feedback for optimization from the regression part and the classification part to the orientation estimation part is given equal importance.
The effectiveness of 3D harmonic loss is mathematically proven and briefly explained when the training sample is positively supervised by classification loss L cls (p i , p gt i ), regression loss L reg (d i , d gt i ), and direction loss L dir (p i , p gt i ).Proof-1: Assuming that the ith training sample is positive, we can assign p gt i = 1, and use the values α = 0.25 and γ = 2 for the focal loss.This helps to analyze how effective the 3D harmonic loss is in reducing the classification loss which is derived as follows Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
with the point derivation Based on ( 7), ( 8) and 9 the gradient backpropagation from the classification part is represented as (10).
Note that, in our experiment β dir = 2, the gradient backpropagation from the classification result is highly associated to Analysis-1: As a result, Fig. 2(b) depicts the outcomes of (10) obtained by sampling ten thousand data on average.The intensity of the color corresponds to the value of ∂H i 3D−Har /∂p i for the corresponding [p i , L reg , L dir ], with the three axes representing the values of L reg , p i , and L dir .Similarly, Fig. 2(a) represents ∂H i 3D /∂p i using the same axis representation.Note: the backpropagation gradient from the classification section is independent of the regression and direction estimation parts, as can be observed in Fig. 2(a) (with a better view in its vertical representation, Fig. 2(e)).When using the 3D harmonic loss (best viewed in Fig. 2(f)), the high regression loss suppresses the gradient from the classification loss (due to poor localization), resulting in relatively low confidence, which establishes mutual consistency between classification and localization.Furthermore, the L dir gradually influences the gradient propagation to achieve a globally unique optimization, where ∂H i 3D /∂p i = 0 occurs only when p i = 1, L dir = 0, and L reg = 0.
Proof-2: The effectiveness analysis of 3D harmonic loss on regression part is derived as follows The gradient back propagation from regression result is highly associated to Analysis-2: Fig. 2(d) depicts the results of ( 11) using ten thousand data samples.The intensity of color on the graph reflects the value of ∂H i 3D−Har /∂Δd i for corresponding [Δd i , L cls , L dir ], with the three axes representing the values of Δd i , L reg , and L dir , respectively.Fig. 2(c) shows the corresponding ∂H i 3D /∂Δd i with the same axis representation.In traditional 3D detection learning, the regression part's gradient backpropagation is independent of classification and direction estimation.Even though our proposed method achieved the same regression result (i.e., the same Δd i ), increasing classification loss will consistently restrict the gradient from maintaining synchronous learning of classification and regression (as shown in Fig. 2(h)).The global unique optimization of ∂H i 3D /∂Δd i = 0 is achieved only when Δd i = 0, L cls = 0, and L dir = 0.
Proof-3: The effectiveness analysis of 3D harmonic loss on the direction part is derived as follows.
Such that Based on the p gt i status, the binary cross entropy loss is updated as follows The gradient from the updated direction loss is as follows.
The type of direction loss estimation is dependent on the p gt i status, and it is derived as follows.
The gradient backpropagation from the direction result is highly associated to Analysis-3: Fig. 3 depicts the outcomes of ( 15 The standard approach for 3D detection based on point clouds involves the utilization of three distinct loss functions, namely L cls , L reg , and L dir , which aim to optimize the model's ability to predict the object's category, object 3D position, and object orientation.The three losses have a relatively independent relationship since the model training employs a direct sum of losses.This independence can result in inconsistent predictions.Nonetheless, the 3D harmonic loss, a suggested solution, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I MAP EVALUATION OF BEV OBJECT DETECTION ON CAR CLASS OF KITTI VALIDATION DATASET
presents a unified formula reconciles the three losses.To explain the harmonization mechanism, Proof-1, 2 and 3, as well as Analysis-1, 2 and 3, are provided.By implementing 3D harmonic loss, the three losses become synchronized to enhance the training of the model, leading to simultaneous convergence and reducing prediction inconsistency.

A. Dataset and Evaluation Metrics
The performance of the proposed 3D harmonic loss method is assessed using the KITTI dataset [27] and the DAIR-V2X-I dataset [28], both of which contain LiDAR pointcloud data and 3D object annotations.The KITTI dataset includes 7481 training frames and 7518 test frames, which were split into training (3712 frames) and validation (3769 frames) datasets following the approach of previous works [6], [8], [12], [21].Detection accuracy is evaluated using mean average precision (mAP) with 40 recall positions and Average Orientation Similarity (AOS) as metrics.
The DAIR-V2X dataset, as described in [28], facilitates infrastructure-based 3D object detection experiments by providing a sub dataset called DAIR-V2X-I.This sub dataset consists of 10,000 lidar pointcloud frames obtained from the infrastructure side, containing annotated 3D objects (493 k in total) belonging to three categories: car, pedestrian, and cyclist.To conduct our experiments in alignment with those in [28], we utilize the DAIR official toolkit to convert the DAIR-V2X-I dataset to the KITTI data format and employ the same evaluation metrics as those used in the KITTI dataset.

B. Implementation
The experiments in this study were performed on a server equipped with a single NVIDIA GeForce RTX 2080Ti GPU.The KITTI dataset was used to evaluate the effectiveness of the proposed model, and five widely used models (one-stage detectors: PointPillar [21] and SECOND [6], two-stage detectors: PointRCNN [8], Part-A 2 [12]) and PV-RCNN [13]) were adopted as baselines.These models were re-implemented and trained using the mmdetection3D platform [38], while also applying the proposed 3D harmonic loss.Additionally, the models were trained using their original training settings and parameters.
We have named our models Harmonic PointPillar, Harmonic SECOND, Harmonic PointRCNN, Harmonic Part-A 2 , and Harmonic PV-RCNN.During the evaluation stage, we kept the post-processing the same as the baselines.We submitted the results of Harmonic PointPillar to the KITTI official benchmark for testing on the KITTI test dataset.In our assessment, we compared the performance of our models to PointPillar (baseline) and other models [7], [9], [11], [14], [31], [33], [34], [35], [39], [40], [41], including two-stage lidar-based, one-stage lidar-based, and fusion-based methods.To assess the performance of the proposed model using the DAIR-V2X-I dataset, we used PointPillar [21] and SECOND [6], two widely used one-stage detectors, as baselines.We implemented our models and baselines on the DAIR-V2X official benchmark [28] using the original training parameters and evaluation settings.The only difference between our models and the baselines is the adoption of the proposed 3D harmonic loss, to ensure a fair comparison.

C. Quantitative Analysis
Experimental results with thorough quantitative analysis are reported below.
Detecting cars is a crucial aspect of Intelligent Transportation Systems (ITS) such as V2V and V2X.Our method's ability to detect cars was evaluated in both on-vehicle and roadside settings, with the resulting mAP values shown in Tables I and II.Our proposed method outperformed the baseline models in terms of average mAP values, particularly in BEV detection where our models achieved significant mAP rates (at least 0.02% and up to 2.36% better than SECOND, and at least 0.15% and up to 2.07% better than PointPillar).We have also submitted our Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.VIII) makes it a popular choice for industrial applications.Our method has optimized the baseline PointPillar model with an improvement of 0.82% on Easy samples and 0.72% on Moderate samples, with only a 0.27% decrease on Hard samples.However, due to our method's focus on balancing and harmonizing the gradient from different parts, the classification confidence for hard samples is usually low, as they typically consist of very sparse points scanned from objects, leading to a drop in mAP due to suppression of the regression part.Furthermore, extremely hard samples, such as outliers, can adversely affect the model's stability due to their large gradient variance.
The evaluation results for the DAIR-V2X-I dataset are presented in Tables IV and V. Our model has achieved the average improvement of at least 0.04% and at most 0.69%.As the majority of current 3D car detection models based on LiDAR were developed and tested from an on-vehicle LiDAR perspective, our future work will focus on developing more effective 3D detection techniques specifically tailored for on-infrastructure LiDAR.
Direction estimation: The performance of 3D direction is evaluated using the average orientation similarity (AOS) index.Table VI presents the AOS evaluation results for different IoU thresholds (0.7 and 0.5).A higher AOS value indicates better direction estimation for 3D objects.Our proposed model achieved an average improvement of 0.14% from PV-RCNN to Harmonic PV-RCNN, 0.26% from PointPillar to Harmonic PointPillar, and 0.76% from SECOND to Harmonic SECOND.Specifically, our proposed strategy significantly improved the performance of vanilla models on easy-level objects (improvement ranging from at least 0.23% to up to 1.99% under 0.7 IoU threshold, and at least  Vulnerable road users detection: In addition to car detection, detecting vulnerable road users such as pedestrians and cyclists is essential for enhancing the security monitoring capabilities of V2X applications.Based on our observations and analysis, the scanned pointcloud shapes of pedestrians and cyclists are more irregular with varying postures, which poses a significant challenge for optimizing the detection models.Table VII presents a comparison of mAP scores for detecting pedestrians and cyclists using the official 0.5 IoU threshold.Our proposed

D. Qualitative Analysis
Visualized results along with detailed qualitative analysis are presented below, depicting the overall performance of 3D detection.
Overall performance: A comparison of the overall performance is illustrated in Fig. 4. In Fig. 4 Dealing with inconsistency problem: To further confirm the effectiveness of the proposed method in resolving inconsistency issues in 3D detection, we present a more detailed qualitative visualization in Fig. 5, where viewers can zoom in for a closer inspection.The baseline model (PointPillar) failed to predict the targets in all example frames due to inconsistency between classification and localization.In contrast, our model demonstrated remarkable robustness in maintaining consistent 3D detection, resulting in the most accurate predictions in all example frames.This confirms that our method can construct a task-consistent 3D detector.

E. Simulations on Realistic Deployment
We utilized our previous experience with PyTorch-style Harmonic PointPillar [26] to convert it into TensorRT-format for deployment.The converted model was deployed on Jetson Xavier TX using float16 quantization techniques, and the same experiments as on PC were conducted.The results show a notable 2x-speed improvement (75.4 Hz on Jetson Xavier TX vs 43.1 Hz on PC Single 2080Ti) with at most a 1% mAP drop.The Jetson results demonstrate that our proposed method is feasible for edge orchestration due to its consistent, continuous trade-off between time efficiency and model accuracy with low energy consumption.Fig. 6 presents a qualitative example of on-infrastructure detection using TensorRT-format Harmonic PointPillar on Jetson Xavier TX.

V. CONCLUSION
In this article, we propose a method to address the inconsistency problem in 3D object detection and achieve better results compared to state-of-the-art methods.Our simulations demonstrate that our proposed method is in strengthening V2X frameworks.We first analyze the causes of inconsistency among classification, localization, and direction estimation and derive theoretical and mathematical solutions.Second, we introduce the 3D harmonic loss function, which effectively resolves the inconsistency problem in the point cloud domain and achieves higher mAP with a deployment speed of 75.4 Hz, surpassing baseline models.Mathematical derivatives are provided to support the effectiveness of our proposed loss mechanism.Our comprehensive experiments demonstrate that our proposed method significantly improves detection accuracy without incurring extra inference time cost.In the future, we plan to focus on improving on-infrastructure detection.

Fig. 1 .
Fig. 1.Illustration of the inconsistency problem in object detection.(a) Example of inconsistency problem in image domain: inconsistent bounding boxes with high classification score but low IoU (compared to groundtruth (red box)) in 2D detection, which leads to the suboptimal output (green box) after post processing (NMS).(b) Guesswork of the similar inconsistency problem in pointcloud-based 3D detection.(c) Real example of inconsistency problem from the PointPillar [21].(d) Expectation of consistent prediction: a better 3D detector is expected to harmonize the localization and classification of predicted objects, resulting in the reasonable output (blue box).Our work focuses on how to alleviate the inconsistent predictions in pointcloud domain, to achieve the expected predictions in real-world applications.

Fig. 2 .Fig. 3 .
Fig. 2. Visualization of the gradients from 3D detection loss related to different sub-tasks (object classification and object localization (regression)).(a) is drawn with gradients from classification part in common 3D detection loss.(b) is drawn with gradients from classification part in our proposed 3D harmonic loss.(c) is drawn with gradients from regression part in common 3D detection loss.(d) is drawn with gradients from regression part in our proposed 3D harmonic loss.For better view, (e), (f), (g) and (h) show their vertical forms.The color intensity indicates the value of gradients (see colorbar).MATLAB was utilized to analyze the data and plot the diagram.
) based on ten thousand data samples.The color intensity in the figure indicates the value of ∂H i 3D−Har /∂p i for the corresponding [p i , L cls , L reg ].The three axes in the figure represent the values of p i , L reg , and L cls , respectively.For example, in Fig. 3(a), the global unique optimization is ∂H i 3D /∂p i = 0 when p gt i = 0 because L cls = 0 and L reg = 0 when p i = 0. Similarly, Fig. 3(b) illustrates the same intrinsic-consistency paradigm for p gt i = 1.
(a) and (d), the Harmonic PointPillar model outperforms the baseline model (PointPillar) in terms of localization accuracy.On the other hand, in Fig. 4(b) and (c), our model detects more valid objects, which were missed by the baseline models.Additionally, our model has a lower false positive (FP) ratio than the baseline model in all example frames.

TABLE II MAP
EVALUATION OF 3D OBJECT DETECTION ON CAR CLASS OF KITTI VALIDATION DATASET TABLE III MAP EVALUATION OF BEV OBJECT DETECTION ON CAR CLASS OF KITTI TEST BENCHMARK Harmonic model to the official KITTI test benchmark (refer to Table III), and its high time efficiency (as indicated in Table

TABLE IV MAP
EVALUATION OF BEV OBJECT DETECTION ON CAR CLASS OF DAIR-V2X-I DATASET TABLE V MAP EVALUATION OF 3D OBJECT DETECTION ON CAR CLASS OF DAIR-V2X-I DATASET [21]) to capture vehicle signals in real-time.Table VIII presents a runtime comparison of various state-of-the-art 3D detectors.On average, our proposed method, like PointPillar[21], is at least 1.5 times faster than other methods since it works as a training optimizer and does not cause delays in detection inference.This measurement confirms that our method is highly time-friendly.Time efficiency of 3D detection matters in V2X applications.

TABLE VIII AVERAGE
RUNTIME COMPARISON OF 3D/BEV OBJECT DETECTION