CVPR 2020论文出炉，阅读中，小小记录。分类来自amusi。

检测

目标检测

D2Det: Towards High Quality Object Detection and Instance Segmentation

这篇文章提出了一个两阶段的检测方法D2Det，高质量体现在两个方面：更精确地定位localization和更准确地分类classification。

定位：对于一个目标提议，用更密集地局部回归方法通过全卷积网络来预测多个密集的box offsets
- 和传统的方法相比：由于能够回归正敏感的实数offset，所以对于固定的区域，不限制关键点的数量（Faster RCNN通过几个全连接层回归一个gloabl offset；Grid R-CNN通过在固定区域中搜索关键点——即目标的边界点）
  
  传统方法， $P(x_P, y_P, w_P, h_P)$ candidate object proposal, and $G(x_G, y_G, w_G, h_G)$ target ground-truth box，offset是：
  $\Delta_x = (x_G − x_P)/w_P,\ \Delta_y= (y_G − y_P)/h_P \\ \Delta_w = \log(w_G /w_P ),\ \Delta_h= \log(h_G /h_P )$
  这篇论文： $(x_l, y_t)$ and $(x_r, y_b)$ 代表gt bbox的左上角和右下角点， $\hat l_i, \hat t_i, \hat r_i, \hat b_i$ represent the dense box offsets predicted by the local feature $p_i$ in left, top, right, and bottom directions, respectively. Offset:
  $l_i = (x_i − x_l)/w_P, r_i = (x_r − x_i)/w_P\\ t_i = (y_i − y_t)/h_P, b_i = (y_b − y_i)/h_P$
  但是这样密集的回归，准确率还是有些欠缺，对于一些部分需要ignore
- 使用binary overlap prediction方法（即图中的 $m,\hat m$ ），减少背景的影响；binary指的就是是背景还是目标区域；用binary cross-entropy loss衡量
- 目标：ﬁnd a tight bounding-box surrounding an object
分类：获取可区分的特征
- 使用discriminative RoI pooling scheme，采样一个proposal的不同子区域，performs an adaptive weighted pooling获得可区分的特征；首先light-weight(equal) offset predictor，然后通过Adaptive weighting $W(F)$ 对区分点增加权重
- RoIAlign：obtain features from k × k sub-regions and passes these features through three fully connected layers.
- only requires a $\frac k 2 \times \frac k 2$ sized RoIAlign followed by the fully connected layers (light-weight due to smaller input vector).
- 首先对于原始的采样点 $F\in R^{2k\times 2k}$ ，用卷积运算W去预测权重值 $W(F)\in R^{2k\times 2k}$ ，这样带有权重的ROI特征向量就为 $\tilde F = W(F)\odot F$ ；运算代表Hadamard product

效果：

在MS COCO test-dev数据集上，45.4 AP（ResNet101）；multi-sacle 50.1 AP；在实例分割任务上，获得 mask AP of 40.2。

Harmonizing Transferability and Discriminability for Adapting Object Detectors

协调与平衡传统无监督域对抗方法中存在的特征可迁移性和判别性的矛盾，通过在不同层次 上校准特征的可迁移性 和判别性来规范对抗域学习，从而实现 细粒度的跨域特征对齐，提出了Hierarchical Transferability Calibration Network (HTCN)。

UDA无监督域适应主要有两种方法：一个是减少统计分布的区别，另一个是对抗学习域不变特征。~~感觉作者所要解决的问题就是学习两个域的不变性的时候，可能会因为图片中的实例之间的判别性而有影响，成为负迁移了~~

网络主要由三大模块组成：

local feature masks： that calibrate the local transferability to provide semantic guidance for the following discriminative pattern alignment（在观察到整个图像的某些局部区域比其他局部区域更具描述性和优势后，）基于浅层特征计算两个域中的局部特征掩码以近似地指导语义一致性，从而进一步增强局部可辨性。对齐之后，可以看作是一个类似注意的模块，以无监督的方式捕获可转移区域。

利用pixel-wise的局部域判别器 $D_1$ 来生成feature masks。其中特征图 $G$ 的长、宽分别为 $H,W$ 。由此the pixel-wise adversarial training loss $L_{la}$ 定义为：
$L_{la}=\frac{1}{N_s\cdot HW}\sum_{i=1}^{N_s}\sum_{k=1}^{HW}\log(D_1(G_1(x_i^s)_k))^2+\frac{1}{N_t\cdot HW}\sum_{i=1}^{N_t}\sum_{k=1}^{HW}\log(1-D_1(G_1(x_i^t)_k))^2\\ G_1(x_i)_k\to r_i^k$
通过 $D_1$ 的输出表示为 $d_i^k=D_1(r_i^k)$ ，其不确定性表示为 $v(r_i^k)=H(d_i^k)$ ；由此mask定义为： $m_f^k=2-v(r_i^k)$ 。这样通过插值重新分配权重后，局部特征变为 $\tilde r_i^k\leftarrow r_i^k\cdot m_f^k$ 。

由此我们可以知道这一部分送入 $D_2$ 后产生的loss为：
$L_{ma}= \mathbb E[\log(D_2(G_2(\hat f_i^s))] + \mathbb E[1 − \log(D_2(G_2(\hat f_i^t))]$
Importance Weighted Adversarial Training with input Interpolation (IWAT-I)：which strengthens the global discriminability(calibrate the global transferability) by reweighting the interpolated image-level features（motivation：不是所有的样本都有相同的迁移性能，尤其在插值操作之后。没有插值的话会有souce-biased的现象）

根据交叉域的相似性来重新设置weight，相似性越高，对学习的重要性也就越高。
$d_i =D_2 (G_1 \circ G_2 (x_i ))\\ v_i = H(d_i) = −d_i\cdot \log(d_i) − (1−d_i)\cdot \log(1 − d_i)$
和前一模块一样 $d_i$ 是 $D_2$ 的输出， $v_i$ 是每一个 $x_i$ 的不确定性。

这样每一个图片 $x_i$ 的权重被定义为 $1+v_i$ ~~(也就是说如果不确定性很高， $D_2$ 难以区分的话，那么就要好好学习了)~~，由此 $G_3$ 的输入为： $g_i = f_i × (1 + v_i )$ ，然后将其通过 $G_3$ 的运算送入 $D_3$ 中，对抗损失则为：
$L_{ga}= \mathbb E[\log(D_3(G_3(g_i^s))] + \mathbb E[1 − \log(D_3(G_3(g_i^t))]$
Context-aware Instance-Level Alignment (CILA) module：which enhances the local discriminability(calibrate the local transferability) by capturing the underlying complementary effect between the instance-level feature and the global context information for the instance-level feature alignment.(the context vector is aggregated from the lower layer, which is relatively invariant (transferability) across domains.)

由于直接concatenation context features and the instance-level features，它们相互之间就是独立的，忽略它们是可以作为对方的互补这一影响，由此提出了a non-linear fusion strategy：
$f_{fus} = [f_c^1, f_c^2, f_c^3] \otimes f_{ins}$
其中， $\otimes$ denotes the tensor product operation

但是这样做，会使得维度爆炸 $d_c\times d_{ins}$ ，由此提出了新的随机方法。
$f_{fus}=\frac 1 {\sqrt{d}}(R_1f_c)\odot(R_2f_{ins})\\ f_c=[f_c^1, f_c^2, f_c^3]$
其中， $\odot$ stands for the Hadamard product. $R_1$ and $R_2$ are random matrices, sampled from uniform distribution only once and not updated during training.

由此这一部分的损失函数定义为：
$L_{ins}= -\frac{1}{N_s}\sum^{N_s}_{i=1}\sum_{i,j}\log(D_{ins}(f_{fus}^{i,j})_s)\\ = -\frac{1}{N_t}\sum^{N_t}_{i=1}\sum_{i,j}\log(1-D_{ins}(f_{fus}^{i,j})_t)$

总的训练损失为：

$max_{D_1,D_2,D_3}min_{G_1,G_2,G_3}\ \mathcal{L_{cls}+L_{reg}-\lambda(L_{la}+L_{ma}+L_{ga}+L_{ins})}$

对于目标域误差的上限定义式，通过三个模块来减小源域和目标域之间的距离，通过多次层次的特征迁移学习来减小常数项。

3D目标检测

Learning Depth-Guided Convolutions for Monocular 3D Object Detection

Depth-guided Dynamic-Depthwise-Dilated local convolutional network ( $D^4$ LCN)

这篇文章大概的思路就是首先用单目的RGB图像得到深度表示，然后将深度表示转换为伪雷达表示，最后使用3D点云的目标检测方法。

对于伪雷达的部分有些想不通，既然是用图片的信息得到雷达的信息，那为什么用这个伪雷达信息得到的更好呢？而不是直接使用深度估计呢？是因为雷达转换的方法更好吗？还是因为从深度信息变换雷达信息的过程获取了更多信息呢？感觉是自己对于神经网络的认识还不到位，不是说给一堆数据，一个网络结构，就能大致学会。这里面有两个函数转换关系，分别来求会更容易更快速学习到。就好像最近纠结很久的学习，需要转换坐标系后再将数据送入网络。实际上转换坐标系只是一个矩阵的乘法，但是网络却学不好。也可能是因为对应的轴还有一个变换。文章中说即使depth map的准确率不是很高，也可以有好的表现。感觉是depth map是作为一个guidance，是2D到3D的一个衔接，而不是只从depth map直接转换，然后通过一个生成网络获得了更加高层的信息。

总体的感觉来说就是RGB-based的话会丢失空间信息，因此从Depth map中去获取，模仿得到lidar信息（应该是lidar信息是depth信息的一个融合，更加高层的信息）；但是仅仅有空间信息的话会丢失语义信息，因此将RGB信息通过特征提取保留，并且将上述两者结合，这样得到的3D信息比较完整。

$D^4$$LCN考虑了四个方面，通过linear operators of shift and element-wise product就可以实现： * **exemplar kernel** 对于每一张图片学习具体场景的几何关系 * **local convolution** 区分每一个像素为背景还是目标 * **depth-wise convolution** 对于不同的purpose用不同的channel filters，减少计算复杂度 * **exemplar dilation rate** 学习适用不同规模目标的感受野 ![](CVPR-2020/Ding_1.png) 首先RGB图像用**特征提取网络** ***获取特征*** $$I_n \in \mathbb R^{h_n\times w_n\times c_n}$$（n代表第n个block），同时用**深度估计**（作者这里是用了DORN，是先生成depth map，然后将两个都作为输入传入D4LCN网络中的）的方法得到***Depth Map***；然后深度图通过**Filter generation network**得到feature extraction network的***卷积核*** $$D_n \in \mathbb R^{h_n\times w_n\times c_n}$$；之后将两个网络的输出通过**depth-guided ﬁltering module** 融合，得到矫正后的***3D feature map***；再者就是通过**3D 目标检测**方法得到output；最后经过**NMS和transform**整理，得到***3D检测结果***。 **结构：** * Backbone * 特征提取网络：ResNet-50 without its ﬁnal FC and pooling layer * filter生成网络：use first three blocks of ResNet-50，学习得到的kernel是***sample-wise, position-wise, and depth-wise*** 上述两者 have the same number of channels of each block * depth-guided ﬁltering module * 生成网络出来之后通过了一个定义为**shifting grid**，其包含kxk个元素，$$\{(g_i,g_j\},g\in (int)[1-k/2,k/2-1]$$; For example, $$g\in\{−1, 0, 1\}$$ when k = 3, and the feature map is moved towards nine directions with a horizontal or vertical step size of 0 or 1 <figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">dynamic_local_filtering(x, depth, dilated=1)</div><div class="line">dynamic_local_filtering(x, depth, dilated=2)</div><div class="line">dynamic_local_filtering(x, depth, dilated=3)</div></pre></td></tr></table></figure> <img src="CVPR-2020/Ding_4.png" style="zoom:50%;" /> 使用shift grid的一个好处就是可以动态设置**不同的感受野大小**，对于目标规模不同的情况更加友好。这样两个分支的输出合并后通过shift得到新的特征图为$$I'=\frac{1}{k\cdot k}\sum_{g_i,g_j}(I\odot D)^{(g_i,g_j)}$

为了实现depth-wise，特征网络之后通过一个shift pooling

def _get_p(self, offset, dtype):
    N, h, w = offset.size(1)//2, offset.size(2), offset.size(3)
    # (1, 2N, 1, 1)
    p_n = self._get_p_n(N, dtype)
    # (1, 2N, h, w)
    p_0 = self._get_p_0(h, w, N, dtype)
    p = p_0 + p_n + offset
    return p

<img src="CVPR-2020/Ding_2.png" style="zoom:50%;" />

为了实现depth-wise，使得不同的kernel可以使用不同的functions，用 $I$ 对每一个filter学习an adaptive dilation rate，即学习an adaptive function $\mathcal A$ ，使得可以获取不同尺寸的感受野，其中 $\mathcal A$ 由三层组成（d denote our maximum dilation rate）：

weight = self.adaptive_layers(x).reshape(-1, 512, 1, 3)

weight = self.adaptive_softmax(weight)

an AdaptiveMaxPool2d layer with the output size of d × d and channel number c;
a convolutional layer with a kernel size of d × d and channel number d × c
a reshape and softmax layer to generate d weights $A^w (I), w\in(int)[1, d]$ with a sum of 1 for each ﬁlter.

这样，新的特征图可以表示为：

$I'=\frac{1}{d\cdot k\cdot k}\cdot \sum_w A^w (I)\sum_{g_i,g_j}(I\odot D)^{(g_i*w,g_j*w)}$

1
2
3

x = dynamic_local_filtering(x, depth, dilated=1) * weight[:, :, :, 0:1] \
              + dynamic_local_filtering(x, depth, dilated=2) * weight[:, :, :, 1:2] \
              + dynamic_local_filtering(x, depth, dilated=3) * weight[:, :, :, 2:3]

 可以解决2D卷积中**尺度不敏感和无意义的局部结构**信息，同时也**利用好了RGB的信息**。

2D-3D detection head
- overall loss contains a classiﬁcation loss (standard cross-entropy (CE) loss), a 2D regression loss, a 3D regression loss and a 2D-3D corner loss (SmoothL1 regression losses). ( $s_t$ and $\gamma$ denote the classiﬁcation score of target class and the focusing parameter, respectively) $L = (1 − s_t)^{\gamma} (L_{class} + L_{2d} + L_{3d} + L_{corner})\\ L_{class}= − \log(s_t)$

End-to-End Pseudo-LiDAR for Image-Based 3D Object Detection

~~这篇论文看github应该是有前序文章的，之后可以看一下~~

这篇论文的主要贡献就是设计了一个Change of Representation (CoR) modules，让PL(pseudo-lidar)和3D 深度估计可以在同一个框架下，训练的时候达到end-to-end的效果（想到end-to-end的一个是因为lidar-based的3D检测方法严重依赖于点的准确性，对于远处稀疏的点的预测并不好；二是因为在KITTI数据集中的图片里有90%的点是背景，对于检测器来说直接固定的效果，会增加影响）。Perfomance就是在PL方法中甚至image-based中STA~~也就是说这比直接用深度估计进行3D检测还要好~~。

Two streams in terms of point cloud processing:

directly operating on the unordered point clouds in 3D , mostly by applying PointNet or/and applying 3D convolution over neighbors;
operating on quantized 3D/4D tensor data, which are generated from discretizing the locations of point clouds into some ﬁxed grids.

CoR关注于两个方面，一个是subsampling，另一个是quantization(在这里使用了soft quantization来克服内在固有的非不一致性)。

quantization：3D信息被离散化到固定的格子中，only the occupation or densities are recorded in the resulting tensor(?).
- 2D and 3D convolutions can be directly applied to extract features from the tensor
- makes the back-propagation difﬁcult
格子的固定中间点的位置记作： $\hat p_m$ ，results tensor $T$ 定义为：

直观上理解就是，如果点p落在m区域之内，那么就记作1；否则记作0. 但是对于后向传播是困难的，让损失函数求关于T的偏导，如果大于0，就说明T应该减少，即希望落在m的点更少；反之，则希望落在m的点更多。但是如何改变，是一个问题。

因此，提出改变forward，在论文中使用了soft quantization，通过RBF(radial basis function)作为权重的参考。

$P_m$ 是落在m区域中的点， $m'$ 是点 $m$ 的邻居区域，这样在反向传播中，可以直接影响到 $P_m$ 点，并且如果有些误差也可能会被 $m'$ 周围部分所承担，因此更加有效。

从下图可以更直观地理解，真实的点云在图一中，然后我们先离散化到固定的格子之中，即Hard quantization；而soft quantization要做的事情就是将当前预测错误的点在拉、推中接近真实值。如果这一格子中没有这一个点，就会想要把它推给周边的点；而如果这个格子中缺少一个点，就会从边上的格子中拉一个回来。当计算的梯度为正的时候，说明需要push；梯度为负的时候，说明需要pull。
subsampling

网络结构：

先训练深度评估网络，SDN模型
固定深度评估网络，训练3D 目标检测器，用了两个lidar-based的方法：PIXOR (voxel-based, with quantization) and PointR-CNN (P-RCNN)(point-cloud-based)
最后带有balanced loss weights联合训练两者

def train_one_epoch(depth_model, ODmodel, ODmodel_fn):
  # 设置为训练状态
  self.model.train()
  self.depth_model.net.train()
  self.optimizer.zero_grad()
  self.depth_model.optimizer.zero_grad()
  
  # 开始训练
  depth_loss, point_cloud = self.depth_model.train(batch)
  loss, tb_dict, disp_dict = self.model_fn(self.model, point_cloud)
  disp_dict['depth_loss'] = depth_loss.item()
  loss = loss*0.01 + depth_loss
  
  # 更新参数
  loss.backward()
  self.optimizer.step()
  self.depth_model.optimizer.step()

loss：

$L = \lambda_{det} L_{det} + \lambda_{depth} L_{depth}\\ L_{det}= \lambda_{cls} L_{cls} + \lambda_{reg} L_{reg}\\ L_{depth}=\frac{1}{|A|}\sum_{(u,v)\in A}l(Z(u, v) − Z^∗(u, v))\\ l(x)=\begin{cases}0.5x^2,if\ |x|<1\\|x|-0.5,otherwise\end{cases}$

前身：

Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving——cvpr2019

这一篇文章讲述的是说，之前伪雷达的方法主要是通过前置摄像头的方法来模拟，但是这样子的效果并不好。发现使用立体摄像头采集的鸟瞰图，可以显著提高范围检测和准确性。指出不是数据的质量而是数据的表示是准确率差异的大部分原因。其将30m范围内的目标检测准确率从22%提升到了74%。

The error of stereo-based 3D depth estimation grows quadratically with the depth of an object, whereas for Time-of-Flight (ToF) approaches, such as LiDAR, this relationship is approximately linear.

To this end, we propose a two-step approach by ﬁrst estimating the dense pixel depth from stereo (or even monocular) imagery and then back-projecting pixels into a 3D point cloud. By viewing this representation as pseudo-LiDAR signal, we can then apply any existing LiDAR-based 3D object detection algorithm. 其实就是上面文章的方法将depth model和3D Object detection model给分开了，这里是2步。

【注意这里箭头是单向的，两个网络之间是不影响的】

Depth estimation(论文使用了pyramid stereo matching network (PSMNet))

A pair of cameras with a horizontal offset (i.e., baseline) b, a disparity map Y. Treats the left image, $I_l$ , as reference and records in Y the horizontal disparity to $I_r$ for each pixel就是左右相机有个水平线的offset b，然后在这里以左边的图片为参考，加上水平线，预测右图每一个像素的差异图Y（视差图）, 且已知左边相机的horizontal focal length $f_U$ 。那么深度图为：
$D(u,v)=\frac{f_U\times b}{Y(u,v)}$

Pseudo-LiDAR generation

Derive the 3D location (x, y, z) of each pixel (u, v)参考深度d：

$(depth)\ z = D(u, v)\\ (width)\ x= \frac{(u − c_U )\times z}{f_U}\\ (height)\ y= \frac{(v − c_V )\times z }{f_V}\\$

$(c_U, c_V)$ is the pixel location corresponding to the camera center and $f_V$ is the vertical focal length.

Since real LiDAR signals only reside in a certain range of heights, we disregard pseudo-LiDAR points beyond that range. As most objects of interest (e.g., cars and pedestrians) do not exceed this height range there is little information loss.

"""
code from https://zhuanlan.zhihu.com/p/91479831
"""
from PIL import Image
import numpy as np
import pptk
import cv2
fu = 2301.3147    
fv = 2301.3147 
cu = 1489.8536 
cv = 479.1750  
disparity_map = Image.open('171206_034625454_Camera_5.png')
disparity_map = np.array(disparity_map)
disparity_map = disparity_map/200      # 除以200恢复原来的值
disparity_map[disparity_map == 0] = 1e-3    # 避免除0
depth_map = fu/disparity_map    # fu是水平焦距, b==1
cv2.imwrite('depth_map.jpg', depth_map)
# slow version
point_cloud = np.empty(shape=(*depth_map.shape, 3))
for u in range(point_cloud.shape[0]):
    for v in range(point_cloud.shape[1]):
        point_cloud[u, v, 2] = depth_map[u, v]              # z
        point_cloud[u, v, 0] = (u-cu)*depth_map[u, v]/fu    # x
        point_cloud[u, v, 1] = (v-cv)*depth_map[u, v]/fv    # y
        
# fast version
rows, cols = depth_map.shape
c, r = np.meshgrid(np.arange(cols), np.arange(rows))
point_cloud = np.stack([c, r, depth_map])
point_cloud = point_cloud.reshape((3, -1))
x = ((point_cloud[:, 0]-cu)*point_cloud[:, 2])/fu
y = ((point_cloud[:, 1]-cv)*point_cloud[:, 2])/fv
point_cloud[:, 0] = x
point_cloud[:, 1] = y
# save
cv2.imwrite('point_cloud.jpg', point_cloud)
point_cloud = point_cloud.reshape((-1, 3))
v = pptk.viewer(point_cloud) #pptk库可以用来可视化点云

3D object detection(论文使用了AVOD and frustum PointNet)

This example goes to show how some operations the convolutional network might perform could border on the absurd.

Future work:

higher resolution stereo images
in this paper we did not focus on real-time image processing and the classiﬁcation of all objects in one image takes on the order of 1s.
it is likely that future work could improve the state-of-the-art in 3D object detection through sensor fusion of LiDAR and pseudo-LiDAR.

PSEUDO -L I DAR++: ACCURATE D EPTH FOR 3D O BJECT D ETECTION IN AUTONOMOUS D RIVING——ICLR 2020上面文章的改进，修改了深度估计网络，使用了4线雷达（价格较低）进行校正

more aligned with accurate depth estimation of faraway objects.

两大贡献：
1. Identify the disparity estimation as a main source of error for stereo-based systems and propose a novel approach to learn depth directly end-to-end instead of through disparity estimates.
2. Advocate that one should not use expensive LiDAR sensors to learn the local structure and depth of objects. Instead one can use commodity stereo cameras for the former and a cheap sparse LiDAR to correct the systematic bias in the resulting depth estimates.
其实就两个方向，一个就是对深度估计网络修改，因为实验表明深度图的误差是视差图的二次方级别的。从公式 $Z(u,v)=\frac{f_U\times b}{D(u,v)}$ 上也可以看出来:

所以直接用深度的误差作为损失函数（而不是使用视差）；并且3D卷积，也是在深度图之上。

the depth cost volume $C_{depth}$ , in which $C_{depth}(u, v, z, :)$ will encode features describing how likely the depth $Z(u, v)$ of pixel $(u, v)$ is $z$ .

其中， $S_{depth}$ used to predict the pixel depth similar.

在道理上， $C_{disp}(u, v, \frac{f_U\times b}{z}, :)$ 应该和 $C_{depth}(u, v, z, :)$ 有同步的耗费，这里用了双线性插值。

第二个方向就是用价格较低的lidar（~~应该是发出的激光线比较小，获取的信息比较稀疏的那种~~）来矫正伪雷达点云。
- First, we characterize the local shapes by the directed K-nearest-neighbor (KNN) graph in the PL point cloud
- solve for these weights with the following constrained quadratic optimization problem:
  $W = arg\ min_W \|Z − WZ\|^2_2 ,s.t. W1 = 1\ and\ W_{ij} = 0\ if\ j \notin N_i$
The two steps described in the main paper can be easily turned into two (sparse) linear systems and then solved by using Lagrange multipliers. For the ﬁrst step, we solve a problem that is slightly modiﬁed from that described in the main paper (for more accurate reconstruction). For the second step, we use the Conjugate Gradient (CG) to iteratively solve the sparse linear system.

MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships

通过部分遮挡object与周围objects的空间限制关系，受CenterNet的启发，将object当作点，用提出的检测器感知预测计算出object和相邻object的3D距离不确定性，随后通过非线性最小二乘联合nonlinear least squares优化。为了保证运行时的效率，将one-stage uncertainty-aware prediction structure(in an unsupervised manner) 和 post-optimization module 集成化。

3个模块：2D框检测、带有不确定度的3D框检测、带有不确定度的3D pair距离；11个分支（蓝色部分）其中：aleatoric uncertainty不确定性 $\tilde\sigma(x)$ makes the loss more robust to noisy input in a regression task.

2D检测
- heatmapW x H x C: predict the object location $c^g = (u^g , v^g )$
- offsetW x H x 2: predict offset vector $(\delta^u, \delta^v)$ from the located keypoint $c^g$ to the bounding box center $c^b = (u^b , v^b)$ respectively.（ $c^g$ 是heatmap的一个极大值点，经过偏移后得到bbox的中心点 $c^b$ ，或者说是2D中心点与 $c^g$ 的偏移值）L1 loss
- dimensionW x H x 2: the size of the bounding box $(w^b , h^b)$ L1 loss
Object Location Error:
3D检测
- depthW x H x 1: 图(c)的z，车到相机的深度由于直接回归深度比较困难，用inverse depth $\hat z$ ，by inverse sigmoid transformation $z = 1/\sigma(\hat z ) − 1$ . L1 loss
- offsetW x H x 2: 图(b)的 $\Delta^u, \Delta^v$ ，3D中心点与 $c^g$ 的偏移值（计算得到的 $c^o$ 可以通过相机内参矩阵K计算得到世界坐标系中的object中心点坐标 $c^w$ L1 loss
  $c^w=(\frac{u^g+\Delta^u-a_x}{f_x},\frac{v^g+\Delta^v-a_y}{f_y},z)\\ K=\left[\begin{matrix}f_x&0&a_x\\0&f_y&a_y\\0&0&1\end{matrix}\right]$
- dimensionW x H x 3: 回归w, h, l L1 loss
- depth $\sigma$ W x H x 1:
- offset $\sigma$ W x H x 1:
- roationW x H x 8: object’s local orientation $\alpha$ (yaw), global orientation $\beta$ in the camera coordinate system. relative rotationof the object to the camera viewing angle $\gamma = arctan(x/z)$ （arctan(positions[1]/position[0])）. Represent the orientation using eight scalars, where the orientation branch is trained by MultiBin loss.(~~8个？8个角点的角度吗？还是 $\alpha,\beta$ 各3个表示，然后 $\gamma$ 用x，z表示~~)
Pair constraint

对于一个pair $(c_i^w,c_j^w)$ 他们的连线中点为 $p_{ij}^w$ ，对应的在featmap上 $(c_i^b,c_j^b)$ 的连线中点为 $p_{ij}^b$ 。（这里 $p_{ij}^b$ 不是 $p_{ij}^w$ 的投影）

N effective objects, M pair constraints, The proposed spatial constraint optimization is formulated as a nonlinear least square problem as

e is the Pairwise Constraint Error vector, $(e_{ij}^x, e_{ij}^y,e_{ij}^z)^T=\vec{\|\tilde k_{ij}^v-k_{ij}^v\|}$ .

W is the weight matrix for different errors. W is a diagonal matrix with dimension $3N^G + 3M$ . The weight of the error is higher when the uncertainty is lower, which means we have more conﬁdence in the predicted output.

For each vertex $\zeta_i$ , there are three variables $(u_i, v_i, z_i)$ , which are the projected center $(u_i, v_i)$ of the 3D bounding box on the feature map and the depth $z_i$ .
- distanceW x H x 3: The 3D absolute distance $k_{ij}^v = (k_x^v , k_y^v , k_z^v)_{ij}$ along the view point direction are taken as the regression target which is the distance branch of the pair constraint output.
  
  For training, $k_{ij}^v$ can be easily collected through the groundtruth 3D object centers from the training data as:
  $k_{ij}^v=\vec{|R(\gamma_{ij})k_{ij}^w)|}$
  其中， $|\vec ·|$ means extract absolute value of each entry in the vector. $k_{ij}^w = c_i^w − c_j^w$ is the 3D distance in camera coordinate, $\gamma_{ij} = arctan(p_x^w / p_z^w)$ is the view direction of their middle point $p_{ij}^w$ , and $R(\gamma_{ij})$ is its rotation matrix along the Y axis as
  $R(\gamma_{ij})=\left[\begin{matrix}cos(\gamma_{ij})&0&-sin(\gamma_{ij})\\0&1&0\\sin(\gamma_{ij})&0&cos(\gamma_{ij})\end{matrix}\right]$
  这里不使用 $k_{ij}^w$ 的原因是，（相机坐标系下）从不同角度看pair的距离是变换的，如图：
- distance $\sigma$ W x H x 1:

不确定度：

Following the heteroscedastic aleatoric uncertainty setup异方差的不确定性设置, we represent a regression task with L1 loss as

$[\tilde y, \tilde \sigma] = f^\theta(x)\\ L(\theta) =\frac{\sqrt 2}{\tilde \sigma}\|y − \tilde y\| + \log \tilde \sigma$

其中， $\tilde\sigma$ is another output of the model and can represent the observation noise of the data x. $\theta$ is the weight of the regression model.

~~最后结果里面baseline就比其他方法高了，感觉怪怪的，这里的baseline是指指只用3D和3D这两个分支么~~

视频目标检测

车道线检测

“人-物”交互(HOI)检测

目标跟踪

分割

语义分割

实例分割

D2Det: Towards High Quality Object Detection and Instance Segmentation

见检测-目标检测

全景分割

视频目标分割

超像素分割

交互式图像分割

CNN

NAS

GAN

Re-ID

3D点云（分类/分割/配准/跟踪等）

人脸（识别/检测/重建等）

人体姿态估计(2D/3D)

人体解析

文本

场景文本检测

场景文本识别

特征(点)检测和描述

超分辨率

模型压缩/剪枝

视频理解/行为识别

人群计数

深度估计

On the uncertainty of self-supervised monocular depth estimation

自监督是一件不确定的事情，本篇论文提出了与深度估计有关的不确定度的评价方法，并评估了自监督方法下的深度估计的不确定性为多少，以及对深度估计的影响；同时还提出了一种boost+self-teaching的不确定评估方法，可以提高深度估计的准确性。

不确定性估计的方法：

图像反转(post)：利用图像翻转来进行不确定度的估计的，对原始的网络不用经过任何的修改，只需要在进行inference的时候进行两次forward就行了，这种方法得到的就是小样本上的方差。即计算当前图片的深度图 与反转图片的深度图的反转 的相似性。
$u_{post}=|d-d^{\rightleftharpoons}|$
经验评估：多次
- Dropout Sampling(drop)：之前的dropout是发生在训练的时候，通过随机丢掉一些信息，使得数据不过拟合。现在在测试中多次随机丢失，抽样出不同的模型。多次深度估计得到均值和方差。
- Bootstrapped Ensemble(Boot)：不同参数初始化
- Snapshot Ensemble(snap)：用余弦退火方法周期性地改变学习速率：C实验中设置为20
  $\lambda_t=\frac{\lambda_0}{2}\cdot(\cos (\frac{\pi\cdot mod(t-1, \lceil\frac T C\rceil)}{\lceil\frac T C\rceil})+1)$
学习一个预测不确定度的模型
- Learned Reprojection(Repr)：投影的关系在没有标签的数据中难以使用，但可以（模仿知道的情况）重构。
  $F(\tilde I, I) = F(\pi(I^†, K^† , R|t, K, d), I)=\alpha\frac {1-SSIM(\tilde I,\hat I)}{2}+(1-\alpha)\|\tilde I-\hat I\|\\ L_{Repr} = \beta \cdot |u_{Repr} − F(\tilde I, I)|\\ L_{Repr}(q) = \beta \cdot|u_{Repr}(q) − min_{i\in[0..K]}\ F(\tilde I_i(q), I(q))|$
- Log-Likelihood Maximization(Log)：衡量 $p(d^∗|I,D)$ 分布的均值和方差；如果是L1 loss就建模成Laplician分布，如果是L2 loss就建模成Gaussian分布；同时解释了关于深度以及姿态的不确定度
  $\log p(d^∗|w) =\log p(d^∗(q)|\Theta(I, w))\\ L_{Log}=\frac{|\mu(d)-d^*|}{\sigma(d)}+\log \sigma(d)$
  w: network weight; log项用来避免分子为0的情况，拒绝对每一个像素的无限预测。
  
  另一篇提到的loss定义为：
  $L_{Log}=\frac{min_{i\in[0..K]}F(\tilde I_i(q),I(q))}{u_{log}}+\log u_{log}$
- Self-Teaching (Self)：解耦深度和姿态这两个不确定度；利用自监督的办法学习一个T网络，这个网络的loss就是传统的重投影误差，然后再训练一个S网络，这个网络的loss使用的是深度值的对数似然
  $L_{Self}=\frac{|\mu(d_S)-d_T|}{\sigma(d_S)}+\log \sigma(d_S)$
Bayesian estimation（结合了经验估计和预测不确定模型的方法）：边缘化所有可能的w而不是进行单点估计
$p(d^∗|I,D) \approx \sum_{i=1}^N p(d^∗|\Theta(I, w_i))\\ \mu (d)=\frac 1 N \sum_{i=1}^N \mu _i(d_i)\\ \sigma^2(d)=\frac 1 N \sum_{i=1}^N (\mu_i(d_i)-\mu(d))^2+\sigma_i^2(d_i)$
结合：如预测+经验两者均取最好的组合(Boot+Self)

实验：

Baseline model：Monodepth2
Depth metrics
- absolute relative error (Abs Rel)
- root mean square error (RMSE)
- the amount of inliers (δ < 1.25)
Uncertainty metrics: 所有点按照不确定性降序排列，比较抽取掉不确定度高的点和抽取掉误差高的点的曲线差距；
- AUSE是Area Under the Sparsification Error，越低越好，计算方法就是用不确定度的曲线减去oracal(ideal)曲线，主要评价不确定度和实际误差的关系
- AURG是Area Under the Random Gain，越高越好，如果相差越大证明不确定度估计是有效果的
单目结果：
- Depth精度上：各种方法基本降低或者不变，但是Self方法对Depth有所提高
- 不确定性上：经验方法比post更好，预测模型的方法结果更好，Boot+self的结合效果最好

感觉惊叹的是作者做了很多实验，包括很多结合的方法，梳理了整个不确定性估计的方法，然后就评估这个不确定估计对深度估计的影响，最后得出不确定估计对深度估计是有提升的。感觉捕捉到了一个发表论文的新点。

分类

图像分类

视频分类

检测

目标检测

D2Det: Towards High Quality Object Detection and Instance Segmentation

Harmonizing Transferability and Discriminability for Adapting Object Detectors

3D目标检测

Learning Depth-Guided Convolutions for Monocular 3D Object Detection

End-to-End Pseudo-LiDAR for Image-Based 3D Object Detection

MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships

视频目标检测

车道线检测

“人-物”交互(HOI)检测

目标跟踪

分割

语义分割

实例分割

D2Det: Towards High Quality Object Detection and Instance Segmentation

全景分割

视频目标分割

超像素分割

交互式图像分割

CNN

NAS

GAN

Re-ID

3D点云（分类/分割/配准/跟踪等）

人脸（识别/检测/重建等）

人体姿态估计(2D/3D)

人体解析

文本

场景文本检测

场景文本识别

特征(点)检测和描述

超分辨率

模型压缩/剪枝

视频理解/行为识别

人群计数

深度估计

On the uncertainty of self-supervised monocular depth estimation

6D目标姿态估计

手势估计

显著性检测

优化

去噪

去模糊

去雾

特征点检测与描述

视觉问答(VQA)

视频问答(VideoQA)

视觉语言导航

视频压缩

视频插帧

风格迁移

轨迹预测

运动预测

光流估计

图像检索

虚拟试衣

HDR

对抗样本

三维重建

深度补全

语义场景补全

图像/视频描述

线框解析

数据集

其他