Research Area:  Motion Tracking  
Status:  In progress  
Project leaders:  Wang Yan  Collaborators:  Xiaoye Han 
Description:  
Adaptive trackingbydetection methods use previous tracking results to generate a new training set for object appearance, and update the current model to predict the object location in subsequent frames. Such approaches are typically bootstarpped by manual or semiautomatic initialization in the first several frames. However, most adaptive trackingbydetection methods focus on tracking of a single object or multiple unrelated objects. Although one can trivially engage several single object trackers to track multiple objects, such solution is frequently suboptimal because it does not utilize the interobject constraints or the obejct layout information [2]. Without Constraint In multiple object case, compared with the single object version, we add constraint that two or more objects can not appear in the same location in one frame, as well as the objects layout information. The training set includes a frame set \(\{ x_1, x_2, \ldots, x_n \}\) indexed by time, and \(\{ \mathbf{Y}_1, \mathbf{Y}_2, \ldots, \mathbf{Y}_n \}\) is the correspond set of structured labels, where \(\mathbf{Y}_i = (\mathbf{y}_i^{(1)}, \mathbf{y}_i^{(2)}, \ldots, \mathbf{y}_i^{(K)})\) indicates the bounding boxes corresponding to \(K\) objects in frame \(i\). If the \(k\)th object does not appear in the \(i\)th frame, \(\mathbf{y}_i^{(k)}=null\). We design a function \(f(x, \mathbf{Y})\) such that the object locations \(\mathbf{Y}^*\) in frame \(x\) are given by maximizing \begin{equation} f(x, \mathbf{Y}) = \sum_{k = 1}^K \langle \mathbf{w}^{(k)}, \Psi(x_{\mathbf{y}^{(k)}}) \rangle + \langle \mathbf{v}, \Phi(\mathbf{Y}; \mathbf{Y}_{i  1}) \rangle, \tag{6} \end{equation} where \(\mathbf{Y}_{i1}\) is the layout in \(i1\)th frame and \(\Phi(\mathbf{Y}; \mathbf{Y}_{i  1})\) is the layout feature of size \(\binom{K}{2} \times 2\), whose \(k\)\(l\)\(j\)th element is \begin{equation} \Phi_{klj}(\mathbf{Y}; \mathbf{Y}_{i  1}) = \left\{ \begin{array}{ll} \left \left( \mathbf{y}_{i  1}^{(k)}(j)  \mathbf{y}_{i  1}^{(l)}(j) \right)  \left( \mathbf{y}^{(k)}(j)  \mathbf{y}^{(l)}(j) \right)\right & \textrm{if $\mathbf{y}_{i  1i}^{(kl)} \neq null$}\\ 0 & \textrm{otherwise} \end{array} \right., \tag{7} \end{equation} while \(\mathbf{y}(1)\) and \(\mathbf{y}(2)\) are the horizontal and vertical coordinates of the bounding box \(\mathbf{y}\)'s center, respectively. The model leads the following optimization. \begin{gather} \min_{\mathbf{w}, \mathbf{v}, \mathbf{\xi}, \mathbf{\eta}} \frac{1}{2}(\sum_{k = 1}^K \ \mathbf{w}^{(k)} \^{2} + \ \mathbf{v} \^2) + C_1 \sum_{i=2}^n \xi_i + C_2 \sum_{k = 1}^K \sum_{\mathbf{z} \in Z} \eta_{\mathbf{z}} \quad \mathrm{s.t.} \tag{8} \\ \begin{split} \sum_{k = 1}^K \langle \mathbf{w}^{(k)}, \Psi(x_i_{\mathbf{y}_i^{(k)}})  \Psi(x_i_{\mathbf{y}^{(k)}}) \rangle + \langle \mathbf{v}, \Phi(\mathbf{Y}_i; \mathbf{Y}_{i  1})  \Phi(\mathbf{Y}; \mathbf{Y}_{i  1}) \rangle \geq \Delta^M(\mathbf{Y}_i, \mathbf{Y})  \xi_i,& \\ \quad \forall i, \quad \mathbf{Y} \neq \mathbf{Y}_i,& \end{split} \tag{9}\\ l_{\mathbf{z}^{(k)}}(\langle \mathbf{w}^{(k)}, \Psi(\mathbf{z}^{(k)}) \rangle + b^{(k)}) \geq 1  \eta_{\mathbf{z}^{(k)}}, \quad \forall k, \quad \forall \mathbf{z}^{(k)} \in Z^{(k)}, \tag{10} \end{gather} (9) is the structured constraint, where \(\mathbf{Y}_i\) is the groundtruth object location set of frame \(i\), \(\mathbf{Y}\) is the set of locations other than groundtruth, and \(\Delta^M(\mathbf{Y}_i, \mathbf{Y}) = \sum_{k = 1}^K \Delta(\mathbf{y}_i^{(k)}, \mathbf{y}^{(k)})\) is a combination of losses on each objects. (10) is the binary constraint. Following figures shows the tracking results on 2 different video clips, 'motinasmultifacefast' and 'toys' respectively. The videos can be found here. In most of the cases, the proposed method significantly outperforms other adaptive single object methods, which quickly get adapted to other wrong image patches. The only exception is the Struck result of the candy bag, since the bag has never been occluded. For the face video, the nonadaptive multiple object tracking method (Huang) works fine since a good face detector is available. However, the same method works poorly on the second video because neither enough training samples nor trained detectors are available. In contrast, the proposed method works equally well on both videos due to its adaptive nature. References
Publication
