ࣗಈंલํͷҠಈରͷܦ࿏༧ଌΛత
ंࡌΧϝϥࢹͷσʔληοτ
52
• αϯϓϧɿ1.8K
• γʔϯɿ53
• ରछྨ
• Ճใ
- pedestrian
- ं྆ใɼΠϯϑϥετϥΫνϟ
ҰൠಓΛࡱӨͨ͠σʔληοτ
Apolloscape Dataset
Figure 3. Example scenarios of the TITAN Dataset: a pedestrian bounding box with tracking ID is shown in , vehicle bounding box
with ID is shown in , future locations are displayed in . Action labels are shown in different colors following Figure 2.
centric views captured from a mobile platform.
In the TITAN dataset, every participant (individuals,
vehicles, cyclists, etc.) in each frame is localized us-
ing a bounding box. We annotated 3 labels (person, 4-
wheeled vehicle, 2-wheeled vehicle), 3 age groups for per-
son (child, adult, senior), 3 motion-status labels for both 2
and 4-wheeled vehicles, and door/trunk status labels for 4-
wheeled vehicles. For action labels, we created 5 mutually
exclusive person action sets organized hierarchically (Fig-
ure 2). In the first action set in the hierarchy, the annota-
tor is instructed to assign exactly one class label among 9
atomic whole body actions/postures that describe primitive
action poses such as sitting, standing, standing, bending,
etc. The second action set includes 13 actions that involve
single atomic actions with simple scene context such as jay-
walking, waiting to cross, etc. The third action set includes
7 complex contextual actions that involve a sequence of
atomic actions with higher contextual understanding, such
as getting in/out of a 4-wheel vehicle, loading/unloading,
etc. The fourth action set includes 4 transportive actions
that describe the act of manually transporting an object by
agent i at each past time step from 1 to Tobs, where (cu, cv
)
and (lu, lv
) represent the center and the dimension of the
bounding box, respectively. The proposed TITAN frame-
work requires three inputs as follows: Ii
t=1:Tobs
for the ac-
tion detector, xi
t
for both the interaction encoder and past
object location encoder, and et
= {αt, ωt
} for the ego-
motion encoder where αt and ωt correspond to the acceler-
ation and yaw rate of the ego-vehicle at time t, respectively.
During inference, the multiple modes of future bounding
box locations are sampled from a bi-variate Gaussian gen-
erated by the noise parameters, and the future ego-motions
ˆ
et are accordingly predicted, considering the multi-modal
nature of the future prediction problem.
Henceforth, the notation of the feature embedding func-
tion using multi-layer perceptron (MLP) is as follows: Φ is
without any activation, and Φr, Φt, and Φs are associated
with ReLU, tanh, and a sigmoid function, respectively.
4.1. Action Recognition
We use the existing state-of-the-art method as backbone
TITAN Dataset
Linear 123 477 1365 950 3983 223 857 2303 1565 6111
LSTM 172 330 911 837 3352 289 569 1558 1473 5766
B-LSTM[5] 101 296 855 811 3259 159 539 1535 1447 5615
PIEtraj 58 200 636 596 2477 110 399 1248 1183 4780
Table 3: Location (bounding box) prediction errors over varying future time steps. MSE in pixels is calculated over all
predicted time steps, CMSE and CFMSE are the MSEs calculated over the center of the bounding boxes for the entire
predicted sequence and only the last time step respectively.
MSE
Method 0.5s 1s 1.5s last
Linear 0.87 2.28 4.27 10.76
LSTM 1.50 1.91 3.00 6.89
PIEspeed 0.63 1.44 2.65 6.77
Table 4: Speed prediction errors over varying time steps
on the PIE dataset. Last stands for the last time step. The
results are reported in km/h.
is generally better on bounding box centers due to the fewer
degrees of freedom.
Context in trajectory prediction. We first evaluate the
proposed speed prediction stream, PIEspeed, by comparing
this model with two baseline models, a linear Kalman filter
and a vanilla LSTM model. We use MSE metric and re-
port the results in km/h. Table 4 shows the results of our
experiments. The linear model achieves reasonable perfor-
PIE Dataset
Figure 5: Illustration of our TrafficPredict (TP) method on camera-based images. There are six scenarios with different road
conditions and traffic situations. We only show the trajectories of several instances in each scenario. The ground truth (GT) is
drawn in green and the prediction results of other methods (ED,SL,SA) are shown with different dashed lines. The prediction
trajectories of our TP algorithm (pink lines) are the closest to ground truth in most of the cases.
stance layer to capture the trajectories and interactions for
instances and use a category layer to summarize the simi-
ҰൠಓΛࡱӨͨ͠σʔληοτ
• αϯϓϧɿ81K
• γʔϯɿ100,000
• ରछྨ
- pedestrian, car, cyclist
ҰൠಓΛࡱӨͨ͠σʔληοτ
• αϯϓϧɿ645K
• γʔϯɿ700
• ରछྨ
• Ճใ
- pedestrian, car, cyclist
- ߦಈϥϕϧɼาߦऀͷྸ
Y. Ma, et al., “Traf
fi
cPredict: Trajectory Prediction for Heterogeneous Traf
fi
c-Agents,” AAAI, 2019.
A. Rasouli, et al., “PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction, ” ICCV, 2019.
S. Malla, et al., “TITAN: Future Forecast using Action Priors,” CVPR, 2020.
nuScenes: A multimodal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu,
Anush Krishnan, Yu Pan, Giancarlo Baldan, Oscar Beijbom
nuTonomy: an APTIV company
[email protected]
Abstract
Robust detection and tracking of objects is crucial for
the deployment of autonomous vehicle technology. Image
based benchmark datasets have driven development in com-
puter vision tasks such as object detection, tracking and seg-
mentation of agents in the environment. Most autonomous
vehicles, however, carry a combination of cameras and
range sensors such as lidar and radar. As machine learn-
ing based methods for detection and tracking become more
prevalent, there is a need to train and evaluate such meth-
ods on datasets containing range sensor data along with im-
ages. In this work we present nuTonomy scenes (nuScenes),
the first dataset to carry the full autonomous vehicle sensor
suite: 6 cameras, 5 radars and 1 lidar, all with full 360 de-
gree field of view. nuScenes comprises 1000 scenes, each
20s long and fully annotated with 3D bounding boxes for
23 classes and 8 attributes. It has 7x as many annotations
and 100x as many images as the pioneering KITTI dataset.
We define novel 3D detection and tracking metrics. We also
provide careful dataset analysis as well as baselines for li-
dar and image based detection and tracking. Data, devel-
opment kit and more information are available online1.
1. Introduction
Figure 1. An example from the nuScenes dataset. We see 6 dif-
ferent camera views, lidar and radar data, as well as the human
annotated semantic map. At the bottom we show the human writ-
ten scene description.
Multimodal datasets are of particular importance as no
single type of sensor is sufficient and the sensor types are
complementary. Cameras allow accurate measurements of
edges, color and lighting enabling classification and local-
ization on the image plane. However, 3D localization from
images is challenging [13, 12, 57, 80, 69, 66, 73]. Lidar
pointclouds, on the other hand, contain less semantic infor-
nuScenes Dataset
ҰൠಓΛࡱӨͨ͠σʔληοτ
• αϯϓϧɿ645K
• γʔϯɿ700
• ରछྨ
• Ճใ
- truck, bicycle, car, etc.
- ηϯαʔใɼਤσʔλɼ܈ใɼΤΰϞʔγϣϯ
H. Caesar, et al., “nuScenes: A multimodal dataset for autonomous driving,” CVPR, 2020.