Proxy Virtual Worlds

virtual KITTI


Virtual KITTI dataset


Virtual KITTI is a photo-realistic synthetic video dataset designed to learn and evaluate computer vision models for several video understanding tasks: object detection and multi-object tracking, scene-level and instance-level semantic segmentation, optical flow, and depth estimation.

Virtual KITTI contains 40 high-resolution videos (17,008 frames) generated from five different virtual worlds in urban settings under different imaging and weather conditions. These worlds were created using the Unity game engine and a novel real-to-virtual cloning method. These videos are automatically and fully annotated at the pixel level with category, instance, flow, and depth labels (cf. below for download links).

Our CVPR 2016 paperpdf , arxiv ] describes the dataset, our semi-automatic method used to build it, and experiments on measuring the real-to-virtual gap, deep learning with virtual data, and measuring the generalization performance under changes in imaging and weather conditions.




10 Aug. 2016: Update to scene ground truth (v.1.2.1). Small bug fix on poles and transparent shaders impacting only few pixels of the scene ground truth images. All the rest is unchanged.


Terms of Use and Reference


The Virtual KITTI Dataset (an Adaptation of the KITTI Dataset)


Copyrights in The Virtual KITTI Dataset are owned by Xerox.


The Virtual KITTI Dataset is provided by Xerox and may be used for non-commercial purposes only and is subject to the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 , a summary of which is located here .


The Virtual KITTI Dataset is an Adaptation of the KITTI Vision Benchmark Suite . See also the publication by Andreas Geiger and Philip Lenz and Raquel Urtasun, entitled "Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite", in Computer Vision and Pattern Recognition (CVPR), 2012.


When using or referring to this dataset in your research, please cite Xerox as the originator of the Virtual KITTI Dataset and cite our CVPR 2016 paperpdf ] (6MB) [ arxiv ], cf. also full reference below:

Virtual Worlds as Proxy for Multi-Object Tracking Analysis
Adrien Gaidon , Qiao Wang, Yohann Cabon, Eleonora Vig
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

    author = {Gaidon, A and Wang, Q and Cabon, Y and Vig, E},
    title = {Virtual Worlds as Proxy for Multi-Object Tracking Analysis},
    booktitle = {CVPR},
    year = {2016}




We provide one .tar[.gz] archive per type of data as described below. Here is a list of all the URLs for batch download, and here is a list of the MD5 checksums for each archive. You can extract an archive to a folder via the command ‘tar xvf filename.tar’ (replace xvf by xzvf for compressed .tar.gz files). Windows users can use the 7-zip software to extract these archives.

In the following, "<version>" is the dataset version number (currently 1.2), and "<world>" is the name of a virtual world, which is the sequence number of the corresponding original “seed” real-world KITTI sequence (0001, 0002, 0006, 0018, 0020). The placeholder "<variation>" denotes one of the different rendering variations in terms of imaging or weather conditions:

    clone: rendering as close as possible to original real-world KITTI conditions
    15-deg-right: horizontal rotation of the camera 15 degrees to the right
    15-deg-left: horizontal rotation of the camera 15 degrees to the left
    morning: typical lighting conditions after dawn on a sunny day
    sunset: lighting typical of slightly before sunset
    overcast: typical overcast weather (diffuse shadows, strong ambient lighting)
    fog: fog effect implemented via a volumetric formula
    rain: simple rain particle effect (ignoring the refraction of water drops on the camera)

Note that our indexes always start from 0.




Rendered RGB frames: link  (11GB)

Rendered RGB frames

Each video is simply a folder in the format:





Object Detection and Multi-Object Tracking Ground Truth: link  (3.3MB)

object detection and multiobject tracking ground truth

The MOT ground truth for each video consists of two CSV-like text files named:


The first set of files (used in our CVPR 2016 experiments) are in a KITTI-like format that can be loaded with the following one-liner in Python using the popular pandas library (assuming ‘import pandas as pd’):

     gt = pd.read_csv("<filename>", sep=" ", index_col=False)

Each line contains one object annotation with the following columns:

    frame: frame index in the video (starts from 0)
    tid: track identification number (unique for each object instance)
    label: name of the category of the object (Car, Van, DontCare)
    trunc: truncation flag, 0: not truncated, 1: truncated, 2: heavily truncated
               (2: ignored in evaluation and marked as “DontCare”)
    occ: occlusion flag, 0: not occluded, 1: occluded, 2: heavily occluded
            (2: ignored in evaluation and marked as “DontCare”)
    l, t, r, b: respectively left, top, right, bottom bounding box coordinates
                 (inclusive, (0,0) origin is on the upper left corner of the image)

The second set of files (with the ‘_raw’ suffix) are our raw annotations (from which the aforementioned ones are derived). They can be loaded as described above, but contain the raw annotations (no filtering and marking as DontCare) and the truncation (trunc) and occlusion (occ) flags are respectively replaced by the automatically estimated truncation ratio (truncr) and occupancy ratio (occupr), as described in the paper.




Semantic and Instance-level Segmentation Ground Truthlink  (391MB)

semantic segmentation ground truth

The per-pixel semantic and instance-level segmentation ground truth is encoded as per-frame .png files (standard 8-bit precision per channel) and a per-world per-variation text file giving the correspondence between RGB color codes and labels:


Each ‘vkitti_<version>_scenegt/<world>_<variation>_scenegt_rgb_encoding.txt’ file contains one line per category formatted like ‘<category>[:<tid>] <R> <G> <B>’, where ‘<category>’ is the name of the semantic category of that pixel (Building, Car, GuardRail, Misc, Pole, Road, Sky, Terrain, TrafficLight, TrafficSign, Tree, Truck, Van, Vegetation), ‘<tid>’ is the (optional) integer track identifier to differentiate between instances of the same category (e.g., ‘Car:0’, but Sky and Tree have no track ID), and ‘<R> <G> <B>’ is the color encoding of that label in the corresponding ‘vkitti_<version>_scenegt/<world>/<variation>/%05d.png’ ground truth images.




Optical Flow Ground Truthlink  (3.8GB)

sintel visualization of vkitti flow ground truth

The optical flow ground truth from the current frame to the next frame for each video is a folder in the format:


where the index corresponds to the current frame (starting from 00000), and the flow can be decoded from the RGB values of each pixel (16 bits per channel) as:

    R = flow along x-axis normalized by image width and quantized to [0;2^16 - 1]
    G = flow along y-axis normalized by image height and quantized to [0;2^16 - 1]
    B = 0 for invalid flow (e.g., sky pixels)

Some example decoding code in Python using OpenCV and numpy:

import numpy as np
import cv2

def read_vkitti_png_flow(flow_fn):
    "Convert from .png to (h, w, 2) (flow_x, flow_y) float32 array"
    # read png to bgr in 16 bit unsigned short
    bgr = cv2.imread(flow_fn, cv2.IMREAD_ANYCOLOR | cv2.IMREAD_ANYDEPTH)
    h, w, _c = bgr.shape
    assert bgr.dtype == np.uint16 and _c == 3
    # b == invalid flow flag == 0 for sky or other invalid flow
    invalid = bgr[..., 0] == 0
    # g,r == flow_y,x normalized by height,width and scaled to [0;2**16 - 1]
    out_flow = 2.0 / (2**16 - 1.0) * bgr[..., 2:0:-1].astype('f4') - 1
    out_flow[..., 0] *= w - 1
    out_flow[..., 1] *= h - 1
    out_flow[invalid] = 0  # or another value (e.g., np.nan)
    return out_flow

Note that our ground truth flow .png files look often like a purple haze in contrast to the visualization at the beginning of this section, which uses the common Sintel color wheel. This is normal, and the consequence of our 16-bit encoding designed to keep high precision, including for large displacements, in a standard loss-less compressed format (PNG16).




Depth Ground Truth: link  (4.1GB):

depth ground truth

The depth ground truth for each video is stored in 16-bit grayscale PNG images as:


Depth values correspond to the depth relative to a fixed far plane of 655.35 meters, normalized and quantized to the [0;2^16 - 1] integer range. This means that a pixel intensity of 1 in these ground truth images corresponds to 1cm. The depth map in centimeters can be directly loaded in Python with numpy and OpenCV via the one-liner (assuming “import cv2”):

    depth = cv2.imread(depth_png_filename, cv2.IMREAD_ANYCOLOR | cv2.IMREAD_ANYDEPTH)



Additional Information



The current version (1.2) is slightly improved w.r.t. the one used in our CVPR 2016 paper, but yields very close results (often with a smaller gap) and the same conclusions.



Modern computer vision algorithms typically require expensive data acquisition and accurate manual labeling. In this work, we instead leverage the recent progress in computer graphics to generate fully labeled, dynamic, and photo-realistic proxy virtual worlds. We propose an efficient real-to-virtual world cloning method, and validate our approach by building and publicly releasing a new video dataset, called “Virtual KITTI”, automatically labeled with accurate ground truth for object detection, tracking, scene and instance segmentation, depth, and optical flow. We provide quantitative experimental evidence suggesting that (i) modern deep learning algorithms pre-trained on real data behave similarly in real and virtual worlds, and (ii) pre-training on virtual data improves performance. As the gap between real and virtual worlds is small, virtual worlds enable measuring the impact of various weather and imaging conditions on recognition performance, all other things being equal. We show these factors may affect drastically otherwise high-performing deep models for tracking.


Related Press Coverage

MIT Tech Review: “To Get Truly Smart, AI Might Need to Play More Video Games

Wired: “Making AI Play Lots of Videogames Could Be Huge (No, Seriously)

Forbes: “How Deep Learning Networks Can Use Virtual Worlds To Solve Real World Problems


Related links

First international workshop on "Virtual/Augmented Reality for Visual Artificial Intelligence (VARVAI) ", in conjunction with ECCV 2016



For questions, please contact Adrien Gaidon .