Proxy Virtual Worlds

virtual KITTI


Virtual KITTI dataset


Virtual KITTI is a photo-realistic synthetic video dataset designed to learn and evaluate computer vision models for several video understanding tasks: object detection and multi-object tracking, scene-level and instance-level semantic segmentation, optical flow, and depth estimation.

Virtual KITTI contains 50 high-resolution monocular videos (21,260 frames) generated from five different virtual worlds in urban settings under different imaging and weather conditions. These worlds were created using the Unity game engine and a novel real-to-virtual cloning method. These photo-realistic synthetic videos are automatically, exactly, and fully annotated for 2D and 3D multi-object tracking and at the pixel level with category, instance, flow, and depth labels (cf. below for download links).

Our CVPR 2016 paperpdf , arxiv ] describes the dataset, our semi-automatic method used to build it, and experiments on measuring the real-to-virtual gap, deep learning with virtual data, and measuring the generalization performance under changes in imaging and weather conditions.



23 Sept. 2016: New version (v.1.3.1) with 2 new variations (+/- 30 degrees camera rotation), new 3D object ground truth and camera parameters (intrinsic + pose), car meta-data (moving/not moving flag, color and make of cars, ...) and minor bug fixes on segmentation and optical flow edge cases (including on car wheels and intricate thin structures). The experimental conclusions are identical to the ones of our CVPR 2016 paper. In fact, the average gap in MOTA for DPMCF is even smaller now (81.0 on real KITTI, 81.2 on VKITTI 1.3.1 clones). Previous version of the files (1.2.1) can still be downloaded here  (md5 ). 

10 Aug. 2016: Update to scene ground truth (v.1.2.1). Small bug fix on poles and transparent shaders impacting only few pixels of the scene ground truth images. All the rest is unchanged.


Terms of Use and Reference


The Virtual KITTI Dataset (an Adaptation of the KITTI Dataset)


Copyrights in The Virtual KITTI Dataset are owned by Xerox.


The Virtual KITTI Dataset is provided by Xerox and may be used for non-commercial purposes only and is subject to the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 , a summary of which is located here .


The Virtual KITTI Dataset is an Adaptation of the KITTI Vision Benchmark Suite . See also the publication by Andreas Geiger and Philip Lenz and Raquel Urtasun, entitled "Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite", in Computer Vision and Pattern Recognition (CVPR), 2012.


When using or referring to this dataset in your research, please cite Xerox as the originator of the Virtual KITTI Dataset and cite our CVPR 2016 paperpdf ] (6MB) [ arxiv ], cf. also full reference below:

Virtual Worlds as Proxy for Multi-Object Tracking Analysis
Adrien Gaidon , Qiao Wang, Yohann Cabon, Eleonora Vig
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

    author = {Gaidon, A and Wang, Q and Cabon, Y and Vig, E},
    title = {Virtual Worlds as Proxy for Multi-Object Tracking Analysis},
    booktitle = {CVPR},
    year = {2016}




We provide one .tar[.gz] archive per type of data as described below. Here is a list of all the URLs for batch download, and here is a list of the MD5 checksums for each archive. You can extract an archive to a folder via the command ‘tar xvf filename.tar’ (replace xvf by xzvf for compressed .tar.gz files). Windows users can use the 7-zip software to extract these archives.

In the following, "<version>" is the dataset version number (currently 1.3.1), and "<world>" is the name of a virtual world, which is the sequence number of the corresponding original “seed” real-world KITTI sequence (0001, 0002, 0006, 0018, 0020). The placeholder "<variation>" denotes one of the 10 different rendering variations in terms of imaging or weather conditions:

    clone: rendering as close as possible to original real-world KITTI conditions
    15-deg-right: horizontal rotation of the camera 15 degrees to the right
    15-deg-left: horizontal rotation of the camera 15 degrees to the left
    30-deg-right: horizontal rotation of the camera 30 degrees to the right
    30-deg-left: horizontal rotation of the camera 30 degrees to the left
    morning: typical lighting conditions after dawn on a sunny day
    sunset: lighting typical of slightly before sunset
    overcast: typical overcast weather (diffuse shadows, strong ambient lighting)
    fog: fog effect implemented via a volumetric formula
    rain: simple rain particle effect (ignoring the refraction of water drops on the camera)

Note that our indexes always start from 0.




Rendered RGB frames: link  (14GB)

Rendered RGB frames

Each video is simply a folder in the format:





Object Detection (2D & 3D) and Multi-Object Tracking Ground Truth: link  (8.5MB)

object detection and multiobject tracking ground truth

New in version 1.3.1: we merged KITTI-like annotations and 'raw' ground truth in a single file that also contains new ground truth information (3D object pose in camera coordinates, 'moving' flag, car model name and color) and corrected the names/meanings of some columns (cf. below).

The MOT ground truth for each video consists of a CSV-like text file named:


These files are in a KITTI-like format that can be loaded with the following one-liner in python using the popular pandas library (assuming ‘import pandas as pd’):

     motgt = pd.read_csv("<filename>", sep=" ", index_col=False)

Each line contains one object annotation with the following columns:

    frame: frame index in the video (starts from 0)
    tid: track identification number (unique for each object instance)
    label: KITTI-like name of the 'type' of the object (Car, Van, DontCare)
    truncated: (changed name in v1.3) KITTI-like truncation flag
                      (0: not truncated, 1: truncated, 2: heavily truncated, marked as “DontCare”)
    occluded: (changed name in v1.3) KITTI-like occlusion flag
                     (0: not occluded, 1; occluded, 2: heavily occluded, marked as “DontCare”)
    alpha: KITTI-like observation angle of the object in [-pi..pi]
    l, t, r, b: KITTI-like 2D 'bbox', respectively left, top, right, bottom bounding box in pixel coordinates
                 (inclusive, (0,0) origin is on the upper left corner of the image)
    w3d, h3d, l3d: KITTI-like 3D object 'dimensions', respectively width, height, length in meters
    x3d, y3d, z3d: KITTI-like 3D object 'location', respectively x, y, z in camera coordinates in meters
                             (center of bottom face of 3D bounding box)
    ry: KITTI-like 3D object 'rotation_y', rotation around Y-axis (yaw) in camera coordinates [-pi..pi]
         (KITTI convention is ry == 0 iff object is aligned with x-axis and pointing right)
    rx: rotation around X-axis (pitch) in camera coordinates [-pi..pi]
    rz: rotation around Z-axis (roll) in camera coordinates [-pi..pi]
    truncr: (changed in v1.3) object 2D truncation ratio in [0..1] (0: no truncation, 1: entirely truncated)
    occupr: object 2D occupancy ratio (fraction of non-occluded pixels) in [0..1]
                 (0: fully occluded, 1: fully visible, independent of truncation)
    orig_label: original KITTI-like name of the 'type' of the object ignoring the 'DontCare' rules
                     (allows to know original type of DontCare objects)
    moving: 0/1 flag to indicate whether the object is really moving between this frame and the next one
    model: the name of the 3D model used to render the object (can be used for fine-grained recognition)
    color: the name of the color of the object

Remarks about 3D information

Internally, the 3D world is projected on the screen by using the Unity Engine rendering pipeline and various shaders. You can reproduce this by projecting points from the camera space (e.g., coordinates x3d,y3d,z3d) to the image pixels by using our camera intrinsic matrix (in pixels, constant, computed from our 1242x375 resolution and 29° fov):

          [[725,    0, 620.5],
    K =  [   0, 725, 187.0],
            [   0,     0,       1]]

In our system of 3D camera coordinates x is going to the right, y is going down, and z is going forward (the origin is the optical center of the camera).




Camera pose (extrinsic parameters)link  (1.1MB)

The 3D camera pose (rotation and translation) for each frame of a video consists of one CSV-like text files named:


Each file can be loaded with the following one-liner in Python using the popular pandas  library (assuming ‘import pandas as pd’):

    extgt = pd.read_csv("<filename>", sep=" ", index_col=False)

Each line consists of the frame index in the video (starts from 0) followed by the row-wise flattened 4x4 extrinsic matrix at that frame:

           r1,1 r1,2 r1,3 t1
    M = r2,1 r2,2 r2,3 t2
           r3,1 r3,2 r3,3 t3
             0     0     0    1

where ri,j are the coefficients of the camera rotation matrix R and ti are the coefficients of the camera translation coefficients t.

This matrix can be used to convert points from world space to the camera space. For a point p = (x,y,z) in the world space, P = (x,y,z,1) in the homogeneous coordinates, you can get the coordinates in the camera space by doing the dot product MP.

See section above for the camera intrinsic parameters and description of our camera coordinate system.




Semantic and Instance-level Segmentation Ground Truthlink  (488MB)

semantic segmentation ground truth

The per-pixel semantic and instance-level segmentation ground truth is encoded as per-frame .png files (standard 8-bit precision per channel) and a per-world per-variation text file giving the correspondence between RGB color codes and labels:


Each ‘vkitti_<version>_scenegt/<world>_<variation>_scenegt_rgb_encoding.txt’ file contains one line per category formatted like ‘<category>[:<tid>] <R> <G> <B>’, where ‘<category>’ is the name of the semantic category of that pixel (Building, Car, GuardRail, Misc, Pole, Road, Sky, Terrain, TrafficLight, TrafficSign, Tree, Truck, Van, Vegetation), ‘<tid>’ is the (optional) integer track identifier to differentiate between instances of the same category (e.g., ‘Car:0’, but Sky and Tree have no track ID), and ‘<R> <G> <B>’ is the color encoding of that label in the corresponding ‘vkitti_<version>_scenegt/<world>/<variation>/%05d.png’ ground truth images.




Optical Flow Ground Truthlink  (4.8GB)

sintel visualization of vkitti flow ground truth

The optical flow ground truth from the current frame to the next frame for each video is a folder in the format:


where each 3-channel PNG16 flow image (index starting from 00000) corresponds to the normalized, quantized, and masked (cf. below) flow from the current frame to the next frame (flow at t = frame t -> frame t+1). The flow values in pixels can be decoded from the RGB values of each pixel (16 bits per channel) as:

    R = flow along x-axis normalized by image width and quantized to [0;2^16 - 1]
    G = flow along y-axis normalized by image height and quantized to [0;2^16 - 1]
    B = 0 for invalid flow (e.g., sky pixels)

Some example decoding code in Python using OpenCV and numpy:

import numpy as np
import cv2

def read_vkitti_png_flow(flow_fn):
    "Convert from .png to (h, w, 2) (flow_x, flow_y) float32 array"
    # read png to bgr in 16 bit unsigned short
    bgr = cv2.imread(flow_fn, cv2.IMREAD_ANYCOLOR | cv2.IMREAD_ANYDEPTH)
    h, w, _c = bgr.shape
    assert bgr.dtype == np.uint16 and _c == 3
    # b == invalid flow flag == 0 for sky or other invalid flow
    invalid = bgr[..., 0] == 0
    # g,r == flow_y,x normalized by height,width and scaled to [0;2**16 - 1]
    out_flow = 2.0 / (2**16 - 1.0) * bgr[..., 2:0:-1].astype('f4') - 1
    out_flow[..., 0] *= w - 1
    out_flow[..., 1] *= h - 1
    out_flow[invalid] = 0  # or another value (e.g., np.nan)
    return out_flow

Note that our ground truth normalized/quantized/masked flow .png files look often like a purple haze in contrast to the visualization at the beginning of this section, which uses the common Sintel color wheel to display the flow in pixels. This is normal, and the consequence of our 16-bit encoding designed to keep high precision, including for large displacements, in a standard loss-less compressed format (PNG16).




Depth Ground Truth: link  (5.1GB):

depth ground truth

The depth ground truth for each video is stored in 16-bit grayscale PNG images as:


Depth values are distances to the camera plane obtained from the z-buffer ( They correspond to the z coordinate of each pixel in camera coordinate space (not the distance to the camera optical center). We use a fixed far plane of 655.35 meters, i.e. points at infinity like sky pixels are clipped to a depth of 655.3m. This allows us to truncate and normalize the z values to the [0;2^16 - 1] integer range such that a pixel intensity of 1 in our single channel PNG16 depth images corresponds to a distance of 1cm to the camera plane. The depth map in centimeters can be directly loaded in Python with numpy and OpenCV via the one-liner (assuming “import cv2”):

    depth = cv2.imread(depth_png_filename, cv2.IMREAD_ANYCOLOR | cv2.IMREAD_ANYDEPTH)



Additional Information



Modern computer vision algorithms typically require expensive data acquisition and accurate manual labeling. In this work, we instead leverage the recent progress in computer graphics to generate fully labeled, dynamic, and photo-realistic proxy virtual worlds. We propose an efficient real-to-virtual world cloning method, and validate our approach by building and publicly releasing a new video dataset, called “Virtual KITTI”, automatically labeled with accurate ground truth for object detection, tracking, scene and instance segmentation, depth, and optical flow. We provide quantitative experimental evidence suggesting that (i) modern deep learning algorithms pre-trained on real data behave similarly in real and virtual worlds, and (ii) pre-training on virtual data improves performance. As the gap between real and virtual worlds is small, virtual worlds enable measuring the impact of various weather and imaging conditions on recognition performance, all other things being equal. We show these factors may affect drastically otherwise high-performing deep models for tracking.


Related Press Coverage

MIT Tech Review: “To Get Truly Smart, AI Might Need to Play More Video Games

Wired: “Making AI Play Lots of Videogames Could Be Huge (No, Seriously)

Forbes: “How Deep Learning Networks Can Use Virtual Worlds To Solve Real World Problems


Related links

First international workshop on "Virtual/Augmented Reality for Visual Artificial Intelligence (VARVAI) ", in conjunction with ECCV 2016



For questions, please contact Adrien Gaidon  and Yohann Cabon (first.lastname at