Author states that the amount and quality of training data is often more important for the performance of a system than architecture and training details. Collecting, processing and annotating real data at scale is difficult, expensive, and raises additional concerns. Synthetic Data is a powerful tool with the potential to overcome these shortcomings. Unfortunately, software tools for effective data generation are less mature than those for architecture design and training, which leads to fragmented generation efforts.
To address these problems authors introduce Kubric, an open-source Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines, and generating TBs of data. They also demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
Kubric, the used assets, all of the generation code, as well as the rendered datasets for reuse and modification have been released by the authors.