Building a scene-specific synthetic data generator with Omniverse Replicator

In today’s world of AI, the amount of training data is a critical factor in the success of model training. Especially in cases where data acquisition is difficult due to rare occurrence of events or annotation cost, synthetic data can be used to supplement data needs. In computer vision, some tasks...

Full description

Bibliographic Details
Main Authors: Kokko, Aaro, Kuhno, Jani
Other Authors: Faculty of Information Technology, Informaatioteknologian tiedekunta, Jyväskylän yliopisto, University of Jyväskylä
Format: Master's thesis
Language:eng
Published: 2024
Subjects:
Online Access: https://jyx.jyu.fi/handle/123456789/95348
Description
Summary:In today’s world of AI, the amount of training data is a critical factor in the success of model training. Especially in cases where data acquisition is difficult due to rare occurrence of events or annotation cost, synthetic data can be used to supplement data needs. In computer vision, some tasks require pixel-wise annotation which, if done by hand, is labor intensive and error-prone. In this study, we use eDSR methodology to design and evaluate a synthetic data generator, to serve as a reference generator for those who seek to start synthetic visual data generation from scratch. A generator, combining an Omniverse Replicator Python script and 3D assets, is developed and the quality of the synthetic data outputs is measured by training three different neural networks to predict segmentation masks from a real-world scene. In addition to the generator, a model of scene-specific synthetic data generation pipeline is presented, to complement the reference generator as a source of knowledge for newcomers in the field. Two major processes in synthetic data generator building are observed to be domain gap bridging and domain randomization. Domain gap bridging aims to increase the visual similarity in the synthetic scene and the real world, while domain randomization aims to increase the data distribution. Because the main benefit of synthetic data is minimal annotation cost, the optimization of generation speed should be integrated in the development process. The Python code developed is available in: https://github.com/jkuhno/reference-SDGenerator