Building a scene-specific synthetic data generator with Omniverse Replicator

In today’s world of AI, the amount of training data is a critical factor in the success of model training. Especially in cases where data acquisition is difficult due to rare occurrence of events or annotation cost, synthetic data can be used to supplement data needs. In computer vision, some tasks...

Täydet tiedot

Bibliografiset tiedot
Päätekijät: Kokko, Aaro, Kuhno, Jani
Muut tekijät: Faculty of Information Technology, Informaatioteknologian tiedekunta, Jyväskylän yliopisto, University of Jyväskylä
Aineistotyyppi: Pro gradu
Kieli:eng
Julkaistu: 2024
Aiheet:
Linkit: https://jyx.jyu.fi/handle/123456789/95348
Kuvaus
Yhteenveto:In today’s world of AI, the amount of training data is a critical factor in the success of model training. Especially in cases where data acquisition is difficult due to rare occurrence of events or annotation cost, synthetic data can be used to supplement data needs. In computer vision, some tasks require pixel-wise annotation which, if done by hand, is labor intensive and error-prone. In this study, we use eDSR methodology to design and evaluate a synthetic data generator, to serve as a reference generator for those who seek to start synthetic visual data generation from scratch. A generator, combining an Omniverse Replicator Python script and 3D assets, is developed and the quality of the synthetic data outputs is measured by training three different neural networks to predict segmentation masks from a real-world scene. In addition to the generator, a model of scene-specific synthetic data generation pipeline is presented, to complement the reference generator as a source of knowledge for newcomers in the field. Two major processes in synthetic data generator building are observed to be domain gap bridging and domain randomization. Domain gap bridging aims to increase the visual similarity in the synthetic scene and the real world, while domain randomization aims to increase the data distribution. Because the main benefit of synthetic data is minimal annotation cost, the optimization of generation speed should be integrated in the development process. The Python code developed is available in: https://github.com/jkuhno/reference-SDGenerator