Automatic media generation (or synthesis) is a field that had an incredible boost in recent years, with the advent of deep generative models. Nowadays, neural networks can create text, images, and videos based on class labels or other media. The common task is to generate content. However, we can take advantage of the learned feature representations on these tasks to understand the relevant features and as a source of interpretability. That is, what features are relevant for the creation of different content, and how can we interpret what the models are learning or paying attention to.
One problem is how to learn efficient and rich representations for video data based on deep generative tasks. I focus on two particular problems for learning effective representations.
- The first one is the semantic transfer between data modalities, in particular video and (written) language.
- And the second one is disentanglement within the same domain, that is, separate different variations and modalities of the data.
The separation of semantics (intra- and inter-domain) allows us to better understand the type of features that are learned by the different architectures on these tasks. My objective is to train the deep generative models on different video reconstruction tasks and study their learning capabilities. I study different problems related to divide the factors of variation in the data, and then how to use them.
- Learning Representations through Deep Generative Models on Video. São Paulo Research Foundation (FAPESP). 2020.