Our understanding of the world involves the creation of high-level representations of it. For example, seen the word smile makes us think about the action or a person with a certain expression that we refer to as a smile, and simultaneously we understand a picture of a person smiling, or the draw , as the same smile concept. That is, the smile representation is tailored to our experiences and our way of perceiving the world; and, more importantly, it commonly describes heterogeneous sources of information.

For computers, we need to perform a similar task in which the raw information (that is, the letters, pixels on an image, or signals from the world) are converted into a representation that can be held and understood by the computer. Commonly, that representation is a set of numbers, a so called descriptor. However, not any set of numbers can represent particular information, not to mention the knowledge related to it. Thus, it is important to have optimal representations.

In the past decades, this task was performed by feature designers (me included) that created algorithms to extract and represent information. Nowadays, the task of finding representations for the information is done automatically by algorithms that learn the best way to represent the information (according to a measure of “goodness”). In computer vision, we are interested in visual sources of information, like images and videos. These sources encode spatial information as 2D representations (like pictures), and as sequences (like videos, as a sequence of pictures). Some particular tasks are naturally dependent on the order of the events. Imagine a person standing up and sitting down. These movements do not differ by their components, but rather by their order. For this type of information, time is a key discriminator. And an important part to create our descriptor of the world, as the arrow of time influence the description of the events.

In general, I am interested in the description of temporal sources of information that come from videos. I am particularly interested in creating algorithms that learn representations from these temporal sources of information (videos) and use them to make decisions. In specific, I explore video understanding with several applications, such as classification (facial expression, activity, and dynamic event recognition), semantic segmentation, and video generation.


Semantic Segmentation

Can machines detect objects in scenes like humnas? Can they distinguish among several copies?