The problem Patang solves
The idea is to translate physical modalities to textual and visual art. This could be used to model synesthesia, and have applications in the metaverse space.
We aim to publish a paper where we train the models powering the generation and collect data from various motions, such as Classical Dance, a kite's flight trajectory, an auto-rickshaw going through a city, a dog/cat's day-to-day movement.
Other ideas to explore:
- Pretend the vectors from the sensors are clip embeddings to Stable Diffusion rather than tokenize them with some LLM's tokenizer. This could potentially get rid of the weird, uncommon words in the tokenized text prompts because the embeddings would be more likely to correspond to more real world images
- Use the vectors from the sensors to construct a tensor that we use as a latent space. Then we run this "latent space" through Stable Diffusion's decoder to get an image
- Create a video of a walk through the latent spaces of our generations
Challenges we ran into
- Creating the representation of motion and ambient sensor data in a form that could be used to condition diffusion models was quite a challenge.
We got around it by using the sensor values as tensors with values that lie in the range of a Flan-T5 tokenizer. We pass this as input to the model and generate an output. This would give us a text output that could be passed to any text-to-X model.
However, this is quite limiting and we would benefit greatly from training our own tokenizers, and have more direct method to pass the output into visual models.
- We also had a challenge figuring out how to use motion to condition diffusion denoising.
Again, this can be solved by training a CLIP model to create embeddings from the motion, temperature, pulse rate, and atmospheric pressure channels.