UTTLS

Teaser

Overview of the proposed approach. On the left, paired RGB images and haptic maps from representative materials (e.g., rock, grass, leather) are processed by modality-specific encoders. On the right, the resulting embeddings are projected into a latent space and analyzed with respect to human perceptual dimensions such as rough/smooth and hard/soft. Our work investigates how visual and tactile modalities contribute to shaping a latent representation that aligns with human material perception.

Abstract

Extended Reality (XR) systems are increasingly incorporating multi-sensory stimuli to enhance realism and user immersion. Among these, the integration of tactile feedback plays a crucial role. Yet, the pipeline for acquiring, processing, and rendering haptic information—especially in synchrony with visual stimuli—remains largely unstandardized. A common strategy for capturing tactile data involves encoding it as haptic maps, essentially image-based rep- resentations of touch. However, the effectiveness of both visual and tactile modalities in modeling perceptual haptic properties is not yet fully understood.

In this study, we analyze the representational power of haptic maps and RGB images from the Touch and Go dataset using latent space analysis. Specifically, we investigate whether a neural network can structure the latent space in a way that reflects human perceptual attributes such as roughness, hardness, and colorfulness.

Our findings contribute to understanding whether haptic maps can serve as reliable proxies for tactile data and align with how humans perceive material properties, marking a step forward toward perceptually grounded haptic representations in XR environments.

Our method: Supervised Contrastive Learning (SCL)

We employ a Supervised Contrastive Learning (SCL) framework to learn a latent space that captures human perceptual attributes from paired RGB images and haptic maps. The architecture consists of two parallel encoders: one for processing RGB images and another for haptic maps. Each encoder extracts modality-specific features, which are then projected into a shared latent space. The SCL loss function encourages the model to cluster embeddings of the same material class while pushing apart those of different classes. This approach allows us to investigate how visual and tactile modalities contribute to shaping a latent representation that aligns with human material perception.

The Supervised Contrastive Learning (SCL) loss function is designed to enhance the discriminative power of the learned embeddings in the latent space. For each anchor sample, the loss encourages the model to pull together embeddings of samples from the same material class (positives) while pushing apart embeddings from different classes (negatives). This is achieved by computing the cosine similarity between the anchor and all other samples in the batch, applying a temperature scaling factor to control the concentration of the distribution. The SCL loss effectively leverages label information to structure the latent space in a way that reflects human perceptual attributes such as roughness, hardness, and colorfulness.

t-SNE visualization: supervised vs self-supervised

t-SNE visualization comparing supervised and self-supervised learning approaches. The plot illustrates how the two methods differ in their ability to cluster similar samples in the latent space.

Latent Space Analysis

To assess whether the latent space representations align with human perceptual organization, as analyzed in [2, 9], we conduct three experiments by re-coloring the t-SNE projections according to perceptual classifications derived from these studies. In each visualization, darker colors indicate material categories explicitly included in the original studies, while lighter tones represent materials that were not directly evaluated but were manually assigned to the closest perceptual class based on our interpretation.

BibTeX


@article{stefani2025understanding,
  title={Understanding Touch Through Latent Spaces: Can Images and Haptic Maps Reflect Human Perception?},
  author={Stefani, Antonio Luigi and Baldoni, Sara and Bisagno, Niccol{\'o} and Battisti, Federica and Conci, Nicola and De Natale, Francesco},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
  month={October},
  year={2025}
}

Funding

We acknowledge the support of the MUR PNRR project iNEST - Interconnected Nord-Est Innovation Ecosystem (ECS00000043) funded by the European Union under NextGenerationEU. In addition, this work was partially supported by the European Union under the Italian National Recovery and Resilience Plan (NRRP) Mission 4, Component 2, Investment 1.3, CUP C93C22005250001, partnership on “Telecommunications of the Future” (PE00000001 - program “RESTART”).

Understanding Touch Through Latent Spaces: Can Images and Haptic Maps Reflect Human Perception?