Publications

We present HAPTIC, an approach that infers coherent 4D hand trajectories from monocular videos. Current video-based hand pose reconstruction methods primarily focus on improving frame-wise 3D pose using adjacent frames rather than studying consistent 4D hand trajectories in space. Despite the additional temporal cues, they generally underperform compared to image-based methods due to the scarcity of annotated video data. To address these issues, we repurpose a state-of-the-art image-based transformer to take in multiple frames and directly predict a coherent trajectory. We introduce two types of lightweight attention layers: cross-view self-attention to fuse temporal information, and global cross-attention to bring in larger spatial context. Our method infers 4D hand trajectories similar to the ground truth while maintaining strong 2D reprojection alignment. We apply the method to both egocentric and allocentric videos. It significantly outperforms existing methods in global trajectory accuracy while being comparable to the state-of-the-art in single-image pose estimation.

project arXiv code BibTeX

Perceiving Systems Conference Paper Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis Danecek, R., Schmitt, C., Polikovsky, S., Black, M. J. In Int. Conf. on 3D Vision (3DV), March 2026 (Accepted)

In order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync quality. To get the best of both, we seek a stochastic model with accurate lip-sync. To that end, we develop a new approach based on the following observation: if a method generates realistic 3D lip motions, it should be possible to infer the spoken audio from the lip motion. The inferred speech should match the original input audio, and erroneous predictions create a novel supervision signal for training 3D talking head avatars with accurate lip-sync. To demonstrate this effect, we propose THUNDER (Talking Heads Under Neural Differentiable Elocution Reconstruction), a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. First, we train a novel mesh-to-speech model that regresses audio from facial animation. Then, we incorporate this model into a diffusion-based talking avatar framework. During training, the mesh-to-speech model takes the generated animation and produces a sound that is compared to the input speech, creating a differentiable analysis-by-audio-synthesis supervision loop. Our extensive qualitative and quantitative experiments demonstrate that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations.

project arXiv BibTeX

Perceiving Systems Conference Paper NeuralFur: Animal Fur Reconstruction from Multi-view Images Sklyarova, V., Kabadayi, B., Yiannakidis, A., Becherini, G., Black, M. J., Thies, J. In Int. Conf. on 3D Vision (3DV), March 2026 (Accepted)

Reconstructing realistic animal fur geometry from images is a challenging task due to the fine-scale details, self-occlusion, and view-dependent appearance of fur. In contrast to human hairstyle reconstruction, there are also no datasets that could be leveraged to learn a fur prior for different animals. In this work, we present a first multi-view-based method for high-fidelity 3D fur modeling of animals using a strand-based representation, leveraging the general knowledge of a vision language model. Given calibrated multi-view RGB images, we first reconstruct a coarse surface geometry using traditional multi-view stereo techniques. We then use a visual question answering (VQA) system to retrieve information about the realistic length structure of the fur for each part of the body. We use this knowledge to construct the animal’s furless geometry and grow strands atop it. The fur reconstruction is supervised with both geometric and photometric losses computed from multi-view images. To mitigate orientation ambiguities stemming from the Gabor filters that are applied to the input images, we additionally utilize the VQA to guide the strands' growth direction and their relation to the gravity vector that we incorporate as a loss. With this new schema of using a VQA model to guide 3D reconstruction from multi-view inputs, we show generalization across a variety of animals with different fur types.

project arXiv code BibTeX

Perceiving Systems Article Textile suit for anywhere full-body motion capture Sun, H., Feng, Y., Kao, P., Black, M. J., Kramer-Bottiglio, R. Science Advances, 12(10):1-15, March 2026 (Published)

Wearable technology has shown notable promise for tracking human motion, offering valuable insights for fields ranging from biomechanics to healthcare. Traditional motion capture systems, however, are often bulky and disruptive, making them impractical for daily use. Advances in textile-based sensing offer a promising alternative, enabling seamless integration of air- and sweat-permeable sensors into everyday clothing. Here, a sensorized textile suit designed for unobtrusive full-body motion capture is presented. The suit is capable of accurately tracking complex movements without interfering with routine activities. This wearable, using an individual-customized network of fabric-based sensors, autonomously identifies and monitors movement angles and patterns, providing insights into physical range, activity frequency, and exertion levels. Language models are shown to interpret motion data into descriptive language, enhancing its potential for real-world applications. This sensorized textile suit and corresponding algorithms represent a step forward in accessible, continuous movement monitoring in the form of everyday clothing, opening avenues for studying human behavior and health in natural environments. YSuit is a fully textile suit for accurate, customizable, washable, and comfortable anywhere full-body motion capture.

pdf publisher site data DOI BibTeX

Haptic Intelligence Article Comparing Placement and Polarity Configurations of a Two-Magnet Fingertip Vibrotactile Device Gertler, I., Ballardini, G., Tangolar, D., Serhat, G., Kuchenbecker, K. J. Scientific Reports, March 2026 (Published)

Vibrotactile feedback enriches the use of wearable technologies for entertainment, navigation, and healthcare. The actuators of these portable systems, particularly fingertip devices, need to be compact, comfortable, and easy to integrate. Multiple vibrating elements could enhance perceptual realism, but how should they be arranged and oriented on the fingerpad? Here, we evaluate a simple approach that uses an audio input signal to drive an air coil that vibrates two magnets embedded in a soft fingertip sheath; the magnets are arranged in the radial-ulnar or proximal-distal direction with either the same or opposite polarity. We explore the effects of these new device configurations on both dynamic response and haptic perception. Experimental results indicate that the vibrations were perceived well across frequencies, with stronger sensations between 180 and 360 Hz, which aligns with the high vibration magnitudes our computational simulation predicts in this frequency range. Interestingly, perceptual responses showed that participants mainly classified vibrations based on the excitation frequency rather than the polarity of the magnets. Participants also rated vibrotactile feedback derived from recorded sounds and replayed for different interactions. Their evaluations offer promising evidence that this actuation approach could be used in extended-reality applications to improve transient user interactions with virtual objects.

Haptic Intelligence Conference Paper Designing a Psychotherapy Support Robot for Young Children Diagnosed with Obsessive-Compulsive Disorder Mohan, M., L’Orsa, R., Grüninger, F., Stollhof, B., Klein, C. S., Dinauer, R., Burns, R. B., Renner, T. J., Hollmann, K., Kuchenbecker, K. J. In Companion Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (HRI), 1-6, Late-Breaking Report (LBR) (6 pages) presented at the IEEE/ACM International Conference on Human-Robot Interaction (HRI), Edinburgh, UK, March 2026, Mayumi Mohan and Rachael L'Orsa contributed equally to this publication (Published)

The gold-standard treatment for children diagnosed with obsessive-compulsive disorder (OCD) is therapist-guided cognitive behavioral therapy (CBT), which includes exposure and response prevention (ERP) sessions that teach children to overcome compulsive responses when exposed to their anxiety-inducing triggers. CBT requires children to report frequent self-assessments of tension during both therapist-supported and therapist-free self-management ERP sessions. Videoconferencing-delivered CBT (vCBT) enables a psychotherapist to treat a child remotely in their home, where OCD symptoms often arise, but these remote therapeutic interactions lack physical presence and can be challenging to run. We propose using a robot as an input/output device during vCBT for young children diagnosed with OCD, and we introduce a stationary table-top koala robot for this application. We further describe the first of three planned participatory design phases: a co-design study comprising two sessions where child and adolescent psychotherapists role-played vCBT ERP exercises with this robot to help define its role.

Haptic Intelligence Ph.D. Thesis Haptify: A Measurement-Based System for Quantifying the Quality of Haptic Interfaces Fazlollahi, F. University of Tübingen, Tübingen, Germany, March 2026, Department of Computer Science (Published)

Grounded force-feedback (GFF) devices, exoskeletons, and other haptic robots modulate human movement through carefully engineered mechanical, electrical, and computational designs. Given their significant societal potential and often high cost, it is essential to fairly and efficiently assess the quality of these intimate cyber-physical interfaces. However, existing device specifications and low-level performance metrics often fail to capture the nuanced qualities that expert users perceive during hands-on experimentation. To address this gap, this thesis introduces Haptify, a comprehensive benchmarking system that can thoroughly, fairly, and noninvasively evaluate GFF haptic devices. Haptify integrates multiple sensing modalities - a seven-camera optical motion-capture system, a custom-built 60-cm-square force plate, and an instrumented end-effector that can be adapted to different devices - to record the interaction between the human hand, the device, and the ground during both passive and active experiments. With this setup, users hold the device end-effector and move it through a series of carefully designed tasks while Haptify measures kinematic and kinetic responses. From this process, we establish six key ways to assess GFF device performance: workspace shape, global free-space forces, global free-space vibrations, local dynamic forces and torques, frictionless surface rendering, and stiffness rendering. These benchmarks enable systematic evaluation and comparison across devices. We first apply Haptify to benchmark two GFF devices produced by 3D Systems: the widely used Touch and the more expensive Touch X. Results reveal that the Touch X offers a slightly smaller workspace than the Touch, but it produces smaller and more predictable free-space forces, reduced vibrations, more consistent dynamic forces and torques, and higher-quality rendering of both frictionless surfaces and stiff virtual objects. To further validate and extend our approach, we conducted a user study with sixteen expert hapticians who used Haptify to evaluate four commercial GFF devices: Novint Falcon, Force Dimension Omega.3, Touch, and Touch X. Experts tested the devices in unpowered mode and across five representative virtual benchmark environments, providing extensive quantitative ratings and qualitative feedback. We distilled recurring themes from their input and analyzed correlations between expert opinions and sensor-based measurements. Our findings show that expert judgments of fundamental haptic quality indicators align closely with the metrics derived from Haptify. Moreover, device performance both unpowered and in active benchmarks can be used to predict its suitability for more complex applications, such as teleoperated surgery. By linking expert assessments with external measurement data, this thesis establishes a combined qualitative-quantitative framework for benchmarking haptic robots. This approach not only enables fair comparison across diverse devices but also establishes a direct connection between objective measurements and the subjective expertise of experienced hapticians. In doing so, it lays the foundation for more rigorous, transparent, and application-relevant evaluation of haptic technologies.

Haptic Intelligence Miscellaneous Minsound: Adding Internal Audio Sensing to Internal Vision Enables Human-Like In-Hand Fabric Recognition with Soft Robotic Fingertips Andrussow, I., Solano, J., Richardson, B. A., Martius, G., Kuchenbecker, K. J. Extended abstract (3 pages) presented at the German Robotics Conference (GRC), Cologne, Germany, March 2026 (Published) BibTeX

Haptic Intelligence Miscellaneous Rendering Forces with a Modular Cable System, Motors, and Brakes Bartels, J. U., Achberger, A., Kuchenbecker, K. J., Sedlmair, M. Extended abstract (3 pages) presented at the German Robotics Conference (GRC), Cologne, Germany, March 2026 (Published)

We describe the hardware design, force-rendering approach, and evaluation of a new reconfigurable haptic interface consisting of a network of hybrid motor-brake actuation modules that apply forces via cables. Each module contains both a motor and a brake, enabling it to smoothly render active forces up to 6 N using its motor and collision forces up to 186 N using its passive one-way brake. The modular design, meanwhile, allows the system to deliver rich haptic feedback in a flexible number of DoF and widely ranging configurations.

Empirical Inference Conference Paper Sim-to-Real for Muscle-Actuated Robots via Learned Actuator Models Schneider, J., Mahajan, M., Chen, L., Guist, S., Schölkopf, B., Posner, I., Büchler, D. 2nd German Robotics Conference (GRC), March 2026 (Accepted) BibTeX

Haptic Intelligence Dynamic Locomotion Ph.D. Thesis The Human Leg Catapult: Biological Mechanisms for Walking Gait Replicated in the EcoWalker Robot Kiss, B. University of Stuttgart, Stuttgart, Germany, March 2026, Faculty of Civil and Environmental Engineering (Published)

Humanoid robots and assistive devices have yet to match the efficiency and adaptability of able-bodied human walking in challenging environments. To bridge this performance gap, my projects explored the underlying mechanisms of human locomotion, focusing on the ankle push-off. Ankle push-off has a prominent role in walking due to its high-power output at the end of the stance phase, and due to the impact of its timing on the adaptability to diverse environments. The human leg catapult analogy provides a framework for the projects to understand and replicate the complex biological mechanisms that govern human walking gait. As a platform for the replication, the human-like bipedal EcoWalker robot was developed from version 1 to 3 in three consecutive projects, with iterative design and control updates tailored to each project's goals. Our findings provide insights into the separate roles of mono- and biarticular muscle-tendon units in the human leg catapult, while we also show functional details of the human leg catapult release mechanism through five distinct release processes on the EcoWalker robot. Utilizing the robot in the projects ensures that our findings are relevant to practical applications, allowing humanoid robot and assistive device developers to build on our insights, potentially reducing the performance gap in efficiency and adaptability between able-bodied human walking and artificial walking.

Haptic Intelligence Robotics Materials Medical Systems Article Functional Gradients Facilitate Tactile Sensing in Elephant Whiskers Schulz, A. K., Kaufmann, L. V., Smith, L. T., Philip, D. S., David, H., Lazovic, J., Brecht, M., Richter, G., Kuchenbecker, K. J. Science, 391(6786):712-718, February 2026, Lena V. Kaufmann and Lawrence T. Smith contributed equally to this work (Published)

MPI-IS News Article YouTube Video Highlight Whisker Simulation Toolkit Edmond Data Repository Download Paper for Free Press Coverage DOI BibTeX

Keratin composites enable animals to hike with hooves, fly with feathers, and sense with skin. Mammalian whiskers are elongated keratin rods attached to tactile skin structures that extend the animal's sensory volume. We investigated the whiskers that cover Asian elephant (Elephas maximus) trunks and found that they are geometrically and mechanically tailored to facilitate tactile perception by encoding contact location in the amplitude and frequency of the vibrotactile signal felt at the whisker base. Elephant whiskers emerge from armored trunk skin and shift from a thick, circular, porous, stiff base to a thin, ovular, dense, soft tip. These functional gradients of geometry, porosity, and stiffness independently tune the neuromechanics of elephant trunk touch to facilitate highly dexterous manipulation while ensuring whisker durability.

Physical Intelligence Article Optoacoustically augmented magnetic guidewire for radiation-free minimally invasive therapies Wang, F., Bao, X., Yildiz, E., Yu, Y., Deán-Ben, X. L., Kang, W., Zhang, S., Sheehan, D., Soon, R. H., Zinnanti, J., Sitti, M. Science Advances, 12:eaea0201, February 2026 (Published)

Endovascular interventions are essential for treating cerebrovascular diseases, yet their monitoring methods commonly rely on ionizing radiation and contrast agents, posing unnecessary risks to patients and clinicians. We present a multifunctional optoacoustically augmented magnetic guidewire (OptoMaG) that integrates optoacoustic imaging with magnetic navigation to enable radiation-free, image-guided interventions. The ~250-micrometer flexible guidewire incorporates a 460-nanometer luminescent core with an enhanced optoacoustic signature and a FePt magnetic tip for precise, steerable control. Proof-of-concept studies show that OptoMaG can be actively navigated with external magnetic fields to traverse a 3D human-scale cerebrovascular phantom and accurately reach target brain sites. Beyond navigation, the FePt tip enables localized thermal ablation under remote radiofrequency stimulation, highlighting its theranostic potential for tumor treatment. In addition, OptoMaG functions as a light source for photodynamic therapy, selectively activating photosensitizers to destroy tumor cells while preserving healthy tissue. Collectively, OptoMaG provides a safe, radiation-free platform merging real-time navigation with targeted therapeutic capabilities.

Haptic Intelligence Ph.D. Thesis Modeling, Fabricating, and Evaluating Synergistic Soft‑Rigid Actuators Gertler, I. University of Stuttgart, Stuttgart, Germany, February 2026, Faculty of Engineering Design, Production Engineering and Automotive Engineering (Published)

Soft actuators offer lightweight, compliant, and safe alternatives to traditional mechanisms, but they often incur complicated actuation schemes, bulky support systems, and limited functionality when made solely from soft materials. Soft‑rigid designs that integrate rigid elements into primarily soft bodies are common, yet the potential of those rigid parts to shape actuation behavior without compromising the overall softness remains underexplored, and fabrication practices often lack reproducibility. This thesis presents two case studies of synergistic hybrid actuation systems that utilize the complementary roles of soft and rigid components to dictate temporal and spectral behavior in response to simple input commands. Between the soft and hard components, one is typically active, while the other is passive. The first case study implements a soft-active/rigid-passive approach for the medical robotics application of endoluminal locomotion. A thin hyperelastic balloon encased in an inextensible sleeve is coupled with a thicker, non-encased balloon on a single fluid supply to serve as front and rear anchors, respectively. Geometry and material selection reshape the pressure-stretch response so the rear anchor inflates and deflates before the front anchor, enabling asymmetric sequencing useful for peristaltic locomotion inside a lumen. Numerical simulation and experiments validate the characteristic curves of dip-molded balloons and alternating anchoring in rigid tubes. The approach can be extended to generate actuation patterns for sequential haptic feedback and other robotic applications. The second case study applies a soft-passive/rigid-active strategy in the domain of fingertip haptic actuation. A dip‑molded silicone sheath with embedded miniature magnets, excited by a single air‑core coil, produces localized, rich vibrotactile feedback. Simulations, mechanical measurements, and user experiments with a single-magnet design show consistent frequency‑dependent behavior and strong perceptual salience. In follow-on work, various dual‑magnet arrangements were also simulated, fabricated, and thoroughly evaluated. Classification tests indicate that frequency content is more important for perception than magnet orientation, while a realism‑rating experiment supports the feasibility of audio-driven simple commands for realistic haptic feedback. The device is demonstrated on the fingertip in virtual reality and could be adapted for other body locations for navigation, rehabilitation, or related applications. Together, these studies provide design rules, a simulation-fabrication-validation workflow, and reproducible fabrication practices for soft-rigid hybrid actuators that realize desired mechanical outputs from minimal actuation commands. The methods and findings generalize to other soft actuators and have potential applications in domains such as medical devices, wearable technologies, and soft sensing.

Physical Intelligence Article Optofluidic three-dimensional microfabrication and nanofabrication Lyu, X., Lei, W., Gardi, G., Khan, M. T. A., Bagheri, S., Zhang, M., Sitti, M. Nature, 00:1-8, January 2026 (Published) DOI URL BibTeX

Perceiving Systems Ph.D. Thesis From Perception to Actions: Autonomous Exploration, Synthetic Data, and Dynamic Worlds Bonetto, E. January 2026 (Published)

Thesis: From Perception to Actions BibTeX

This thesis explores innovative methods and frameworks to enhance intelligent systems' visual perception capabilities. Vision is the primary means by which many animals perceive, understand, learn, reason about, and interact with the world to achieve their goals. Unlike animals, intelligent systems must acquire these capabilities by processing raw visual data captured by cameras using computer vision and deep learning. First, we consider a crucial aspect of visual perception in intelligent systems: understanding the structure and layout of the environment. To enable applications such as object interaction or extended reality in previously unseen spaces, these systems are often required to estimate their own motion. When operating in novel environments, they must also construct a map of the space. Together, we have the essence of the Simultaneous Localization and Mapping (SLAM) problem. However, pre-mapping environments can be impractical, costly, and unscalable in scenarios like disaster response or home automation. This makes it essential to develop robots capable of autonomously exploring and mapping unknown areas, a process known as Active SLAM. Active SLAM typically involves a multi-step process in which the robot acts on the available information to decide the next best actions. The goal is to autonomously and efficiently explore environments without using prior information. Despite an extensive history, Active SLAM methods focused only on short- or long-term objectives, without considering the totality of the process or adapting to the ever-changing states. Addressing these gaps, we introduce iRotate to capitalize on continuous information-gain prediction. Distinct from prevailing approaches, iRotate constantly (pre)optimizes camera viewpoints acting on i) long-term, ii) short-term, and iii) real time objectives. By doing this, iRotate significantly reduces energy consumption and localization errors, thus diminishing the exploration effort - a substantial leap in efficiency and effectiveness. iRotate, like many other SLAM approaches, leverages the assumption of operating in a static environment. Dynamic components in the scene significantly impact SLAM performance in the localization, place recognition, and optimization steps, hindering the widespread adoption of autonomous robots. This stems from the difficulties of collecting diverse ground truth information in the real world and the long-standing limitations of simulation tools. Testing directly in the real world is costly and risky without prior simulation validation. Datasets instead are inherently static and non-interactive making them useless for developing autonomous approaches. Then, existing simulation tools often lack the visual realism and flexibility to create and control fully customized experiments to bridge the gap between simulation and the real world. This thesis addresses the challenges of obtaining ground truth data and simulating dynamic environments by introducing the GRADE framework. Through a photorealistic rendering engine, we enable online and offline testing of robotic systems and the generation of richly annotated synthetic ground truth data. By ensuring flexibility and repeatability, we allow the extension of previous experiments through variations, for example, in scene content or sensor settings. Synthetic data can first be used to address several challenges in the context of Deep Learning (DL) approaches, e.g. mismatched data distribution between applications, costs and limits of data collection procedures, and errors caused by incorrect or inconsistent labeling in training datasets. However, the gap between the real and simulated worlds often limits the direct use of synthetic data making style transfer, adaptation techniques, or real-world information necessary. Here, we leverage the photorealism obtainable with GRADE to generate synthetic data and overcome these issues. First, since humans are significant sources of dynamic behavior in environments and the target of many applications, we focus on their detection and segmentation. We train models on real, synthetic, and mixed datasets, and show that using only synthetic data can lead to state-of-the-art performance in indoor scenarios. Then, we leverage GRADE to benchmark several Dynamic Visual SLAM methods. These often rely on semantic segmentation and optical flow techniques to identify moving objects and exclude their visual features from the pose estimation and optimization processes. Our evaluations show how they tend to reject too many features, leading to failures in accurately and fully tracking camera trajectories. Surprisingly, we observed low tracking rates not only on simulated sequences but also in real-world datasets. Moreover, we also show that the performance of the segmentation and detection models used are not always positively correlated with the ones of the Dynamic Visual SLAM methods. These failures are mainly due to incorrect estimations, crowded scenes, and not considering the different motion states that the object can have. Addressing this, we introduce DynaPix. This Dynamic Visual SLAM method estimates per-pixel motion probabilities and incorporates them into a new enhanced pose estimation and optimization processes within the SLAM backend, resulting in longer tracking times and lower trajectory errors. Finally, we use GRADE to address the challenge of limited and inaccurate annotations of wild zebras, particularly for their detection and pose estimation when observed by unmanned aerial vehicles. Leveraging the flexibility of GRADE, we introduce ZebraPose - the first full top-down synthetic-to-real detection and 2D pose estimation method. Unlike previous approaches, ZebraPose demonstrates that both tasks can be performed using only synthetic data, eliminating the need for costly data collection campaigns, time-consuming annotation procedures, or syn-to-real transfer techniques. Ultimately, this thesis demonstrates how combining perception with action can overcome critical limitations in robotics and environmental perception, thereby advancing the deployment of intelligent and autonomous systems for real-world applications. Through innovations like iRotate, GRADE, and ZebraPose, it paves the way for more robust, flexible, and efficient intelligent systems capable of navigating dynamic environments.

Perceiving Systems Ph.D. Thesis Physics-Informed Modeling of Dynamic Humans and Their Interactions Shashank, T. January 2026 (Published)

Building convincing digital humans is central to the vision of shared virtual worlds for AR, VR, and telepresence. Yet, despite rapid progress in 3D vision, today’s virtual humans often fall into a physical "uncanny valley”—bodies float above or penetrate objects, motions ignore balance and biomechanics, and human object interactions miss the rich contact patterns that make behavior look real. Enforcing physics through simulation is possible, but remains too slow, restrictive, and brittle for real-world, in-the-wild settings. This thesis argues that physical realism does not require full simulation. Instead, it can emerge from the same principles humans rely on every day: intuitive physics and contact. Inspired by insights from biomechanics and cognitive science, I present a unified framework that embeds these ideas directly into learning-based 3D human modeling. In this thesis, I present a suite of methods that bridge the gap between 3D human reconstruction and physical plausibility. I first introduce IPMAN, which incorporates differentiable biomechanical cues, such as center of mass and center of pressure, to produce stable, balanced, and grounded static poses. I then extend this framework to dynamic motion with HUMOS, a shape-conditioned motion generation model that accounts for how individual physiology influences movement, without requiring paired training data. Moving beyond locomotion, I address complex human-object interactions with DECO, a 3D contact detector that estimates dense, vertex-level contact across the full body surface. Finally, I present PICO, which establishes contact correspondences between the human body and arbitrary objects to recover full 3D interactions from single images. Together, these contributions bring physics-aware human modeling closer to practical deployment. The result is a step toward digital humans that not only look right, but move and interact with the world in ways that feel intuitively real.

Thesis BibTeX

Physical Intelligence Article 3D-printed low-voltage-driven ciliary hydrogel microactuators Liu, Z., Wang, C., Ren, Z., Wang, C., Wang, W., Ko, J., Song, S., Hong, C., Chen, X., Wang, H., Hu, W., Sitti, M. Nature, 649:885-893, January 2026 (Published) DOI URL BibTeX

Physical Intelligence Article Perturbing dynamics of active emulsions and their collectives Khan, M. T. A., Gardi, G., Soon, R. H., Zhang, M., Sitti, M. Matter, 9:00, January 2026 (Published)

Controlling fluidic flows in active droplets is crucial in developing intelligent models to understand and mimic single-celled microorganisms. Typically, these fluidic flows are affected by the interfacial dynamics of chemical agents. We found that these flows can be reconfigured by the mere presence of an anisotropic solid boundary embedded within active droplets. Spontaneous fluidic flows dynamically orient an embedded magnetic cluster, and the magnetic cluster, when realigned, causes these flows to reorient, thus providing control over the propulsion dynamics of chemotactic emulsions. When continuously perturbed, achiral emulsions exhibit emergent chiral motion with rotating fluidic flows. Such solid-fluid interactions occur in a number of self-propelling oil droplet systems, thereby enabling control over the emergent collective behaviors of chemically distinct active droplets.

Haptic Intelligence Robotics Article Open-Source Hardware and Software Platform for Vibrotactile Motion Guidance Rokhmanova, N., Martus, J., Faulkner, R., Fiene, J., Kuchenbecker, K. J. Device, 4(1):100966, January 2026 (Published)

Vibrotactile feedback can enhance motor learning, sports training, and rehabilitation, but a lack of standardized tools limits its adoption. We developed a modular open-source hardware and software platform for delivering vibrotactile feedback that is spatially and temporally precise. The prototype device uses medical adhesive, linear resonant actuators (LRAs), and rigid 3D-printed components to standardize skin contact, avoiding the variability introduced by straps. The platform was validated by using the device's built-in accelerometers to fit a dynamic model of mechanical actuator vibration and examine how the anatomical site and body composition affect perceived vibration strength in 20 participants. Then, the platform was integrated with an optical motion-capture system to teach six participants a toe-in gait, showing potential for real-time, tailored clinical studies. By openly sharing the platform's hardware and software, we provide tools for delivering standardized vibrations and benchmarking feedback strategies in diverse applications.

Social Foundations of Computation Miscellaneous Scaling Open-Ended Reasoning To Predict the Future Chandak, N., Shashwat, G., Prabhu, A., Hardt, M., Geiping, J. January 2026 (Submitted)

High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.

arXiv BibTeX

Social Foundations of Computation Conference Paper Train-before-Test Harmonizes Language Model Rankings Zhang, G., Dominguez-Olmedo, R., Hardt, M. The Fourteenth International Conference on Learning Representations (ICLR), oral, Top1.18%, January 2026 (Accepted)

Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. Recent work attributed ranking disagreement to the phenomenon of training on the test task: As released, different models exhibit a different level of preparation for any given test task. A candidate solution to the problem is train-before-test: Give each model the same benchmark-specific finetuning before evaluation. Our primary contribution is a broad empirical evaluation of train-before-test across 24 benchmarks and 61 models. We show that train-before-test significantly improves ranking agreement consistently across all benchmarks. Whereas rankings have little external validity to start with, they enjoy a significant degree of external validity when applying train-before-test: Model rankings transfer gracefully from one benchmark to the other. Even within the same model family, train-before-test reduces strong ranking disagreement to near-perfect agreement. In addition, train-before-test reduces the model-score matrix to essentially rank one, revealing new insights into the latent factors of benchmark performance. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.

arXiv BibTeX

Perceiving Systems Ph.D. Thesis Aerial Robot Formations for Dynamic Environment Perception Price, E. University of Tübingen, Tübingen, Germany, December 2025 (Published)

Perceiving moving subjects, like humans and animals, outside an enclosed and controlled environment in a lab is inherently challenging, since subjects could move outside the view and range of cameras and sensors that are static and extrinsically calibrated. Previous state-of-the-art methods for such perception in outdoor scenarios use markers or sensors on the subject, which are both intrusive and unscalable for animal subjects. To address this problem, we introduce robotic flying cameras that autonomously follow the subjects. To enable functions such as monitoring, behaviour analysis or motion capture, a single point of view is often insufficient due to self-occlusion, lack of depth perception and coverage from all sides. Therefore, we propose a team of such robotic cameras that fly in formation to provide continuous coverage from multiple view-points. The position of the subject must be determined using markerless, remote sensing methods in real time. To solve this, we combine a convolutional neural network-based detector to detect the subject with a novel cooperative Bayesian fusion method to track the detected subject from multiple robots. The robots need to then plan and control their own flight path and orientation relative to the subject to achieve and maintain continuous coverage from multiple view-points. This, we address with a model-predictive-control-based method to predict and plan the motion of every robot in the formation around the subject. A preliminary demonstrator is implemented with multi-rotor drones. However, drones are noisy and potentially unsafe for the observed subjects. To address this, we introduce non-holonomic lighter-than-air autonomous airships (blimps) as the robotic camera platform. This type of robot requires dynamically constrained orbiting formations to achieve omnidirectional visual coverage of a moving subject in the presence of wind. Therefore, we introduce a novel model-predictive formation controller for a team of airships. We demonstrate and evaluate our complete system in field experiments involving both human and wild animals as subjects. The collected data enables both human outdoor motion capture and animal behaviour analysis. Additionally, we propose our method for autonomous long-term wildlife monitoring. This dissertation covers the design and evaluation of aerial robots suitable to this task, including computer vision/sensing, data annotation and network training, sensor fusion, planning, control, simulation, and modelling.

Thesis DOI BibTeX

Perceiving Systems Conference Paper BEDLAM2.0: Synthetic humans and cameras in motion Tesch, J., Becherini, G., Achar, P., Yiannakidis, A., Kocabas, M., Patel, P., Black, M. J. In Advances in Neural Information Processing Systems (NeurIPS), Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, December 2025 (Published)

Inferring 3D human motion from video remains a challenging problem with many applications. While traditional methods estimate the human in image coordinates, many applications require human motion to be estimated in world coordinates. This is particularly challenging when there is both human and camera motion. Progress on this topic has been limited by the lack of rich video data with ground truth human and camera movement. We address this with BEDLAM2.0, a new dataset that goes beyond the popular BEDLAM dataset in important ways. In addition to introducing more diverse and realistic cameras and camera motions, BEDLAM2.0 increases diversity and realism of body shape, motions, clothing, hair, and 3D environments. Additionally, it adds shoes, which were missing in BEDLAM. BEDLAM has become a key resource for training 3D human pose and motion regressors today and we show that BEDLAM2.0 is significantly better, particularly for training methods that estimate humans in world coordinates. We compare state-of-the art methods trained on BEDLAM and BEDLAM2.0, and find that BEDLAM2.0 significantly improves accuracy over BEDLAM. For research purposes, we provide the rendered videos, ground truth body parameters, and camera motions. We also provide the 3D assets to which we have rights and links to those from third parties.

Project Paper Video URL BibTeX

Perceiving Systems Conference Paper HairFree: Compositional 2D Head Prior for Text-Driven 360° Bald Texture Synthesis Ostrek, M., Black, M., Thies, J. In Advances in Neural Information Processing Systems (NeurIPS), Advances in Neural Information Processing Systems (NeurIPS), December 2025 (Published)

Synthesizing high-quality 3D head textures is crucial for gaming, virtual reality, and digital humans. Achieving seamless 360° textures typically requires expensive multi-view datasets with precise tracking. However, traditional methods struggle without back-view data or precise geometry, especially for human heads, where even minor inconsistencies disrupt realism. We introduce HairFree, an unsupervised texturing framework guided by textual descriptions and 2D diffusion priors, producing high-consistency 360° bald head textures—including non-human skin with fine details—without any texture, back-view, bald, non-human, or synthetic training data. We fine-tune a diffusion prior on a dataset of mostly frontal faces, conditioned on predicted 3D head geometry and face parsing. During inference, HairFree uses precise skin masks and 3D FLAME geometry as input conditioning, ensuring high 3D consistency and alignment. We synthesize the full 360° texture by first generating a frontal RGB image aligned to the 3D FLAME pose and mapping it to UV space. As the virtual camera moves, we inpaint and merge missing regions. A built-in semantic prior enables precise region separation—particularly for isolating and removing hair—allowing seamless integration with various assets like customizable 3D hair, eyeglasses, jewelry, etc. We evaluate HairFree quantitatively and qualitatively, demonstrating its superiority over state-of-the-art 3D head avatar generation methods. https://hairfree.is.tue.mpg.de/

pdf project poster BibTeX

Physical Intelligence Article Nuclear magnetic resonance for wireless magnetic tracking Efe Tiryaki, M., Esmaeili-Dokht, P., Lazovic, J., Pruessmann, K. P., Sitti, M. Nature Communications, 16:10840, December 2025 (Published)

Wireless trackers have emerged as a crucial technology in minimally invasive medical procedures with their remote localization capabilities. Existing trackers suffer from miniaturization issues and complex designs, which limit their integration into medical devices. We present nuclear magnetic resonance (NMR) magnetic sensing, a quantum sensing approach with nT sensitivity for wireless magnetic tracking. NMR magnetic sensing enables millimeter-scale tracking accuracy and versatile miniaturized tracker designs for minimally invasive medical devices in magnetic resonance imaging scanners. As examples, we demonstrate miniature magnetic trackers with submillimeter-scale diameters for guidewires and optic fibers, flexible magnetic trackers for soft devices, and ferrofluidic trackers for shape-morphing devices. With the demonstrated miniaturization and wide range of tracker design possibilities, wireless magnetic tracking with NMR is promising for future minimally invasive medical operations.

Perceiving Systems Conference Paper GenLit: Reformulating Single Image Relighting as Video Generation Bharadwaj, S., Feng, H., Becherini, G., Abrevaya, V. F., Black, M. J. In SIGGRAPH Asia Conference Papers ’25, Association for Computing Machinery, SIGGRAPH Asia, December 2025 (To be published)

Manipulating the illumination of a 3D scene within a single image represents a fundamental challenge in computer vision and graphics. This problem has traditionally been addressed using inverse rendering techniques, which involve explicit 3D asset reconstruction and costly ray-tracing simulations. Meanwhile, recent advancements in visual foundation models suggest that a new paradigm could soon be possible -- one that replaces explicit physical models with networks that are trained on large amounts of image and video data. In this paper, we exploit the implicit scene understanding of a video diffusion model, particularly Stable Video Diffusion, to relight a single image. We introduce GenLit, a framework that distills the ability of a graphics engine to perform light manipulation into a video-generation model, enabling users to directly insert and manipulate a point light in the 3D world within a given image and generate results directly as a video sequence. We find that a model fine-tuned on only a small synthetic dataset generalizes to real-world scenes, enabling single-image relighting with plausible and convincing shadows and inter-reflections. Our results highlight the ability of video foundation models to capture rich information about lighting, material, and shape, and our findings indicate that such models, with minimal training, can be used to perform relighting without explicit asset reconstruction or ray-tracing.

Project Page Paper DOI URL BibTeX

Empirical Inference Conference Paper A data and task-constrained mechanistic model of the mouse outer retina shows robustness to contrast variations Kadhim, K. L., Beck, J., Huang, Z., Macke, J. H., Rieke, F., Euler, T., Deistler, M., Berens, P. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) bioRxiv BibTeX

Empirical Inference Conference Paper Are Language Models Efficient Reasoners? A Perspective from Logic Programming Opedal, A., Zengaffinen, Y., Shirakami, H., Pasti, C., Sachan, M., Saparov, A., Cotterell, R., Schölkopf, B. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper CauSciBench: Assessing LLM Causal Reasoning for Scientific Research Acharya, S., Zhang, T. J., Kim, A., Haghighat, A., Sun, X., Shrestha, R. B., Mordig, M., Danisman, F., Jose, C., Qi, Y., Cobben, P., Schölkopf, B., Sachan, M., Jin, Z. NeurIPS 2025: 5th Workshop on Mathematical Reasoning and AI (Math-AI) and CauScien Workshop, December 2025 (Published) URL BibTeX

Empirical Inference Conference Paper Counterfactual reasoning: an analysis of in-context emergence Miller, M., Schölkopf, B., Guo, S. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Haptic Intelligence Article Creating an Affective Robot That Feels Both Touch and Emotion Burns, R. B., Richardson, B. A., Klingenberg, J., Kuchenbecker, K. J. IEEE Transactions on Affective Computing, 1-18, December 2025, Rachael Bevill Burns and Benjamin A. Richardson contributed equally to this publication (Published)

Despite the importance of sensitive skin for living creatures, most robots can feel contact on only a tiny fraction of their exterior, if at all. Furthermore, typical robot reactions to touch are limited to event-based acknowledgments, lacking perceptual richness, lifelike positive/negative responses, and temporal dynamics. We address these gaps by introducing a practical full-body tactile-perception system for social robots, turning a NAO robot into the Haptic Empathetic Robot Animal (HERA). The sixteen main regions of the robot's body are instrumented with soft resistive tactile sensors covered by a tailored koala suit. Windows of each time-varying sensor output are continually classified into five gestures at two intensities via a two-stage machine-learning model. On challenging testing data containing simultaneous contacts, touch detection achieves an F1 score of 0.773, and gesture recognition achieves 52.2% accuracy (5.2 times chance); considering the temporal, spatial, and semantic adjacency of the applied touches increases these metrics to 0.896 and 86.6%, respectively. In turn, each detected contact drives a real-time emotion model that represents the robot's affective state as a second-order dynamic system analogous to a mass-spring-damper. This model's parameters control the robot's disposition, stoicism, and calmness. We explain the connections between HERA's hardware and software subsystems and demonstrate their combined ability to create an affective robot that feels both touch and emotion.

Empirical Inference Conference Paper Cultural Alien Sampler: Open-ended art generation balancing originality and coherence Hernandez, A., Yakura, H., Brinkmann, L., Sola, M. C., Alhaija, H. A., Serna, I., Rahaman, N., Schölkopf, B., Rahwan, I. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, Creative AI Track, December 2025 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper Do-PFN: In-Context Learning for Causal Effect Estimation Robertson*, J., Reuter*, A., Guo, S., Hollmann, N., Hutter, F., Schölkopf, B. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025, *equal contribution (Accepted) arXiv BibTeX

Empirical Inference Conference Paper Effortless, Simulation-Efficient Bayesian Inference using Tabular Foundation Models Vetter, J., Gloeckler, M., Gedon, D., Macke, J. H. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper FNOPE: Simulation-based inference on function spaces with Fourier Neural Operators Moss, G., Muhle, L. S., Drews, R., Macke, J. H., Schröder, C. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper Forecasting in Offline Reinforcement Learning for Non-stationary Environments Ada, S. E., Martius, G., Ugur, E., Oztop, E. In Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper Identifying multi-compartment Hodgkin-Huxley models with high-density extracellular voltage recordings Tanoh, I. C., Deistler, M., Macke, J. H., Linderman, S. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Perceiving Systems Ph.D. Thesis Learning Hands in Action Fan, Z. December 2025 (Published)

Hands are our primary interface for acting on the world. From everyday tasks like preparing food to skilled procedures like surgery, human activity is shaped by rich and varied hand interactions. These include not only manipulation of external objects but also coordinated actions between both hands. For physical AI systems to learn from human behavior, assist in physical tasks, or collaborate safely in shared environments, they must perceive and understand hands in action, how we use them to interact with each other and with the objects around us. A key component of this understanding is the ability to reconstruct human hand motion and hand-object interactions in 3D from RGB images or videos. However, existing methods focus largely on estimating the pose of a single hand, often in isolation. They struggle with scenarios involving two hands in strong interactions or the interactions with objects, particularly when those objects are articulated or previously unseen. This is because reconstructing 3D hands in action poses significant challenges, such as severe occlusions, appearance ambiguities, and the need to reason about both hand and object geometry in dynamic configurations. As a result, current systems fall short in complex real-world environments. This dissertation addresses these challenges by introducing methods and data for reconstructing hands in action from monocular RGB inputs. We begin by tackling the problem of interacting hand pose estimation. We present DIGIT, a method that leverages a part-aware semantic prior to disambiguate closely interacting hands. By explicitly modeling hand part interactions and encoding the semantics of finger parts, DIGIT robustly recovers accurate hand poses, outperforms prior baselines and provides a step forward for more complete 3D hands in action understanding. Since hands frequently manipulate objects, jointly reconstructing both is crucial. Existing methods for hand-object reconstruction are limited to rigid objects and cannot handle tools with articulation, such as scissors or laptops. This severely restricts their ability to model the full range of everyday manipulations. We present the first method that jointly reconstructs two hands and an articulated object from a single RGB image, enabling unified reasoning across both rigid and articulated object interactions. To support this, we introduce ARCTIC, a large-scale motion capture dataset of humans performing dexterous bimanual manipulation with articulated tools. ARCTIC includes both articulated and fixed (rigid) configurations, along with accurate 3D annotations of hand poses and object motions. Leveraging this dataset, our method jointly infers object articulation states, and hand poses, advancing the state of hand-object understanding in complex object manipulation settings. Finally, we address generalization to in-the-wild object interactions. Prior approaches either rely on synthetic data with limited realism or require object models at test time. We introduce HOLD, a self-supervised method that learns to reconstruct 3D hand-object interactions from monocular RGB videos, without paired 3D annotations or known object models. HOLD learns via an appearance- and motion-consistent objective across views and time, enabling strong generalization to unseen objects in interaction. Experiments demonstrate HOLD's ability to generalize to in-the-wild monocular settings, outperforming fully-supervised baselines trained on synthetic or lab-captured datasets. Together, DIGIT, ARCTIC, and HOLD advance the 3D understanding of hands in action, covering both hand-hand and hand-object interactions. These contributions improve the robustness in interacting hand pose estimation, introduce a dataset for bimanual manipulation with rigid and articulated tools, and include the first singe-image method for jointly reconstructing hands and articulated objects learned directly from this dataset. In addition, HOLD removes the need for object templates by enabling hand-object reconstruction in the wild. These developments move toward more scalable physical AI systems capable of interpreting and imitating human manipulation, with applications in teleoperation, human-robot collaboration, and embodied learning from demonstration.

PDF BibTeX

Empirical Inference Conference Paper Reparameterized LLM Training via Orthogonal Equivalence Transformation Qiu, Z., Buchholz, S., Xiao, T., Dax, M., Schölkopf, B., Liu, W. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper Root Cause Analysis of Outliers with Missing Structural Knowledge Orchard, W. R., Okati, N., Garrido Mejia, S., Blöbaum, P., Janzing, D. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper SPARTAN: A Sparse Transformer World Model Attending to What Matters Lei, A., Schölkopf, B., Posner, I. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Organizational Leadership and Diversity Conference Paper Inclusive Leadership in the Age of AI: A Dataset and Comparative Study of LLMs vs. Real-Life Leaders in Workplace Action Planning Singh, V., Schulte im Walde, S., Keplinger, K. Findings of the Association for Computational Linguistics: EMNLP 2025, 19732-19753, Association for Computational Linguistics, Suzhou, China, Empirical Methods in Natural Language Processing, November 2025 (Published)

Generative Large Language Models have emerged as useful tools, reshaping professional workflows. However, their efficacy in inherently complex and human-centric tasks such as leadership and strategic planning remains under-explored. In this interdisciplinary study, we present a novel dataset and compare LLMs and human leaders in the context of work-place action planning, specifically focusing on translating the abstract idea of inclusion into actionable SMART goals. We developed the Leader Success Bot, a script-based chat-bot co-designed with domain experts, to guide more than 250 real-life leaders in generating inclusive workplace action plans. We systematically prompted seven state-of-the-art chat-based LLMs to perform the same task using the socio-demographic data of real-life leaders and instructions co-developed with domain experts. Our publicly released dataset enables direct comparison between human and LLM-generated workplace action plans, offering in-sights into their respective strengths, biases, and limitations. Our findings highlight critical gaps and opportunities for LLMs in leadership applications, fostering interdisciplinary collaboration and NLP applications.

Haptic Intelligence Perceiving Systems Ph.D. Thesis An Interdisciplinary Approach to Human Pose Estimation: Application to Sign Language Forte, M. University of Tübingen, Tübingen, Germany, November 2025, Department of Computer Science (Published)

Accessibility legislation mandates equal access to information for Deaf communities. While videos of human interpreters provide optimal accessibility, they are costly and impractical for frequently updated content. AI-driven signing avatars offer a promising alternative, but their development is limited by the lack of high-quality 3D motion-capture data at scale. Vision-based motion-capture methods are scalable but struggle with the rapid hand movements, self-occlusion, and self-touch that characterize sign language. To address these limitations, this dissertation develops two complementary solutions. SGNify improves hand pose estimation by incorporating universal linguistic rules that apply to all sign languages as computational priors. Proficient signers recognize the reconstructed signs as accurately as those in the original videos, but depth ambiguities along the camera axis can still produce incorrect reconstructions for signs involving self-touch. To overcome this remaining limitation, BioTUCH integrates electrical bioimpedance sensing between the wrists of the person being captured. Systematic measurements show that skin-to-skin contact produces distinctive bioimpedance reductions at high frequencies (240 kHz to 4.1 MHz), enabling reliable contact detection. BioTUCH uses the timing of these self-touch events to refine arm poses, producing physically plausible arm configurations and significantly reducing reconstruction error. Together, these contributions support the scalable collection of high-quality 3D sign language motion data, facilitating progress toward AI-driven signing avatars.

Empirical Inference Conference Paper Improving Large Language Model Safety with Contrastive Representation Learning Simko, S., Sachan, M., Schölkopf, B., Jin, Z. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 28166-28194, (Editors: Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet), Association for Computational Linguistics, November 2025 (Published) arXiv DOI URL BibTeX

Social Foundations of Computation Miscellaneous Policy Design in Long-run Welfare Dynamics Wu, J., Abebe, R., Hardt, M., Stoica, A. Proceedings of the Fifth ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO), November 2025 (Published) URL BibTeX

Physical Intelligence Article Optoacoustic-Guided Magnetic Microrobot Platform for Precision Drug Delivery Wang, F., Yildiz, E., Deán-Ben, X. L., Yu, Y., Nozdriukhin, D., Kang, W., Zhang, S., Zinnanti, J., Sheehan, D., Soon, R. H., Sitti, M. Advanced Materials, 38:e11870, October 2025 (Published)

Precision drug delivery remains a significant challenge due to limitations in drug loading, targeted release, precise navigation, and real-time monitoring. Here, the study reports a magnetic microrobot platform (MMP) that integrates high-capacity drug loading, magnetically actuated collective navigation, controlled drug release, and real-time 3D optoacoustic imaging in a single system. The MMP exploits synergistic advantages by embedding hard-magnetic FePt nanoparticles in a degradable ZIF-8 shell, achieving a drug loading efficiency of ≈93.9% and enabling precise release in response to pH changes and radiofrequency-induced heating. Reconfigurable swarm behavior strategies significantly enhance the navigation efficiency of microrobots against physiological blood flows within complex cerebral vasculature. The ex vivo and in vivo experiments further demonstrate strong contrast characteristics of the microrobots, enabling high-resolution visualization of deep vascular structures and dynamic tracking of MMP with real-time 3D optoacoustic imaging. This multifunctional strategy paves the way for clinical translation and precision therapy in complex biological settings.

Physical Intelligence Article Emergent Motility of Self-Organized Particle-Giant Unilamellar Vesicle Assembly Karaz, S., Gardi, G., Han, M., Baltaci, S. F., Akolpoglu, M. B., Sitti, M. Advanced Materials, xx:e12036, October 2025 (Published)

Giant unilamellar vesicles (GUVs), soft cell-sized compartments formed through the self-assembly of lipid molecules, have long been utilized as model systems and passive carriers in membrane biophysics and biomedical applications. However, their potential as dynamically responsive and motile systems remains largely untapped due to challenges in achieving controlled and sustained motion in soft, deformable structures. Here, an autonomous cell-like microrobot through the emergent self-assembly of GUVs (5-10 µm) and silica microparticles (1-3 µm) under alternating current electric fields is realized. Self-propulsion arises from asymmetric self-organization of the particles on the vesicle surface, enabling a reversible transformation of the assembly into an active structure. Unlike rigid colloidal systems, GUVs introduce unique features enabled by their soft lipid membranes: shape deformations, membrane tension-dependent motility, and field-triggered live bacteria release via vesicle bursting. Through experiments and simulations, the mechanisms underlying self-assembly and propulsion are investigated, and a dynamic phase diagram is constructed to map the motion regime as a function of field parameters. Finally, it is shown that these self-assembled structures are capable of reconfiguration in response to local constraints in the environment, suggesting potential applications in complex environments and advancing the potential of GUVs toward the rational design of cell-like microrobots or artificial cell systems.

Physical Intelligence Article Wireless nonresonant stimulation of neurons on a magnetoelectric film surface Aydin, A., Jahanshahi, A., Esmaeili-Dokht, P., Han, M., Gardi, G., Yu, Y., Soon, R. H., Temel, Y., Sitti, M. Science advances, 11:eadx6829, October 2025 (Published)

Wireless neural interfaces are emerging as a minimally invasive treatment option for neurological disorders. Among the wireless technologies, magnetically powered systems are effective for targeting deep brain sites. However, dependence on high-frequency electromagnetic fields in such systems limits their safe implementation. In this study, we demonstrate the use of millimeter-scale magnetoelectric (ME) films as a direct neural interface for wireless neurostimulation, powered by static and alternating magnetic fields in the nonresonant regime (10 hertz). To accomplish this objective, electrical potential trends of the ME films under varying low-frequency magnetic fields are investigated and used to demonstrate neural stimulation by calcium imaging on primary neurons in vitro via a capacitive-like charge injection mechanism. In addition, electrical polarization orientation is revealed as a critical design parameter in direct neuron-ME interfaces. These findings collectively demonstrate the potential of nonresonant powering of ME films as a promising minimally invasive wireless neural stimulation technique.

Empirical Inference Article In silico biological discovery with large perturbation models Miladinovic*, D., Höppe*, T., Chevalley, M., Georgiou, A., Stuart, L., Mehrjou, A., Bantscheff, M., Schölkopf, B., Schwab, P. Nature Computational Science, October 2025, *equal contribution (Published)

Data generated in perturbation experiments link perturbations to the changes they elicit and therefore contain information relevant to numerous biological discovery tasks—from understanding the relationships between biological entities to developing therapeutics. However, these data encompass diverse perturbations and readouts, and the complex dependence of experimental outcomes on their biological context makes it challenging to integrate insights across experiments. Here we present the large perturbation model (LPM), a deep-learning model that integrates multiple, heterogeneous perturbation experiments by representing perturbation, readout and context as disentangled dimensions. LPM outperforms existing methods across multiple biological discovery tasks, including in predicting post-perturbation transcriptomes of unseen experiments, identifying shared molecular mechanisms of action between chemical and genetic perturbations, and facilitating the inference of gene–gene interaction networks. LPM learns meaningful joint representations of perturbations, readouts and contexts, enables the study of biological relationships in silico and could considerably accelerate the derivation of insights from pooled perturbation experiments.