Abstract: Enabling a robot interaction to monitor a short-term task-oriented multimodal interaction with human partners is a very challenging research objective. The usual approach consists in observing human-human interactions and try to reproduce certain properties such as the coordination of multimodal events associated with particular sub-tasks of the interaction chart. This policy faces two main issues: firstly, scaling human behaviors to the sensorimotor abilities of a particular robot is a tough challenge; secondly, human partners will certainly behave differently when facing a human vs. a robot whatever social skills it may have. We will sketch the approach we put forward in the ANR project SOMBRERO, namely (1) an immersive teleoperation for collecting ground-truth HRI data, (2) a learning framework for modelling joint interactive behaviors via discrete events and (3) a set of gestural controllers converting these discrete events into multimodal trajectories. We will detail research we conducted over the last years for building up a complete system around NINA, our iCub platform. We will in particular discuss the development and evaluation of some key controllers – notably gaze and speech – as well as the assessment of the whole multimodal score that orchestrates the gestural controllers.