In the evolving field of artificial intelligence, multi-modal models that process various types of data—such as text, images, and audio—have shown promise in tasks like healthcare diagnostics and visual question answering. However, these models often underperform compared to single-modality models, a phenomenon that has puzzled researchers. To address this, a team from NYU's Center for Data Science introduced the inter- and intra-modality modeling (I2M2) framework. This approach explicitly captures the relationships both between different data modalities (inter-modality) and within each modality (intra-modality), aiming to enhance the model's ability to integrate and interpret complex, multi-source information. nyudatascience.medium.com
The I2M2 framework was evaluated across several datasets, including knee MRI scans for diagnosing conditions like ACL injuries and meniscus tears, as well as vision-language tasks such as visual question answering. The results demonstrated consistent performance improvements over traditional multi-modal models, highlighting the framework's versatility and effectiveness. By making the modeling of these dependencies explicit, I2M2 allows the AI system to better understand and utilize the intricate relationships inherent in multi-modal data, paving the way for more robust and accurate AI applications in diverse fields. nyudatascience.medium.com