" " What Contributes Most to Multimodal Transformer Success? – Web 3 News Hubb " "
Web 3 News Hubb
  • Home
  • Edge Computing
  • Artificial Intelligence
  • Blockchain
  • Contact
No Result
View All Result
Web 3 News Hubb
  • Home
  • Edge Computing
  • Artificial Intelligence
  • Blockchain
  • Contact
No Result
View All Result
Web 3 News Hubb
No Result
View All Result
Home Artificial Intelligence

What Contributes Most to Multimodal Transformer Success?

admin by admin
August 28, 2023
in Artificial Intelligence


The ability to ground language to vision is a fundamental aspect of real-world AI systems; it is useful across a range of tasks (e.g., visual question answering) and applications (e.g., generating descriptions for visually impaired). Multimodal models (pre-trained on image-language pairs) aim to address this grounding problem. A recent family of models, multimodal transformers (e.g., Lu et al., 2019; Chen et al., 2020; Tan and Bansal, 2019; Li et al., 2020), have achieved state-of-the-art performance in a range of multimodal benchmarks, suggesting that the joint-encoder transformer architecture is better suited for capturing the alignment between image-language pairs than previous approaches (such as dual encoders).

In particular, compared to the dual-encoder architecture where there is no cross-talk between the modalities, multimodal transformers (joint encoders) are more sample efficient. In the plot below, we see that, when tested on zero-shot image retrieval, an existing multimodal transformer (UNITER) performs similar to a large-scale dual encoder (CLIP) which is trained on 100 times more data.

BOW-DE: Miech & Alayrac et al. Arxiv 2021, MMT: Hendricks et al. TACL 2021, UNITER: Chen et al. ECCV 2020, CLIP: Radford et al. Arxiv 2021, ALIGN: Jia et al. Arxiv 2021

In this work, we examine what aspects of multimodal transformers – attention, losses, and pretraining data – are important in their success at multimodal pretraining. We find that Multimodal attention, where both language and image transformers attend to each other, is crucial for these models’ success. Models with other types of attention (even with more depth or parameters) fail to achieve comparable results to shallower and smaller models with multimodal attention. Moreover, comparable results can be achieved without the image (masked region modelling) loss originally proposed for multimodal transformers. This suggests that our current models are not tapping into the useful signal in the image modality, presumably because of the image loss formulation.

We also study different properties of multimodal datasets such as their size and the degree to which the language describes its corresponding image (noisiness). We find that a dataset’s size does not always predict multimodal transformers’ performance; its noise level and language similarity to the evaluation task are both important contributing factors. These suggest curating less noisy image–text datasets to be important despite the current trend of harvesting noisy datasets from the web.

Overall, our analysis shows that multimodal transformers are stronger than dual encoder architecture (given the same amount of pretraining data), mainly due to the cross-talk through multimodal attention. However, there are still many open problems when designing multimodal models, including better losses for the image modality and robustness to dataset noise.



Source link

Previous Post

dYdX Founder: Focus on Overseas Markets Over US for Crypto

Next Post

New Cisco Services Help You Achieve Business Outcomes— Faster

Next Post

New Cisco Services Help You Achieve Business Outcomes— Faster

  • Ethereum Node and Client Comparisons

    0 shares
    Share 0 Tweet 0
  • ChatGPT: The Technicalities behind the Rising Star of Conversational AI | by ximnet | Mar, 2023

    0 shares
    Share 0 Tweet 0
  • The Crucial Role of Network Integration in Large Enterprises

    0 shares
    Share 0 Tweet 0
  • Xsolla and Crypto.com Partner to Integrate Payment Solutions

    0 shares
    Share 0 Tweet 0
  • How to Create a Healthcare Chatbot Using NLP | by Devashish Datt Mamgain | Mar, 2023

    0 shares
    Share 0 Tweet 0

© Web3 News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Edge Computing
  • Artificial Intelligence
  • Blockchain
  • Contact

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Edge Computing
  • Artificial Intelligence
  • Blockchain
  • Contact

© 2022 Web 3 News Hubb All rights reserved.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In