Stable Diffusion

4650 readers

1 users here now

Discuss matters related to our favourite AI Art generation technology

Also see

Other communities

founded 2 years ago

MODERATORS

db0@lemmy.dbzer0.com

Even_Adder@lemmy.dbzer0.com

Emerging Properties in Unified Multimodal Pretraining (infosec.pub)

submitted 2 days ago* (last edited 2 days ago) by Even_Adder@lemmy.dbzer0.com to c/stable_diffusion@lemmy.dbzer0.com

0 comments fedilink hide all child comments

Abstract

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at this https URL

Paper: https://arxiv.org/abs/2505.14683

Code: https://github.com/bytedance-seed/BAGEL

Demo: https://demo.bagel-ai.org/

Project Page: https://bagel-ai.org/

Model: https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here