Stable Diffusion

4650 readers
1 users here now

Discuss matters related to our favourite AI Art generation technology

Also see

Other communities

founded 2 years ago
MODERATORS
1
 
 

This is a copy of /r/stablediffusion wiki to help people who need access to that information


Howdy and welcome to r/stablediffusion! I'm u/Sandcheeze and I have collected these resources and links to help enjoy Stable Diffusion whether you are here for the first time or looking to add more customization to your image generations.

If you'd like to show support, feel free to send us kind words or check out our Discord. Donations are appreciated, but not necessary as you being a great part of the community is all we ask for.

Note: The community resources provided here are not endorsed, vetted, nor provided by Stability AI.

#Stable Diffusion

Local Installation

Active Community Repos/Forks to install on your PC and keep it local.

Online Websites

Websites with usable Stable Diffusion right in your browser. No need to install anything.

Mobile Apps

Stable Diffusion on your mobile device.

Tutorials

Learn how to improve your skills in using Stable Diffusion even if a beginner or expert.

Dream Booth

How-to train a custom model and resources on doing so.

Models

Specially trained towards certain subjects and/or styles.

Embeddings

Tokens trained on specific subjects and/or styles.

Bots

Either bots you can self-host, or bots you can use directly on various websites and services such as Discord, Reddit etc

3rd Party Plugins

SD plugins for programs such as Discord, Photoshop, Krita, Blender, Gimp, etc.

Other useful tools

#Community

Games

  • PictionAIry : (Video|2-6 Players) - The image guessing game where AI does the drawing!

Podcasts

Databases or Lists

Still updating this with more links as I collect them all here.

FAQ

How do I use Stable Diffusion?

  • Check out our guides section above!

Will it run on my machine?

  • Stable Diffusion requires a 4GB+ VRAM GPU to run locally. However, much beefier graphics cards (10, 20, 30 Series Nvidia Cards) will be necessary to generate high resolution or high step images. However, anyone can run it online through DreamStudio or hosting it on their own GPU compute cloud server.
  • Only Nvidia cards are officially supported.
  • AMD support is available here unofficially.
  • Apple M1 Chip support is available here unofficially.
  • Intel based Macs currently do not work with Stable Diffusion.

How do I get a website or resource added here?

*If you have a suggestion for a website or a project to add to our list, or if you would like to contribute to the wiki, please don't hesitate to reach out to us via modmail or message me.

2
 
 

Abstract

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at this https URL

Paper: https://arxiv.org/abs/2505.14683

Code: https://github.com/bytedance-seed/BAGEL

Demo: https://demo.bagel-ai.org/

Project Page: https://bagel-ai.org/

Model: https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT

3
4
 
 

Abstract

While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.

Technical Report: https://arxiv.org/abs/2505.07747

Code: https://github.com/stepfun-ai/Step1X-3D

Demo: https://huggingface.co/spaces/stepfun-ai/Step1X-3D

Project Page: https://stepfun-ai.github.io/Step1X-3D/

Models: https://huggingface.co/stepfun-ai/Step1X-3D

5
 
 

Abstract

Animation has gained significant interest in the recent film and TV industry. Despite the success of advanced video generation models like Sora, Kling, and CogVideoX in generating natural videos, they lack the same effectiveness in handling animation videos. Evaluating animation video generation is also a great challenge due to its unique artist styles, violating the laws of physics and exaggerated motions. In this paper, we present a comprehensive system, AniSora, designed for animation video generation, which includes a data processing pipeline, a controllable generation model, and an evaluation benchmark. Supported by the data processing pipeline with over 10M high-quality data, the generation model incorporates a spatiotemporal mask module to facilitate key animation production functions such as image-to-video generation, frame interpolation, and localized image-guided animation. We also collect an evaluation benchmark of 948 various animation videos, with specifically developed metrics for animation video generation. Our entire project is publicly available on this https URL.

Paper: https://arxiv.org/abs/2412.10255

Code: https://github.com/bilibili/Index-anisora/tree/main

Hugging Face: https://huggingface.co/IndexTeam/Index-anisora

Modelscope: https://www.modelscope.cn/organization/bilibili-index

Project Page: https://komiko.app/video/AniSora

6
7
8
9
10
11
12
13
14
 
 
15
16
17
18
 
 

Change Log for SD.Next

Highlights for 2025-04-28

Another major release with over 120 commits!
Highlights include new Nunchaku Wiki inference engine that allows running FLUX.1 with 3-5x higher performance!
And a new FramePack extension for high-quality I2V and FLF2V video generation with unlimited duration!

What else?

  • New UI History tab
  • New models: Flex.2, LTXVideo-0.9.6, WAN-2.1-14B-FLF2V, schedulers: UniPC and LCM FlowMatch, features: CFGZero
  • Major updates to: NNCF, OpenVINO, ROCm, ZLUDA
  • Cumulative fixes since last release

ReadMe | ChangeLog | Docs | WiKi | Discord

19
20
21
 
 

Abstract

Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new Decoupled Diffusion Transformer (DDT), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet 256×256, Our DDT-XL/2 achieves a new state-of-the-art performance of 1.31 FID (nearly 4× faster training convergence compared to previous diffusion transformers). For ImageNet 512×512, Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.

Paper: https://arxiv.org/abs/2504.05741

Code: https://github.com/MCG-NJU/DDT

Demo: https://huggingface.co/spaces/MCG-NJU/DDT

22
 
 

Abstract

Recent text-to-image diffusion models achieve impressive visual quality through extensive scaling of training data and model parameters, yet they often struggle with complex scenes and fine-grained details. Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. ReflectionFlow introduces three complementary inference-time scaling axes: (1) noise-level scaling to optimize latent initialization; (2) prompt-level scaling for precise semantic guidance; and most notably, (3) reflection-level scaling, which explicitly provides actionable reflections to iteratively assess and correct previous generations. To facilitate reflection-level scaling, we construct GenRef, a large-scale dataset comprising 1 million triplets, each containing a reflection, a flawed image, and an enhanced image. Leveraging this dataset, we efficiently perform reflection tuning on state-of-the-art diffusion transformer, FLUX.1-dev, by jointly modeling multimodal inputs within a unified framework. Experimental results show that ReflectionFlow significantly outperforms naive noise-level scaling methods, offering a scalable and compute-efficient solution toward higher-quality image synthesis on challenging tasks.

Paper: https://arxiv.org/abs/2504.16080

Code: https://github.com/Diffusion-CoT/ReflectionFlow

GenRef: https://huggingface.co/collections/diffusion-cot/reflectionflow-release-6803e14352b1b13a16aeda44

Project Page: https://diffusion-cot.github.io/reflection2perfection/

23
24
 
 

This was bound to happen. No centralized VC-backed company could stay uncensored like that forever.

CivitAI already gathered all the growth it could, now it has to focus on making a profit and therefore pleasing credit card companies and advertisers.

Terms of Service Update We’ve updated Section 9.6 of our Terms of Service (ToS) to explicitly prohibit content depicting:

Incest

Self-harm, including depictions of anorexia or bulimia

Content that promotes hate, harm, or extremist ideologies

Bodily excretions and related fetishes, including;

Urine

Vomit

Menstruation

Diapers

And to;

Require all mature content (X, XXX) to include generation metadata. You'll be prompted to add metadata when uploading new content.

Additionally, the following content depicted in any mature or suggestive context (X, XXX) is explicitly prohibited;

Firearms aimed at or pointed towards individuals.

Mind-altered states, including being drunk, drugged, under hypnosis, or mind control.

Depiction of illegal substances or regulated products (e.g. narcotics, pharmaceuticals)

25
view more: next ›