Apply here: nvidia.eightfold.ai/careers?star...
I'm personally interested in multimodal generation and the tools that power it.
@jonlorraine.bsky.social
Research scientist @NVIDIA | PhD in machine learning @UofT. Previously @Google / @MetaAI. Opinions are my own. π€ π» βοΈ
Apply here: nvidia.eightfold.ai/careers?star...
I'm personally interested in multimodal generation and the tools that power it.
π New NVIDIA Spatial Intelligence Lab internship postings for 2026.
Come work with us to advance foundational technologies that enable AI systems to model and interact meaningfully with the world!
Topics on our homepage: research.nvidia.com/labs/sil/
Application link below
Join us at #CVPR2025 for a preview of this #NVIDIA tech during a live-coding session. A #GPU back end will be reserved for all attending β just donβt forget to bring your laptop for some hands-on fun!
Wed, Jun 11, 8am-noon, or join in at 10:20 after the break. tinyurl.com/nv-kaolin-cv...
We find a new set of use cases for Stable Audio Open ( @jordiponsdotme.bsky.social, @stabilityai.bsky.social, @hf.co) and other large pretrained audio generative models, like AudioLDM and beyond!
09.05.2025 16:06 β π 1 π 0 π¬ 0 π 0Our work is inspired by and builds on the SDS update of DreamFusion (dreamfusion3d.github.io/, @benmpoole.bsky.social , @ajayjain9.bsky.social , @jonbarron.bsky.social), and related updates (VSD, SDI @vincentsitzmann.bsky.social, SJC, many more!)
09.05.2025 16:06 β π 1 π 0 π¬ 1 π 0π‘ SDS treats any differentiable parameter set as optimizable from a prompt. Source-guided separation emerged when we brainstormed novel uses. We hope for similarly practical tasks to surfaceβe.g., automatic Foley layering?βas the community experiments.
09.05.2025 16:06 β π 1 π 0 π¬ 1 π 0π Vision of the Future: Content designers easily use one video + audio diffusion backbone with SDS-style updates to nudge any differentiable taskβimpacts, lighting, cloth, fluidsβuntil the joint model says βlooks & sounds rightβ given powerful user controls, like text.
09.05.2025 16:06 β π 1 π 0 π¬ 1 π 0β οΈ Limitations β οΈ
Clip-Length Budget: We optimized on β€10 s clips; minute-scale audio may have artifacts or blow up memory. A hierarchical/windowed Audio-SDS could help here.
β οΈ Limitations β οΈ
Audio-Model Bias: We rely on Stable Audio Open, so when this struggles, e.g., on rare instruments, speech, audio without silence at the end, or out-of-domain SFX, our method can have difficulties. Other diffusion models can help here.
This project was led by the great work of @jrichterpowell.bsky.social along with Antonio Torralba.
See more work from the NVIDIA Spatial Intelligence Lab: research.nvidia.com/labs/toronto...
Work supported indirectly by MIT CSAIL, @vectorinstitute.ai
#nvidia #mit
Results on Prompt-Guided Source Separation:
We report an improved SDR to ground-truth sources when available and show improved CLAP scores after training.
Results on Tuning FM Synthesizers & Impact Synthesis:
We improve CLAP scores over training for prompts, along with qualitative results. Impact synthesis shows improved performance on impact-oriented prompts.
Results on Fully-Automatic In-the-Wild Source Separation:
We demonstrate a pipeline that takes a video from the internet, captions the audio with a model (like AudioCaps), and provides that to an LLM-assistant who suggests source decompositions. We run our method on the suggested decompositions.
Modifications to SDS for Audio Diffusion:
π
° We use an augmented Decoder-SDS in audio space, π
± using a spectrogram emphasis to better weight transients, and π
²οΈ multiple denoising steps to increase fidelity.
This image highlights these in red in the detailed overview of our update.
β’ Prompt-Guided Source Separation:
A prompt-conditioning source separation for a given audio, such as separating a βsax β¦β and βcars β¦β from a music recording on a road, by using the audio-SDS update for each channel while forcing the sum of channels to reconstruct the audio.
β‘ Physical Impact Synthesis:
We generate impacts consistent with prompts like βhitting pot with wooden spoonβ by convolving an impact with a learned object and reverb impulse. We learn the parametrized forms of the object and reverb impulses.
β FM Synthesis:
A toy setup where we generate settings aligning with prompts like βkick drum, bass, reverbβ using sine oscillators modulating each otherβs frequency as in a synthesizer.
We visualize the final optimized parameters as the dial settings on a synthesizer instrument's user interface.
We propose three novel audio tasks: β FM Synthesis, β‘ Physical Impact Synthesis, and β’ Prompt-Guided Source Separation.
This image briefly summarizes the use case, optimizable parameters, rendering function, and parameter update.
Intuitively, our update finds a direction to move the audio to increase its probability given the prompt, by noising and denoising with our diffusion model, then βnudgingβ our audio towards it by propagating the update through our differentiable rendering to our audio parameters.
09.05.2025 16:06 β π 1 π 0 π¬ 1 π 0π New NVIDIA paper: Audio-SDS π
We repurpose Score Distillation Sampling (SDS) for audio, turning any pretrained audio diffusion model into a tool for diverse tasks, including source separation, impact synthesis & more.
π§ Demos, audio examples, paper: research.nvidia.com/labs/toronto...
π§΅below
What if you could control the weather in any video β just like applying a filter?
Meet WeatherWeaver, a video model for controllable synthesis and removal of diverse weather effects β such as π§οΈ rain, βοΈ snow, π fog, and βοΈ clouds β for any input video.
π Announcing meshgen (AI in Blender) Update 0.6
π€ Now supports remote backend via litellm, e.g. Hugging Face, ollama
π¨ UI/UX overhaul
This lays the foundation for Blender agents and more advanced 3D AI models (coming this year)
π§ New dataset of 50k+ low poly obj meshes
π’ objaverse subsampled to <500 poly models, converted to untextured objs
π§ suitable for training autoregressive transformer-based 3D models, which have limited context length, such as LLaMA-Mesh
Here's a recap of what happened in AI 3D in 2024
30.12.2024 22:44 β π 4 π 1 π¬ 0 π 0We envision a future where LLMs are universal generative tools capable of seamlessly producing content across multiple modalities, including text, images, videos, and 3D structures.
12.12.2024 19:10 β π 4 π 0 π¬ 0 π 0Integrating 3D mesh generation into LLMs opens exciting possibilities for interactive design. Users can converse with a model to create and manipulate 3D objects in real time.
12.12.2024 19:10 β π 2 π 0 π¬ 1 π 0We're excited to scale LLaMA-Mesh to handle more complex and detailed meshes by extending context lengths. Integrating textures and physical properties, exploring larger base models, part-based generation, and enabling dynamic generation are interesting ways forward!
12.12.2024 19:10 β π 0 π 0 π¬ 1 π 0Due to context length constraints, we're currently limited to meshes with up to 500 faces. We generate one 3D object per dialog due to our fine-tuning dataset construction. We see a slight degradation in language ability, perhaps due to using UltraChat in fine-tuning.
12.12.2024 19:10 β π 0 π 0 π¬ 1 π 0This project was led by Zhengyi Wang with Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, and Xiaohui Zeng.
See more work from the #NVIDIA Toronto AI Lab here: research.nvidia.com/labs/toronto...
Work supported by Tsinghua University, @vectorinst.bsky.social, @uoft.bsky.social #UofT #Tsinghua
We generate diverse and high-quality 3D meshes directly from textual prompts without expanding the vocabulary or introducing new tokenizers.
12.12.2024 19:10 β π 1 π 0 π¬ 1 π 0