Cem Koç's Avatar

Cem Koç

@cemkoch.bsky.social

Coffee Lover • Husky Dad • ML Researcher @  • Berkeley Grad

24 Followers  |  36 Following  |  9 Posts  |  Joined: 17.11.2024  |  1.5311

Latest posts by cemkoch.bsky.social on Bluesky

Huge thanks to amazing to the amazing people:
@pavankumarvasu.bsky.social, Fartash Faghri, Chun-Liang Li, Hadi Pouransari, @onceltuzel.bsky.social, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Christopher Webb

07.05.2025 22:26 — 👍 0    🔁 0    💬 0    📌 0
Video thumbnail

Today we have released the code and a demo iOS application for FastVLM - our extremely efficient and fast vision language model which runs on your device using MLX! You can check out the code and the app here: github.com/apple/ml-fas...

07.05.2025 22:20 — 👍 4    🔁 3    💬 1    📌 0
Post image

Join us! Registration is required.

simons.berkeley.edu/events/move-...

19.03.2025 03:42 — 👍 5    🔁 1    💬 0    📌 0

If you're looking for research scientist roles in Europe, check out Marco's post! The Paris team is fantastic, and does diverse idea-driven and impactful research. In addition, MLR is highly collaborative across timezones, so you'd have a chance to work with many others too.

18.12.2024 17:14 — 👍 2    🔁 1    💬 0    📌 0
Preview
FastVLM: Efficient Vision Encoding for Vision Language Models Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders su...

For more, check out our paper on arxiv: arxiv.org/abs/2412.13303

With the amazing people: @pavankumarvasu.bsky.social , Fartash Faghri, Chun-Liang Li, Hadi Pouransari, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, and @onceltuzel.bsky.social

19.12.2024 19:22 — 👍 1    🔁 1    💬 0    📌 0
Post image

What is exciting is that FastVLM model family (VLMs with FastViTHD vision backbone) scales very well with more SFT data, which is vital, and achieves SOTA performance while being significantly faster 🚀

19.12.2024 19:10 — 👍 0    🔁 0    💬 0    📌 0
Post image

We ran multiple experiments comparing different resolution sizes (256, 512, 768, 1024) and LLM sizes (0.5B, 1.5B, 7B) to find the optimal setup. FastViTHD's Pareto-optimal curve shows significant gains over FastViT (which is already better than ViTs)👇

19.12.2024 18:58 — 👍 0    🔁 0    💬 0    📌 0
Post image

Text-rich tasks require high image resolutions which increase the vision encoding latency + number of image tokens which then leads to higher LLM pre-filling time. Therefore instead of using an isotropic architecture we use a hybrid vision backbone that can scale to higher input resolutions.

19.12.2024 18:50 — 👍 0    🔁 0    💬 0    📌 0
Post image

We measure time-to-first-token (TTFT) as the wait time to get the first token response from the VLM which combines the Vision Encoder Latency + LLM pre-filling time (time it takes for LLM to fill the KV-cache and output its first token) and at high resolutions vision encoder latency dominates.

19.12.2024 18:42 — 👍 0    🔁 0    💬 0    📌 0
Post image

FastVLM incorporates FastViTHD, a novel hybrid vision encoder backbone designed to output fewer image tokens and significantly reduce the encoding time for high resolution images.

19.12.2024 18:34 — 👍 0    🔁 0    💬 0    📌 0
Post image

Excited about vision-language models? 🚀 Check out our latest work on FastVLM, a new family of efficient vision-language models that balances the tradeoff between high-resolution image understanding and latency without compromising accuracy!

arxiv.org/abs/2412.13303

19.12.2024 18:18 — 👍 1    🔁 1    💬 6    📌 0
WVD Pipeline

WVD Pipeline

🤔Image-to-3D, monocular depth estimation, camera pose estimation, …, can we achieve all of this with just ONE model easily?

🚀Our answer is Yes -- Excited to introduce our latest work: World-consistent Video Diffusion (WVD) with Explicit 3D Modeling!

arxiv.org/abs/2412.01821

04.12.2024 13:41 — 👍 14    🔁 6    💬 1    📌 0
Post image

𝗗𝗼𝗲𝘀 𝗮𝘂𝘁𝗼𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗽𝗿𝗲-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝘄𝗼𝗿𝗸 𝗳𝗼𝗿 𝘃𝗶𝘀𝗶𝗼𝗻? 🤔
Delighted to share AIMv2, a family of strong, scalable, and open vision encoders that excel at multimodal understanding, recognition, and grounding 🧵

paper: arxiv.org/abs/2411.14402
code: github.com/apple/ml-aim
HF: huggingface.co/collections/...

22.11.2024 08:32 — 👍 59    🔁 19    💬 3    📌 1
Post image Post image

Looking for an alternative to RAG for personalization?

With PLUM, a pipeline for teaching LLMs to remember prior user conversations, we aim to enable your future personalization research! Joint work with @maartjeterhoeve.bsky.social, Katherine Metcalf and Yizhe Zhang from my internship at Apple.

🧵

21.11.2024 18:03 — 👍 10    🔁 2    💬 1    📌 2

@cemkoch is following 19 prominent accounts