Piotr Skalski's Avatar

Piotr Skalski

@skalskip92.bsky.social

Open-source Lead @roboflow. VLMs. GPU poor. Dog person. Coffee addict. Dyslexic. | GH: https://github.com/SkalskiP | HF: https://huggingface.co/SkalskiP

630 Followers  |  155 Following  |  28 Posts  |  Joined: 19.11.2024  |  1.7373

Latest posts by skalskip92.bsky.social on Bluesky

Post image

that's all the code you need to run detection and tracking
"how to track objects with SORT tracker" notebook: colab.research.google.com/github/robof...

25.04.2025 13:03 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Video thumbnail

it's build on top of supervision package allowing you to take advantage of all the tools we already created

25.04.2025 13:03 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

trackers v2.0.0 is out

combo object detectors from top model libraries with multi-object tracker of your choice

for now we support SORT and DeepSORT; more trackers coming soon

link: github.com/roboflow/tra...

25.04.2025 13:03 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

object detection example project: bsky.app/profile/skal...

11.12.2024 16:58 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

image to JSON example project: bsky.app/profile/skal...

11.12.2024 16:58 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

you need to prepare your dataset in JSONL format; dataset includes three subsets: train, test, and valid

each subset contains images and annotations.jsonl file where each line of the file is a valid JSON object; each JSON object has three keys: image, prefix, and suffix

11.12.2024 16:58 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

to limit the memory (VRAM) usage during the training, we can use LoRA, QLoRA, or freeze parts of the graph

11.12.2024 16:58 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

fine-tuning large vision-language models like PaliGemma 2 can be resource-intensive. to put this into perspective, the largest variant of the recent YOLOv11 object detection model (YOLOv11x) has 56.9M parameters. in contrast, PaliGemma 2 models range from 3B to 28B parameters.

11.12.2024 16:58 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

PG2 offers 9 pre-trained models with sizes of 3B, 10B, and 28B parameters and resolutions of 224, 448, and 896 pixels.

to pick the right variant, you need to take into account the vision-language task you are solving, available hardware, amount of data, inference speed

11.12.2024 16:58 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

PG2 combines a SigLIP-So400m vision encoder with a Gemma 2 language model to process images and text. these tokens are then linearly projected and combined with input text tokens. Gemma 2 language model processes these combined tokens and generates output text tokens.

11.12.2024 16:58 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

new blog post is out: how to fine-tune PaliGemma 2

all I learned in a single blog

- PaliGemma 2 architecture
- dataset annotation and structure
- picking the right checkpoint
- memory optimization
- hyperparameters tuning

link: blog.roboflow.com/fine-tune-pa...

11.12.2024 16:58 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

the paper suggests some nice strategies to increase the model's detection accuracy using fake boxes and <noise> special token; I plan to explore those in the coming days.

08.12.2024 16:26 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

PG2 offers 9 pre-trained models with sizes of 3B, 10B, and 28B parameters and resolutions of 224, 448, and 896 pixels.

we can see that PaliGemma2's object detection performance depends more on input resolution than model size. 3B 448 seems like a sweet spot.

08.12.2024 16:26 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

PG2 performs worse on the object detection task than specialized detectors; you can easily train a YOLOv11 model with 0.9 mAP on this dataset.

compared to PG1, it performs much better; datasets with a large number of classes were hard to fine-tune with previous version

08.12.2024 16:26 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

PaliGemma2 for object detection on custom dataset

- used google/paligemma2-3b-pt-448 checkpoint
- trained on A100 with 40GB VRAM
- 1h of training
- 0.62 mAP on the validation set

colab with complete fine-tuning code: colab.research.google.com/github/robof...

08.12.2024 16:26 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 2    πŸ“Œ 1

also take into account that Gemini and Gemma are 2 different models; Gemma is a lot smaller, open-source and can run locally

08.12.2024 15:03 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

totally agree; it's not perfect! but
- there are still a lot bigger versions of the model, both in terms of parameters and input resolution
- I only trained it for 1 hour

08.12.2024 15:02 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

a multimodal dataset I used to fine-tune the model.

link: universe.roboflow.com/roboflow-jvu...

06.12.2024 16:18 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

PG2 offers 9 pre-trained models with sizes of 3B, 10B, and 28B parameters and resolutions of 224, 448, and 896 pixels.

it looks like OCR-related metrics ST-VQA, TallyQA, and TextCaps... benefit more from increased resolution than model size. that's why I went from 224 to 336.

06.12.2024 16:18 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

PaliGemma2 for image to JSON data extraction

- used google/paligemma2-3b-pt-336 checkpoint; I tried to make it happen with 224, but 336 performed a lot better
- trained on A100 with 40GB VRAM
- trained with LoRA

colab with complete fine-tuning code: colab.research.google.com/github/robof...

06.12.2024 16:18 β€” πŸ‘ 14    πŸ” 2    πŸ’¬ 1    πŸ“Œ 1
Post image

how to prevent this in open-source projects?

- never allow github actions from first-time contributors.
- always require review for new contributors.
- never run important actions automatically via bots.
- protect release actions with unique cases and selected actors.

05.12.2024 21:21 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

what happened?

malicious code was injected into the pypi deployment workflow (github action).

the source code itself wasn't infected. however, the resulting tar/wheel files were corrupted during the build process.

05.12.2024 21:21 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

popular computer vision package ultralytics (home of yolov8 and yolo11) was compromised.

a crypto miner was injected into versions 8.3.41 and 8.3.42.

link: github.com/ultralytics/...

05.12.2024 21:21 β€” πŸ‘ 12    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

smart parking systems are just the beginning. roboflow workflows can be used for so much more. check out my clothes detection + sam2 + stabilityai inpainting workflow

27.11.2024 17:43 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

custom python blocks in roboflow workflows are powerful. built a telegram bot connector for real-time alerts.

27.11.2024 17:43 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

2 years ago i made a whole tutorial on coding this from scratch. now it's just 2 clicks in workflows.

link to my original line counting tutorial: www.youtube.com/watch?v=OS5q...

27.11.2024 17:43 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

I regret using OpenAI for license plate OCR.

expensive, slow, censors results, and refuses to read plates 20-30% of the time.

open-source models like florence2 are more reliable.

27.11.2024 17:43 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

built a smart parking system with roboflow workflows.

- license plate detection and ocr
- object tracking with bytetrack.
- counting cars entering and leaving the lot.
- real-time alerts via telegram.

27.11.2024 17:43 β€” πŸ‘ 16    πŸ” 1    πŸ’¬ 3    πŸ“Œ 0

@skalskip92 is following 20 prominent accounts