v's Avatar

v

@vsoch.bsky.social

I’m the Vanessasaurus! πŸ₯‘ https://vsoch.github.io

533 Followers  |  26 Following  |  198 Posts  |  Joined: 25.04.2023  |  1.8081

Latest posts by vsoch.bsky.social on Bluesky

Our work is using LLMs for jobspecs and it wouldn’t work to validate that with something else from an LLM. We need the workload manager to do it.

07.08.2025 04:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

It would be good if we didn't have to alter the Slurm source code and do custom builds to get the functionality we need. It might come to that (still haven't heard from a Slurm developer) but hopefully it's not that!

07.08.2025 04:00 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Thanks @chromamagic.com! Yes that’s what we need. Is it possible a slurm dev knows the final answer and/or could provide guidance for our use case? πŸ€”

06.08.2025 19:08 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

For folks familiar with #SLURM- we are looking for a way to validate SBATCH directives. Slurm has --test-only but it does validation for flags *and* against a cluster, and they seem tied. In #Flux we can validate directives (via a directive parser) separately from resources (feasibility plugin). Ty!

06.08.2025 17:25 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

This was 20 years ago. When we are young we often can't anticipate oncoming darkness. But we also can't anticipate our own resilience. What we come to realize with experience is that we are always in a mixture of light and shadow. It is a choice to not just see, but try to be a source of light. πŸ•―οΈ

06.08.2025 05:13 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Yes, I'm going to paint it red! πŸ”΄

Maybe I'll show up for the tutorial too! πŸ˜„

05.08.2025 01:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Flux Bird!? WHAT are you doing over there? 🦩

πŸ˜…

05.08.2025 01:50 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Our #SC25 tutorial is up! πŸŽ‰

sc25.conference-program.com/presentation...

I'm already excited! We will be co-presenting with #AWS and teaching you how to run #HPC workloads using #Kubernetes with the Flux Operator. And a taste of MuMMI, a workload with Ai/ML components. Hope to see you there! πŸ₯³

04.08.2025 03:58 β€” πŸ‘ 8    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
The Last One In the Data Center 😎
YouTube video by vsoch The Last One In the Data Center 😎

"When you're the last one in the data center. "

Me: I will do important, serious work.
But also: youtu.be/OImn6x2VQu8?...

🦩

07.07.2025 05:47 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
😎 Flux Time
YouTube video by vsoch 😎 Flux Time

Of course the full music video "Flux Time" that was cut short in the live version.

youtu.be/N25GySogBeE?...

Thank you to everyone that attended! Please reach out to any of us with questions. 🦩

01.07.2025 17:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Orchestrating the Future: Ensemble Workloads in the Age of Converged Computing - HPCKP 05/06/2025

Our talk "Ensemble Workloads in the Age of Converged Computing" presented the #FluxFramework Operator, deployment of Flux in different cloud environments, and user-space Kubernetes "Usernetes."

hpckp.org/talks/orches...

01.07.2025 17:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Flux: Next-Generation Workload Management for High Performance Computing and Cloud - HPCKP Flux is a next-generation workload manager, the primary means to run workloads on the first NNSA1 exascale system El Capitan that was just announced as the #1 machine on the Top500 list at the Superco...

If you missed the #FluxFramework workshop and talks at #HPCKP, they are online!

πŸ₯‘ Flux Framework Workshop: hpckp.org/talks/flux-n...

The workshop includes an introduction to Flux, a talk on "Flux Environments," a hands on tutorial (actually a container adventure), music video and Jeopardy! πŸŒ€

01.07.2025 17:34 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Abstract: The rise of AI and the economic dominance of cloud computing have created a new nexus of innovation for high performance computing (HPC). In addition to performance needs, scientific workflows increasingly demand capabilities of cloud environments: portability, reproducibility, dynamism, and automation. Geopolitical changes that lead to cuts in scientific spending paired with resource contention introduce the new reality that portability is a new metric of performance. A strategy planning for flexible movement requires understanding of the strengths and weaknesses of different converged environments for the needs of HPC. In this talk I will present a cross-cloud usability study that assessed 11 different HPC proxy applications and benchmarks across three clouds (Microsoft Azure, Amazon Web Services, and Google Cloud), six environments, and two compute configurations (CPU and GPU), performing scaling tests of applications in all environments up to 28,672 CPUs and 256 GPUs. I will present insights for usability, work needed, and lessons learned from such an ambitious undertaking, and hope to inspire discussion about future vision for orchestration of HPC applications in cloud.

Abstract: The rise of AI and the economic dominance of cloud computing have created a new nexus of innovation for high performance computing (HPC). In addition to performance needs, scientific workflows increasingly demand capabilities of cloud environments: portability, reproducibility, dynamism, and automation. Geopolitical changes that lead to cuts in scientific spending paired with resource contention introduce the new reality that portability is a new metric of performance. A strategy planning for flexible movement requires understanding of the strengths and weaknesses of different converged environments for the needs of HPC. In this talk I will present a cross-cloud usability study that assessed 11 different HPC proxy applications and benchmarks across three clouds (Microsoft Azure, Amazon Web Services, and Google Cloud), six environments, and two compute configurations (CPU and GPU), performing scaling tests of applications in all environments up to 28,672 CPUs and 256 GPUs. I will present insights for usability, work needed, and lessons learned from such an ambitious undertaking, and hope to inspire discussion about future vision for orchestration of HPC applications in cloud.

Please join us this Tuesday, July 1st, at 9am Pacific to learn about my team's work on "Cloud Usability for #HPC Applications" hosted by the #CASS software stewardship organization. Please message or email me for the calendar invite. Hope to see you there!

29.06.2025 18:25 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Post image Post image

For the first time - user-space Kubernetes running under a Flux allocation on a production cluster. This is OSU and LAMMPS. This has been months of work and persistence. We got this working on an old kernel, and hugely strict security policy. Experiments and more details coming soon! πŸ₯³

27.06.2025 01:20 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Regardless of speed of delivery, I'd say a better tactic is to reduce or just say no to more meetings. If you can drop with minimal impact to others, then either you don't need to be there in the first place, or the meeting was not necessary.

26.06.2025 14:14 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

No, because then you can't participate, and it is assuming that time later is less valuable than time in the current moment. Later me would rather have participated in the meeting and be outside biking or running.

26.06.2025 14:14 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Request for Information–Future Generation High Performance Computing Center | HPC @ LLNL This website enables public access to Request for Information No. HPC-007 (RFI) pertaining to a Future Generation High Performance Computing Center. The RFI points of contact are LLNS Contract Analyst...

We’ve got a request for information out on where we want to take Livermore Computing and other #HPC centers in the next five years.

hpc.llnl.gov/fg-hpcc-rfi

Check it out and send us your thoughts.

25.06.2025 23:21 β€” πŸ‘ 14    πŸ” 8    πŸ’¬ 2    πŸ“Œ 0

Science is increasingly using AI/ML paired nicely with traditional simulation. With this setup, your simulations can run on bare metal and interact with a model, database, or message queue via a service. All in a job. With bypass mechanisms we can even get around slirp4netns.

25.06.2025 08:22 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
A Contention-Free Model for Converged Kubernetes on HPC High performance computing (HPC) and cloud have traditionally been separate, and presented in an adversarial light. The conflict arises from disparate beginnings that led to two drastically different ...

We don’t want to ditch by any means. They work well together. We’ve deployed userspace Kubernetes with Flux on 3 clouds and our on-premises setup is underway! The entire K8s cluster comes up and is torn down in the lifecycle of a job.

arxiv.org/abs/2406.06995

This is converged computing.

25.06.2025 08:14 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Image Compatibility In Cloud Native Environments In industries where systems must run very reliably and meet strict performance criteria such as telecommunication, high-performance or AI computing, containerized applications often need specific oper...

Our post on compatibility using #OCI artifacts in the #Kubernetes blog is hot off the press! πŸ“°

kubernetes.io/blog/2025/06...

We are working on adding an exporter to #NFD for #HPC use cases. github.com/kubernetes-s... and planning experiments. If anyone has ideas, please share in the thread! πŸ‘‡

25.06.2025 06:13 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 0    πŸ“Œ 1
Post image

For most that missed the #ISC25 Flux Framework Tutorial, we just posted our slides online:

github.com/flux-framewo...

Thank you to those that attended, and see you next time! πŸ‘‹

18.06.2025 01:29 β€” πŸ‘ 1    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
πŸ”οΈ 33 Miles
YouTube video by vsoch πŸ”οΈ 33 Miles

The biggest lie I tell myself... "Just a little further..."

πŸ’™πŸ’š

youtu.be/7m9mkqpSzXM?...

16.06.2025 03:15 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image

These running socks and leggings are channeling #FluxBird! 🦩

14.06.2025 22:07 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

There are different goals and incentive structures. In #HPC, my sense is that we are leveraging #AI for research and scientific discovery. Industry is interested in products that lead to profit. Instead of ownership, maybe a more interesting question is: How do we best work together? πŸ€”

14.06.2025 06:56 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I don't mean to assuage your excitement. I think we can be excited about #AI and what it is doing for #HPC, and the innovations & scientific models we are contributing. But I also think we should be respectful that it is a fully fledged community in its own right, and not try to squash it under HPC.

14.06.2025 06:56 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We want to lead, so we try to minimize AI "just a part of HPC" or "is HPC" to ameliorate that. The building of LLMs (that we use) is being done by cloud hyperscalers and AI companies. They have the resources, and (personally knowing a lot of their engineers) the talent.

14.06.2025 06:56 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

One of the hosts of the HPC Podcast says something similar, and I respectfully disagree @thoefler.bsky.social. While #HPC labs were involved with championing of GPUs for scientific computing, GPUs != AI. When I hear this statement it hints that we are sensitive to not leading the innovation space.

14.06.2025 06:56 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Infrastructure Engineering: A Missing, Undervalued Role in the Software Ecosystem Research has become increasingly reliant on software, serving as the driving force behind bioinformatics, high performance computing, physics, machine learning and artificial intelligence, to name a f...

Interesting! I think this was something I saw coming. πŸ‘€

At least this area of work has been my passion for a long time, but there doesn't seem to be an area carved out for it in most research institutions.

14.06.2025 06:11 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

What's coming next? Along with continued work on all of the above, the next item of interest is automated compatibility assessment via descriptive metadata or #OCI artifacts.

bsky.app/profile/vsoc...

I hope everyone had a wonderful week, whether you attended a conference or not! 😘

13.06.2025 18:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image Post image Post image

For the last taste of current work, we talk about running user-space Kubernetes alongside Flux, a project we call "The Bare Metal Bros." Although slirp4netns adds network overhead, when we use bypass mechanisms (Infiniband and EFA) we get close to equivalent performance.

13.06.2025 18:30 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@vsoch is following 20 prominent accounts