Our work is using LLMs for jobspecs and it wouldnβt work to validate that with something else from an LLM. We need the workload manager to do it.
07.08.2025 04:09 β π 0 π 0 π¬ 0 π 0@vsoch.bsky.social
Iβm the Vanessasaurus! π₯ https://vsoch.github.io
Our work is using LLMs for jobspecs and it wouldnβt work to validate that with something else from an LLM. We need the workload manager to do it.
07.08.2025 04:09 β π 0 π 0 π¬ 0 π 0It would be good if we didn't have to alter the Slurm source code and do custom builds to get the functionality we need. It might come to that (still haven't heard from a Slurm developer) but hopefully it's not that!
07.08.2025 04:00 β π 1 π 0 π¬ 1 π 0Thanks @chromamagic.com! Yes thatβs what we need. Is it possible a slurm dev knows the final answer and/or could provide guidance for our use case? π€
06.08.2025 19:08 β π 0 π 0 π¬ 1 π 0For folks familiar with #SLURM- we are looking for a way to validate SBATCH directives. Slurm has --test-only but it does validation for flags *and* against a cluster, and they seem tied. In #Flux we can validate directives (via a directive parser) separately from resources (feasibility plugin). Ty!
06.08.2025 17:25 β π 2 π 1 π¬ 1 π 0This was 20 years ago. When we are young we often can't anticipate oncoming darkness. But we also can't anticipate our own resilience. What we come to realize with experience is that we are always in a mixture of light and shadow. It is a choice to not just see, but try to be a source of light. π―οΈ
06.08.2025 05:13 β π 0 π 0 π¬ 0 π 0Yes, I'm going to paint it red! π΄
Maybe I'll show up for the tutorial too! π
Flux Bird!? WHAT are you doing over there? π¦©
π
Our #SC25 tutorial is up! π
sc25.conference-program.com/presentation...
I'm already excited! We will be co-presenting with #AWS and teaching you how to run #HPC workloads using #Kubernetes with the Flux Operator. And a taste of MuMMI, a workload with Ai/ML components. Hope to see you there! π₯³
"When you're the last one in the data center. "
Me: I will do important, serious work.
But also: youtu.be/OImn6x2VQu8?...
π¦©
Of course the full music video "Flux Time" that was cut short in the live version.
youtu.be/N25GySogBeE?...
Thank you to everyone that attended! Please reach out to any of us with questions. π¦©
Our talk "Ensemble Workloads in the Age of Converged Computing" presented the #FluxFramework Operator, deployment of Flux in different cloud environments, and user-space Kubernetes "Usernetes."
hpckp.org/talks/orches...
If you missed the #FluxFramework workshop and talks at #HPCKP, they are online!
π₯ Flux Framework Workshop: hpckp.org/talks/flux-n...
The workshop includes an introduction to Flux, a talk on "Flux Environments," a hands on tutorial (actually a container adventure), music video and Jeopardy! π
Abstract: The rise of AI and the economic dominance of cloud computing have created a new nexus of innovation for high performance computing (HPC). In addition to performance needs, scientific workflows increasingly demand capabilities of cloud environments: portability, reproducibility, dynamism, and automation. Geopolitical changes that lead to cuts in scientific spending paired with resource contention introduce the new reality that portability is a new metric of performance. A strategy planning for flexible movement requires understanding of the strengths and weaknesses of different converged environments for the needs of HPC. In this talk I will present a cross-cloud usability study that assessed 11 different HPC proxy applications and benchmarks across three clouds (Microsoft Azure, Amazon Web Services, and Google Cloud), six environments, and two compute configurations (CPU and GPU), performing scaling tests of applications in all environments up to 28,672 CPUs and 256 GPUs. I will present insights for usability, work needed, and lessons learned from such an ambitious undertaking, and hope to inspire discussion about future vision for orchestration of HPC applications in cloud.
Please join us this Tuesday, July 1st, at 9am Pacific to learn about my team's work on "Cloud Usability for #HPC Applications" hosted by the #CASS software stewardship organization. Please message or email me for the calendar invite. Hope to see you there!
29.06.2025 18:25 β π 4 π 1 π¬ 0 π 0For the first time - user-space Kubernetes running under a Flux allocation on a production cluster. This is OSU and LAMMPS. This has been months of work and persistence. We got this working on an old kernel, and hugely strict security policy. Experiments and more details coming soon! π₯³
27.06.2025 01:20 β π 1 π 0 π¬ 0 π 0Regardless of speed of delivery, I'd say a better tactic is to reduce or just say no to more meetings. If you can drop with minimal impact to others, then either you don't need to be there in the first place, or the meeting was not necessary.
26.06.2025 14:14 β π 0 π 0 π¬ 1 π 0No, because then you can't participate, and it is assuming that time later is less valuable than time in the current moment. Later me would rather have participated in the meeting and be outside biking or running.
26.06.2025 14:14 β π 0 π 0 π¬ 1 π 0Weβve got a request for information out on where we want to take Livermore Computing and other #HPC centers in the next five years.
hpc.llnl.gov/fg-hpcc-rfi
Check it out and send us your thoughts.
Science is increasingly using AI/ML paired nicely with traditional simulation. With this setup, your simulations can run on bare metal and interact with a model, database, or message queue via a service. All in a job. With bypass mechanisms we can even get around slirp4netns.
25.06.2025 08:22 β π 1 π 0 π¬ 1 π 0We donβt want to ditch by any means. They work well together. Weβve deployed userspace Kubernetes with Flux on 3 clouds and our on-premises setup is underway! The entire K8s cluster comes up and is torn down in the lifecycle of a job.
arxiv.org/abs/2406.06995
This is converged computing.
Our post on compatibility using #OCI artifacts in the #Kubernetes blog is hot off the press! π°
kubernetes.io/blog/2025/06...
We are working on adding an exporter to #NFD for #HPC use cases. github.com/kubernetes-s... and planning experiments. If anyone has ideas, please share in the thread! π
For most that missed the #ISC25 Flux Framework Tutorial, we just posted our slides online:
github.com/flux-framewo...
Thank you to those that attended, and see you next time! π
The biggest lie I tell myself... "Just a little further..."
ππ
youtu.be/7m9mkqpSzXM?...
These running socks and leggings are channeling #FluxBird! π¦©
14.06.2025 22:07 β π 0 π 0 π¬ 0 π 0There are different goals and incentive structures. In #HPC, my sense is that we are leveraging #AI for research and scientific discovery. Industry is interested in products that lead to profit. Instead of ownership, maybe a more interesting question is: How do we best work together? π€
14.06.2025 06:56 β π 1 π 0 π¬ 1 π 0I don't mean to assuage your excitement. I think we can be excited about #AI and what it is doing for #HPC, and the innovations & scientific models we are contributing. But I also think we should be respectful that it is a fully fledged community in its own right, and not try to squash it under HPC.
14.06.2025 06:56 β π 1 π 0 π¬ 1 π 0We want to lead, so we try to minimize AI "just a part of HPC" or "is HPC" to ameliorate that. The building of LLMs (that we use) is being done by cloud hyperscalers and AI companies. They have the resources, and (personally knowing a lot of their engineers) the talent.
14.06.2025 06:56 β π 0 π 0 π¬ 2 π 0One of the hosts of the HPC Podcast says something similar, and I respectfully disagree @thoefler.bsky.social. While #HPC labs were involved with championing of GPUs for scientific computing, GPUs != AI. When I hear this statement it hints that we are sensitive to not leading the innovation space.
14.06.2025 06:56 β π 2 π 0 π¬ 2 π 0Interesting! I think this was something I saw coming. π
At least this area of work has been my passion for a long time, but there doesn't seem to be an area carved out for it in most research institutions.
What's coming next? Along with continued work on all of the above, the next item of interest is automated compatibility assessment via descriptive metadata or #OCI artifacts.
bsky.app/profile/vsoc...
I hope everyone had a wonderful week, whether you attended a conference or not! π
For the last taste of current work, we talk about running user-space Kubernetes alongside Flux, a project we call "The Bare Metal Bros." Although slirp4netns adds network overhead, when we use bypass mechanisms (Infiniband and EFA) we get close to equivalent performance.
13.06.2025 18:30 β π 1 π 0 π¬ 1 π 0