Fernanda Foertter

HPC Programmer, Data Scientist, Physicist, Developer Advocate, Aspiring Humanitarian

SC23 Retrospect

TL;DR: There were two buzzwords that caught my attention this supercomputing conference

  • Foundational Models
  • Data Friction

And neither of these are traditional HPC buzzwords. So what’s happening?

Everyone is going foundational

Let’s tackle the first and get that one out of the way. I remember a time when scientists, who were mostly focused on deterministic simulations, would have scoffed at the idea that AI could ever be useful in their particular domains. But this year was different: just about every major scientific domain was embracing AI experimentation, but some of these specifically cited foundational models.

One thing I love to ask at conferences is “what exciting thing are you working on lately” and let me tell you, there were a lot of people talking about building foundational models. For materials science, for biology, for healthcare and, the most surprising one, for climate science. Climate scientists are notorious for asking for things like “bitwise reproducibility” across hardware, libraries, and even across levels of compiler optimization. And not for nothing, their field is being constantly scrutinized at a world stage by not only peers but also members of the general public. So it’s important for policy to be able to compare models from year to year.

I was happy to see this move to experiment to what is seemingly a “blackbox” in the context of deterministic simulations. And if there’s any field out there with the absolute perfect amount of data to “do AI right” its the climate community. All of the conversations I had were still in the work-in-progress phase except one poster. I expect to see a lot more next year especially as the largest/fastest machines that can handle large climate datasets, Frontier and Aurora, (not NVIDIA based) have yet to build a similarly functional AI ecosystem for their respective hardware.

Data Friction

The other new-ish buzzword taking flight this year was data friction. Defining this is rather ambiguous but I think the most succinct way is to say “data is hard.” So so hard. There were several workshops and BoFs on the matter. Data workflows, pipelines, computational steering… I mean, HPC has struggled with data-ing for a long time. This isn’t new, for a long time we had a tiny-data-in-> simulate -> big-data-out model in HPC.

But in 2023 the folks in interesting scientific spaces are excited about a computing model that reverses that order:
big-data-in -> magic -> small-data-out. (The magic in the middle can be anything: simulation, analytics, ML.) And yes, we had to analyze big-data-out outputs in the past. But that’s not the same as “out of all these extremely-detailed digital twins, give me 5 candidates that exhibit XYZ.” And then keep iterating until the digital twins match real life. In my head I call it “weighted parameter permutation hell.”

(I once again put a shout out to statisticians, the real MVPs who can help us make sense from the ocean of data we find ourselves in. Make Statisticians HPC Again)

A few events stood out that week

I want to focus on NSDF and NAIRR. NSDF session appeared mostly university faculty led. The goal is to sort of map and connect data sets from lots and lots of sources. The second, NAIRR, is a bit more unclear. The NAIRR BoF had representatives from DOE, NOAA, NASA and NIH. There was a report that was written outlining this pilot and the panel started off outlining what each agency would bring to the table. The responses were what you might expect.

  • DOE: “we will bring computers”
  • The rest: “we offer data”

I stood up at that point and made “more of a comment than a question” type of remark. I told the panel that my entire career has been chasing data that is never offered in a computable format. When I worked at NVIDIA customers told me they wanted to do AI but their data was locked up in proprietary or domain-specific formats inside instruments. When I was a consultant at BioTeam the Pharma companies told me their data were locked up in 23 different databases and none of it is harmonized. Now while working at Voltron Data my DevRel team of engineers tried to use NIH and data from census and those are either locked up in pdf, dat, csv or gzip downloads over ftp. It seems to me that if NAIRR, NSDF and Globus could combine forces we could solve a lot of this because “silos gonna silo.”

It’s exahausting to come up with a great idea that combines say weather data plus census data and fail because data is so hard to wrangle. That’s data friction. And that’s why I’m at Voltron Data. I’m tired of wrangling. I just want to compute, I want to filter, I want to see what’s inside and not depend on some vague file description or incomplete metadata to know if it’s worth downloading 7.5TB’s of uncompressed csv via slow ftp.

Well this retrospective took a turn to ranting about the state of data, so I better close with a positive. Over 14,000 people attended SC23. WOW. What an exciting time to be alive.

What interesting buzzwords did you hear at SC23?

Leave a comment