Dan Stanzione On Supercomputers, Pandemic Computing, and Predicting Hardware Needs
TACC houses a host of computing resources—all with Texas-themed names. Frontera is a Dell C6420 system ranked eighth on the June 2020 Top500 list of supercomputers. Its 23.5 HPL petaflops is achieved with 448,448 Intel Xeon cores. Stampede2 is the flagship supercomputer of the Extreme Science and Engineering Discovery Environment (XSEDE) and achieves 18 petaflops of peak performance with 4,200 Knights Landing (KNL) nodes—the second generation of processors based on Intel's Many Integrated Core (MIC) architecture—and 1,736 Intel Xeon Skylake nodes. Then there’s the Lonestar5 (1252 Cray XC40 compute nodes), the Longhorn (108 IBM Power System AC922 nodes), the Maverick2 (NVIDIA GPUs) and many more.
With so much big compute to wrangle, Stanzione and TACC are adept at provisioning systems, balancing loads, and predicting trends in scientific computing. As part of the Trends from the Trenches column, Stan Gloss sat down with Stanzione to talk about big compute, pandemics, and technology trends.
Editor’s Note: Trends from the Trenches is a regular column from the BioTeam, offering a peek behind the curtain of some of their most interesting case studies and projects at the intersection of science and technology.
Stan Gloss: What percentage of the research that you do at TACC is actually supporting life sciences, and how's that changed over time?
Dan Stanzione: There's been a surge in research associated with COVID-19; that's sort of skewed the numbers for the last few months. Starting in early March through about the end of June, [our workload has] gone to about 30% COVID-19 support, both technical staff and computing cycles. I think we have 45 projects that we're supporting, some of which are very large collaborations with huge teams around the world. I think just one of them involves 600 researchers.
I would say prior to COVID-19, we were probably running 15% or 20% life sciences across the center. Our computing time has been particularly [focused] on the molecular dynamics and protein structure side of life sciences. Relating to COVID, [the work] has been more on the data science, epidemiology, data integration pieces of it. [That work] doesn't use the [compute] cycles the way the protein structure stuff does, but it certainly uses the people time and the software.
During normal times, how much of your work is committed to UT researchers?
Our big machines are supported by NSF, federally supported. I have about 10% of the cycles that we keep at home for UT folks, and the rest go around the world. 90% of our users are not UT Austin.
If I was working as an NSF-funded program, could I get access to TACC? How does one get time?
There're several mechanisms and programs that we use. For Stampede2 and Wrangler, and a few of our other platforms, there's a shared services group at NSF that allocates time among the various supercomputing centers, the project is called XSEDE. They have a quarterly allocations request. That's for the larger things.
If you are at a university and you're doing open research, you can just apply for a start-up allocation—we turn that around in-house in about a week—and get on the machine. But as your usage grows to many thousands of hours, you can apply quarterly to XSEDE. You write a proposal that shows you know what your science is and why you need the time—justify it. It's a competitive process. There's no cost to do it, you just apply for time.
For our very largest machine, Frontera, it's in a separate track at NSF and we have a separate allocations proposal where we also take quarterly requests for the very largest projects to go on the machine. Again, it's proposal-driven and peer-reviewed once per quarter to get onto the machine.
You don't have to be NSF-supported, although that's the bulk of our users and you get preference if you have support from NSF, but probably 10% 10 or 15% of the cycles we allocate are the NIH-supported researchers, also DOE or the USDA. It can be any funding source that's not classified, mostly academic. We will take industry users for open, publishable research through the NSF process, or we can always do for cash. We have a chargeback mechanism to get access if you're doing something that's not open or you're not getting enough time through the publicly funded ways to get time.
You mentioned that Frontera had a special allocation that's different than Stampede and the others. Tell me about Frontera and why is that different?
Frontera is the second in NSF's series of leadership class systems. Frontera was number five in the world when it debuted about a year ago. It's still in the top 10 in the world, I think, on the new list two weeks ago, at the time we're recording this. It's still one of the 10 biggest machines anywhere in the world. It's the biggest university-based resource anywhere in the world, certainly in the United States. There are a few large government machines in China and the US that are larger, but we're sort of the largest, truly open academic machines out there.
We followed another big NSF machine, Blue Waters, that was funded in a similar track. And there's a lot of people who want time on these machines, and there's a lot of tension between having the number of people who want access for all the different computational problems that are out there, and the problems that need a whole lot of time to make progress—that can use a third of the machine for two or three months to solve a single problem or they can't do anything.
We separated the XSEDE machines dealing with the capacity problem. Having the thousands of users that we have to support, we reserve Frontera for the capability problem, the few users who need a whole lot of time. Stampede2, which is also still in the top 25 in the world, has 10 times the users and projects, but each individual user’s share is much smaller. We have literally 3,000 projects on Stampede2 and we keep Frontera, at any given time, around 50 or 60. As you might imagine, the average project gets a lot more computing time. We're really reserving the largest single challenges to run on Frontera, and with Stampede2, we're trying to promote broad-access to high-performance computing.
How do you manage 3,000 simultaneous projects? That sounds pretty daunting.
I'll give you the short answer, but yes. Although there are moves away from this in some segments of the enterprise, most of this is still batch-scheduled. That's the notion of the allocation process: everybody gets a fixed amount of time. We have an accounting system that deducts that time as they submit to a queue for jobs, so you just sort of run them in the order they show up. [We consider] different things about priority and fairness and scheduling and prioritizing big jobs so little jobs don't starve them out. But essentially, we have hundreds of users log in every day and submit thousands of jobs every day. And we just queue them up and run them. Keep the machine busy, 365 days a year doing this stuff and crank through. Both machines do over a million jobs a year at this point.
Do you have many of those types of clients that are a commercial client who want to pay for some time?
We have a fair number. People directly using time on the systems tend to be the smaller and mid-sized companies across several industries. We have a couple of aerospace companies; we have a few oil and gas companies who do production computing with us.
We also have partnerships with a lot of the large industry customers to do benchmarking. They want access to our systems to test things out, but it's more about learning from us to build their own in-house infrastructure. So they do a limited amount of computing, but they're asking us to test codes. Or they come to trainings; they come to our annual industrial partners meetings to exchange best practices. Given that we're in Texas, we have most of the large oil and gas companies participate through those meetings. Altogether we work with probably 40 or 50 companies.
So there's an educational component to what you do?
Oh, absolutely. Our job is to figure out how to use advanced computing technologies to create scientific engineering and societal outcomes? That means not just buying and deploying the systems, but operating them and training people to use them. I think our staff is more valuable than our machines, quite frankly. Computers are relatively easy to get. Computers run by professionals who know science and build the software stack and continual stuff to run on: that's the scarce commodity.
In addition to the 30,000-ish or so servers that we run across the various different large computing platforms that we have, we have about 170 staff who take care of these things ranging from life science experts to astronomers and chemists and computer science experts, machine learning experts, data curation expert. Increasingly, the workflow in science is you bring together a bunch of data from a bunch of sources. We have to clean it and integrate it and do a fair amount of pre-processing on that. You're probably going to do both simulation and some form of AI somewhere in the workflow. Then you're going to need to understand that data with visualization or some other technique for data analysis. Finally, you're going to want to publish and reproduce those results over time. So we try and be part of that whole computational science workflow in what we do.
That's incredible. How do you forecast where supercomputing is going to be two to three years?
Yeah, I'm already designing machines for 2024. These are tens of millions of dollars procurement, so you don't want to buy old technologies. Computer technologies have a pretty short lifespan. Usually we're trying to make a decision about two years before technology comes to market. For Frontera, it's a proposal-driven process, and there was a competition for submitting proposals to the government for decisions. To some extent, we pick a technology and a vendor team to work with, and our competitors might pick different ones. And then the competition sorts them out as to who the winner is. We submitted the proposal for Frontera two full years before the start of production. We were extrapolating performance on chips that did not yet exist in close conjunction with our vendor partners.
Is a vendor partner like Intel?
Yes, Intel was our chip provider, although Frontera actually has an ecosystem. We have a GPU subsystem; we have a large memory subsystem. The primary compute is CPU-based Intel Cascade Lake Xeon. We were working way ahead of release with Intel. Fortunately, in that case, it was a fairly incremental change from the Skylake Xeons we had used on Stampede2 the year before, so we had some insight into what was going to happen. And it was sort of a linear extrapolation. But as you change technologies, that's not always true.
We worked with Intel and Dell on the main part of the system. We actually have an IBM and Nvidia piece to the system as well, and then another single precision focus subsystem with Nvidia that's oil-cooled with Green Revolution Cooling.
It is tough to stay abreast of these things. We can work with the chip manufacturers on what the roadmap is for technology, but we have to translate this into delivered science results.
Right now, one thing we're evaluating pretty closely is are these chips to do tensor processing. It can optimize for precision, often down to 16 bit. We're trading off accuracy for speed in those situations, which in the case of during neural network makes a lot of sense, because you're just weighting the connections between neurons, essentially, for most of the computation. You really just need to know, is this one important or not important?
We can sort of understand the chip design and how it works, but can we build a software ecosystem that will build applications on that? How much change do our users have to go through? Because again, we're supporting several thousand academic research teams; they don't all have large staffs of programmers to go in and make changes. We want to pull our users forward with what we think the best technologies are, but we can't get too far ahead of them, or they just won't use the machines. If it's a radical change and they have one grad student who's using some code they inherited, they can’t spend two years recoding it around a new technology.
We put gradual pressure on them to change as the systems change, and then we have to work with the vendors to make sure we're not making too radical a change each time. This is why you see the very incremental rollout of technologies, like GPUs, where it's taken a decade and a half to really get penetration. It wasn’t because the chips weren’t ready, but because the software wasn't ready to use them. That's a huge problem. We have thousands of applications that we support that need to migrate to these new technologies. When we're looking at very different chips, we are scared that we might build something that our users don't want.
Yes. For example, in life sciences there was a nice promotion of Hadoop, but none of the scientists wanted to modify their codes to take advantage of it.
Yeah, and that Hadoop model is largely gone as a result. It was sort of a technology fad that came and went. Some of the codes, especially in things like weather, have been around for 20 or 30 years and they can't turn on a dime for a fad.
You have ecosystem of high-performance computing technologies designed to do whatever the client needs. So if it maps very nicely to a GPU, you have GPUs available. Are you basically taking what your trend lines are for the types of uses that you're seeing on the systems and mapping the utilization of the different types of technologies based on that?
Yep. There're really three sources of information we use to drive those sorts of decisions. First, we actually get users together and ask about forward-looking future challenges and how they see science changing. They tell you, "It's going to be more data-intensive," or, "We're going to have more uncertainty quantification." Or whatever it is looking forward. Then we get their aspirational goals and desires around that. Those aren't always necessarily assured to match with reality, although they are an important source of input. Second, we look at the allocations that our users are actually writing and look at the change in those over time. When push comes to shove, what are they really asking for? That gives you a slightly different snapshot on the reality, the present reality.
Third, we look back with workload analysis over time of how the cycles are actually used on the machines that we have, and what runs where. Often that tells a somewhat different story. When we ask users, five years from now always looks dramatically different than today. If we look back at the workloads over the last 10 years, there are some changes, but they're remarkably consistent. Even though, in each of those 10 years, when we asked for a five-year vision, it was always going to be terribly different five years away. In truth, it stayed relatively constant. There's more of everything, but the mix is about the same: molecular dynamics versus astrophysics, how much finite element method versus how many FFTs are out there. We see some shifts, but they tend to not be that rapid. AI is potentially disruptive, but it's taking time to make its way into the workflows.
We have to balance what they tell us they want, what they're actually willing to spend their time on, and then what's actually run. Those are the three different sources we use, then we blend those together to figure out how that maps to what we're going to see out there and what information we actually get from the vendors about future technologies.
Can you tell me a little bit more about how the pandemic has impacted TACC?
Changes will be wrought throughout society, but obviously we had operational changes. We've been fortunate that on our staff, a couple of people have been infected, but by and large, we've been healthy. And because we're all tech workers to some extent, we have a few people who have to go in and have to lay hands on the hardware and do stuff on keeping machines running, but by and large, it's pretty smooth for us to switch to telecommuting.
The biggest impact is that we've had to divert a lot of our resources into actually tackling this pandemic. As I mentioned earlier, for several months, it was running as high as 30% of all Frontera and Stampede cycles were going into COVID work.
We work with thousands of scientists. There are many we know very well, and we've had relationships with for a long time, and they know our systems and our platforms pretty well. And I've been able to say, "Yeah, let's just skip the process. We know you do good work. We know this is a priority right now. Let's just get going."
We've been able to do some pretty miraculous things in fast response, but that’s not an accident. We rehearsed for this doing work for SARS and work for HIV, and work for H1N1 and H1N5 and MERS.
There're people who dedicate their careers to this stuff, and we've been dedicated to supporting them. We have the infrastructure and the people and the software tools in place, and that allows us to respond quickly when there actually is a disaster. But we couldn't start from zero and do as much as we've done in this short a time without the relationships and the infrastructure in place to do this.
When you do that, what happens to all the other work?
It just gets put a little bit on the back burner, right? People have to wait longer to get their stuff through and they have less time available. We're actually planning to deploy an expansion; we're going to add a few hundred nodes to offset some of the time that we've lost. We'll make it up over several years. But we'll add some capacity because of the time we've diverted away and anticipate that we'll continue to divert away to look at these COVID things.
We've diverted time, I'd say, in three big categories of work. One is at the atomic level, understanding the structure of the viron itself, understanding the structure of the cells and the drugs that we might wrap around it, and doing a whole bunch of protein folding and structural work with the light sources and cryoEM folks to get data to confirm that stuff. It's traditional simulation. At the other extreme, we work on the whole person, which is the epidemiology, right? How does the virus spread? Doing contact tracing. Looking at cell phone data to see how interaction patterns are changing, and how social distancing actually reduced the number of people that you see when you put a regulation into effect. You can do a whole bunch of data science around mostly cell phone data to figure out things we couldn't do even 20 years ago. We can model how you position your resources, doing models to look at hospitalization rates and how many ICU beds we need to have available, which affects public health policy.
Finally, in between those two is the genomic level stuff, which is coupled to and informed by the molecular work. Can we find in the RNA sequence for the virus, similarities to other viruses and sort of treatments that are effective? Can we understand its evolution? To figure out prospective treatments, can we understand the hosts that it's infecting? Can we say, "These sequences tend to mean you're more vulnerable or less vulnerable. This part of this sequence forms these proteins, which makes a certain segment of the population not as vulnerable,” Can we translate those? Both the molecular part and the genomics part actually affect therapeutics and drugs or vaccine work.
Looking into your crystal ball, in the next three to five years how is technology and computing technology going to change? And what are you designing for three to five years from now in your next generation that we should be thinking about?
Yep. There's a lot of layers to that question. We have to think about how computational science is going to change, and then how computing technology changes. And I think there's really exciting things going on at both levels.
From the science perspective, the role that AI is going to play is going to continue to augment our science in some very interesting ways. It is getting much cheaper to acquire data from everything from massive sensors for both environmental and traffic analysis, 2 to 5G, and our ability to just get bits. Put very low power, very high data-rate accurate sensors out for very low cost. Fusing that data into the scientific workflow and using AI methods to get statistically valid ways to put that into the workflow is interesting.
It's unfortunate that we're not getting a free ride out of the physics and getting higher performance anymore. But it does mean that it's an opportunity to be creative in terms of architecture. How do we use the 100 million transistors per square centimeter that we can get in any processor now? We're seeing sort of a plethora of these new architectures around AI chips in more GPU types, GPU-CPU hybrids that I think are very exciting. I think the thing that's going to help us most on performance is the tighter integration of memory. We can put so many transistors on a chip that we can do plenty of operations, we just can't get data to them fast enough. They'll start integrating memory into the silicon, or at least onto the package with the chip a lot more. That's going to give us some huge performance increases and power efficiency improvements. We’re also broadly switching to liquid cooling to allow these higher densities and higher power per socket.
We're also getting better code efficiency out of that. It's raising our power per square foot, but it's also raising our efficiency even more than that. Data centers are going to have a lot less air moving around and a lot more liquid moving around the infrastructure that we have to build for them into these larger systems of much more tightly integrated chips with a lot more heterogeneity where we can be creative as architects and how to use it.
Does storage technology have to change along with this? The way you store now may matter again.
Not necessarily a different type of storage system at this point, but we're seeing different access patterns. For the traditional, big, 3-D simulations we've done, it was throughput of IO that mattered. We had these big transactions, mostly large, fairly regular, and can we just feed enough of them. And now, especially with graph algorithms and the AI methods, we're seeing very small IOs that are more frequent: small, random access.
The good news is that's it hard to do on a spinning disk, but fairly easy to do with a solid-state storage device, which is what we're moving to anyway. A lot of our software around building file systems is organized around the notion of disks that are spinning around, and we have to go get things off these rotating platters. There's a lot of optimization we can do for solid state that we haven't done yet.
This is one of those situations where it would be nice if we could get users to give up the whole notion of files and open and move to more of an object methodology, but that's not really people-friendly, so I don't think that's going to happen. It's going to happen in the system software layers, but not in the application layers, because the core leading applications aren't going to change fast enough. We used to have 100 big files and now we have 10 billion little files, so we're changing the way that we manage storage systems. We're moving to a lot more per-user volumes that are dynamic instead of having one, big shared file system. We're in that transition now, but I think we can build it on the storage blocks we have with solid state, and now with these non-volatile DIMMs, right? That'll be the top rung of hierarchy. What the exact breakdown of those is going to be, is hard to forecast at this point, but faster and more amenable of a random access storage is going to really help us keep the compute part saturated.
The other piece of this is then also, you have a lot of collaborators who are now transferring more and more of their information to and from you. Where's network going to go? Because if the game is going to ramp up, the networking has to ramp up.
Yeah, that's always been true. Just in and out of Frontera and Stampede and their archive systems, we move 10 or 15 terabytes a day, apiece. I think we're moving over a petabyte of data a month into TACC at this point. Again, I see sort of big sensor networks and 5G and stuff like that as really driving that stuff forward. We have been relatively successful at moving people into better protocols that let us get the wires closer to full on using technologies like Globus instead of things like HTTP to move data. That's been super helpful in extending the pipelines that we have, but we have 100 gig pipes now and we'll have 400 gig pipes in the near future to support this. We'll have it within two years, would be my guess. And we'll need it by then too.