How much of the energy consumed in an AI chip is spent doing something useful? This question affects everything from software to system architecture to chip design.
Key Takeaways
Heat is a serious problem within AI chips, and it is limiting how much processing can be done. The solution is either to extract heat faster, or generate less of it. Neither approach is easy, but long-term solutions need to focus on the second option.
Every action within a chip consumes energy and generates heat, which must be removed from the chip. The amount of activity is limited by the amount of heat that can be removed and the speed at which that happens. Many advances are being made to reduce energy consumption, and while possible, they are expensive and consume additional energy.
But that’s only part of the problem. The total amount of available energy is not fully elastic, and increases in energy production are not keeping up with rising demand for incremental energy. And this begs the question, ‘Is all of the activity useful, and is it being done using the minimum amount of energy possible?’ Given that the human brain consumes about 20 watts, there is clearly a lot of potential for future optimization, but all advances must make economic sense.
It is often said that to understand something, you have to follow the money. That may be very important here, because power consumption is no longer just inconvenient. It is now a major cost factor. “While power has risen in importance, it has always been considered a second-class citizen in chip design,” says Marc Swinnen, director of product marketing at Synopsys. “But it leads directly to the bottom line, and the cost of cooling is quite intense. You pay for the electricity to heat up the circuits when you run them, and pay again for more electricity to pump the heat out. Power has become a significant contributor to the cost of the system.”
That is why incremental improvements in heat extraction are the favored solution so far. “While power is important, and it’s good to know how much energy your algorithm will consume, it is not the first design criterion,” says Roland Jancke, head of department design methodology at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “It does play a more important role at the system level, but you do not have enough information about power consumption of your algorithm or your component. There are so many possibilities, many of which will never even be part of the investigation. Performing architectural studies is difficult.”
Work is underway at large systems companies to address this, but it is being kept quiet for competitive reasons. “Some of this work is so new that they feel they have the edge because they started to pay attention to it a little bit earlier than others,” says Suhail Saif, director of product management and solutions engineering at Keysight EDA. “It’s all secretive and in-house, where every design house doesn’t know where the competition is. They feel they have an edge, and it is their moat. They want to keep it close to their chest for the moment. It will need more maturity in the industry before they decide it is not worth it, when everyone is doing almost the same thing and there’s no return on investment. Then they will let one of the EDA companies handle it. And while everybody will then benefit from it, it’s less effort, less headache. I don’t think we are there yet.”
Consider communications
For the past several decades, the industry has sought improvement through aggregation. More and more content was integrated into a single monolithic die, and for the most part this defined the size of the compute problem that was tackled for standard applications. That stopped being the case with AI, where massive arrays of processors, spread across racks and even data centers, became part of the mainstream.
“A lot of the power is consumed by communication between chips,” says Synopsys’ Swinnen. “Part of the penalty of disaggregation is that you have more communication costs between the blocks in your system. One of the beauties of monolithic was that the communication was low-power and high-bandwidth. Data centers are another form of disaggregation. We have multiple processors across multiple racks, many yards away from each other. They’ve looked at the communication power, and that can be reduced by going optical. The data backplane in the data centers is becoming optical.”
All aspects of communications are being investigated. “Take a look at the recent industry efforts for high-performance communication protocols,” says Badarinath Kommandur, fellow at Cadence. “There is a pretty intense focus on metrics like picojoules per bit. Moving forward, the industry wants to get to femtojoules per bit. This is becoming front and center, especially in AI-driven applications.”
Compute fabrics are becoming so performance-intensive that traditional communications is struggling to keep up. “Copper has been the focus for a long time, but if you look at it in terms of speed scaling, we are now contending with skin effect, which impacts how electrons flow through the medium,” says David Kuo, vice president for product marketing and business development at Point2 Technology. “There’s a limit for how copper can support the future data center workloads. With optical, there’s cost, power, and reliability issues. There’s a saying in the data center, use copper when you can, use optical when you have to.”
There is great reluctance to make that switch. “Optics is a step function in complexity,” says Swinnen. “There is a new set of physics, and a different set of tool expertise required. Then there are issues marrying optics and semiconductors together. It’s gotten a lot better. While people talk a lot about picojoules per bit, photonic systems are much more efficient in that they require less energy per bit to transmit the data. But that number is low because they have such high bandwidth, not because they’re low power.”
Point2 Technology is looking at a possible middle ground. “We have developed an eTube technology,” says Kuo. “It uses RF transmission to transmit data over a plastic waveguide. We are replacing the copper medium with a plastic material, and we define the waveguide structure. Then, using an RF transmitter and receiver, we transfer the signal over the waveguides. The antenna is very similar to patch antennas.”
On-chip communications also must be considered. “For modern multi-core and multi-die SoCs, moving data around — weights, activations, and metadata — costs far more energy than the compute that processes it,” says Guillaume Boillet, vice president of strategic marketing at Arteris. “This shifts the network-on-chip (NoC) from being an integration fabric to being one of the primary levers for power optimization. Teams that architect their NoC around workload traffic patterns can dramatically reduce data movement, localize communication, minimize congestion, and cut dynamic power across the chip. In a world increasingly limited by watts, controlling where data flows and how efficiently it moves is becoming just as important as optimizing the compute itself.”
Consider design
While many AI workloads are somewhat general-purpose, inference applications can often be tuned to directly address immediate needs. “We have to come up with hardware architectures that exploit the network architecture itself,” says Sharad Chole, chief scientist at Expedera. “Edge devices are basically limited by bandwidth. Training is done using multiple HBMs. But on the edge, there is literally one LPDDR, or not even 64 channels, maybe even a smaller-channel LPDDR that gets deployed on low-cost edge devices. That means bandwidth management becomes a critical part of how we execute things on edge inference.”
Most of today’s wasted power does not come from the arithmetic itself, but from everything around it. “Unnecessary data movement, poorly matched memory hierarchies, unused speculative work, glitch power, and guardbands that assume worst-case conditions that rarely occur are just a few examples of waste,” says Arteris’ Boillet. “Meaningful improvement must therefore come from electronic productivity — maximizing useful work per Joule — throughout the entire stack, from system scheduling and workload shaping to architectural and micro-architectural efficiency. In a world increasingly limited by watts, controlling where data flows and how efficiently it moves is becoming just as important as optimizing the compute itself.”
Consider implementation
While there may be large gains to be made at the architectural level, significant waste also may persist through implementation. “Fixed-voltage guardbands were meant to provide safety, but over time they have become an energy tax baked into every chip,” says Noam Brousard, vice president of solutions engineering at proteanTecs. “Guardbands assume that every worst case will happen at the same time. In reality, that almost never occurs. Yet the chip is forced to run at an inflated voltage all the time. The result is obvious. The chip burns energy it does not need to use. This unused margin translates into gigawatts of waste. It is a hidden cost that grows with every generation.”
Guardbands are also created to deal with uncertainty. “The PDK comes from the foundry processes,” says Kuo. “But is it accurate enough to describe transistor-level performance? What we find is it is not quite accurate enough. You get a surprise when you get to silicon. With analog and RF design, the reason it is so challenging is that you’re constantly pushing the boundaries beyond what the foundry can actually define.”
AI designs certainly push process boundaries. “For the more advanced PDKs, any leading foundry will take the silicon learnings and optimize them for the designs they’re targeting in high volume,” says Cadence’s Kommandur. “It is quite possible that if you’re designing something first time around with the 0.5 PDK, or something close to that, your end high-volume manufacturing silicon PDKs will be quite different. You really have to adapt to the evolution of PDKs for these advanced nodes. For mature nodes, where the silicon to PDK correlation is extremely high, you are optimizing design based on what you expect to see in silicon. The foundry certainly puts in some pessimism.”
Some techniques can adapt to this level of uncertainty. “Traditional approaches like DVFS and AVS cannot solve this,” says proteanTecs’ Brousard. “They rely on limited visibility and indirect estimates, so they still require large guard bands. While these are good indications of the stress a specific workload applies, it is a second-order indication. Without direct insight into real path delays, you cannot safely remove margin. You cannot optimize what you cannot see.”
Brousard says that real-time silicon feedback systems are required to get rid of the guard bands. “We achieve this by utilizing small-footprint IP integrated across the chip that continuously monitors the margin to timing failure of millions of real logic paths in mission mode,” he explains. Since the timing margin itself is the ultimate indicator of performance health, monitoring it directly makes the system agnostic to the individual factors causing degradation (be it workload, temperature, aging, or voltage droops). They measure in real time for each P-state, for each functional workload running on a specific P-state, and even throughout each workload.”
Another form of wasted power that performs no useful function is glitches. “This has long been neglected,” says Swinnen. “And it is a significant portion of total power. It is difficult to analyze because it depends on precise timing. Only recently have tools existed that are able to analyze this and reduce it.”
While AI created some of these problems, it also can help solve them. “Optimizing PPA using AI is extremely challenging,” says William Wang, CEO of ChipAgents. “You are tackling major issues like balancing power and area tradeoffs and avoiding reward hacking, but it’s also incredibly promising. Human engineers can only juggle a limited number of factors in power-sensitive design, whereas AI can reason across a much broader context and surface design recommendations that deliver real efficiency gains early in the stack.”
Consider software
Designers may be looking to eke out every power improvement they can, but it can all be for nothing if software asks hardware to perform unnecessary work. “The semiconductor industry is responsible for power consumption and the power envelope, meeting power targets,” says Keysight’s Saif. “But software also needs to think about this. Software is the master in system design, but hardware is the execution engine for the commands that come from software. Software might not be paying as much attention to this pain point as they need to.”
There has to be some degree of hardware/software co-design. “Improving power efficiency is a complicated system hardware and software challenge,” says Steven Woo, fellow and distinguished inventor at Rambus. “To achieve the best application performance and power efficiency, hardware must provide the right acceleration features, and software must be designed to use these features to their fullest capabilities. This can mean re-designing algorithms, refactoring software, and having applications developers be more architecturally aware of system hardware characteristics like cache sizes, DRAM, and storage tiers. Data movement continues to be a major consumer of power, and applications developers need to be aware of potential tradeoffs in storing and retrieving intermediate results versus simply recomputing these results if power can be saved.”
For many years, chipmakers have been asking for software productivity improvements. “If you look back 20 years, a lot of software was programmed using low-level programming languages,” says Andy Heinig, department head for efficient electronics at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “That was time-consuming and painful, but with each level of abstraction we also lose power efficiency. If you look at how software is designed, it’s not very efficient, and we lose a lot of power. It makes it easier to program software. But a lot of efficiency is gone by this type of programming.”
There is little hardware can do to overcome bad software. “Hardware is optimizing the way it executes software commands, but software needs to be more aware of what commands are sent to hardware,” says Saif. “They need to be aware of the downstream challenge of managing the power envelope, keeping within budget. I interact with enough hardware engineers to know their frustration with the software processes.”
Conclusion
There appears to be general agreement within the industry that data movement is expensive, both in terms of performance and power consumption. The only acceptable answer, in the long term, is how to significantly reduce the need for it. But today, the solutions being created are how to optimize the power consumed by it. While this is a predictable strategy for the semiconductor industry, it allows for a major disruption in coming years when someone solves the real problem — and that will involve software.
first read on the website. enjoyed the flow of the post.