Litepaper: Exa, the next paradigm of sustainable hardware for AI.
Exa introduces a polymorphic computing chip technology, achieving up to 27.6x efficiency gains over the H100 GPUs through dynamically reconfigurable hardware that adapts to diverse AI models via software configuration.
August 27, 2024 12:17:16 PM PDT
 launch
 announcement
 litepaper
This litepaper introduces Exa's polymorphic chip technology, achieving up to 2.3 TFLOPS/W (FP32) at 400 W, which is 27.6x greater than the NVIDIA H100 GPUs. Our dynamically reconfigurable hardware adapts to computational models via software configuration, addressing growing demands while reducing energy consumption and costs. Supporting diverse approaches, including KolmogorovArnold networks, it represents a new paradigm of computing. Preliminary benchmarks from our latest Learnable Function Unit (LFU) revision (0.4.1) demonstrate significant potential for performance improvements.
1. Introduction: The AI Compute Crisis
The field of artificial intelligence is experiencing unprecedented growth, driving an exponential increase in computational requirements. This surge in demand is fueling rapid advancements in science and technology, pushing the boundaries of what was previously thought possible. However, this progress comes at a significant cost. Current AI infrastructure, primarily based on GPU clusters, is reaching unsustainable levels of energy consumption. Highperformance GPU setups often require power in the megawatt range, leading to concerns about longterm viability and environmental impact^{[4]}. As AI continues to advance, this energy crisis threatens to impede progress and potentially lead to a stagnation in technological and scientific discovery.
1.1 Limitations of Classical Computing for AI
Traditional computing architectures, including GPUs, are influenced by the von Neumann model, though GPUs deviate from it in significant ways. This architectural approach, characterized by the separation of processing and memory, leads to performance bottlenecks in dataintensive tasks such as AI computations.
In the context of AI and machine learning, the constant need to shuttle data between memory and processing units consumes significant time and energy, often becoming the primary performance limiter. As AI models grow in size and complexity, these bottlenecks become increasingly pronounced, limiting the ability to scale computational power effectively.
GPUs, while designed for parallel processing, still suffer from memoryrelated constraints. Constant memory access consumes significant power, and memory bandwidth often becomes a bottleneck. The time required to fetch data from memory can introduce latency, which is particularly problematic for realtime AI applications.
ApplicationSpecific Integrated Circuits (ASICs) represent a broad approach to addressing some of these limitations. ASICs are customdesigned chips tailored for specific computational tasks, offering superior performance and energy efficiency compared to generalpurpose processors for their intended applications. However, the current trend of creating new ASICs for specific AI model architectures is problematic. This approach, while potentially yielding shortterm performance gains, introduces significant drawbacks:
 Lack of flexibility: Once manufactured, these modelspecific ASICs cannot be altered, making them unable to adapt to evolving AI algorithms and architectures.
 Rapid obsolescence: As AI research progresses rapidly, modelspecific ASICs risk becoming outdated shortly after production.
 Development costs: Designing and manufacturing new ASICs for each novel AI architecture is extremely costly and timeconsuming.
 Limited applicability: These highly specialized chips often have limited use outside their specific intended application, reducing their overall value and utility.
This inflexibility becomes a significant drawback in the fastpaced field of AI research and development, where new models and techniques emerge frequently.
Furthermore, current hardware solutions, including GPUs and modelspecific ASICs, lack the ability to dynamically adapt to the diverse needs of different AI architectures, resulting in suboptimal performance and energy efficiency across varied AI workloads.
2. Exa's Polymorphic Computing Hardware
To address these challenges, Exa has developed a novel polymorphic computing architecture. This approach introduces a modelspecific, reconfigurable hardware paradigm that adapts to the unique requirements of each AI model. The key principles include specialization, parallelism and asynchronous computation, dynamic reconfigurability, and simplicity at the foundational level.
2.1 The Learnable Function Unit (LFU)
At the core of Exa's polymorphic computing system is the Learnable Function Unit (LFU). Mathematically, an LFU represents any univariate function:
$LFU:R→R$In hardware, each LFU is a specialized component designed to approximate any univariate function with high fidelity. The function implemented by an LFU is preconfigured as a hyperparameter, with adjustable parameters for finetuning (or generally just training).
A key feature of the LFU hardware is its ability to perform its designated function with zero additional operations. Once configured, the LFU operates mostly asynchronously (depending on the configuration), immediately transforming inputs to outputs based on its preconfigured function. This design eliminates the need for traditional instruction fetching and decoding, significantly reducing latency and power consumption.
The LFU hardware operates at FP32 precision, allowing it to approximate a wide variety of functions, including but not limited to Gaussian functions, linear functions, and more complex mathematical operations.
2.2 Network Architecture and Flexibility
The network of LFUs in Exa's architecture forms a complex, interconnected system that goes beyond simple directed graphs. LFUs can connect in various ways, including loops and conditional connections, allowing for the implementation of diverse AI architectures. This flexibility enables the creation of structures such as MultiLayer Perceptrons (MLPs), KolmogorovArnold Networks (KANs)^{[3]}, and other novel architectures.
2.3 Memory and Data Flow
Exa's architecture incorporates a novel approach to memory management. Instead of constantly shuttling data between memory and processing units, the system loads input data once, allowing it to propagate through the highly parallelized AI model. This approach significantly reduces memory access operations, leading to improved energy efficiency and reduced latency.
LFUs can act as accumulators, temporarily storing intermediate results as data flows through the network. This distributed approach to memory allows for efficient handling of complex, multistep computations without the need for frequent external memory access.
The final output is read once, completing the computation cycle. This singleload, singleread approach, combined with the massively parallel processing capabilities of the LFU network, enables Exa's architecture to achieve high throughput and energy efficiency across a wide range of AI workloads.
3. Exa Completeness: A Theoretical Framework
We propose the concept of Exa completeness, a novel theoretical framework that describes the computational capabilities of our polymorphic computing system. Exa completeness is rooted in the ability to compose and approximate any univariate function, which can then be used to construct multivariate functions. This concept is inspired by the KolmogorovArnold representation theorem^{[1]}^{[2]}
3.1 Formal Definition
Let $A$ be a set isomorphic to $R$ or a computational approximation thereof (e.g., FP32). Define a system $S=(B,T,P)$ where:
 $B$ is a finite set of basis functions, $ϕ:A→A$
 $T$ is a finite set of transformation functions, $T:B→(A→A)$
 $P$ is a finite set of superposition methods, $P:(A→A)_{∗}→(A→A)$
Let $ϵ≥0$, representing the desired approximation error for the system.
The system $S$ is considered "Exa complete" if:

Univariate Completeness:
$∀f∈C(A,A).∃h∈H.x∈Asup ∣f(x)−h(x)∣≤ϵ$where $H$ is the set of all functions that can be constructed using finite compositions of elements from $B$, $T$, and $P$.

Multivariate Extension:
$∀m,n∈N.∀F∈C(A_{m},A_{n}).∃G∈G.x∈A_{m}sup ∥F(x)−G(x)∥≤ϵ$where $G$ is the set of all functions $A_{m}→A_{n}$ that can be constructed using finite combinations of functions satisfying condition 1.
Here, $C(A,A)$ is the set of all continuous functions from $A$ to $A$, $C(A_{m},A_{n})$ is the set of all continuous functions from $A_{m}$ to $A_{n}$.
$∥⋅∥$ denotes an appropriate norm on $A_{n}$.
Note: The system inherently has an approximation error bounded by $ϵ$. This error can be made arbitrarily small by choosing a sufficiently small $ϵ$, including $ϵ=0$ for exact representation.
3.2 LFUs and Exa Completeness
In the context of our polymorphic computing system, the LFUs serve as the fundamental building blocks that enable Exa completeness. The set of basis functions $B$ corresponds to the initial configurations of our LFUs, while the transformation functions $T$ represent the reconfiguration capabilities of these units. The superposition methods $P$ are realized through the interconnections and compositions of multiple LFUs.
This formal definition of Exa completeness provides a rigorous foundation for understanding the expressive power of our system. It guarantees that, given sufficient LFUs and appropriate configurations, our polymorphic computing architecture can approximate any continuous function to arbitrary precision.
4. Implementing AI Architectures with Exa
The flexibility of Exa's LFU network allows for efficient implementation of various AI architectures. We'll demonstrate how our system can realize MultiLayer Perceptrons (MLPs), KolmogorovArnold Networks (KANs)^{[3]}, and more complex structures like transformers with attention mechanisms.
4.1 Realizing MLPs with Exa Complete System
Consider a standard MLP with one hidden layer:
$MLP(x)=i=1∑N w_{i}σ(a_{i}⋅x+b_{i})$where $σ$ is the activation function, $w_{i}$, $a_{i}$, and $b_{i}$ are weights and biases.
In an Exa complete system, we can construct this MLP as follows:
 Each term $σ(a_{i}⋅x+b_{i})$ can be represented by a composition of LFUs, one for each input dimension and one for the activation function.
 The weighted sum can be implemented using additional LFUs configured for multiplication and addition.
4.2 Realizing KANs with Exa Complete System
KolmogorovArnold Networks (KANs)^{[3]}, which are based on the KolmogorovArnold representation theorem^{[1]}^{[2]}, can be efficiently implemented in our Exa complete system. The theorem states that any multivariate continuous function can be represented as a superposition of univariate functions:
$f(x_{1},...,x_{n})=q=1∑2n+1 Φ_{q}(p=1∑n ϕ_{q,p}(x_{p}))$where $Φ_{q}$ and $ϕ_{q,p}$ are continuous univariate functions.
Our LFUs are particularly wellsuited for implementing KANs because:
 Each univariate function $ϕ_{q,p}$ and $Φ_{q}$ can be directly represented by an LFU or a composition of LFUs.
 The summation operations can be effectively implemented using additional LFUs configured for addition.
4.3 Implementing Transformers and Attention Mechanisms
The flexibility of our Exa complete system extends to more complex architectures like transformers and their attention mechanisms. Key components of transformer architectures can be realized through appropriate configurations of LFUs:
 Softmax Function: Can be implemented using a combination of LFUs to perform exponentiation and normalization.
 Attention Mechanism: The dotproduct attention can be realized using LFUs configured for multiplication, summation, and the softmax operation.
 FeedForward Networks: Similar to MLPs, these can be constructed using LFUs for matrix multiplication and activation functions.
By leveraging the reconfigurability of our LFUs, we can efficiently implement the intricate operations required for transformer architectures, potentially leading to significant performance improvements and energy efficiency gains compared to traditional hardware solutions.
5. Performance Benchmarks and Efficiency Analysis
To evaluate the potential performance gains of our polymorphic computing technology, we conducted a series of simulations focusing on the power consumption and computational efficiency of a single LFU core. These simulations were designed to compare the energy efficiency of our LFUbased system against traditional GPU architectures, specifically the NVIDIA H100.
It's important to note that LFUs don't perform traditional floatingpoint operations in the conventional sense. As explained earlier in the litepaper, LFUs operate as asynchronous components that transform inputs to outputs based on their preconfigured functions. However, for the purpose of comparison with traditional architectures, we use an equivalent measure of floatingpoint operations per second (FLOPS) to quantify performance.
The efficiency metric used in our analysis is FLOPS per Watt, which quantifies the computational performance relative to power consumption.
Our simulations covered multiple revisions of the LFU, with the latest being revision 0.4.1. Each revision represents an improvement or change in the LFU design, resulting in different power efficiency and performance.
LFU revision  Maximum LFU power  Number of LFU cores  Maximum power  Maximum performance (FP32)  Maximum energy performance (FP32) 

0.4.1  127 µW  3.14 M  400 W  945 TFLOPS  2 362 GFLOPS/W 
0.4  231 µW  1.73 M  400 W  519 TFLOPS  1 298 GFLOPS/W 
0.1  2.04 mW  196 K  400 W  19.6 TFLOPS  49.02 GFLOPS/W 
Our simulations indicate a significant improvement in this metric for the Exa system compared to the H100 GPU. The relative efficiency gain can be expressed as:
$Efficiency Gain=E_{GPU}E_{Exa} $where $E_{Exa}$ and $E_{GPU}$ represent the energy efficiency (in GFLOPS/Watt) of the Exa system and the H100 GPU, respectively.
Performance per Watt comparison (FP32), Exa vs flagship GPUs (2024)
The performanceperwatt ratio for the H100 GPU is documented at approximately 85.7 GFLOPS/W (FP32, 700 W TDP)^{[5]}. Our LFU simulations indicate a maximum efficiency gain of 27.6x relative to the H100 GPU.
It is imperative to emphasize that these simulations concentrate primarily on individual LFU cores, which are anticipated to be the dominant powerconsuming components due to the potential activation of millions of such cores during runtime. This focus on LFU cores constitutes the foundation of our benchmarking methodology. The simulations, however, do not encompass the power consumption of the LFU interconnect network. This network is expected to remain largely in an idle state during operation, with only active connections consuming negligible power.
It is imperative to emphasize that these results are derived from preliminary simulations and are subject to revision as our technology evolves. As our research and development progress, we anticipate further improvements in both energy efficiency and computational capabilities, potentially surpassing the current benchmarks.
5.1 Areas for Future Improvement
While our current simulations demonstrate significant potential advantages, we acknowledge several areas for continued refinement and optimization:

Model Upload Efficiency: Some complex models may take longer to compile and upload initially. However, unlike GPUs, our hardware executes all models much faster after this onetime upload.

Quantization Flexibility: The current system operates at FP32 precision. We are currently working on making the quantization configurable as well.
Onchip training is possible as you can create an equivalence training model of the model you are training; however, this consumes extra LFU cores. We are working on integrating training functionality into each LFU core instead. Stay tuned for the next update!
These benchmarks marks a major milestone in the development of our new hardware. Our next crucial phase will be to manufacture prototype chips for further benchmarking and testing.
5.2 Opensource SDK & firmware
Exa is committed to providing an opensource software ecosystem to support our polymorphic computing technology. This ecosystem includes the Software Development Kit (SDK), firmware, and tools for framework integration. By making our software opensource, we aim to ensure transparency, encourage community contributions, and facilitate seamless integration with popular AI frameworks such as JAX, PyTorch, Julia's FLUX & LUX, and TinyGrad.
6. Conclusion
Exa's polymorphic computing technology represents a significant advancement in AI hardware design. By addressing the critical issues of energy efficiency and architectural flexibility, we enable the next generation of AI advancements. Our approach offers a more efficient and adaptable solution for AI computation, overcoming many of the limitations imposed by current architectures. The ability to reconfigure our hardware for any AI model through software uploads sets Exa apart from traditional fixedfunction AI accelerators and modelspecific ASICs.
Our commitment to supporting a wide range of AI architectures positions Exa at the forefront of a new era in computational intelligence. As we continue to refine and expand our technology, we invite researchers, developers, and industry partners to join us in exploring the vast potential of polymorphic computing. Together, we can push the boundaries of AI capabilities while ensuring a sustainable and accessible future for computational intelligence, all on a single, versatile hardware platform.
// <3 Exa
References
 Kolmogorov–Arnold representation theorem (2024) (fetched 20240826)
 A.N. Kolmogorov, On the representation of continuous functions of several variables as superpositions of continuous functions of a smaller number of variables (1956)
 Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., Hou, T.Y, Tegmark, M., KAN: Kolmogorov–Arnold Networks (2024)
 How AI Is Fueling a Boom in Data Centers and Energy Demand (2024) (fetched 20240826)
 NVIDIA Corporation, NVIDIA H100 Tensor Core GPU Architecture (2022)
Message signature
BEGIN PGP SIGNED MESSAGE Hash: SHA512 This litepaper introduces Exa's polymorphic chip technology, achieving up to **2.3 TFLOPS/W** (FP32) at **400 W**, which is **27.6x** greater than the NVIDIA H100 GPUs. Our dynamically reconfigurable hardware adapts to computational models via software configuration, addressing growing demands while reducing energy consumption and costs. Supporting diverse approaches, including KolmogorovArnold networks, it represents a new paradigm of computing. Preliminary benchmarks from our latest Learnable Function Unit (LFU) revision (0.4.1) demonstrate significant potential for performance improvements. ## 1. Introduction: The AI Compute Crisis The field of artificial intelligence is experiencing unprecedented growth, driving an exponential increase in computational requirements. This surge in demand is fueling rapid advancements in science and technology, pushing the boundaries of what was previously thought possible. However, this progress comes at a significant cost. Current AI infrastructure, primarily based on GPU clusters, is reaching unsustainable levels of energy consumption. Highperformance GPU setups often require power in the megawatt range, leading to concerns about longterm viability and environmental impact<R i={4} r={metadata.refs} />. As AI continues to advance, this energy crisis threatens to impede progress and potentially lead to a stagnation in technological and scientific discovery. ### 1.1 Limitations of Classical Computing for AI Traditional computing architectures, including GPUs, are influenced by the von Neumann model, though GPUs deviate from it in significant ways. This architectural approach, characterized by the separation of processing and memory, leads to performance bottlenecks in dataintensive tasks such as AI computations. In the context of AI and machine learning, the constant need to shuttle data between memory and processing units consumes significant time and energy, often becoming the primary performance limiter. As AI models grow in size and complexity, these bottlenecks become increasingly pronounced, limiting the ability to scale computational power effectively. GPUs, while designed for parallel processing, still suffer from memoryrelated constraints. Constant memory access consumes significant power, and memory bandwidth often becomes a bottleneck. The time required to fetch data from memory can introduce latency, which is particularly problematic for realtime AI applications. ApplicationSpecific Integrated Circuits (ASICs) represent a broad approach to addressing some of these limitations. ASICs are customdesigned chips tailored for specific computational tasks, offering superior performance and energy efficiency compared to generalpurpose processors for their intended applications. However, the current trend of creating new ASICs for specific AI model architectures is problematic. This approach, while potentially yielding shortterm performance gains, introduces significant drawbacks: 1. Lack of flexibility: Once manufactured, these modelspecific ASICs cannot be altered, making them unable to adapt to evolving AI algorithms and architectures. 2. Rapid obsolescence: As AI research progresses rapidly, modelspecific ASICs risk becoming outdated shortly after production. 3. Development costs: Designing and manufacturing new ASICs for each novel AI architecture is extremely costly and timeconsuming. 4. Limited applicability: These highly specialized chips often have limited use outside their specific intended application, reducing their overall value and utility. This inflexibility becomes a significant drawback in the fastpaced field of AI research and development, where new models and techniques emerge frequently. Furthermore, current hardware solutions, including GPUs and modelspecific ASICs, lack the ability to dynamically adapt to the diverse needs of different AI architectures, resulting in suboptimal performance and energy efficiency across varied AI workloads. ## 2. Exa's Polymorphic Computing Hardware To address these challenges, Exa has developed a novel polymorphic computing architecture. This approach introduces a modelspecific, reconfigurable hardware paradigm that adapts to the unique requirements of each AI model. The key principles include specialization, parallelism and asynchronous computation, dynamic reconfigurability, and **simplicity at the foundational level**. ### 2.1 The Learnable Function Unit (LFU) At the core of Exa's polymorphic computing system is the Learnable Function Unit (LFU). Mathematically, an LFU represents any univariate function: $$ \text{LFU}: \mathbb{R} \rightarrow \mathbb{R} $$ In hardware, each LFU is a specialized component designed to approximate any univariate function with high fidelity. The function implemented by an LFU is preconfigured as a hyperparameter, with adjustable parameters for finetuning (or generally just training). A key feature of the LFU hardware is its ability to perform its designated function with zero additional operations. Once configured, the LFU operates mostly asynchronously (depending on the configuration), immediately transforming inputs to outputs based on its preconfigured function. This design eliminates the need for traditional instruction fetching and decoding, significantly reducing latency and power consumption. The LFU hardware operates at FP32 precision, allowing it to approximate a wide variety of functions, including but not limited to Gaussian functions, linear functions, and more complex mathematical operations. ### 2.2 Network Architecture and Flexibility The network of LFUs in Exa's architecture forms a complex, interconnected system that goes beyond simple directed graphs. LFUs can connect in various ways, including loops and conditional connections, allowing for the implementation of diverse AI architectures. This flexibility enables the creation of structures such as MultiLayer Perceptrons (MLPs), KolmogorovArnold Networks (KANs)<R i={3} r={metadata.refs} />, and other novel architectures. ### 2.3 Memory and Data Flow Exa's architecture incorporates a novel approach to memory management. Instead of constantly shuttling data between memory and processing units, the system loads input data once, allowing it to propagate through the highly parallelized AI model. This approach significantly reduces memory access operations, leading to improved energy efficiency and reduced latency. LFUs can act as accumulators, temporarily storing intermediate results as data flows through the network. This distributed approach to memory allows for efficient handling of complex, multistep computations without the need for frequent external memory access. The final output is read once, completing the computation cycle. This singleload, singleread approach, combined with the massively parallel processing capabilities of the LFU network, enables Exa's architecture to achieve high throughput and energy efficiency across a wide range of AI workloads. ## 3. Exa Completeness: A Theoretical Framework We propose the concept of Exa completeness, a novel theoretical framework that describes the computational capabilities of our polymorphic computing system. Exa completeness is rooted in the ability to compose and approximate any univariate function, which can then be used to construct multivariate functions. This concept is inspired by the KolmogorovArnold representation theorem<R i={1} r={metadata.refs} /><R i={2} r={metadata.refs} /> ### 3.1 Formal Definition Let $A$ be a set isomorphic to $\mathbb{R}$ or a computational approximation thereof (e.g., FP32). Define a system $S = (B, T, P)$ where:   $B$ is a finite set of basis functions, $\phi : A \to A$   $T$ is a finite set of transformation functions, $T : B \to (A \to A)$   $P$ is a finite set of superposition methods, $P : (A \to A)^* \to (A \to A)$ Let $\epsilon \geq 0$, representing the desired approximation error for the system. The system $S$ is considered ***"Exa complete"*** if: 1. **Univariate Completeness**: $$ \forall f \in C(A, A). \exists h \in H. \sup_{x \in A} f(x)  h(x) \leq \epsilon $$ where $H$ is the set of all functions that can be constructed using finite compositions of elements from $B$, $T$, and $P$. 2. **Multivariate Extension**: $$ \forall m,n \in \mathbb{N}. \forall F \in C(A^m, A^n). \exists G \in G. \sup_{x \in A^m} \F(x)  G(x)\ \leq \epsilon $$ where $G$ is the set of all functions $A^m \to A^n$ that can be constructed using finite combinations of functions satisfying condition 1. Here, $C(A,A)$ is the set of all continuous functions from $A$ to $A$, $C(A^m,A^n)$ is the set of all continuous functions from $A^m$ to $A^n$. $\\cdot\$ denotes an appropriate norm on $A^n$. Note: The system inherently has an approximation error bounded by $\epsilon$. This error can be made arbitrarily small by choosing a sufficiently small $\epsilon$, including $\epsilon = 0$ for exact representation. ### 3.2 LFUs and Exa Completeness In the context of our polymorphic computing system, the LFUs serve as the fundamental building blocks that enable Exa completeness. The set of basis functions $B$ corresponds to the initial configurations of our LFUs, while the transformation functions $T$ represent the reconfiguration capabilities of these units. The superposition methods $P$ are realized through the interconnections and compositions of multiple LFUs. This formal definition of Exa completeness provides a rigorous foundation for understanding the expressive power of our system. It guarantees that, given sufficient LFUs and appropriate configurations, our polymorphic computing architecture can approximate any continuous function to arbitrary precision. ## 4. Implementing AI Architectures with Exa The flexibility of Exa's LFU network allows for efficient implementation of various AI architectures. We'll demonstrate how our system can realize MultiLayer Perceptrons (MLPs), KolmogorovArnold Networks (KANs)<R i={3} r={metadata.refs} />, and more complex structures like transformers with attention mechanisms. ### 4.1 Realizing MLPs with Exa Complete System Consider a standard MLP with one hidden layer: $$ \text{MLP}(x) = \sum_{i=1}^N w_i \sigma(a_i \cdot x + b_i) $$ where $\sigma$ is the activation function, $w_i$, $a_i$, and $b_i$ are weights and biases. In an Exa complete system, we can construct this MLP as follows: 1. Each term $\sigma(a_i \cdot x + b_i)$ can be represented by a composition of LFUs, one for each input dimension and one for the activation function. 2. The weighted sum can be implemented using additional LFUs configured for multiplication and addition. ### 4.2 Realizing KANs with Exa Complete System KolmogorovArnold Networks (KANs)<R i={3} r={metadata.refs} />, which are based on the KolmogorovArnold representation theorem<R i={1} r={metadata.refs} /><R i={2} r={metadata.refs} />, can be efficiently implemented in our Exa complete system. The theorem states that any multivariate continuous function can be represented as a superposition of univariate functions: $$ f(x_1, ..., x_n) = \sum_{q=1}^{2n+1} \Phi_q \left( \sum_{p=1}^n \phi_{q,p}(x_p) \right) $$ where $\Phi_q$ and $\phi_{q,p}$ are continuous univariate functions. Our LFUs are particularly wellsuited for implementing KANs because: 1. Each univariate function $\phi_{q,p}$ and $\Phi_q$ can be directly represented by an LFU or a composition of LFUs. 2. The summation operations can be effectively implemented using additional LFUs configured for addition. ### 4.3 Implementing Transformers and Attention Mechanisms The flexibility of our Exa complete system extends to more complex architectures like transformers and their attention mechanisms. Key components of transformer architectures can be realized through appropriate configurations of LFUs: 1. Softmax Function: Can be implemented using a combination of LFUs to perform exponentiation and normalization. 2. Attention Mechanism: The dotproduct attention can be realized using LFUs configured for multiplication, summation, and the softmax operation. 3. FeedForward Networks: Similar to MLPs, these can be constructed using LFUs for matrix multiplication and activation functions. By leveraging the reconfigurability of our LFUs, we can efficiently implement the intricate operations required for transformer architectures, potentially leading to significant performance improvements and energy efficiency gains compared to traditional hardware solutions. ## 5. Performance Benchmarks and Efficiency Analysis To evaluate the potential performance gains of our polymorphic computing technology, we conducted a series of simulations focusing on the power consumption and computational efficiency of a single LFU core. These simulations were designed to compare the energy efficiency of our LFUbased system against traditional GPU architectures, specifically the NVIDIA H100. It's important to note that LFUs don't perform traditional floatingpoint operations in the conventional sense. As explained earlier in the litepaper, LFUs operate as asynchronous components that transform inputs to outputs based on their preconfigured functions. However, for the purpose of comparison with traditional architectures, we use an equivalent measure of floatingpoint operations per second (FLOPS) to quantify performance. The efficiency metric used in our analysis is FLOPS per Watt, which quantifies the computational performance relative to power consumption. Our simulations covered multiple revisions of the LFU, with the latest being revision **0.4.1**. Each revision represents an improvement or change in the LFU design, resulting in different power efficiency and performance. <div className="wfull md:px20 md:flex md:justifycenter my8">  LFU revision  Maximum LFU power  Number of LFU cores  Maximum power  Maximum performance (FP32)  Maximum energy performance (FP32)    **0.4.1**  127 µW  3.14 M  400 W  945 TFLOPS  2 362 GFLOPS/W   0.4  231 µW  1.73 M  400 W  519 TFLOPS  1 298 GFLOPS/W   0.1  2.04 mW  196 K  400 W  19.6 TFLOPS  49.02 GFLOPS/W  </div> Our simulations indicate a significant improvement in this metric for the Exa system compared to the H100 GPU. The relative efficiency gain can be expressed as: $$ \text{Efficiency Gain} = \frac{E_\text{Exa}}{E_\text{GPU}} $$ where $E_\text{Exa}$ and $E_\text{GPU}$ represent the energy efficiency (in GFLOPS/Watt) of the Exa system and the H100 GPU, respectively. <div className="wfull flex justifycenter md:px20 mt12 mb24"> <LFUvsH100Chart className="wfull" /> </div> The performanceperwatt ratio for the H100 GPU is documented at approximately 85.7 GFLOPS/W (FP32, 700 W TDP)<R i={5} r={metadata.refs} />. Our LFU simulations indicate a maximum efficiency gain of **27.6x** relative to the H100 GPU. It is imperative to emphasize that these simulations concentrate primarily on individual LFU cores, which are anticipated to be the dominant powerconsuming components due to the potential activation of millions of such cores during runtime. This focus on LFU cores constitutes the foundation of our benchmarking methodology. The simulations, however, do not encompass the power consumption of the LFU interconnect network. This network is expected to remain largely in an idle state during operation, with only active connections consuming **negligible power**. **It is imperative to emphasize that these results are derived from preliminary simulations and are subject to revision as our technology evolves.** As our research and development progress, we anticipate further improvements in both energy efficiency and computational capabilities, **potentially surpassing the current benchmarks**. ### 5.1 Areas for Future Improvement While our current simulations demonstrate significant potential advantages, we acknowledge several areas for continued refinement and optimization: 1. **Model Upload Efficiency**: Some complex models may take longer to compile and upload initially. However, unlike GPUs, our hardware executes all models much faster after this onetime upload. 2. **Quantization Flexibility**: The current system operates at FP32 precision. We are currently working on making the quantization configurable as well. **Onchip training is possible** as you can create an *equivalence training model* of the model you are training; however, this consumes extra LFU cores. We are working on integrating training functionality into each LFU core instead. Stay tuned for the next update! **These benchmarks marks a major milestone in the development of our new hardware.** Our next crucial phase will be to manufacture prototype chips for further benchmarking and testing. ### 5.2 Opensource SDK & firmware Exa is committed to providing an opensource software ecosystem to support our polymorphic computing technology. This ecosystem includes the Software Development Kit (SDK), firmware, and tools for framework integration. By making our software opensource, we aim to ensure transparency, encourage community contributions, and facilitate seamless integration with popular AI frameworks such as [JAX](https://github.com/google/jax), [PyTorch](https://github.com/pytorch/pytorch), Julia's [FLUX](https://github.com/FluxML/Flux.jl) & [LUX](https://github.com/LuxDL/Lux.jl), and [TinyGrad](https://github.com/tinygrad/tinygrad). ## 6. Conclusion Exa's polymorphic computing technology represents a significant advancement in AI hardware design. By addressing the critical issues of energy efficiency and architectural flexibility, we enable the next generation of AI advancements. Our approach offers a more efficient and adaptable solution for AI computation, overcoming many of the limitations imposed by current architectures. The ability to reconfigure our hardware for any AI model through software uploads sets Exa apart from traditional fixedfunction AI accelerators and modelspecific ASICs. Our commitment to supporting a wide range of AI architectures positions Exa at the forefront of a new era in computational intelligence. As we continue to refine and expand our technology, we invite researchers, developers, and industry partners to join us in exploring the vast potential of polymorphic computing. Together, we can push the boundaries of AI capabilities while ensuring a sustainable and accessible future for computational intelligence, all on a single, versatile hardware platform. // <3 Exa BEGIN PGP SIGNATURE iHUEARYKAB0WIQTqH4Le1ZB+MwLlm9jtpSZBtyZXuAUCZs4mPAAKCRDtpSZBtyZX uKWWAQD9j2Artat4W4T8TWDTOxxlYQmOP+xC1sF5HfRNsuZd9wD/Yqiq6PMa3NaY RiKT3/EoqD1x4mrcgYb3vAiDhF8fpQ8= =VZs4 END PGP SIGNATURE
Sm9pbiB1cy4K
XADJL5QSZECI6H6X5B6O4HTUOKITGGUKMEJSEKVOI47K7V7DXZXBSTFTF55PRWABRWAQI6YCGCGC TX4MSHRKFYIZMZWAZE7LHGAQTY7T7MBU65SCH5UDQDRBKYBAGP6347ZPNMUXLYGENNPQ3KDSW2RD SGYPFG46NETD6LXUIK5SWH7T7LVJ36NKAIKSIAA3QKUQOATF7WVT2FHL6VXR5PZTMFYDLQ3XSFAM OHSQU4KMKKHALQK4TVHQGFJH2QYXK5TQUDPDFZMD3TMMJYY4R4JLEPM53OVNRLRMADGEHGYCLWKJ 3V3CV37XVKA5UYVRTR2U3PT2DSVT4FEA34ZJVWTFN4A4476GLPE4QH7ROXKU2NPIIO3NYTV6K7F4 7EA556QBTF3JTW3DZMP5UZFAKGHC6IGXKI2MT3TCFVLB6BL7HZWG23VFNQ7F7VTDXSQ7F6XY4Y7Z HXXA3ROJKD5IU4OEOLJTAXCPREJRXJH24NOMFEVKTLMW4VXLYQAH3TTUCFJ6VNRXGZOOKVADHMHD E7L4MP5SD743HDLMGN6CGJE52MYG5ZSNIAUPWLKEI64WJXQ6KIZFXHBWK5BKZZOAWBHICYUS52PS CJQ27NGKIAE7OZDPRWO4K3ZKLMXWNUADNLE7AUQQPFWOCQ7XAQQS4TQXDH6YKWZG2KOVBW5BXVNX NX63HANMJNESRGSZKCUCVAVL6TKZSKZE3XODY22P2NN35DOJYRKXDSYFJHSGGIZOSHMN4CJN42PO LIZW2P4YE2FFI7I7QSJ4MMDRT5QNBPQRYKPG4UENL5LAZ3F5BKI2B3EOGIHH6KQTRYSWKOV2TFPQ 6IB4V6MHUDBY5BJ3DULOYVK3B76L66B5M4WLZAJCIMV7GQAEDXXQU45ULL3LOBXFMB335BYKVZH4 ISLYPH7XIARPQ655S7RYGPWMW6TAX4LLM6MLMZFFRTVPMIRYIDJSX5R2SKAZAGVLWCUUSPJDH7HQ SB4EDMM7GH5BL22PCW3DAXKLWRHU4O233GPVWECQCWMIKKQVLSJUV43T7ZQRVELMO4XVLKFYZ2P6 FILJQBPXRE2U6ZLYRD26C2GYTQM52CIV7TPJVGVPMODKSY6YBR6I5YE7NFJTUT7ZT2HO3KMTW52L ABYL3Q52GPYXKKJ5A2LK2MC22I7MH252XPH3ZQQS7SFAJPECYQCFFZ3ZA5OEZICAGMX3C2AOXM3Z ZAGC6FTARFJPB4QD33ZDJM6W6RWEDKHJBJPGFXSHG5XX42IDMUVMTLBSDXTQ3ODOFZ3Z5O4ADJDH LU7HGT3INFVUPWCCGJ3UFHM7GAFA5ZZ24ZJNZTB24XIG2BPGS2TUJN2MWIGUMWEKP65SSAB2EITW GIH34NHFLAZWLB2KUCG73IR5KR3KJC2F3D2WVO33QVYX3UI6XTEONYGGZBJKJNACDKXNRWJJCPBS XQNAUVM5IZJIJE6KMQAKSGHOK67U442GTW5DUZ3DEUVUXEFGT2ACXD2BCVPWFFOB27VTZYSEZGE6 MMCS2RVPTXIDT77AJGSLPCZP5Y5Z5QSHDMNZTMLCKV======