← Go back home

Litepaper: Exa, the next paradigm of sustainable hardware for AI.

Exa introduces a polymorphic computing chip technology, achieving up to 27.6x efficiency gains over the H100 GPUs through dynamically reconfigurable hardware that adapts to diverse AI models via software configuration.

August 27, 2024 12:17:16 PM PDT

  • launch
  • announcement
  • litepaper
Abstract AI art.

This litepaper introduces Exa's polymorphic chip technology, achieving up to 2.3 TFLOPS/W (FP32) at 400 W, which is 27.6x greater than the NVIDIA H100 GPUs. Our dynamically reconfigurable hardware adapts to computational models via software configuration, addressing growing demands while reducing energy consumption and costs. Supporting diverse approaches, including Kolmogorov-Arnold networks, it represents a new paradigm of computing. Preliminary benchmarks from our latest Learnable Function Unit (LFU) revision (0.4.1) demonstrate significant potential for performance improvements.

1. Introduction: The AI Compute Crisis

The field of artificial intelligence is experiencing unprecedented growth, driving an exponential increase in computational requirements. This surge in demand is fueling rapid advancements in science and technology, pushing the boundaries of what was previously thought possible. However, this progress comes at a significant cost. Current AI infrastructure, primarily based on GPU clusters, is reaching unsustainable levels of energy consumption. High-performance GPU setups often require power in the megawatt range, leading to concerns about long-term viability and environmental impact[4]. As AI continues to advance, this energy crisis threatens to impede progress and potentially lead to a stagnation in technological and scientific discovery.

1.1 Limitations of Classical Computing for AI

Traditional computing architectures, including GPUs, are influenced by the von Neumann model, though GPUs deviate from it in significant ways. This architectural approach, characterized by the separation of processing and memory, leads to performance bottlenecks in data-intensive tasks such as AI computations.

In the context of AI and machine learning, the constant need to shuttle data between memory and processing units consumes significant time and energy, often becoming the primary performance limiter. As AI models grow in size and complexity, these bottlenecks become increasingly pronounced, limiting the ability to scale computational power effectively.

GPUs, while designed for parallel processing, still suffer from memory-related constraints. Constant memory access consumes significant power, and memory bandwidth often becomes a bottleneck. The time required to fetch data from memory can introduce latency, which is particularly problematic for real-time AI applications.

Application-Specific Integrated Circuits (ASICs) represent a broad approach to addressing some of these limitations. ASICs are custom-designed chips tailored for specific computational tasks, offering superior performance and energy efficiency compared to general-purpose processors for their intended applications. However, the current trend of creating new ASICs for specific AI model architectures is problematic. This approach, while potentially yielding short-term performance gains, introduces significant drawbacks:

  1. Lack of flexibility: Once manufactured, these model-specific ASICs cannot be altered, making them unable to adapt to evolving AI algorithms and architectures.
  2. Rapid obsolescence: As AI research progresses rapidly, model-specific ASICs risk becoming outdated shortly after production.
  3. Development costs: Designing and manufacturing new ASICs for each novel AI architecture is extremely costly and time-consuming.
  4. Limited applicability: These highly specialized chips often have limited use outside their specific intended application, reducing their overall value and utility.

This inflexibility becomes a significant drawback in the fast-paced field of AI research and development, where new models and techniques emerge frequently.

Furthermore, current hardware solutions, including GPUs and model-specific ASICs, lack the ability to dynamically adapt to the diverse needs of different AI architectures, resulting in suboptimal performance and energy efficiency across varied AI workloads.

2. Exa's Polymorphic Computing Hardware

To address these challenges, Exa has developed a novel polymorphic computing architecture. This approach introduces a model-specific, reconfigurable hardware paradigm that adapts to the unique requirements of each AI model. The key principles include specialization, parallelism and asynchronous computation, dynamic reconfigurability, and simplicity at the foundational level.

2.1 The Learnable Function Unit (LFU)

At the core of Exa's polymorphic computing system is the Learnable Function Unit (LFU). Mathematically, an LFU represents any univariate function:

LFU:RR\text{LFU}: \mathbb{R} \rightarrow \mathbb{R}

In hardware, each LFU is a specialized component designed to approximate any univariate function with high fidelity. The function implemented by an LFU is preconfigured as a hyperparameter, with adjustable parameters for fine-tuning (or generally just training).

A key feature of the LFU hardware is its ability to perform its designated function with zero additional operations. Once configured, the LFU operates mostly asynchronously (depending on the configuration), immediately transforming inputs to outputs based on its preconfigured function. This design eliminates the need for traditional instruction fetching and decoding, significantly reducing latency and power consumption.

The LFU hardware operates at FP32 precision, allowing it to approximate a wide variety of functions, including but not limited to Gaussian functions, linear functions, and more complex mathematical operations.

2.2 Network Architecture and Flexibility

The network of LFUs in Exa's architecture forms a complex, interconnected system that goes beyond simple directed graphs. LFUs can connect in various ways, including loops and conditional connections, allowing for the implementation of diverse AI architectures. This flexibility enables the creation of structures such as Multi-Layer Perceptrons (MLPs), Kolmogorov-Arnold Networks (KANs)[3], and other novel architectures.

2.3 Memory and Data Flow

Exa's architecture incorporates a novel approach to memory management. Instead of constantly shuttling data between memory and processing units, the system loads input data once, allowing it to propagate through the highly parallelized AI model. This approach significantly reduces memory access operations, leading to improved energy efficiency and reduced latency.

LFUs can act as accumulators, temporarily storing intermediate results as data flows through the network. This distributed approach to memory allows for efficient handling of complex, multi-step computations without the need for frequent external memory access.

The final output is read once, completing the computation cycle. This single-load, single-read approach, combined with the massively parallel processing capabilities of the LFU network, enables Exa's architecture to achieve high throughput and energy efficiency across a wide range of AI workloads.

3. Exa Completeness: A Theoretical Framework

We propose the concept of Exa completeness, a novel theoretical framework that describes the computational capabilities of our polymorphic computing system. Exa completeness is rooted in the ability to compose and approximate any univariate function, which can then be used to construct multivariate functions. This concept is inspired by the Kolmogorov-Arnold representation theorem[1][2]

3.1 Formal Definition

Let AA be a set isomorphic to R\mathbb{R} or a computational approximation thereof (e.g., FP32). Define a system S=(B,T,P)S = (B, T, P) where:

  • BB is a finite set of basis functions, ϕ:AA\phi : A \to A
  • TT is a finite set of transformation functions, T:B(AA)T : B \to (A \to A)
  • PP is a finite set of superposition methods, P:(AA)(AA)P : (A \to A)^* \to (A \to A)

Let ϵ0\epsilon \geq 0, representing the desired approximation error for the system.

The system SS is considered "Exa complete" if:

  1. Univariate Completeness:

    fC(A,A).hH.supxAf(x)h(x)ϵ\forall f \in C(A, A). \exists h \in H. \sup_{x \in A} |f(x) - h(x)| \leq \epsilon

    where HH is the set of all functions that can be constructed using finite compositions of elements from BB, TT, and PP.

  2. Multivariate Extension:

    m,nN.FC(Am,An).GG.supxAmF(x)G(x)ϵ\forall m,n \in \mathbb{N}. \forall F \in C(A^m, A^n). \exists G \in G. \sup_{x \in A^m} \|F(x) - G(x)\| \leq \epsilon

    where GG is the set of all functions AmAnA^m \to A^n that can be constructed using finite combinations of functions satisfying condition 1.

Here, C(A,A)C(A,A) is the set of all continuous functions from AA to AA, C(Am,An)C(A^m,A^n) is the set of all continuous functions from AmA^m to AnA^n.

\|\cdot\| denotes an appropriate norm on AnA^n.

Note: The system inherently has an approximation error bounded by ϵ\epsilon. This error can be made arbitrarily small by choosing a sufficiently small ϵ\epsilon, including ϵ=0\epsilon = 0 for exact representation.

3.2 LFUs and Exa Completeness

In the context of our polymorphic computing system, the LFUs serve as the fundamental building blocks that enable Exa completeness. The set of basis functions BB corresponds to the initial configurations of our LFUs, while the transformation functions TT represent the reconfiguration capabilities of these units. The superposition methods PP are realized through the interconnections and compositions of multiple LFUs.

This formal definition of Exa completeness provides a rigorous foundation for understanding the expressive power of our system. It guarantees that, given sufficient LFUs and appropriate configurations, our polymorphic computing architecture can approximate any continuous function to arbitrary precision.

4. Implementing AI Architectures with Exa

The flexibility of Exa's LFU network allows for efficient implementation of various AI architectures. We'll demonstrate how our system can realize Multi-Layer Perceptrons (MLPs), Kolmogorov-Arnold Networks (KANs)[3], and more complex structures like transformers with attention mechanisms.

4.1 Realizing MLPs with Exa Complete System

Consider a standard MLP with one hidden layer:

MLP(x)=i=1Nwiσ(aix+bi)\text{MLP}(x) = \sum_{i=1}^N w_i \sigma(a_i \cdot x + b_i)

where σ\sigma is the activation function, wiw_i, aia_i, and bib_i are weights and biases.

In an Exa complete system, we can construct this MLP as follows:

  1. Each term σ(aix+bi)\sigma(a_i \cdot x + b_i) can be represented by a composition of LFUs, one for each input dimension and one for the activation function.
  2. The weighted sum can be implemented using additional LFUs configured for multiplication and addition.

4.2 Realizing KANs with Exa Complete System

Kolmogorov-Arnold Networks (KANs)[3], which are based on the Kolmogorov-Arnold representation theorem[1][2], can be efficiently implemented in our Exa complete system. The theorem states that any multivariate continuous function can be represented as a superposition of univariate functions:

f(x1,...,xn)=q=12n+1Φq(p=1nϕq,p(xp))f(x_1, ..., x_n) = \sum_{q=1}^{2n+1} \Phi_q \left( \sum_{p=1}^n \phi_{q,p}(x_p) \right)

where Φq\Phi_q and ϕq,p\phi_{q,p} are continuous univariate functions.

Our LFUs are particularly well-suited for implementing KANs because:

  1. Each univariate function ϕq,p\phi_{q,p} and Φq\Phi_q can be directly represented by an LFU or a composition of LFUs.
  2. The summation operations can be effectively implemented using additional LFUs configured for addition.

4.3 Implementing Transformers and Attention Mechanisms

The flexibility of our Exa complete system extends to more complex architectures like transformers and their attention mechanisms. Key components of transformer architectures can be realized through appropriate configurations of LFUs:

  1. Softmax Function: Can be implemented using a combination of LFUs to perform exponentiation and normalization.
  2. Attention Mechanism: The dot-product attention can be realized using LFUs configured for multiplication, summation, and the softmax operation.
  3. Feed-Forward Networks: Similar to MLPs, these can be constructed using LFUs for matrix multiplication and activation functions.

By leveraging the reconfigurability of our LFUs, we can efficiently implement the intricate operations required for transformer architectures, potentially leading to significant performance improvements and energy efficiency gains compared to traditional hardware solutions.

5. Performance Benchmarks and Efficiency Analysis

To evaluate the potential performance gains of our polymorphic computing technology, we conducted a series of simulations focusing on the power consumption and computational efficiency of a single LFU core. These simulations were designed to compare the energy efficiency of our LFU-based system against traditional GPU architectures, specifically the NVIDIA H100.

It's important to note that LFUs don't perform traditional floating-point operations in the conventional sense. As explained earlier in the litepaper, LFUs operate as asynchronous components that transform inputs to outputs based on their preconfigured functions. However, for the purpose of comparison with traditional architectures, we use an equivalent measure of floating-point operations per second (FLOPS) to quantify performance.

The efficiency metric used in our analysis is FLOPS per Watt, which quantifies the computational performance relative to power consumption.

Our simulations covered multiple revisions of the LFU, with the latest being revision 0.4.1. Each revision represents an improvement or change in the LFU design, resulting in different power efficiency and performance.

LFU revisionMaximum LFU powerNumber of LFU coresMaximum powerMaximum performance (FP32)Maximum energy performance (FP32)
0.4.1127 µW3.14 M400 W945 TFLOPS2 362 GFLOPS/W
0.4231 µW1.73 M400 W519 TFLOPS1 298 GFLOPS/W
0.12.04 mW196 K400 W19.6 TFLOPS49.02 GFLOPS/W

Our simulations indicate a significant improvement in this metric for the Exa system compared to the H100 GPU. The relative efficiency gain can be expressed as:

Efficiency Gain=EExaEGPU\text{Efficiency Gain} = \frac{E_\text{Exa}}{E_\text{GPU}}

where EExaE_\text{Exa} and EGPUE_\text{GPU} represent the energy efficiency (in GFLOPS/Watt) of the Exa system and the H100 GPU, respectively.

Performance per Watt comparison (FP32), Exa vs flagship GPUs (2024)

The performance-per-watt ratio for the H100 GPU is documented at approximately 85.7 GFLOPS/W (FP32, 700 W TDP)[5]. Our LFU simulations indicate a maximum efficiency gain of 27.6x relative to the H100 GPU.

It is imperative to emphasize that these simulations concentrate primarily on individual LFU cores, which are anticipated to be the dominant power-consuming components due to the potential activation of millions of such cores during runtime. This focus on LFU cores constitutes the foundation of our benchmarking methodology. The simulations, however, do not encompass the power consumption of the LFU interconnect network. This network is expected to remain largely in an idle state during operation, with only active connections consuming negligible power.

It is imperative to emphasize that these results are derived from preliminary simulations and are subject to revision as our technology evolves. As our research and development progress, we anticipate further improvements in both energy efficiency and computational capabilities, potentially surpassing the current benchmarks.

5.1 Areas for Future Improvement

While our current simulations demonstrate significant potential advantages, we acknowledge several areas for continued refinement and optimization:

  1. Model Upload Efficiency: Some complex models may take longer to compile and upload initially. However, unlike GPUs, our hardware executes all models much faster after this one-time upload.

  2. Quantization Flexibility: The current system operates at FP32 precision. We are currently working on making the quantization configurable as well.

On-chip training is possible as you can create an equivalence training model of the model you are training; however, this consumes extra LFU cores. We are working on integrating training functionality into each LFU core instead. Stay tuned for the next update!

These benchmarks marks a major milestone in the development of our new hardware. Our next crucial phase will be to manufacture prototype chips for further benchmarking and testing.

5.2 Open-source SDK & firmware

Exa is committed to providing an open-source software ecosystem to support our polymorphic computing technology. This ecosystem includes the Software Development Kit (SDK), firmware, and tools for framework integration. By making our software open-source, we aim to ensure transparency, encourage community contributions, and facilitate seamless integration with popular AI frameworks such as JAX, PyTorch, Julia's FLUX & LUX, and TinyGrad.

6. Conclusion

Exa's polymorphic computing technology represents a significant advancement in AI hardware design. By addressing the critical issues of energy efficiency and architectural flexibility, we enable the next generation of AI advancements. Our approach offers a more efficient and adaptable solution for AI computation, overcoming many of the limitations imposed by current architectures. The ability to reconfigure our hardware for any AI model through software uploads sets Exa apart from traditional fixed-function AI accelerators and model-specific ASICs.

Our commitment to supporting a wide range of AI architectures positions Exa at the forefront of a new era in computational intelligence. As we continue to refine and expand our technology, we invite researchers, developers, and industry partners to join us in exploring the vast potential of polymorphic computing. Together, we can push the boundaries of AI capabilities while ensuring a sustainable and accessible future for computational intelligence, all on a single, versatile hardware platform.

// <3 Exa

References

  1. Kolmogorov–Arnold representation theorem (2024) (fetched 2024-08-26)
  2. A.N. Kolmogorov, On the representation of continuous functions of several variables as superpositions of continuous functions of a smaller number of variables (1956)
  3. Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., Hou, T.Y, Tegmark, M., KAN: Kolmogorov–Arnold Networks (2024)
  4. How AI Is Fueling a Boom in Data Centers and Energy Demand (2024) (fetched 2024-08-26)
  5. NVIDIA Corporation, NVIDIA H100 Tensor Core GPU Architecture (2022)

Message signature

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

This litepaper introduces Exa's polymorphic chip technology, achieving up to **2.3 TFLOPS/W** (FP32) at **400 W**, which is **27.6x** greater than the NVIDIA H100 GPUs. Our dynamically reconfigurable hardware adapts to computational models via software configuration, addressing growing demands while reducing energy consumption and costs. Supporting diverse approaches, including Kolmogorov-Arnold networks, it represents a new paradigm of computing. Preliminary benchmarks from our latest Learnable Function Unit (LFU) revision (0.4.1) demonstrate significant potential for performance improvements.

## 1. Introduction: The AI Compute Crisis

The field of artificial intelligence is experiencing unprecedented growth, driving an exponential increase in computational requirements. This surge in demand is fueling rapid advancements in science and technology, pushing the boundaries of what was previously thought possible. However, this progress comes at a significant cost. Current AI infrastructure, primarily based on GPU clusters, is reaching unsustainable levels of energy consumption. High-performance GPU setups often require power in the megawatt range, leading to concerns about long-term viability and environmental impact<R i={4} r={metadata.refs} />. As AI continues to advance, this energy crisis threatens to impede progress and potentially lead to a stagnation in technological and scientific discovery.

### 1.1 Limitations of Classical Computing for AI

Traditional computing architectures, including GPUs, are influenced by the von Neumann model, though GPUs deviate from it in significant ways. This architectural approach, characterized by the separation of processing and memory, leads to performance bottlenecks in data-intensive tasks such as AI computations.

In the context of AI and machine learning, the constant need to shuttle data between memory and processing units consumes significant time and energy, often becoming the primary performance limiter. As AI models grow in size and complexity, these bottlenecks become increasingly pronounced, limiting the ability to scale computational power effectively.

GPUs, while designed for parallel processing, still suffer from memory-related constraints. Constant memory access consumes significant power, and memory bandwidth often becomes a bottleneck. The time required to fetch data from memory can introduce latency, which is particularly problematic for real-time AI applications.

Application-Specific Integrated Circuits (ASICs) represent a broad approach to addressing some of these limitations. ASICs are custom-designed chips tailored for specific computational tasks, offering superior performance and energy efficiency compared to general-purpose processors for their intended applications. However, the current trend of creating new ASICs for specific AI model architectures is problematic. This approach, while potentially yielding short-term performance gains, introduces significant drawbacks:

1. Lack of flexibility: Once manufactured, these model-specific ASICs cannot be altered, making them unable to adapt to evolving AI algorithms and architectures.
2. Rapid obsolescence: As AI research progresses rapidly, model-specific ASICs risk becoming outdated shortly after production.
3. Development costs: Designing and manufacturing new ASICs for each novel AI architecture is extremely costly and time-consuming.
4. Limited applicability: These highly specialized chips often have limited use outside their specific intended application, reducing their overall value and utility.

This inflexibility becomes a significant drawback in the fast-paced field of AI research and development, where new models and techniques emerge frequently.

Furthermore, current hardware solutions, including GPUs and model-specific ASICs, lack the ability to dynamically adapt to the diverse needs of different AI architectures, resulting in suboptimal performance and energy efficiency across varied AI workloads.

## 2. Exa's Polymorphic Computing Hardware

To address these challenges, Exa has developed a novel polymorphic computing architecture. This approach introduces a model-specific, reconfigurable hardware paradigm that adapts to the unique requirements of each AI model. The key principles include specialization, parallelism and asynchronous computation, dynamic reconfigurability, and **simplicity at the foundational level**.

### 2.1 The Learnable Function Unit (LFU)

At the core of Exa's polymorphic computing system is the Learnable Function Unit (LFU). Mathematically, an LFU represents any univariate function:

$$
\text{LFU}: \mathbb{R} \rightarrow \mathbb{R}
$$

In hardware, each LFU is a specialized component designed to approximate any univariate function with high fidelity. The function implemented by an LFU is preconfigured as a hyperparameter, with adjustable parameters for fine-tuning (or generally just training).

A key feature of the LFU hardware is its ability to perform its designated function with zero additional operations. Once configured, the LFU operates mostly asynchronously (depending on the configuration), immediately transforming inputs to outputs based on its preconfigured function. This design eliminates the need for traditional instruction fetching and decoding, significantly reducing latency and power consumption.

The LFU hardware operates at FP32 precision, allowing it to approximate a wide variety of functions, including but not limited to Gaussian functions, linear functions, and more complex mathematical operations.

### 2.2 Network Architecture and Flexibility

The network of LFUs in Exa's architecture forms a complex, interconnected system that goes beyond simple directed graphs. LFUs can connect in various ways, including loops and conditional connections, allowing for the implementation of diverse AI architectures. This flexibility enables the creation of structures such as Multi-Layer Perceptrons (MLPs), Kolmogorov-Arnold Networks (KANs)<R i={3} r={metadata.refs} />, and other novel architectures.

### 2.3 Memory and Data Flow

Exa's architecture incorporates a novel approach to memory management. Instead of constantly shuttling data between memory and processing units, the system loads input data once, allowing it to propagate through the highly parallelized AI model. This approach significantly reduces memory access operations, leading to improved energy efficiency and reduced latency.

LFUs can act as accumulators, temporarily storing intermediate results as data flows through the network. This distributed approach to memory allows for efficient handling of complex, multi-step computations without the need for frequent external memory access.

The final output is read once, completing the computation cycle. This single-load, single-read approach, combined with the massively parallel processing capabilities of the LFU network, enables Exa's architecture to achieve high throughput and energy efficiency across a wide range of AI workloads.

## 3. Exa Completeness: A Theoretical Framework

We propose the concept of Exa completeness, a novel theoretical framework that describes the computational capabilities of our polymorphic computing system. Exa completeness is rooted in the ability to compose and approximate any univariate function, which can then be used to construct multivariate functions. This concept is inspired by the Kolmogorov-Arnold representation theorem<R i={1} r={metadata.refs} /><R i={2} r={metadata.refs} />

### 3.1 Formal Definition

Let $A$ be a set isomorphic to $\mathbb{R}$ or a computational approximation thereof (e.g., FP32).
Define a system $S = (B, T, P)$ where:
- - $B$ is a finite set of basis functions, $\phi : A \to A$
- - $T$ is a finite set of transformation functions, $T : B \to (A \to A)$
- - $P$ is a finite set of superposition methods, $P : (A \to A)^* \to (A \to A)$

Let $\epsilon \geq 0$, representing the desired approximation error for the system.

The system $S$ is considered ***"Exa complete"*** if:

1. **Univariate Completeness**: 
   $$
   \forall f \in C(A, A). \exists h \in H. \sup_{x \in A} |f(x) - h(x)| \leq \epsilon
   $$
   where $H$ is the set of all functions that can be constructed using finite compositions of elements from $B$, $T$, and $P$.

2. **Multivariate Extension**: 
   $$
   \forall m,n \in \mathbb{N}. \forall F \in C(A^m, A^n). \exists G \in G. \sup_{x \in A^m} \|F(x) - G(x)\| \leq \epsilon
   $$
   where $G$ is the set of all functions $A^m \to A^n$ that can be constructed using finite combinations of functions satisfying condition 1.

Here, $C(A,A)$ is the set of all continuous functions from $A$ to $A$, $C(A^m,A^n)$ is the set of all continuous functions from $A^m$ to $A^n$.

$\|\cdot\|$ denotes an appropriate norm on $A^n$.

Note: The system inherently has an approximation error bounded by $\epsilon$. This error can be made arbitrarily small by choosing a sufficiently small $\epsilon$, including $\epsilon = 0$ for exact representation.

### 3.2 LFUs and Exa Completeness

In the context of our polymorphic computing system, the LFUs serve as the fundamental building blocks that enable Exa completeness. The set of basis functions $B$ corresponds to the initial configurations of our LFUs, while the transformation functions $T$ represent the reconfiguration capabilities of these units. The superposition methods $P$ are realized through the interconnections and compositions of multiple LFUs.

This formal definition of Exa completeness provides a rigorous foundation for understanding the expressive power of our system. It guarantees that, given sufficient LFUs and appropriate configurations, our polymorphic computing architecture can approximate any continuous function to arbitrary precision.

## 4. Implementing AI Architectures with Exa

The flexibility of Exa's LFU network allows for efficient implementation of various AI architectures. We'll demonstrate how our system can realize Multi-Layer Perceptrons (MLPs), Kolmogorov-Arnold Networks (KANs)<R i={3} r={metadata.refs} />, and more complex structures like transformers with attention mechanisms.

### 4.1 Realizing MLPs with Exa Complete System

Consider a standard MLP with one hidden layer:

$$
\text{MLP}(x) = \sum_{i=1}^N w_i \sigma(a_i \cdot x + b_i)
$$

where $\sigma$ is the activation function, $w_i$, $a_i$, and $b_i$ are weights and biases.

In an Exa complete system, we can construct this MLP as follows:

1. Each term $\sigma(a_i \cdot x + b_i)$ can be represented by a composition of LFUs, one for each input dimension and one for the activation function.
2. The weighted sum can be implemented using additional LFUs configured for multiplication and addition.

### 4.2 Realizing KANs with Exa Complete System

Kolmogorov-Arnold Networks (KANs)<R i={3} r={metadata.refs} />, which are based on the Kolmogorov-Arnold representation theorem<R i={1} r={metadata.refs} /><R i={2} r={metadata.refs} />, can be efficiently implemented in our Exa complete system. The theorem states that any multivariate continuous function can be represented as a superposition of univariate functions:

$$
f(x_1, ..., x_n) = \sum_{q=1}^{2n+1} \Phi_q \left( \sum_{p=1}^n \phi_{q,p}(x_p) \right)
$$

where $\Phi_q$ and $\phi_{q,p}$ are continuous univariate functions.

Our LFUs are particularly well-suited for implementing KANs because:

1. Each univariate function $\phi_{q,p}$ and $\Phi_q$ can be directly represented by an LFU or a composition of LFUs.
2. The summation operations can be effectively implemented using additional LFUs configured for addition.

### 4.3 Implementing Transformers and Attention Mechanisms

The flexibility of our Exa complete system extends to more complex architectures like transformers and their attention mechanisms. Key components of transformer architectures can be realized through appropriate configurations of LFUs:

1. Softmax Function: Can be implemented using a combination of LFUs to perform exponentiation and normalization.
2. Attention Mechanism: The dot-product attention can be realized using LFUs configured for multiplication, summation, and the softmax operation.
3. Feed-Forward Networks: Similar to MLPs, these can be constructed using LFUs for matrix multiplication and activation functions.

By leveraging the reconfigurability of our LFUs, we can efficiently implement the intricate operations required for transformer architectures, potentially leading to significant performance improvements and energy efficiency gains compared to traditional hardware solutions.

## 5. Performance Benchmarks and Efficiency Analysis

To evaluate the potential performance gains of our polymorphic computing technology, we conducted a series of simulations focusing on the power consumption and computational efficiency of a single LFU core. These simulations were designed to compare the energy efficiency of our LFU-based system against traditional GPU architectures, specifically the NVIDIA H100.

It's important to note that LFUs don't perform traditional floating-point operations in the conventional sense. As explained earlier in the litepaper, LFUs operate as asynchronous components that transform inputs to outputs based on their preconfigured functions. However, for the purpose of comparison with traditional architectures, we use an equivalent measure of floating-point operations per second (FLOPS) to quantify performance.

The efficiency metric used in our analysis is FLOPS per Watt, which quantifies the computational performance relative to power consumption. 

Our simulations covered multiple revisions of the LFU, with the latest being revision **0.4.1**. Each revision represents an improvement or change in the LFU design, resulting in different power efficiency and performance.

<div className="w-full md:px-20 md:flex md:justify-center my-8">

| LFU revision | Maximum LFU power | Number of LFU cores | Maximum power | Maximum performance (FP32) | Maximum energy performance (FP32) |
|----------|----------|----------|----------|----------|----------|
| **0.4.1** | 127 &#181;W | 3.14 M | 400 W | 945 TFLOPS | 2 362 GFLOPS/W |
| 0.4 | 231 &#181;W | 1.73 M | 400 W | 519 TFLOPS | 1 298 GFLOPS/W |
| 0.1 | 2.04 mW | 196 K | 400 W | 19.6 TFLOPS | 49.02 GFLOPS/W |

</div>

Our simulations indicate a significant improvement in this metric for the Exa system compared to the H100 GPU. The relative efficiency gain can be expressed as:

$$
\text{Efficiency Gain} = \frac{E_\text{Exa}}{E_\text{GPU}}
$$

where $E_\text{Exa}$ and $E_\text{GPU}$ represent the energy efficiency (in GFLOPS/Watt) of the Exa system and the H100 GPU, respectively.

<div className="w-full flex justify-center md:px-20 mt-12 mb-24">
  <LFUvsH100Chart className="w-full" />
</div>

The performance-per-watt ratio for the H100 GPU is documented at approximately 85.7 GFLOPS/W (FP32, 700 W TDP)<R i={5} r={metadata.refs} />. Our LFU simulations indicate a maximum efficiency gain of **27.6x** relative to the H100 GPU.

It is imperative to emphasize that these simulations concentrate primarily on individual LFU cores, which are anticipated to be the dominant power-consuming components due to the potential activation of millions of such cores during runtime. This focus on LFU cores constitutes the foundation of our benchmarking methodology. The simulations, however, do not encompass the power consumption of the LFU interconnect network. This network is expected to remain largely in an idle state during operation, with only active connections consuming **negligible power**. 

**It is imperative to emphasize that these results are derived from preliminary simulations and are subject to revision as our technology evolves.** As our research and development progress, we anticipate further improvements in both energy efficiency and computational capabilities, **potentially surpassing the current benchmarks**.

### 5.1 Areas for Future Improvement

While our current simulations demonstrate significant potential advantages, we acknowledge several areas for continued refinement and optimization:

1. **Model Upload Efficiency**: Some complex models may take longer to compile and upload initially. However, unlike GPUs, our hardware executes all models much faster after this one-time upload.

2. **Quantization Flexibility**: The current system operates at FP32 precision. We are currently working on making the quantization configurable as well.

**On-chip training is possible** as you can create an *equivalence training model* of the model you are training; however, this consumes extra LFU cores. We are working on integrating training functionality into each LFU core instead. Stay tuned for the next update!

**These benchmarks marks a major milestone in the development of our new hardware.** Our next crucial phase will be to manufacture prototype chips for further benchmarking and testing.

### 5.2 Open-source SDK & firmware

Exa is committed to providing an open-source software ecosystem to support our polymorphic computing technology. This ecosystem includes the Software Development Kit (SDK), firmware, and tools for framework integration. By making our software open-source, we aim to ensure transparency, encourage community contributions, and facilitate seamless integration with popular AI frameworks such as [JAX](https://github.com/google/jax), [PyTorch](https://github.com/pytorch/pytorch), Julia's [FLUX](https://github.com/FluxML/Flux.jl) & [LUX](https://github.com/LuxDL/Lux.jl), and [TinyGrad](https://github.com/tinygrad/tinygrad).

## 6. Conclusion

Exa's polymorphic computing technology represents a significant advancement in AI hardware design. By addressing the critical issues of energy efficiency and architectural flexibility, we enable the next generation of AI advancements. Our approach offers a more efficient and adaptable solution for AI computation, overcoming many of the limitations imposed by current architectures. The ability to reconfigure our hardware for any AI model through software uploads sets Exa apart from traditional fixed-function AI accelerators and model-specific ASICs.

Our commitment to supporting a wide range of AI architectures positions Exa at the forefront of a new era in computational intelligence. As we continue to refine and expand our technology, we invite researchers, developers, and industry partners to join us in exploring the vast potential of polymorphic computing. Together, we can push the boundaries of AI capabilities while ensuring a sustainable and accessible future for computational intelligence, all on a single, versatile hardware platform.

// &lt;3 Exa
-----BEGIN PGP SIGNATURE-----

iHUEARYKAB0WIQTqH4Le1ZB+MwLlm9jtpSZBtyZXuAUCZs4mPAAKCRDtpSZBtyZX
uKWWAQD9j2Artat4W4T8TWDTOxxlYQmOP+xC1sF5HfRNsuZd9wD/Yqiq6PMa3NaY
RiKT3/EoqD1x4mrcgYb3vAiDhF8fpQ8=
=VZs4
-----END PGP SIGNATURE-----

Sm9pbiB1cy4K

XADJL5QSZECI6H6X5B6O4HTUOKITGGUKMEJSEKVOI47K7V7DXZXBSTFTF55PRWABRWAQI6YCGCGC
TX4MSHRKFYIZMZWAZE7LHGAQTY7T7MBU65SCH5UDQDRBKYBAGP6347ZPNMUXLYGENNPQ3KDSW2RD
SGYPFG46NETD6LXUIK5SWH7T7LVJ36NKAIKSIAA3QKUQOATF7WVT2FHL6VXR5PZTMFYDLQ3XSFAM
OHSQU4KMKKHALQK4TVHQGFJH2QYXK5TQUDPDFZMD3TMMJYY4R4JLEPM53OVNRLRMADGEHGYCLWKJ
3V3CV37XVKA5UYVRTR2U3PT2DSVT4FEA34ZJVWTFN4A4476GLPE4QH7ROXKU2NPIIO3NYTV6K7F4
7EA556QBTF3JTW3DZMP5UZFAKGHC6IGXKI2MT3TCFVLB6BL7HZWG23VFNQ7F7VTDXSQ7F6XY4Y7Z
HXXA3ROJKD5IU4OEOLJTAXCPREJRXJH24NOMFEVKTLMW4VXLYQAH3TTUCFJ6VNRXGZOOKVADHMHD
E7L4MP5SD743HDLMGN6CGJE52MYG5ZSNIAUPWLKEI64WJXQ6KIZFXHBWK5BKZZOAWBHICYUS52PS
CJQ27NGKIAE7OZDPRWO4K3ZKLMXWNUADNLE7AUQQPFWOCQ7XAQQS4TQXDH6YKWZG2KOVBW5BXVNX
NX63HANMJNESRGSZKCUCVAVL6TKZSKZE3XODY22P2NN35DOJYRKXDSYFJHSGGIZOSHMN4CJN42PO
LIZW2P4YE2FFI7I7QSJ4MMDRT5QNBPQRYKPG4UENL5LAZ3F5BKI2B3EOGIHH6KQTRYSWKOV2TFPQ
6IB4V6MHUDBY5BJ3DULOYVK3B76L66B5M4WLZAJCIMV7GQAEDXXQU45ULL3LOBXFMB335BYKVZH4
ISLYPH7XIARPQ655S7RYGPWMW6TAX4LLM6MLMZFFRTVPMIRYIDJSX5R2SKAZAGVLWCUUSPJDH7HQ
SB4EDMM7GH5BL22PCW3DAXKLWRHU4O233GPVWECQCWMIKKQVLSJUV43T7ZQRVELMO4XVLKFYZ2P6
FILJQBPXRE2U6ZLYRD26C2GYTQM52CIV7TPJVGVPMODKSY6YBR6I5YE7NFJTUT7ZT2HO3KMTW52L
ABYL3Q52GPYXKKJ5A2LK2MC22I7MH252XPH3ZQQS7SFAJPECYQCFFZ3ZA5OEZICAGMX3C2AOXM3Z
ZAGC6FTARFJPB4QD33ZDJM6W6RWEDKHJBJPGFXSHG5XX42IDMUVMTLBSDXTQ3ODOFZ3Z5O4ADJDH
LU7HGT3INFVUPWCCGJ3UFHM7GAFA5ZZ24ZJNZTB24XIG2BPGS2TUJN2MWIGUMWEKP65SSAB2EITW
GIH34NHFLAZWLB2KUCG73IR5KR3KJC2F3D2WVO33QVYX3UI6XTEONYGGZBJKJNACDKXNRWJJCPBS
XQNAUVM5IZJIJE6KMQAKSGHOK67U442GTW5DUZ3DEUVUXEFGT2ACXD2BCVPWFFOB27VTZYSEZGE6
MMCS2RVPTXIDT77AJGSLPCZP5Y5Z5QSHDMNZTMLCKV======