US PA 19/190,418
US PA 19/210,515

PROGRAMMABLE STACKED PROCESSOR WITH SIMULTANEOUS
EXECUTION OF ALL PROGRAM STEPS OVER A DATA STREAM

Filed 04/25/2025

F. A. Q.

Why build a novel non‑von Neumann processor? What's wrong with what we've got?

All modern processors — CPUs, GPUs, NPUs, and TPUs, including ARM and RISC-V — use variations of the original von Neumann and Harvard architectures developed in the late 40’s and 50’s. From the IBM and UNIVAC mainframes of the 60’s to the modern multi-core microprocessors of today, each generation has retained the foundational model of instruction processing — fetch, decode, execute, and repeat — despite radical advances in scale and speed.

Early computing faced severe resource constraints. RAM, originally built from magnetic cores, stored each single bit in a separate iron ring. While technical advances eventually brought the cost of a 32K bit memory unit down to about $500 (but still occupied a cubic foot of space), the severely limited memory capacity forced programs to be as small as possible.

However, the landscape shifted dramatically in the 1970s with the advent of Dynamic RAM (DRAM). Memory became cheap and abundant. As available RAM ballooned, so did the size and complexity of programs. Codebases grew bloated, introducing latency and inefficiency that processors were ill-suited to handle.

Modern CPUs still execute programs sequentially, one instruction at a time. Each core must fetch instructions — through multiple levels of cache — decode them, execute them, and then use the results to inform the next step. While pipelining allows some overlap (e.g., executing one instruction while decoding the next and anticipating branches), the inherently sequential design still places a hard ceiling on throughput. Even at a single cycle per instruction, a 1000-step algorithm still takes 1000 cycles per core, per output.

This is the core limitation: in classic architecture processors, more instructions always mean more clock cycles. That’s fine for general-purpose tasks but for workloads that demand massive throughput (like cryptography, data compression, or High Frequency Trading), this architecture is fundamentally insufficient.

Enter the Many-Headed Hydra:
The Stacked Slice Processing Unit

Instead of executing one instruction at a time, over and over, the Hydra-like Stacked SPU takes a radical departure: it executes all the instructions, all at once.

Each “slice” in the stack is preloaded with a single step of the algorithm. On every clock cycle, every slice simultaneously executes its instruction against the incoming data stream like a row of Hydra heads striking in perfect synchrony. There’s no instruction fetch, no decoding bottleneck, and no sequential delay from one step to the next.

This architecture eliminates the constant round-trip to cache or RAM for instructions and variables and the serial delays of traditional designs. The entire algorithm is expressed as a pipeline of dedicated hardware steps, one per slice, executing deterministically and in parallel, cycle after cycle.

How does it work?

A stack of SPUs, in contrast to a CPU or GPU, has each SPU execute one step of an algorithm on every clock cycle, and does not fetch new instructions, which must be preloaded. Instead of changing the state of a single core, and executing the next instruction against the changed state, the state of each SPU (including the register values) is passed down to the next SPU in the stack, and subsequent instructions are executed by succeeding slices.

When the first slice executes the first instruction against the first data set, it passes the results to the second slice which executes the second instruction against the first data set, which passes the result to the third slice and so on. But while the slices below apply their instructions to the first data set, the first slice has gone on to apply its first instruction to the next data set. So, data values and algorithmic states flow continuously down the stack, and once the entire stack is filled, the results of the algorithm are returned with a throughput of one per clock cycle.

Some algorithms are not amenable to this form, in fact, most programs that take more or less random branches with indeterminate loops, and make varying subroutine calls to avoid repeated encoding cannot be converted to a steaming processor based set of instructions. But some of the algorithms that do fit the paradigm are applied to massive workloads where the advantages of the maximum throughput and minimum power per data processed are dominant over the difficulty of arranging the system to stream the inputs and outputs continuously.

Many algorithms that require high throughput processing share common attributes such as repetitious patterns of highly optimized instruction code that make it straightforward to define the steps necessary to load them onto a stack of SPUs. These common programming tasks include encryption, decryption, video encoding and decoding, and crypto currency algorithms. They also tend to be the largest consumers of both power and CPU cycles in a complete processing system.

Bitcoin as an example

Instructions for SHA-256

Some algorithms, like the ubiquitous SHA-256 used by Bitcoin™, run without any branching dependent on state, and can even execute more than one instruction at a time when none makes use of the outputs of any other. Other algorithms will need to select differing steps to execute depending on the state derived from the data, which cannot be predicted and set ahead of time.

To the right can be seen a working set of instructions that compute the SHA-256 hash as a continuous stream. The full computation requires 1216 slices, each of which support two ALU operations per instruction, which results in the maximum possible throughput of one 512 bit block hashed per each clock cycle.

The problem definition of Bitcoin™ is to search for a value, called in the literature a "nonce," which will result in the hash whose value is in a specific range. Whichever miner first reports a nonce value that is confirmed correct, wins the competition.

While it will take 1216 clock cycles for a single block to fall through the stack, the rate of nonce guesses entered and final hashes returned is one cycle each.

There is no mechanism by which a clocked ASIC circuit can compute anything in less than a single clock cycle, so this is the upper bound on possible throughput.

Multiple SPU stacks may be placed on a single chip, and multiple chips placed on a single board, so by assigning each section a different range of nonces to test, the throughput may be arbitrarily expanded. Chips may be chained together to implement arbitrarily long programs with single cycle throughput.

Incidentally, this simultaneously represents the least power expense, as nothing is done which is not a direct contributor to the computation.

A Stacked SPU solution simply has no housekeeping overhead, which forms the majority of the power used for CPU / GPU based solutions.

As described in the Patent, the Stacked SPU design provides two distinct ways to implement conditional execution: one highly efficient way that is limited in scope, and one that makes less efficient use of resources but is not limited.

A typical conditional construct is the ‘if … then … else’ statement form, common to many languages. If the clause succeeding the ‘if’ keyword evaluates to true, at the current moment, then the instructions in the ’then’ clause are executed, otherwise the instructions in the ‘else’ clause are executed. Each of the executed clauses may be further divided by additional conditionals included so that complex trees of execution may be implemented.

To provide for branches in the instruction sequences, two methods are provided by each SPU slice and both may be used together. The simpler form is the ‘skip’ mechanism: if the conditional evaluates to false, the skip value is set to the number of instructions in the ‘then’ clause. Each slice that receives a non-zero skip value has its execution suppressed; it simply passes on the state and registers it received from the prior slice without alteration while reducing the skip counter value by one and passing it on. When the skip count reaches zero, which should correspond to the first instruction of the ‘else’ clause, normal execution resumes and proceeds to the end of the conditional section. But if the conditional expression evaluates to true, the skip value is left at zero and the execution proceeds with the ‘then’ clause immediately following. At the end of the ‘then’ clause instructions, the skip value is set to the number of instructions in the ‘else’ clause which cause them to be skipped.

Whichever clause is selected, the skip value will be zero while it executes, so it may have nested ‘if … then … else’ conditionals that can set the skip value to select which sub-clause to execute. At the end of the nested conditional, the skip value will always return to zero, so the remaining non-conditional instructions are undisturbed.

And on the path not taken, the skip value will be counting down as the instructions pass, and any nested conditionals will not be executed, so the value will count down undisturbed. This allows any number of nested conditionals to be executed without further hardware support.

While this mechanism will clearly provide the expected effect by selecting which instructions to execute based on the algorithm intended, it is not a particularly efficient way to utilize the hardware slices, since for a given input, one section or the other will always be idle.

An alternative mechanism is to provide multiple instructions for each slice on parallel ‘tracks.’ The ‘if’ conditional can be on the primary track and the ‘then’ clause instructions can be on one track while the ‘else’ clause instructions can be on another. Instead of simply skipping the ‘then’ clause, the conditional can select the alternate track to execute the ‘else’ clause or stay on the current track to execute the ‘then’ clause. At the end of the ‘else’ clause the last instruction can simply switch back to the main track. When the two parallel clauses are of differing lengths, which is common, the shorter clause simply ends with a skip to the end using the mechanism described previously.

If there are two ALUs implemented in each slice, two tracks and two instructions may be executed at the same moment in time, which takes advantage of the common situation when the program is not in a conditional section.

More than two ALUs may be implemented, and more than two tracks, as optimized for the range of algorithms the stack is intended to run. Since the skip mechanism is indefinitely nestable, the number of tracks and parallel ALUs can be determined statistically without limiting the range of possible algorithms.

It should be noted that which path a given datum flowing down the stack takes is dependent only on the state of the instructions applied to that datum, so some flows will be on one track while others are on the alternate track and the slices will simply execute the instructions for the track that datum is on as it flows down the stack without disturbing the paths that other data flows take. This is shown above left where different inputs will cause different paths to be taken as defined by the algorithm for the specific values and conditions presented.

Once data completely fills the stack, every slice will be executing an instruction on each clock cycle. On each cycle, one new data entry will be pushed into the top, and one result output will be delivered at the bottom. The throughput, as measured by clock cycles per completed outputs, is one per cycle.

When the number of available tracks in the hardware is sufficiently large, a program need not fit into the number of slices available in the hardware stack. The output at the bottom of the stack can be optionally fed into the top of the stack, with the provision that the final instruction at the bottom selects a new track to be executed at the top. The data can then flow through the stack twice and require half as many slices to implement the algorithm.

Another example that may be relevant to consider is the AES 256 encryption and decryption algorithm.

The standard AES encryption algorithm, mandated by the US DoD, is widely used for encrypted transport and storage of sensitive data. Standard CPU offerings from Intel and AMD include special instructions to efficiently encode individual steps of the algorithm for encryption and decryption.

However, these implementations still utilize a sequence of many instructions, and their performance is quoted by the manufacturers in multiple CPU clock cycles per byte.

With the addition of NCMUL, NCA2B, NCA4B, and additional data rearrangement instructions to the SPU, standard AES encryption and decryption can also process multiple data streams with single cycle throughput at minimal power.

Inversion in the Galois Field used by AES is time consuming and the generally accepted solution is to use a 256 entry lookup table in RAM and process the bytes one at a time. But using the definition of inversion deduced from Fermat’s Little Theorem

\frac{1}{a} = a^{254} mod p

where p is the irreducible polynomial, in this case $x^{8} + x^{4} + x^{3} + x + 1,$ that defines the finite field in question. Multiplication in the Galois Field is accomplished by XOR shifted values only as no carries are used when the partial products are all of the form $a ⦁ b mod 2$

Using non-carry multiplies only, inversion may be directly calculated in a series of slices as an alternating series of squares and multiplies.

But as squaring a number in a $mod 2$ finite field is identical to inserting 0 bits between each bit of the original value, raising any number to a power of the form 2ⁿ may be accomplished using wiring only, so including additional instructions to compute $a^{2} ⦁ b mod p$ and $a^{4} ⦁ b mod p$ has little impact on the size of each slice (since it utilizes the same reduction logic as $a ⦁ b mod p$ )

The same functionality can thus be expressed in fewer instructions. Both have the same throughput of one block processed per cycle, but the second form takes fewer slices and is physically smaller. Both forms can also operate on multiple bytes in parallel, so when using 32 bit registers they can compute four $GF (2^{8})$ byte instructions at once, further reducing the number of slices required.

Additional instructions support the required affine transformations and block rearrangements. These instructions can be optionally included in the slice design without impacting other features whenever the system will be required to perform high throughput standard encryption and decryption.

While Intel and other processors can accomplish the same function in a few specialized instructions, they must fetch and execute instructions sequentially from a cache and only initiate a single instruction on any given cycle. The SPU stack, in contrast, is capable of executing

ALL the instructions of a complex algorithm on each and every clock cycle

and throughput is limited only by the rate that data can be delivered to and from the computation engine.

Specialized instructions to support AES, fixed point integer multiply and divide, as well as floating point operations may be included as needed in the SPU design. A full set of all instructions may be used to support the widest variety of algorithms, or unused instructions may be omitted when the target has a restricted algorithmic range, e.g. crypto mining.

US PA 19/190,418US PA 19/210,515

PROGRAMMABLE STACKED PROCESSOR WITH SIMULTANEOUS EXECUTION OF ALL PROGRAM STEPS OVER A DATA STREAM

Filed 04/25/2025

F. A. Q.

US PA 19/190,418
US PA 19/210,515

PROGRAMMABLE STACKED PROCESSOR WITH SIMULTANEOUS
EXECUTION OF ALL PROGRAM STEPS OVER A DATA STREAM