U. S. Patent Application 19/190,418

PROGRAMMABLE STACKED PROCESSOR WITH SIMULTANEOUS

EXECUTION OF ALL PROGRAM STEPS OVER A DATA STREAM

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention is in the technical field of digital computing and pertains

more particularly to a new and unique computer structure and operation.

2. Description of Related Art

Since the earliest programmable computer systems were introduced, in a typical

CPU, coded instructions are fetched usually from sequential locations in a digital

memory and executed one at a time with status and values held over in registers from one

step to the next. Repeated instructions are implemented by looping back to reread the

coded instructions starting from a higher step until all computations are complete. This

von Neumann architecture (specifically where the data and instructions are kept in

distinct stores) has formed the basis for modern central processing units (CPUs), graphics

processing units (GPUs), and tensor processing units (TPUs). CPUs are typically general

processors, GPUs are dedicated to graphics, such as images and video, and TPUs are

application-specific integrated circuits (ASICs) designed by GoogleTM for neural

networks. TPUs possess specialized features, such as the matrix multiply unit (MXU)

and proprietary interconnections.

Generally, all of these different sorts of processors have common limitations that

are a result of choices made in relatively early days of computer evolution. This is a

natural consequence of design evolution by modifying existing mechanisms and

structures. Many computing systems have multiple cores wherein independent

processors share resources such as floating-point processing units and memory caches,

but the execution of programs is not synchronized between the cores and each core may

execute unrelated instruction sequences. This arrangement provides a very versatile

design but is not necessarily the most efficient way to execute algorithms where such

pervasive versatility is not required.

The inventor posits that a new and non-obvious computing system is needed to

maximize efficiency and increase computing power where extremely high throughput is

required, and very repetitive sequences of fixed operations may be inherent to algorithm

definition. This is a typical case for many forms of encryption, decryption, data integrity,

compression, decompression, and cryptocurrency. The normal operations of an

Operating System should be handled by a typical CPU whenever the extreme conditions

of an algorithm do not present a significant burden.

BRIEF SUMMARY OF THE INVENTION

In one embodiment of the invention a computing machine is provided, comprising

a plurality of slice processing units (SPUs) coupled in a specific order from a first SPU to

a last SPU, each SPU configured to perform a single instruction of a specific algorithm

on an incoming value and to pass a resulting output value, an input memory stream

interface coupled to the first SPU in the specific order of SPUs, configured to receive a

value from a direct memory access (DMA) controller and to pass the value to the first

SPU in the specific order of SPUs, an output memory stream interface coupled to the last

SPU in the specific order of SPUs, configured to receive the output value of the last SPU

in the specific order of SPUs, and to pass the value to the DMA controller, a set of one or

more random access memory (RAM) units coupled to the DMA controller, and a central

processing unit (CPU) coupled to the DMA controller, wherein all instruction steps of the

specific algorithm are performed in parallel by the plurality of SPUs, and a result of the

algorithm is output by the last SPU in the specific order in each clock cycle.

In one embodiment the interconnected SPUs, the input and output memory stream

interfaces, and the DMA controller comprise an application specific integrated circuit

(ASIC), wherein at least one instruction of an algorithm is preprogrammed into

individual ones of the SPUs. Also, in one embodiment one or both of the one or more

RAM units and the CPU are a part of the ASIC. In one embodiment algorithms with

conditional branches are supported by a skip count passed between SPUs in order, such

that if the skip count has a positive value the SPU does not perform the programmed

instruction and decrements the skip count by one to the next SPU, and if the skip count is

zero, the SPU performs its programmed instruction. And in one embodiment algorithms

with conditional branches are further supported by SPUs being preprogrammed with

more than one instruction and an integer track number is passed from SPU to SPU, the

track number determining which instruction of the more than one instruction is to be

performed on the incoming value by the SPU.

In one embodiment the machine further comprises the final result passed to the

output memory stream interface is passed back to the DMA controller and written to the

at least one RAM unit. In one embodiment the machine further comprises the final result

passed to the output memory stream interface is also passed back to the first SPU as a

new input value.

In another aspect of the invention a method for calculating an algorithm is

provided, comprising providing a plurality of slice processing units (SPUs) coupled in a

specific order from a first SPU to a last SPU, each SPU configured to perform a single

instruction of a specific algorithm expresses as a sequence of instructions on an incoming

value and to pass a resulting output value, coupling the first SPU in the specific order to

an input memory stream interface configured to receive a value from a direct memory

access (DMA) controller coupled to a random access memory (RAM) and to pass the

value to the first SPU, coupling the last SPU in the specific order to an output memory

stream interface configured to receive the output value of the last SPU, and to pass that

value to the DMA controller, and programming each SPU in the specific order with a

different one of the sequence of instructions comprising the specific algorithm; passing

values sequentially to the first SPU from the input memory stream interface, and

performing all instruction steps of the specific algorithm in parallel by the plurality of

SPUs, such that a result of the algorithm is output by the last SPU in the specific order in

each clock cycle.

In one embodiment the method further comprises providing the plurality of SPUs

as an application specific integrated circuit (ASIC). Also, in one embodiment the method

comprises providing the plurality of SPUs, the input memory stream interface and the

output memory stream interface as an (ASIC). In one embodiment the method comprises

providing the plurality of SPUs, the input memory stream interface, the output memory

stream interface, and the RAM as an (ASIC). In one embodiment the method comprises

providing the plurality of SPUs, the input memory stream interface, the output memory

stream interface, the RAM and a central processing unit (CPU) as an (ASIC). Also in

one embodiment the method comprises supporting algorithms with conditional branches

by a skip count passed between SPUs in order, such that if the skip count has a positive

value the SPU does not perform the programmed instruction and decrements the skip

count by one to the next SPU, and if the skip count is zero, the SPU performs its

programmed instruction.

In one embodiment the method comprises further supporting algorithms with

conditional branches by programming individual ones of the SPUs with more than one

instruction and passing an integer track number between SPUs, the track number

determining which instruction of the more than one instruction is to be performed on the

incoming value by the SPU. In one embodiment the method further comprises passing

the outputs of the last SPU in the specific order as results of the algorithm to the output

memory stream interface is passed back to the DMA controller and writing the outputs to

the RAM. And in one embodiment the method further comprises passing the final result

that is passed to the output memory stream interface to the first SPU as a new input value.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

Fig. 1 depicts an overall configuration of a system in an embodiment of the

present invention.

Fig. 2 shows a typical configuration for each slice processor unit (SPU) in the

system of Fig. 1 in an embodiment of the invention.

Fig. 3 shows the typical breakout of fields from an instruction word in an

embodiment of the invention.

Fig. 4 illustrates register select logic for determining primary, secondary, and

auxiliary values used by logical and numeric operations in an embodiment of the

invention.

Fig. 5 illustrates internal details of an arithmetic logic unit (ALU) that implements

instructions for each SPU in an embodiment of the invention.

Fig. 6 illustrates internal support for each program register contained with register

write control logic in an embodiment of the invention.

Fig. 7 shows a typical implementation of the propagation of algorithmic status

bits and control of conditional execution in an embodiment of the invention.

Fig. 8 shows circuitry for recirculating results on demand in an embodiment of the

invention.

DETAILED DESCRIPTION OF THE INVENTION

Following descriptions teach an integrated circuit that executes a sequence of

programming steps in parallel rather than sequentially but retains the ability to have

partial results passed from one step to the next. This design in this new invention posits a

coordinated sequence, or ‘stack’ of limited processors the inventor calls Slice Processor

Units (SPUs) that do not fetch instructions from memory. One or more pre-selected

instructions are installed into individual ones of the SPUs, and in some embodiments

each SPU, and the instructions in each SPU that has instructions are executed on each

clock cycle when data is available from the previous SPU, or in the case of the first SPU,

an input data stream or feedback from the bottom of the stack. Loops are unrolled and

each iteration of an instruction is placed in a different SPU of the stack. Each SPU has its

own set of data registers, whose values are inherited from a previous SPU and forwarded

to the next SPU after making changes according to the SPU’s instruction.

An effect of the instructions implemented in the stack of SPUs is effectively the

same as they would be for a typical fetch and execute single core with a distinction that,

instead of instructions modifying the data register values of that core over time, the

instructions modify the values passed down the stack of SPUs. Instead of many clock

cycles required for a single core to produce a single result, the stack of SPUs produces

one result per clock cycle, realizing a corresponding increase in throughput.

While the latency in clock cycles from the time a data item is taken in, to the time

a corresponding result is produced depends on the number of slices the data passes

through, the throughput is set to one clock cycle per output, limited only by the rate that

data can be delivered. Some algorithms operate linearly without branches such as the

well-known and pervasive SHA256 used in cryptocurrencies. Other algorithms have

conditional operations, typically expressed as “if … then … else …” constructs in

computer languages, whereupon one of several branches is executed depending on

circumstances expressed in numeric or logical values of preliminary computations.

An algorithm design for a stack of SPUs supports conditional execution in two

distinct ways. The first way is by skipping operations of a contiguous set of SPUs using

a skip count. On detection of a condition that obviates the need to execute a subsequence

of instructions, the skip count is set to the number of instructions, and hence the number

of SPUs, that are to be skipped. Each SPU receives the current skip count from the

previous SPU. If the skip value is not zero, the operation of that SPU is suppressed and

the skip count that is passed to the next SPU is reduced by one.

If the skip value is zero, then the SPU operates normally. To implement a typical

“if … then … else …” conditional execution, if the conditional clause is met, no skip

count is set and the “then” clause operations are executed immediately following the

current SPU. At the end of the “then” clause an unconditional skip of the “else” clause is

initiated. If the condition clause is not met, the “then” clause is skipped, and execution is

resumed when the “else” clause operations are reached.

With each individual consideration of status bits tested by the operations, the

decision to proceed with the “then” clause or the “else” clause is made specifically for

that streaming program execution as it passes down through the stack of SPUs. Some

executions may be moving down one branch while simultaneously other executions are

moving down an alternative branch. This implements the algorithm in a way that exactly

corresponds to the execution of a sequentially based program in a standard CPU that does

not make use of backwards branches or loops.

This mechanism may be indefinitely recursive. If one clause of a conditional is

executed, it may have subclauses that are conditionally executed or skipped based on the

single skip count, as it is not otherwise in use for an executed clause. Conversely, in a

conditional clause that is skipped, the skip count will not be zero and the conditional

operations that would select one or another subclause are not executed, and the skip count

is decremented and passed down without disturbance.

While this method is quite versatile, it is not necessarily the most efficient

method, as SPUs are dedicated to one clause or another, and for any given streaming

execution passing down the stack, one section or the other will be quiescent. This is not

the most optimal use of the limited set of SPUs extant in the system as a whole.

An alternative method of conditional execution is to implement multiple tracks of

instructions. Storage of instructions in this circumstance is expanded such that each SPU

contains more than one preloaded instruction to execute, controlled by an incoming track

number. The differing instructions that implement the multiple clauses are loaded into

different tracks of the same SPU such that the SPU will execute one instruction or

another depending on which track it is “on” for any given cycle, and fewer SPUs will be

idle. Since the number of SPUs dedicated to the conditional operation is some fixed

number, the shorter of two tracks will end with a skip to get to the common point. Nested

conditionals may be implemented by switching tracks further and the general skip

mechanism may be used if the number of tracks is insufficient for a complex set of

conditionals.

In this way, the number of tracks implemented in an ASIC can be chosen based

on statistics of the expected algorithms and extended when required by the skipping

mechanism, as opposed to having a fixed limit of possible paths beyond which an

algorithm cannot be implemented.

Returning the output of the final slice to the first slice in lieu of new data allows

the tracking mechanism to implement programs that are larger than the implemented

stack size. Reusing the slices in this way reduces the number of SPUs required for a

given program at the cost of throughput. Each single pass through the stack will take a

single clock cycle per datum presented with a corresponding multiple of latency.

It should be clear that not all algorithms can (or should) be adapted to this scheme

and that circumstance is compatible with the intent of this Invention. This new design

should be used where very high throughput is required and very repetitive sequences of

fixed operations are inherent to the definition of the algorithm. This is a typical case for

many forms of encryption, decryption, data integrity, compression, decompression, and

cryptocurrency, among possibly other forms of computing. The normal operations of an

Operating System should be handled by a typical CPU whenever the extreme conditions

of an algorithm do not present a significant burden.

Fig. 1 depicts an overall configuration of a system according to an embodiment of

the invention. A stack of Slice Processor Units (SPUs) 105 is configured such that each

SPU executes one of several preloaded instructions on each clock cycle whenever input

data is available. Instructions are never fetched from Random Access Memory (RAM),

rather instructions are preloaded before the program is executed and are not altered

during execution. The entire contents of the data registers are passed from one SPU to

the next, along with program status bits and other internal states. As the results of each

program step are completed, they are passed down to the next program step. The net

result is that each step of the algorithm is executed simultaneously with all others; but

they each operate on data received at a distinct and different point in time. The

throughput is thus synchronized with the input data rate. While the stack illustrated in

Fig. 1 shows only 10 SPUs, the number of SPUs in a robust implementation would

typically be in the thousands. The standard SHA-256 algorithm can be implemented in a

stack of less than 2,000 SPUs.

An input memory stream interface 104 and an output memory stream interface

106 are connected to a Direct Memory Access (DMA) controller 102 which manages

data to and from a set of RAMs 103 which may be internal or external to the ASIC.

DMA controller 102 also provides access to a system controller CPU 101 which also may

be internal or external to the ASIC. The system controller CPU provides general

housekeeping functions as well as executing algorithms that are not suited to the

constraints of the stacked SPU form.

Fig. 2 shows a configuration for each SPU in stack 105 in one embodiment of the

invention. A master operation execution enable (EXECEN) 201 signal indicates data

from the previous SPU is available and enables computations to proceed (when not

suppressed by slice skipping) and output registers to accept new results. Track number

202 signal from the previous SPU selects which track this SPU is on, and is updated each

clock cycle, as described above. Skip count 203 signal denotes how many SPUs should

be skipped until execution resumes and is used to extend the conditional execution

mechanism when there are insufficient distinct tracks available to accommodate an

algorithm. Algorithmic status bits 204 control the conditional execution path. Register

values 205 provide the current algorithmic register values. A multiplexor 207 uses the

track number to dynamically select one of the preloaded instructions 206 and forwards

the instruction currently selected to an instruction decoder 209 which demuxes the

instruction word into control signals for the various function blocks which includes at

least a single cycle Arithmetic Logic Unit (ALU) 213. Control signals from the

instruction decoder are used by register select logic 210 to determine which register value

205 from the previous SPU, or arbitrary constant from the instruction, or limited constant

such as zero or one, to pass forward as primary value to the ALU. Likewise, register

select logic 211 determines the secondary value and register select logic 212 determines

the auxiliary value.

Arbitrary constants are extracted from the instruction code by instruction decoder

209 and limited constants such as zero are determined by using invalid register indices to

indicate which of several limited specific constants are to be used. This allows both one

arbitrary constant and multiple limited constants to be used by the same instruction.

ALU 213 executes single cycle operations such as addition, logical AND, OR,

and XOR along with post alignment and numeric shifting of the operation results. Both

the operation and the output register index are received from the instruction decoder and

the results of the operation are passed along with the output register index to operational

register storage controller 221. Subtraction is not included in this typical implementation

because register select logic elements 210, 211, and 212 may be instructed to negate their

input value by the instruction decoder 209. Carry and zero status bits reflecting the

results of the operation are passed from ALU 213 to algorithmic status propagation logic

215 which is enabled by decoder 209 to override or retain algorithmic status bits 204

from the previous SPU and forward them to output algorithmic status register 220 and

conditional branch controller 214 which determines if the track number 202 from the

previous SPU should be retained or replaced with a new track number and skip count

from instruction decoder 209.

Conditional branch controller 214 accepts the branch control signal from

algorithmic status propagation logic 215 and uses the result to determine if the track

number is altered and skip count is altered. The altered or retained track number is

output to track number register 218 for transmission to the next SPU.

Skip count 203 suppresses execution of the SPU when it is non-zero, allowing

only register and status forwarding and multi-stage operations (not shown in this

embodiment) to proceed. Conditional decrement and zero detect circuit 208 reduces the

incoming skip count by one when the count is greater than zero and suppresses all

functionality when the incoming skip count is not zero. Decremented skip value from

conditional decrement and zero detect circuit 208 is output to skip count register 219 for

transmission to the next SPU unless replaced by a new skip count value from instruction

decoder 209.

Status bits output from algorithmic status propagation logic 215 are passed to the

next SPU via output algorithmic status register 220.

Output numeric value of ALU 213 is written to the target register indicated by

instruction decoder 209 by register value replacement logic 216 which promotes prior

values from register values 205 for each algorithmic register not specifically targeted for

alteration to register storage controller 221.

EXECEN 201 signal is propagated to operation execution enable output register

217 and enables output clocked registers for track number register 218, skip count

register 219, algorithmic status register 220, and register storage controller 221.

Fig. 3 shows the typical breakout of bit fields in the instruction decoder 209. The

selected instruction word 301 from multiplexor 207 is decomposed to provide various bit

fields to other components. The number of bits in each field will vary to conform to the

indexing and numeric requirements of the SPU.

Operation code 302 selects the base operation of ALU 213. Status mask bits 306

are combined with the output status bits of ALU 213 and then compared to status test bits

307 to determine if conditional track number 303 and conditional skip count 304 should

replace the values from the previous SPU. Status propagation bits 305 indicate which

status bits from ALU 213 are propagated to the next SPU and which are retained from the

previous SPU.

Shift mode 308 determines how the ALU will post process the computed result.

Primary register index 310 indicates which register value or constant is to be used as

primary value of ALU 213 and primary negation flag 309 indicates if that value should

be negated. Secondary register index 312 indicates which register value or constant is to

be used as secondary value of ALU 213 and secondary negation flag 311 indicates if that

value should be negated. Auxiliary register index 314 indicates which register value or

constant is to be used as auxiliary value of ALU 213 and auxiliary negation flag 313

indicates if that value should be negated. Target register index 315 indicates which

register is to receive the result from ALU 213 and may be an invalid or constant index to

indicate the result is discarded and only affects status bits.

Primary constant value 317 is a full width constant sufficient to set the value of an

algorithmic register and auxiliary constant value 316 is a limited width constant sufficient

to indicate the range of shifts and rotates. Primary constant value 317 provides the

constant value to register select logic 210 and 211 and auxiliary constant value 316

provides the constant value to register select logic 212.

Fig. 4 shows the logic used to determine the values of primary 210, secondary

211, and auxiliary 212 register select logic elements respectively used by ALU 213.

Multiplexor 405 selects one of the register values 403 passed forward by the previous

SPU, or for the first SPU, the input memory stream interface 104, or alternatively one of

the fixed constants 401, such as the values for 1 or 0, or the singular constant 402

extracted from the instruction word by instruction decoder 209. Choice of value is

indicated to multiplexor 405 by a register index 404 from instruction decoder 209.

Selected value from multiplexor 405 is negated by logic 407 and either the original or

negated value is selected by multiplexor 408 as indicated by a negation flag 406 from

instruction decoder 209. Resulting value 409 is then used as the selected value for each

of the primary 210, secondary 211, and auxiliary 212 register select logic elements.

Fig. 5 shows internal details of ALU 213 that implements instructions for each

SPU in stack 105. The instruction selected by multiplexor 207 is broken into distinct

fields and presented as various interfaces. Primary value is provided by register select

logic 210 according to primary register index 310 and primary negation flag 309.

Secondary value is provided by register select logic 211 according to secondary register

index 312 and secondary negation flag 311. Auxiliary value is provided by register select

logic 212 according to auxiliary register index 314 and auxiliary negation flag 313.

Target register index 315 indicates which register receives the ALU 213 result

value to register value replacement logic 216 and register values 205 provide the values

for algorithmic registers that are not so targeted. Bitwise inclusive-or logic 501 provides

the IOR function of the primary and secondary values, bitwise exclusive-or logic 502

provides the XOR function, bitwise conjunction logic 503 provides the AND function,

and adder 504 provides the sum of the values and the carry status bit 513.

Operation code 302 indicates which computation function is selected by

multiplexor 505 which provides the value to logical shift left logic 506, roll left logic

507, logical shift right logic 508, and arithmetic shift right logic 509. Shift mode 308

indicates which shifted or rolled value is selected by multiplexor 511 to be passed on to

register value replacement logic 216. Skip zero detected 716 (See Fig. 7) is used by

register value replacement logic 216 to force values from register values 205 to be passed

unchanged and EXECEN 201 signal indicates when values are to be clocked into register

value replacement logic 216.

Zero test logic 510 takes the selected value from multiplexor 511 and outputs a

corresponding signal to zero status bit 512.

Fig. 6 shows internal support for each program register contained within register

value replacement logic 216. A single constant register index 601, ostensibly the same

used to access source registers in the instruction word, is compared to target register

index 315 by comparator 604. If the indices are identical, AND gate 605 combines with

skip zero detected 716 (See Fig. 7) and selects the value 602 from ALU 213 via

multiplexor 606 whenever the skip count is zero and the indices are identical, otherwise

the value 603 of that algorithmic register from the previous SPU is retained through to

the next SPU and clocked into output register 607 when enabled by EXECEN 201 signal.

Any algorithmic register not selected by target register index 314 always has the

corresponding value 603 from the previous SPU retained by multiplexor 606 through to

the next SPU via output register 607. When target register index 314 is all zeros, as is the

case after system reset, no replacement register is selected and the instruction word is

interpreted as a no-operation code if status propagation bits 305 are also zeros.

Fig. 7 shows a typical implementation of conditional decrement and zero detect

circuit 208, conditional branch controller 214, and algorithmic status propagation logic

215 for a system supporting two status bits, zero status bit 512 and carry status bit 513

from ALU 213.

Skip count 203 is passed to decrement logic 710 and zero detect logic 711. When

zero detect logic 711 indicates the incoming skip count value is non-zero, it forces

multiplexor 714 to pass the reduced value from decrement logic 710 to the skip count

register 219 enabled by EXECEN 201 signal. When zero detect logic 711 indicates the

incoming skip count value is zero, it forces multiplexor 714 to pass the value from

multiplexor 712 instead.

Status propagation bits 305 cause multiplexors 706 and 707 to select algorithmic

status bits 204 when each individual control bit is low or zero status bit 512 and carry

status bit 513 respectively from ALU 213 when each corresponding control bit is high.

Multiplexors 708 and 709 are controlled by output of zero detect logic 711 to select

algorithmic status bits 204 from previous SPU when incoming skip count 203 is non-zero

and the outputs of multiplexors 706 and 707 respectively when skip count 203 is zero.

The outputs of multiplexors 708 and 709 are clocked into output algorithmic status

register 220 when enabled by EXECEN 201 signal.

Status mask bits 306 provide individual bit selects to AND gates 701 and 702

which are combined with the outputs of multiplexors 706 and 707 respectively and

are forwarded to XOR gates 703 and 704 to determine if they differ from status test bits 307.

AND gate 705 asserts a logical high when all bits selected by status mask bits 306 differ

from status test bits 307 to effectuate a branch. This choice of polarity provides for no

branch taken when all bits of the instruction word are zeros, as is the case after system

reset.

Output of AND gate 705 is combined by logic 712 to pass conditional skip value

304 to multiplexor 714 when output of AND gate 705 is asserted and an all zero skip

count when output of AND gate 705 is de-asserted.

Output of AND gate 705 is also combined with output of zero detect logic 711 by

AND gate 713 to multiplexor 715. Multiplexor 715 will select track number 202 from

previous SPU when output of AND gate 705 is de-asserted or incoming skip count 203 is

non-zero. The output of multiplexor 715 is clocked into track number register 218 when

enabled by EXECEN 201 signal to provide the outgoing track value for the next SPU.

Output of zero detect logic 711 is provided via skip zero detected 716 signal to

register value replacement logic 216 (See Fig. 5) and AND gate 605 (See Fig. 6).

Fig. 8 shows an arrangement of additional logic that allows for the output of the

last slice to continuously be passed in as input to the first slice for additional processing.

Input memory stream interface 104 accepts data from the DMA controller 102

and presents it to multiplexors 802, 803, 804, 805, and 806 where it becomes the initial

input to the first SPU in the stack 105 under control of logic 801. When desired, original

data from input memory stream interface 104 may be supplemented by the final outputs

of the last SPU in the stack 105 under control of logic 801 in place of, or in addition to,

directing said outputs to output memory stream interface 104 for recording into RAM.

Logic 801 may use counters, data states, or any other means to determine when

data should be written to memory and when it should be sent back to the stack of SPUs

for further processing.

Data processed in this way could be passed through the stack starting with one

initial track, and then reprocessed through the stack as many times as desired on differing

tracks, limited only by the number of tracks in the implementation. This would allow the

reuse of the stack by expanding the tracks and result in a net reduction in the physical

size of the circuit required to execute a given algorithm.

The instructions may be placed in the instruction word registers by any method

convenient to the implementation, such as shifting in instructions from one SPU to the

next or directly accessing each individual SPU. The former has the advantage that the

physical tile of each SPU is identical and the latter the advantage of being able to make

small adjustments in the program without loading all SPUs in the stack. The

implementation may use any of a variety of methods that best suits the design constraints.

Numerous extensions to this basic design are envisioned including, but not limited

to, multistage processors for fixed and floating point multiplication, division, and

addition; multiple simultaneous logical operations into distinct operational register

targets, matrix support operations, and transcendental functions.

It will be apparent to the skilled person that the embodiments described in the

disclosure are exemplary only and are not limiting to the scope of the invention. The

skilled person will be aware that many alterations may be made within the scope of the

invention, which is limited only by the claims.

CLAIMS

1. A computing machine, comprising:

a plurality of slice processing units (SPUs) coupled in a specific order from a first

SPU to a last SPU, each SPU configured to perform a single instruction of a specific

algorithm on an incoming value and to pass a resulting output value;

an input memory stream interface coupled to the first SPU in the specific order of

SPUs, configured to receive a value from a direct memory access (DMA) controller and

to pass the value to the first SPU in the specific order of SPUs;

an output memory stream interface coupled to the last SPU in the specific order of

SPUs, configured to receive the output value of the last SPU in the specific order of

SPUs, and to pass the value to the DMA controller;

a set of one or more random access memory (RAM) units coupled to the DMA

controller; and

a central processing unit (CPU) coupled to the DMA controller;

wherein all instruction steps of the specific algorithm are performed in parallel by

the plurality of SPUs, and a result of the algorithm is output by the last SPU in the

specific order in each clock cycle.

2. The computing machine of claim 1 wherein the interconnected SPUs, the input and

output memory stream interfaces and the DMA controller comprise an application

specific integrated circuit (ASIC), wherein at least one instruction of an algorithm is

preprogrammed into individual ones of the SPUs.

3. The computing machine of claim 2 wherein one or both of the one or more RAM units

and the CPU are a part of the ASIC.

4. The computing machine of claim 1 wherein algorithms with conditional branches are

supported by a skip count passed between SPUs in order, such that if the skip count has a

positive value the SPU does not perform the programmed instruction and decrements the

skip count by one to the next SPU, and if the skip count is zero, the SPU performs it's

programmed instruction.

5. The computing machine of claim 4 wherein algorithms with conditional branches are

further supported by SPUs being preprogrammed with more than one instruction and an

integer track number is passed from SPU to SPU, the track number determining which

instruction of the more than one instruction is to be performed on the incoming value by

the SPU.

6. The computing machine of claim 1 further comprising the final result passed to the

output memory stream interface is passed back to the DMA controller and written to the

at least one RAM unit.

7. The computing machine of claim 1 further comprising the final result passed to the

output memory stream interface is passed back to the first SPU as a new input value.

8. A method for calculating an algorithm, comprising:

providing a plurality of slice processing units (SPUs) coupled in a specific order

from a first SPU to a last SPU, each SPU configured to perform a single instruction of a

specific algorithm expresses as a sequence of instructions on an incoming value and to

pass a resulting output value;

coupling the first SPU in the specific order to an input memory stream interface

configured to receive a value from a direct memory access (DMA) controller coupled to a

random access memory (RAM) and to pass the value to the first SPU;

coupling the last SPU in the specific order to an output memory stream interface

configured to receive the output value of the last SPU, and to pass that value to the DMA

controller;

programming each SPU in the specific order with a different one of the sequence

of instructions comprising the specific algorithm;

passing values sequentially to the first SPU from the input memory stream

interface; and

performing all instruction steps of the specific algorithm in parallel by the

plurality of SPUs, such that a result of the algorithm is output by the last SPU in the

specific order in each clock cycle.

9. The method of claim 8 comprising providing the plurality of SPUs as an application

specific integrated circuit (ASIC).

10. The method of claim 8 comprising providing the plurality of SPUs, the input

memory stream interface and the output memory stream interface as an (ASIC).

11. The method of claim 8 comprising providing the plurality of SPUs, the input

memory stream interface, the output memory stream interface, and the RAM as an

(ASIC).

12. The method of claim 8 comprising providing the plurality of SPUs, the input

memory stream interface, the output memory stream interface, the RAM and a central

processing unit (CPU) as an (ASIC).

13. The method of claim 8 comprising supporting algorithms with conditional branches

by a skip count passed between SPUs in order, such that if the skip count has a positive

value the SPU does not perform the programmed instruction and decrements the skip

count by one to the next SPU, and if the skip count is zero, the SPU performs it's

programmed instruction.

14. The method of claim 13 comprising further supporting algorithms with conditional

branches by programming individual ones of the SPUs with more than one instruction

and passing an integer track number between SPUs, the track number determining which

instruction of the more than one instruction is to be performed on the incoming value by

the SPU.

15. The method of claim 8 further comprising passing the outputs of the last SPU in the

specific order as results of the algorithm to the output memory stream interface and

writing the outputs to the RAM by the memory stream interface.

16. The method of claim 8 further comprising passing the final result that is passed to

the output memory stream interface to the first SPU as a new input value.

ABSTRACT

A computing machine has a plurality of slice processing units (SPUs) coupled in a

specific order from a first SPU to a last SPU, each SPU configured to perform a single

instruction of a specific algorithm on an incoming value and to pass a resulting output

value to a next SPU in the specific order. Values are retrieved from a memory unit by a

direct memory access (DMA) controller and passed as input to the first SPU in the

specific order, and each value passed begins a new performance of the algorithm

executed in parallel with previously submitted values. Each SPU in each clock cycle

performs its instruction and passes its result to the next SPU. A result for the algorithm is

thus produced in each clock cycle. The SPUs are part of an application specific

integrated circuit (ASIC) and may be programmed with instructions.