aka programmable hardware, (re)configurable logic, etc
Subject: Repost: Performance Benchmarks: Emulating FPGAs Using General Purpose Processors Date: 09 Feb 1996 00:00:00 GMT newsgroups: comp.arch.fpga Reposted after Netcom corrupted the first copy. In <1996Feb9.082016.20892@super.org> sc-@vcc.com (Steve Casselman) writes: >>Another thread that would be good is just for some benchmark >>data on what people are doing now. For example if someone >>is doing neural nets you might post: >> >>I did a 12 neuron neural net that did 12 Billion connections/sec >>in 3000 gates and a ALPHA 330MegHz can do them @ 300 Million >>connections/sec. (no I did not do this it is just an example >>of the detail of the proposed post) A while back I was wondering about this in general: "how much better are FPGAs at executing, in hardware, an arbitrary computation, than are modern general purpose processors at executing, in software, the same arbitrary computation?". And so I wrote the following loose analysis which might be interesting to discuss: Emulating FPGAs using general purpose processors Jan Gray, 1995 It is well known that FPGAs can implement general purpose computers, and general purpose computers can emulate FPGAs. In this message I consider how efficiently a general purpose processor can emulate an arbitrary FPGA circuit. It arises from my curiosity regarding what performance advantage (if any) the custom reconfigurable computing crowd has over general purpose processors, to quantify, best case, how much faster an FPGA is at arbitrary FPGA problems. Setting aside such modern features as embedded SRAM, 3-state buses, asynchronous flip-flop set/reset, and even flip-flop clock enables, we model a typical lookup-table-based FPGA logic element as an arbitrary 4-input logic function driving either another stage of logic or a flip-flop. Without fudging too much we can then consider that all multilevel logic functions are pipelined with a register between each logic function. This simplifies my equivalence construction below and allows us to compare against an FPGA clock rate based upon the sum of flip-flop clock-to-output delay, delay through some interconnect, one logic function delay, and the flip-flop setup time. Thus, we model an FPGA as a vector of n 4-input function generators F[i] driving n flip-flops R[i] which are in turn fed back into the function generator inputs. In practice most FPGAs are incapable of modeling arbitrary interconnect schemes between logic elements, instead greatly favouring local interconnect between nearby logic elements. However, we’ll be overly generous and permit any function generator F[i] to compute any function of any 4 flip-flop outputs: R[i]’ = F[i](R[a[i]], R[b[i]], R[c[i]], R[d[i]]); for i, a[i], b[i], c[i], d[i] all in [0..n). Note that a, b, c, d, are simple mappings which describe which flip-flop outputs are inputs to each given function generator. For instance, if R[0]’ = R[2] xor R[3] xor R[5] xor R[7], we would have a[0]=2, b[0]=3, c[0]=5, d[0]=7. On a general purpose processor with word width w, we can implement the flip-flop vector by partitioning it into w bit registers. To simplify the presentation, assume that n == w. Then a simulation of one clocking of the FPGA is to form A, B, C, D, each a mapping of R such that A[i] == R[a[i]] B[i] == R[b[i]] C[i] == R[c[i]] D[i] == R[d[i]] and finally compute R’[i] = F[i](A[i], B[i], C[i], D[i]) word-wise, over all the bit positions i simultaneously. The first sub-problem is to compute 4 mappings of bits of R into A, B, C, D efficiently. One approach is to perform wordwise bit shifting, xor, and mask operations upon R. For instance, R can be bit-reversed in 5 lg w simple RISC instructions, arbitrarily permuted in 10 lg w instructions, and arbitrary mapped (any bit input to any bit output, including arbitrary bit replication) in roughly 20 lg w instructions employing 4 lg w constants. (Knuth developed this by analogy from exchange networks). Another approach is to build w/8 256-entry lookup tables, for each byte of bits in R, and "or" the contributions together: word R, LUA[w/8][256], A = 0; for (i = 0; i < w/8; i++) A |= LUA[i][(R>>(i*8)) % 256]; Written in-line this would require about w/8 shifts, masks, loads and "or"s. Let us assume the tables fit into d-cache and charge overall 4*w/8 == w/2 instructions to arbitrarily map R into A. For example, for w==64, the wordwise bit mapping approach would require approximately 120 instructions, whereas the table lookup approach would require approximately 32 instructions (including 8 loads). Thus the four mappings A, B, C, D require about 4*32 == 128 instructions, including 32 loads and about 8 KB of lookup tables. The next sub-problem is to compute F(A,B,C,D). To perform this in parallel across all n bits, it suffices to generate the 16 minterms M[0]=~A&~B&~C&~D, M[1]=~A&~B&~C&D, ..., M[15]=A&B&C&D and then combine them under 16 masks N[0..15]: R = M[0]&N[0] | M[1]&N[1] | ... | M[15]&N[15]; By reusing common subexpressions this can be done in about 60 instructions. All totaled, on a w-bit processor, we need about 4*max(w/2, 20 lg w) + 60 instructions to simulate one clocking of an arbitrary w-bit wide FPGA, as modeled. Also of interest, the state required to describe the arbitrary w-bit FPGA of 4-input logic elements, is only about 16 + 4 lg w words. Let’s try out this model against some real hardware! A lowly XC4002A-5 has 8x8 CLBs each with 2 function generators and two flip-flops, and with a clock to flip-flop output delay of 3 ns, an "interconnect delay" which I'll empirically and arbitrarily state is 17 ns, and a combined function generator plus flip-flop setup time of 5 ns, all totaled, let’s say 25 ns. In summary, we won’t be far wrong (?) to say an ‘02A is a "128-bit FPGA" which you can clock at 40 MHz. In comparison a 32x32 CLB XC4025-5 would be a "2048-bit FPGA". In the general purpose processor corner, let’s employ an Alpha 21164A. I recall this machine can issue 4 instructions per clock at 300 MHz. Since w==64, it would take an Alpha about 200 instructions or about 50 issue slots (167 ns) to emulate one clocking of an arbitrary 64-bit FPGA. Thus our Alpha might emulate half an ‘02A at a rate of about 6 MHz. For another example, the BiCMOS MicroUnity media processor implementation can apparently perform 128-bit operations at 1000 MHz. Here with w==128, it could emulate an arbitrary 128-bit FPGA at about 4 MHz, perhaps faster if we can take advantage of some of its special bit shuffle instructions. Well! These results surprised me. Even executing a billion operations per second on 64- or 128-bit registers, these general purpose processors couldn’t achieve 10% of the speed*gates figure of a small FPGA. Even if you consider my "12 ns" FPGA interconnect delay assumption far too generous given the arbitrary FPGA’s arbitrary interconnection topology, and derate it to 50 ns or more, the little FPGA is still several times faster at FPGA type work than a general purpose microprocessor. (I also wonder if this presentation doesn’t suggest a slow, yet compact, physical FPGA implementation. Attach a vector of flip-flops to the input and output sides of a "4-context" n-stage shuffle/copy/exchange interconnect, set up to route its input values according to 4 different routing configurations. The interconnect can be pipelined, and run at high speed, to map the flip-flop outputs into 4 registered function generator inputs and every 8 or 9 clocks the flip-flop itself could be clocked. Overall the pipeline could run at 200+ MHz and the overall FPGA at 25 MHz.) Jan Gray