web version | |
{josuah.net} | {panoramix-labs.fr} | |
• {josuah.net} | |
• {panoramix-labs.fr} | |
{git} | {cv} | {links} | {quotes} | {ascii} | {tgtimes} | {gopher} | {mail} | |
• {git} | |
• {cv} | |
• {links} | |
• {quotes} | |
• {ascii} | |
• {tgtimes} | |
• {gopher} | |
• {mail} | |
━━━━━━━━━━━━━━━━━━━━━━━━━━�… | |
Wishbone B4: Standard or Pipelined? | |
━━━━━━━━━━━━━━━━━━━━━━━━━━�… | |
While writing {HDL} to teach a chip new tricks, it is best to avoid drowning in | |
the complexity. | |
The famous {divide and rule} helps: splitting the design in modules that, like | |
a programming language function, reduce the scope of what is worked on, and | |
hides the complexity for the parent module that calls them. | |
But it quickly ends-up in an sea of many modules communicating in many | |
different ways. | |
Organising communication with a bus | |
──────────────────────────�… | |
Adding another layer of organisation becomes necessary: is using a {bus} that | |
acts as a central spine for communication across the whole design. | |
Multiple bus protocols are used, with {Wishbone} the simplest and most widely | |
used one for open source cores. | |
What flavor? | |
──────────── | |
The Wishbone bus comes in multiple variants: | |
• Use or not of an extra `CTI` signal: *Classic* or *Registered Feedback*; | |
• Different timing constraints for `ACK`: *Synchronous* or *Asynchronous*; | |
• Different meanings for `STB` and `CYC`: *Standard* or *Pipelined*; | |
• Some extra optional signals. | |
I suppose the aim was to offer the largest coverage of all use-cases, so that | |
Wishbone to be used in a standard way for most situations. | |
This large range of options also makes it harder to support every combination, | |
some being incompatible together, and it seems common to use the most basic | |
wishbone on every case. | |
Left is to decide which combination is the simplest. | |
Standard and Pipelined | |
────────────────────── | |
At first, I wanted to avoid the Pipelined mode, to keep it as simple as | |
possible. But my opinion changed when having a look at how both worked: | |
In **Standard mode**, when a master issue a request with `STB_O`, as long sa | |
the slave did not send ready, it will keep `STB_O` high, until it sees an | |
`ACK_I` held high by the slave. The `CYC_O` and `STB_O` are both set on the | |
clock where ACK_I is received, and it is only on the next clock that it is | |
possible to isue a new request. | |
┊ ___ ___ ___ ___ ___ | |
┊ CLK_I __/ \___/ \___/ \___/ \___/ \__ | |
┊ _______________________ | |
┊ CYC_O __/ \______________ | |
┊ _______________________ | |
┊ STB_O __/ \______________ | |
┊ _______ | |
┊ ACK_I __________________/ \______________ | |
┊ | |
In **Pipelined mode**, a master issue a request by taking `STB_O` high, and | |
instead of waiting for `ACK_I` to take it back low, it check `STALL_I`: if | |
high, then it waits; if low, it considers the request queued by the slave, and | |
may submit another one right away. In that case, the `ACK_I` only tells the | |
master that a queued request has finished. | |
┊ ___ ___ ___ ___ ___ | |
┊ CLK_I __/ \___/ \___/ \___/ \___/ \__ | |
┊ _______________________________ | |
┊ CYC_O __/ \______ | |
┊ _______________ | |
┊ STB_O __/ \______________________ | |
┊ _______ | |
┊ ACK_I __________________________/ \______ | |
┊ _______ | |
┊ STALL_I __/ \______________________________ | |
┊ | |
In both case, `CYC_O` stays up through the whole transaction, and `ACK_I` | |
announces that the request is done. | |
Other signals, such as data, read/write or address have been omitted for | |
clarity. | |
Standard uses one less signal | |
──────────────────────────�… | |
Implementing a Pipelined slave does not reveal to be more complex in practice: | |
• If the slave is simple and gives single-clock answers, the extra `STALL_I` | |
can be tied low (`STALL_I = 0`) and ignored. | |
• If the slave has multiple cycles before taking a request, the `STALL_I` | |
would have been used in Standard mode anyway, in the form of an internal | |
`busy` register. | |
Although, a Standard master is a bit simpler to implement, as it does not have | |
to wait that the request is queued first, and then to wait again that the slave | |
provides an answer, and instead only has to wait the `ACK_I`. | |
Pipelined for better throughput | |
──────────────────────────�… | |
In the timing examples above, the slave takes 3 cycles to work on the request, | |
and then sets the `ACK_I` signal. | |
It seems to take one more clock cycle to operate, but the Pipelined mode still | |
has a higher throughput: it is not necessary to wait that the result is | |
available to submit a new request. | |
This will only work if the slave is having a buffer, a FIFO to queue the | |
incoming requests and work on them later. | |
Pipelined as easy to implement as Standard | |
──────────────────────────�… | |
Having a Pipelined mode may seem more difficult to implement since it suggests | |
that a complex queuing mechanism is to write for it, but a pipeline is entirely | |
optional even in Pipelined mode. | |
The only `ACK_I` needs to be shifted by one clock, which is done by using a | |
register instead of a wire for it. This will add the delay needed, due to | |
registers applying changes on the next clock. | |
That way, it is still possible to write very simple modules that do everything | |
in a single clock. | |
Standard has a 1-clock better lattency | |
──────────────────────────�… | |
A single clock cycle is indeed consumed in Wishbone in its Pipelined mode. This | |
could lead to an overall higher lattency, in particular if there are multiple | |
Wishbone buses chained together. | |
Pipelined may help with timing | |
──────────────────────────�… | |
If too complex operations are done in a single clock cycle, it may take too | |
much time for all the signal to settle down and stablise until the next clock | |
tick. | |
A too long chain of logic and the timing constraint (the clock rate) might be | |
missed. | |
A long chain of logic might be broken down in two steps with registers, that | |
let half of the steps be done before, and after the register, so that there is | |
roughly half of the work to be one in a single clock tick. | |
If Wishbone is used in Standard mode, the signals would have to propagate | |
inside the master, then to the slave, then inside the slave, then back to the | |
master, all of that in probably a single clock tick. | |
Placing a register in the bus, by making `ACK_O` a register, permits to break | |
the long chain form master to slave and back to master by introducing an | |
intermediate step (register) for the signal to take a pause before going back | |
to master, making sure it had time to settle down in the slave. | |
That way, if the timings of the slave are fine with one master, it has better | |
chances to be fine with any other master, since the timings of the slave and | |
master do not sum-up anymore. | |
Conclusion | |
────────── | |
While the Standard wishbone seems more frequently uesd, the Pipelined mode | |
seems to be a bit more keen on timing, and most of the drawbacks like extra | |
clock for ACK or extra signal, would likely also appear in the Standard mode. | |
I am still new to Wishbone, and much curious about what you think about it: | |
Which variant do you use? Anything that I would have missed for the Standard | |
mode? `[email protected]` | |
Among notable Pipelined mode users is {ZipCPU}. | |
Update | |
────── | |
While looking at {this} ZipCPU article, it seems that its motivation for using | |
Pipelined mode is expressed in these sentencse: | |
Reminding the way logic gates may "solve maths": | |
┊ One solution to sequencing operations is to create a giant state machine. | |
┊ The reality, though, is that an FPGA tends to create all the logic for | |
┊ every state at once, and then only select the correct answer at the end of | |
┊ each clock tick. In this fashion, a state machine can be very much like the | |
┊ simple ALU we've discussed. | |
And the conclusion of what makes more sense: | |
┊ On the other hand, if the FPGA is going to implement all of the logic for | |
┊ the operation anyway, why not arrange each of those operations into a | |
┊ sequence, where each stage does something useful? This approach rearranges | |
┊ the algorithm into a pipeline. | |
And its use of Wishbone is extensively explained in {https://raw.githubusercon… | |
Links | |
───── | |
• {http://cdn.opencores.org/downloads/wbspec_b4.pdf#page=91} | |
• {http://zipcpu.com/zipcpu/2017/05/29/simple-wishbone.html} | |
• {https://zipcpu.com/blog/2017/08/14/strategies-for-pipelining.html} | |
• {https://raw.githubusercontent.com/ZipCPU/zipcpu/master/doc/orconf.pdf} |