FPGA Design and Implementation
1
ASIC & VLSI
• Time-to-market: Some large ASICs can take a year or
more to design.
• Design Issues: you need a lot of time to handles the
mapping, routing, placement, and timing.
• The FPGA design flow eliminates the complex and time-
consuming floorplanning, place and route, timing
analysis.
CONCEPTUAL FPGA
Interconnect Resources
I/O Cell
Logic Block
FPGA
• Speed (Memory BRAM & Distributed)
• (RAM lost data). // size and cost
• Floating point & Fixed point issue.
• Flex.
08/28/09
Design Entry
Technology Mapping
Design Flow
Placement
Process
Diagram
Routing
Programming Unit
Configured FPGA
Why HDL?
• To allow the designer to implement and verify complex
hardware functionality at a high level, without the
requirement of having to know the details of the low-
level design implementation.
• Advantage:
• FPGAs have lower prototyping costs
• FPGAs have shorter production times
• Synthesis: The process which translates VHDL code
into a complete circuit with logical elements( gates, flip
flops, etc…).
Maximum Throughput Designs
• Dataflow
• Unrolling
• Pipelining
• Merging
Loop Unrolling
• arrays a[i], b[i] and c[i] are mapped to RAMs.
• Rolled Loop: This implementation takes four clock cycles, one multiplier and each RAM can be a
single port.
• Unrolled Loop: The entire loop operation can be performed in a single clock cycle. requires four
multipliers and requires the ability to perform 4 reads and 4 write in the same clock cycle; may
require the arrays be implemented as register arrays rather than RAM.
Loop Merging
Pipelining
• pipelining allows operations to happen
concurrently.
Pipelining
• Function pipelining is only possible as there is no resource contention or data dependency which
prevents pipelining. The input array “m[2]” is implemented with a single-port RAM. The function
cannot be pipelined because the two reads operations on input “m[2]” (“op_Read_m[0]” and
“op_Read_m[1]”) cannot be performed in the same clock cycle.
• Solution: The resource contention problem could be solved by using a dual-port RAM for array
“m[2]", allowing both reads to be performed in the same clock cycle or increasing the the interval
of pipeline
Array Optimizations
08/28/09
Array Optimizations
• Mapping: When there are many small arrays mapping to a single large
array will reduce the storage overhead.
• Partitioning: If each small array gets a separate memory, a lot of memory
space is potentially wasted and the design will be large and consequently
large power consumption.
• Horizontal mapping: this corresponds to creating a new array by
concatenating the original arrays. Physically, this gets implemented as a
single array with more elements.
• Vertical mapping: this corresponds to creating a new array by
concatenating the original words in the array. Physically, this gets
implemented by a single array with a larger bit-width.
Horizontal mapping
08/28/09
Horizontal mapping
• Although horizontal mapping can result in using less RAM
components and hence improve area, it can have an impact on
throughput and performance.
• In the previous example both the accesses to "array1" and "array2"
can be performed in the same clock cycle.
• If both arrays are mapped to the same RAM this will now require a
separate access, and clock cycle, for each read operation.
Vertical mapping
Array Partitioning
• Arrays can also be partitioned into smaller arrays because it has a limited
amount of read ports and write ports which can limit the throughput of a
load/store intensive algorithm.
• The bandwidth can sometimes be improved by splitting up the original array
(a single memory resource) into multiple smaller arrays (multiple memories),
effectively increasing the number of ports.
Array Partitioning
• If the elements of an array are accessed one at a time, an efficient
implementation in hardware is to keep them grouped together and
mapped into a RAM.
• If multiple elements of an array are required simultaneously, it may
be more advantageous for performance to implement them as
individual registers: allowing parallel access to the data.
• Implementing an array of storage elements as individual registers
may help performance but this consume large area and increase
power consumption.
xa7a100tfgg484-2i
2-D for size N =128*128
Input Array Dual port Independent Registers
LUT 1642 10778
FF 835 9548
Power 246 2031