# **BluSP-B3 processor**

BlueICe invented a novel processor architecture and instruction set, which has been optimized to address markets and applications which require a mix of very low power consumption and significant processing power. Examples of these applications include: OFDM based wireless transceiver standards [e.g. Wireless Lan, 802.15.4g], GPS receivers, RADAR receivers, Image recognition etc.

The architecture is combining the performance of DSP processing power, with the versatility of a Micro Controller architecture.

It's programming model is very straightforward. It can be fully programmed in standard C. To use all [complex] Digital Signal Processing capability, one specific C data type has been introduced: "Complex Fract Short Int".

A number of power consumption figures and performance data are provided in this datasheet. BlueICe claims they are best in class for this type of processing power. Power consumption and speed performance go hand in hand, especially in advanced technologies like e.g. 40nm CMOS. The excellent results have been achieved by handcrafted design, by the ultra RISC relatively small instruction set and a number of other architectural choices.

# Key Features and Benefits

Standard C based straightforward programming model.

Straightforward C programming model. It can be fully programmed in standard C. Only one additional date type has been defined, which allows full utilization of the core's complex DSP capability. A software development kit [SDK] is available based on ECLIPS. This SDK contains a Debugger, Profiler, Instruction set simulator and Compiler.

### Ultra low power consumption.

In a reference 40nmLP technology, optimized for low leakage, power consumption on an average software pattern can be as low as 13uA/Mhz. A software pattern using the core at full load [all slots fully busy all the time, which can be achieved for specific DSP algorithms] is consuming around 21uA/Mhz. All figures are nominal and with a nominal supply voltage of 1.1Volt.

#### High speed.

In a reference 40nmLP technology a speed of up to 600Mhz can be achieved. Moving to a 40nmGP

technology the speed improvement is estimated up to 1GHZ. This is achieved with a 5stage pipeline architecture.

This speed allows for the core being synthesized also in technologies as e.g. 0.18u CMOS, still delivering adequate performance.

Low gate count.

The core has been synthesized with a gate count of 73KGATES, which allows for multiple instances on a one chip.

DSP capability.

The core's DSP instruction set allows it to run DSP functions in a very efficient way. DSP instructions operate on 16bit or 32bit wide data.

Some examples:

In 32bit mode, an MP3 decoding algorithm requires the core to run at ~4Mhz, or in other words in a 40nm 600Mhz version it uses 0.7% of the core's processing bandwidth.

A 256 FFT is requiring 1350 cycles all in. In this the FFT is calculated on 16bit wide data.

Complete offer.

Finally together with the core a complete software development environment is delivered: in an ECLIPS framework, containing a Compiler, Debugger, Instruction set simulator and Profiler.

# Typical applications

The core addresses applications which require a mix of very low power consumption and significant processing power.

Examples of these applications include: OFDM based wireless transceiver standards [e.g. Wireless Lan, 802.15.4g], GPS receivers, RADAR receivers, Image recognition.

Furthermore it is capable to address a wide variety of audio applications in an efficient way, operated on 32bit wide data.

One example to demonstrate its calculation/power efficiency capability:

Speed and power consumption performance have been analyzed for an Wireless Lan 11n demodulator. The heart of this demodulator is an FFT.

To do the full demodulation the core has to be operated at 180MHZ. It then consumes on average 4mWatt in receive mode. This in the same 40nmLP technology which has been described above. Taking into account memory power consumption the overall power consumption for an 11n receiver is estimated at 6mWatt.

The core power consumption can be further optimized by reducing its supply voltage, which allows in the WIFI 11n example for another 20% reduction. This assumes adequate memories can be found or can be developed.

# <u>Processor Details</u>

- Ultra-risc: Small instruction set, about 90 instructions.
- C friendly design:
  - 32 fields of 32 bits register file for easy scheduling by C-compiler.
  - Orthogonal, simple instruction set.
- DSP capabilities in core
  - Supports fractional fixed-point data types.
  - Supports saturated add/multiply.
  - Native instructions operating on complex fractional data types.
- Multiplication/accumulation unit part of the core.
  - Supports single-cycle 16x16 bit i+q complex multiplication and addition.
- 3-issue machine.
  - Capability to execute 3 instructions in a single cycle.
  - Slot1: Multiply/ALU instruction capability.
  - Slot2: ALU instruction capability.
  - Slot3 : Load/store or branch instruction capability.
- Small instruction width for efficient execution of control code, and reducing power dissipation in instruction RAM.
- Wide, 64-bit data memory. 8-bit, 16-bit, 32-bit and 64-bit load/store instructions are part of the instruction set.
- All instructions execute in single-cycle. Note the core can execute 3 instructions per cycle.

• Instructions can be combined as follows: one unit can handle a load/store or an branch instruction; the second unit can handle a complex MAC or any ALU instruction; the third unit can handle any ALU instruction.

- Debug interrupt and application interrupt part of the core.
- Hardware debug unit.
- Precise exception handling part of the core.
  - Precise exception handling means the core stops *before* the instruction that caused the

exception. (Integrated wind-back technology in the core).

## Tools and libraries

- Eclips IDE environment
- C-Compiler
- Debugger
- · Assembly development environment
- Instruction set simulator
- Profiler
- Communications library

### Detailed Implementation data

- Synthesized into 73kgates
- 4000 flops, > 95 % on gated clocks.
- Speed in CMOS90GP: 500 Mhz
- 600 Mhz in C40LP.
- Estimated >1GHz clock rate in C40GP technology.
- Estimated 200 Mhz in C180.
- Dissipation in CMOS90LP : 62 uA/Mhz for a full load pattern.
- Dissipation in C40LP:
- 15uA/Mhz for an average C-program, utilizing 1 1.5 of the 3 slot potential. Power consumption at 1.1Volt supply voltage.
- 23uA/Mhz for a full loading patterns. All 3 slots occupied all the time. Power consumption at 1.1 supply voltage.
- Limiting the speed of the core to 250Mhz, and reducing the supply voltage to 0.9Volt power for the average pattern reduces to 12uA/Mhz. and for the full load pattern reduces to 19uA/Mhz.
- These results have been obtained through advanced synthesis, using floorplanning information at synthesis. In this synthesize the core has not been synthesized for highest speed. More floor planning optimization is possible, which would reduce the power by 10-20%.
- The power consumption given in the first sections of the datasheet assume this further layout optimization to be achieved.

### <u>Detailed Performance data</u>

- DSP performance metric through FFT
- Efficient execution of FFT transform. 1350 clocks for 256-point Complex FFT. (Radix-4 transform)
- This is part of the communication library
- Detailed analysis for a 2kpoints FFT:
- 14000 clocks
- required memory accesses:
- 5632 64-bit reads
- 5632 64-bit writes
- 1408 64-bit reads [twiddle function]
- 14080 64-bits reads [instructions]
- A 32bit MP3 coder/decoder is estimated to require a 3-5 Mhz core speed.

## <u>Block diagram</u>



### Processor pipeline

5-stage pipeline (6 stage through multiplier and data RAM).

# BluelCe

- 1. Instruction RAM address.
- 2. Fetch.
- 3. Decode.
- 4. Execute.
- 5. Write Back.