# WELCOME TO MY DEFENSE

Saraju P. Mohanty



# Energy and Transient Power Minimization During Behavioral Synthesis

#### Saraju P. Mohanty

Dept. of Computer Science and Engineering
University of South Florida
smohanty@csee.usf.edu

For more details, visit:



http://www.csee.usf.edu/~smohanty/research



#### **Outline of the Talk**

- Introduction
- Related Works
- Target Architecture
- Proposed Datapath Scheduling Schemes
- Image Watermarking Chip Design
- Conclusions





- Simultaneous minimization of various powers and energy considered.
- Secure JPEG Encoder and Secure Digital Still Camera

#### What is High-Level Synthesis??

McFarland (1990)

"HLS is conversion or translation from an algorithmic level specification of the behavior of a digital system to a RT level structure that implements that behavior."

[Analogous to "compiler" that translates high-level language like C/Pascal to assembly language.]

NOTE: also known as Behavioral Synthesis.



#### Various Phases of Behavioral Synthesis



#### Why Power Reduction?

- To reduce energy costs
- To increase battery life time
- To increase battery efficiency
- To maintain supply voltage levels
- To reduce power supply noise
- To reduce cross-talk and electromagnetic noise
- To use smaller heat sinks
- To make packaging cheaper
- To increase reliability
- To reduce use of natural resources



#### Why Dynamic Power Minimization ??

- Veendrick Observation: In a well designed circuit, short-circuit power dissipation is less than 20% of the dynamic power dissipation.
- Sylvester and Kaul: At larger switching activity the static power is negligible compared to the dynamic power.



Figure 1.10. Static Vs Dynamic Power Dissipation for different Switching Activity [3, 4]



#### Dynamic Power: Major one

$$P_{dynamic} = \frac{1}{2} C_L V_{dd}^2 N f$$

 $C_L$  = load capacitor,  $V_{dd}$  = supply voltage,

N = average number of transitions/clock cycle

= E(sw) = 2  $a_{0->1}$  = switching activity

f = clock frequency

#### Note:

- 1. N\*f is transition density
- 2.  $C_L*N (= C_{sw} = C_{eff})$  is the effective switching capacitance

#### **Dynamic Power Reduction: How??**

- Reduce Supply Voltage (V<sub>dd</sub>): delay increases; performance degradation
- Reduce Clock Frequency (f): only power saving no energy, performance degradation
- Reduce Switching Activity (N or E(sw)): no switching no power loss !!! Not in fully under designers control. Switching activity depends on the logic function. Temporal/and spatial correlations difficult to handle.
- Reduce Physical Capacitance: done by reducing device size reduces the current drive of the transistor making the circuit slow

#### What is our approach?

Adjust the frequency and reduce the supply voltage in a co-coordinated manner to reduce various forms dynamic power while maintaining performance, through datapath scheduling during behavioral synthesis.



#### **Dynamic Frequency ??**



Single Frequency

Dynamic Frequency



DCU uses clock divider strategy

More details:

- •Ranganathan, et.al.
- Byrnjolfson and Zilic

#### Digital Watermarking?



Digital watermarking is defined as a process of embedding data (watermark) into a multimedia object to help to protect the owner's right to that object.

#### **Types**

- Visible and Invisible
- Spatial, DCT and Wavelet domain
- Robust and Fragile

USF UNIVERSITY OF

SARAJUPRASAD MOHANTY

#### Digital Watermarking: Examples



#### Watermarking: General Framework

- > Encoder: Inserts the watermark into the host image
- Decoder: Decodes or extracts the watermark from image
- Comparator: Verifies if extracted watermark matches with the inserted one

#### The Watermarking Encoders Designed are:

- Spatial domain invisible-robust and invisible-fragile watermarking encoder
- Spatial domain visible watermarking encoder
- DCT domain invisible and visible watermarking encoder (only architecture proposed)

#### **Related Works**

#### Low Power Synthesis / Watermarking

- Scheduling for Energy Minimization
- Switching Activity Reduction at Behavioral Level
- Datapath Scheduling for Peak Power Reduction
- Scheduling for Variable Voltage Processor
- Design and Synthesis of Variable Frequency/Latency and Multiple Voltage based Systems
- Hardware-based Watermarking Systems



#### **Scheduling Schemes using Multiple Voltages**

| Table 2.1. Datapath | Scheduling Sche | mes Using Multiple | Supply Voltages |
|---------------------|-----------------|--------------------|-----------------|
| 1                   |                 | - I                | 11 / 0          |

| Proposed             | Optimization | Constraints | Operating Voltage         | Time                           |
|----------------------|--------------|-------------|---------------------------|--------------------------------|
| Scheme               | Method Used  | Assumed     | Levels                    | Complexity                     |
| Johnson and          | ILP          | Time        | $(5.0V \rightarrow 2.0V)$ | Expoential                     |
| Roy [89, 90]         |              |             |                           |                                |
| Johnson and          | ILP          | Time        | (5.0V, 3.3V, 2.4V)        | Expoential                     |
| Roy [6]              |              |             |                           |                                |
| Chang and            | Dynamic      | Time        | (5.0V, 3.3V, 2.4V)        | Pseudo-                        |
| Pedram [63, 91]      | Programming  |             |                           | Polynomial                     |
| Lin, Hwang           | ILP and      | Time and    | (5.0V, 3.3V)              | Expoential                     |
| and Wu [92]          | Heuristic    | Resource    |                           | $O\left(n^3logn\right)$        |
| Sarrafzadeh          | Dynamic Prog | Time and    | (5.0V, 3.3V)              | $O\left(n^2k\beta R ^2\right)$ |
| and Raje [93]        | Geometric    | Resource    |                           | O(nClognC)                     |
| Kumar and            | Stochastic   | Resource    | (5.0V, 3.3V, 2.4V)        | $O(n^2)$                       |
| Bayoumi [94, 95, 96] | Evolution    |             |                           |                                |
| Elgamel and          | Genetic      | Time and    | (5.0V, 3.3V, 2.4V)        | NA                             |
| Bayoumi [97]         | Algorithms   | Area        |                           |                                |
| Shiue and            | List-Based   | Time and    | (5.0V, 3.3V) or           | Polynomial                     |
| Chakrabarti [98, 99] |              | Resource    | (5.0V, 3.3V, 2.4V)        |                                |
| Manzak and           | Lagrangian   | Time and    | (5.0V, 3.3V,              | $O\left(n^2\right)$ and        |
| Chakrabarti [100]    | Multiplier   | Resource    | 2.4V, 1.5V)               | $O(n^2 log L)$                 |
| Manzak and           | List-Based   | Time and    | (5.0V, 3.3V,              | $O\left(r^2L^2\right)$         |
| Chakrabarti [101]    |              | Resource    | 2.4V, 1.5V)               |                                |

None of these works:

- •Handle variable frequency
- •Minimize other forms of power

And

Most of the cases, the time penalty and area penalty are high.

#### Switching Reduction during Behavioral Synthesis

Table 2.2. High-Level Synthesis Schemes using Switching Activity Reduction

| D 1                       | e dimi                | 26.4.4         | π.          | 0/ D      |
|---------------------------|-----------------------|----------------|-------------|-----------|
| Proposed                  | Synthesis Tasks       | Methods        | Time        | % Power   |
| Work                      | Performed             | Used           | Complexity  | Reduction |
| Kumar, Katkoori, Rader    | Scheduling, Register  | Simulation     | NA          | NA        |
| and Vemuri [102, 103]     | Optimization, etc.    | of DFG         |             |           |
| Raghunathan               | Tranformation, Sche-  | Iterative      | Polynomial  | 4.6       |
| and Jha [104]             | duling and Allocation | Improvement    |             |           |
| Musoll and Cortadella     | Scheduling and        | List-Based     | $O(n^2m)$   | 6.67      |
| [50]                      | Resource Binding      | Algorithm      |             |           |
| Lundberg, Muhammad,       | NA                    | Hierarchical   | NA          | 14.93     |
| Roy and Wilson [112, 113] |                       |                |             |           |
| Shin and Lin              | Resource              | Heuristic      | Polynomial  | 7.84      |
| [114]                     | Allocation            |                |             |           |
| Monteiro, Devadas,        | Scheduling            | HYPER [115]    | NA          | 22.43     |
| Ashar and Mauskar [116]   |                       |                |             |           |
| Gupta and                 | Scheduling            | Force-Directed | $O(n^4t)$   | 16.4      |
| Katkoori [119]            | _                     | Heuristic      | , ,         |           |
| Murugavel and             | Scheduling            | Game Theory    | Exponential | 13.9      |
| Ranganathan [120]         | Binding               |                |             |           |

- •These synthesis works neither handle multiple supply voltages nor variable frequency.
- •Minimize average power only.
- •Often accompanied by high time penalty.

#### **Peak Power Reduction at Behavioral Level**

Table 2.3. Relative Performance of Various Schemes Proposed for Peak Power Minimization

| Proposed             | Synthesis Tasks | Methods        | Time        | % Power     |
|----------------------|-----------------|----------------|-------------|-------------|
| Work                 | Performed       | Used           | Complexity  | Reduction   |
| Martin and Knight    | Scheduling      | Genetic        | NA          | 40.3-60.0   |
| [53, 56]             | Assignment      | Algorithms     |             |             |
| Shiue and et. al.    | Scheduling      | ILP            | Exponential | 50.0 - 75.0 |
| [122, 123, 124, 111] |                 | Force Directed | $O(cn^3)$   |             |
| Raghunathan,         | Scheduling      | Data Monitor   | NA          | 17.42-32.46 |
| and et. al. [59]     |                 | Operations     |             |             |

- Do not handle MV or DFC
- High time penalty
- Do not minimize other forms of power

#### Scheduling for Variable Frequency Processor

| Table 2.4. | Scheduling | Algorithms i | for Variable | Voltage Processor | ſ |
|------------|------------|--------------|--------------|-------------------|---|
|            |            | -            |              |                   |   |

| 20020 2. 1.            | 5 cm c c c c c c c c c c c c c c c c c c | .50111111111111111111111111111111111111 |            | mage rrocessor                                          |         |
|------------------------|------------------------------------------|-----------------------------------------|------------|---------------------------------------------------------|---------|
| Proposed               | Working                                  | Static or                               | Method     | Running                                                 | % Power |
| Work                   | Level                                    | Dynamic                                 | Used       | Time                                                    | Savings |
| Ishihara and           | OS                                       | Static                                  | ILP        | Exponential                                             | 70      |
| Yasuura [125]          |                                          |                                         |            |                                                         |         |
| Okuma, Ishihara,       | OS                                       | Static                                  | ILP        | Exponential                                             | 56      |
| and Yasuura [126, 127] |                                          | Dynamic                                 | Heuristic  | NA                                                      | 58      |
| Hong, Potkonjak,       | OS                                       | Dynamic                                 | Heuristic  | O(N+m)                                                  | 20      |
| and Srivastava [128]   |                                          |                                         |            |                                                         |         |
| Hong, Kirovski,        | System                                   | Static                                  | Heuristic  | $O(n^3)$                                                | 25      |
| and et. al. [129]      |                                          |                                         |            |                                                         |         |
| Mansour, Mansour,      | Circuit and                              | Static                                  | List-based | $O(n^4)$                                                | 56      |
| and et. al. [130]      | Behavioral                               |                                         | Heuristic  |                                                         |         |
| Azevedo, Issenin,      | Compiler                                 | Static                                  | Heuristic  | NA                                                      | 82      |
| and Comea [131, 132]   |                                          |                                         |            |                                                         |         |
| Hsu, Kremer,           | Compiler                                 | Static                                  | Heuristic  | NA                                                      | 70      |
| and Hsiao [135, 136]   |                                          |                                         |            |                                                         |         |
| Pering, Burd           | OS                                       | Static                                  | Heuristic  | O(n)                                                    | 80      |
| and Brodersen [69]     |                                          |                                         |            |                                                         |         |
| Lee and [137]          | OS                                       | Static                                  | Heuristic  | $O\left(n^2\left(\frac{T_{max}}{T_{min}}\right)\right)$ | 54.5    |
| Krishna [137]          |                                          | Dynamic                                 | Heuristic  | NA                                                      | 65.6    |
| Pouwelse, Langen-      | OS                                       | Dynamic                                 | Heuristic  | $O(n^3)$                                                | 50      |
| doen, and Sips [64]    |                                          |                                         |            |                                                         |         |
| Yao, Demers,           | OS and                                   | Static                                  | Heuristic  | $O(nlog^2n)$                                            | NA      |
| and Shenker [138]      | Circuit                                  | Dynamic                                 | NA         | NA                                                      | NA      |

- Handle variable frequency at OS or compiler level.
- Minimize
   average power
   or energy only.

#### Design and Synthesis using Variable Frequency

Table 2.5. Design and Synthesis Works on Variable Frequency or Multiple Frequency

| Proposed               | Design or    | Power or    | Operation | Voltage or       | Result   |
|------------------------|--------------|-------------|-----------|------------------|----------|
| Work                   | Synthesis    | Performance | Mode      | Frequency        |          |
| Usami, Igarashi,       | Design       | Low-Power   | Multiple  | (3.3, 1.9)V      | 47%      |
| and et. al. [7, 75]    | Synthesis    |             | Voltage   |                  | (max)    |
| Usami, Igarashi,       | Design       | Low-Power   | Variable  | NA               | 55%      |
| and et. al. [74]       |              |             | Voltage   |                  | (max)    |
| Ranganathan,           | Design       | High        | Dynamic   | 50 - 400MHz      | 1.79-3.0 |
| and et. al. [8, 70]    |              | Performance | Frequency |                  | (times)  |
| Krishna, and           | Synthesis    | Low-Power   | Dynamic   | (5.0, 3.3, 2.4)V | 2 - 54%  |
| et. al. [144, 145]     | (Scheduling) |             | Frequency |                  |          |
| Papachristou,          | Synthesis    | Low-Power   | Multiple  | NA               | 50%      |
| and et. al. [146]      | (Allocation) |             | Frequency |                  | (max)    |
| Burd, Brodersen,       | Design       | Low-Power   | Variable  | 1.2 - 3.8V       | 11%      |
| and et. al. [147, 148] |              |             | Voltage   |                  | (avg)    |
| Kim and                | Design       | Low-Power   | Frequency | NA               | NA       |
| Chae [72]              |              |             | Scaling   |                  |          |
| Pouwelse,              | Design       | Low-power   | Variable  | 0.8 - 2.0V       | NA       |
| and, et. al. [11]      |              |             | Frequency | 59-251MHz        |          |
| Acquaviva, Benini,     | Design       | Low-power   | Variable  | NA               | 40%      |
| and Riccò [149]        |              |             | Frequency |                  | (max)    |
| Benini, and et. al.    | Design       | High        | Variable  | NA               | 27%      |
| [150, 151]             | Synthesis    | Performance | Latency   |                  |          |
| Raghunathan,           | Synthesis    | High        | Variable  | NA               | 1.6×     |
| and et. al. [152]      |              | Performance | Latency   |                  |          |
| Nowka and              | Design       | Low-power   | Frequency | 1.0 - 1.8V       | NA       |
| [153, 154]             |              |             | Scaling   |                  |          |
| Lu, Benini,            | Design       | Low-power   | Frequency | 103 - 206MHz     | 46%      |
| and Michelli [155]     |              |             | Scaling   |                  | (max)    |

- •Low-power or High-performance synthesis or design works using variable frequency.
- •Minimize only average power.

#### Hardware Systems for Digital Watermarking

Table 2.6. Watermarking Chips Proposed in Current Literature

| Proposed      | Type of   | Target | Working | Techno-   | Chip                 | Chip Power  |
|---------------|-----------|--------|---------|-----------|----------------------|-------------|
| Work          | Watermark | Object | Domain  | logy      | Area                 | Consumption |
| Mathai and    | Invisible | Video  | Wavelet | $0.18\mu$ | NA                   | NA          |
| et. al. [161] | Robust    |        |         |           |                      |             |
| Tsai and Lu   | Invisible | Image  | DCT     | $0.35\mu$ | $3.064 \times 3.064$ | 62.78mW     |
| [162]         | Robust    |        |         |           | $mm^2$               | 3.3V, 50MHz |
| Garimella and | Invisible | Image  | Spatial | $0.13\mu$ | $3453 \times 3453$   | $37.6\mu W$ |
| et. al. [163] | Fragile   |        |         |           | $\mu m^2$            | 1.2V        |

A lot needs to be done ......



#### In this Dissertation ......

#### Two design options explored

- Multiple Supply Voltages and Dynamic Frequency Clocking (MVDFC)
- Multiple Supply Voltages and Multicycling (MVMC)
   Minimization during Behavioral Synthesis
- Energy or Energy-delay-product
- Peak power
- Simultaneous peak and average power
- Transient power
- Power fluctuation
- Framework for simultaneous minimization

Designing various watermarking chips



#### **Target Architecture**



- ☐ Level converters are used when a low-voltage functional unit is driving a high-voltage functional unit.
- ☐ Each functional unit has one register and one multiplexer.
- ☐ The register and the multiplexor operate at the same voltage level as that of the functional units.
- $\Box$  Operational delay of a FU :  $(d_{FU} + d_{Mux} + d_{Reg} + d_{Conv})$ .
- ☐ Time for voltage conversion equals to time for frequency change.
- ☐ Controller has a storage unit to store the cycle frequency index (cfi<sub>c</sub>).
- □ Datapath is represented as a sequencing DFG.
- ☐ Operating frequencies are calculated from the delays.

## A Framework for Simultaneous Minimization

#### **CPF** Minimization

(Different Power and Energy Parameters)

#### Aim at simultaneous minimization of:

- Average Power
- Peak power
- Cycle difference power
- Peak power differential
- •Total Energy

NOTE: The peak power, the cycle difference power, and the peak power differential drive the transient characteristic of a CMOS circuit.

#### **CPF Minimization: Power Definitions**

- Cycle Power (P<sub>c</sub>): power consumption of any control step.
- Peak Power (P<sub>peak</sub>): maximum power consumption of any control step i.e. maximum (P<sub>c</sub>).
- Mean Cycle Power (P): mean of the cycle powers (an estimate for the average power consumption of a DFG).
- Cycle Difference Power (DP<sub>c</sub>): quantifies variation of power consumption of a cycle c from the mean /average power consumption. This determines the power profile of a DFG over all the control steps.
- Peak power differential (DP<sub>peak</sub>): the maximum of the cycle difference power for any control step.
- Mean Cycle Difference Power (DP): mean of the cycle difference powers (a measure of overall power fluctuation)



#### **CPF Minimization: Cycle Power Function**

- We Define: A new parameter called "cycle power function" (CPF) as an equally weighted sum of the normalized mean cycle power and the normalized mean cycle difference power.
- We claim: The minimization of CPF using multiple supply voltages and dynamic frequency clocking (MVDFC), and multiple supply voltages and multicycling (MVMC) under resource constraints will lead to the reduction of energy and all different forms of power.

### CPF Minimization: Power Models (Notations Needed)

| Table 6.1. List | of notataions | and terminol | logy used: | in CPF model | ing |
|-----------------|---------------|--------------|------------|--------------|-----|
|                 |               |              | 00         |              | _   |

| N              | : total number of control steps in the DFG                                              |
|----------------|-----------------------------------------------------------------------------------------|
| 0              | : total number of operations in the DFG                                                 |
| c              | : a control step or a clock cycle in the DFG                                            |
| $o_i$          | : any operation i, where $1 \le i \le O$ ,                                              |
| $P_c$          | : the total power consumption of all functional units active in control step $c$        |
|                | (cycle power consumption)                                                               |
| $P_{peak}$     | : peak power consumption for the DFG equal to $max(P_c)_{\forall c}$                    |
| $P_{peak}$ $P$ | : mean power consumption of the DFG (average $P_c$ over all control steps)              |
| $P_{norm}$     | : normalised mean power consumption of the DFG                                          |
| $DP_c$         | : cycle difference power (for cycle $c$ ; a measure of cycle power fluctuation)         |
| $DP_{peak}$    | : peak differential power consumption for the DFG equal to $max(DP_c)_{\forall c}$      |
| $DP^{r}$       | : mean of the cycle difference powers for all control steps in DFG                      |
| $DP_{norm}$    | : normalised mean of the mean difference powers for all steps in DFG                    |
| CPF            | : cycle power function                                                                  |
| $FU_{k,v}$     | : any functional unit of type $k$ operating at voltage level $v$                        |
| $FU_i$         | : any functional unit $FU_{k,v}$ needed by $o_i$ for its execution $(o_i \in FU_{k,v})$ |
| $FU_{i,c}$     | : any functional unit $FU_i$ active in control step $c$                                 |
| $R_c$          | : total number of functional units active in step $c$                                   |
|                | (same as the number of operations scheduled in $c$ )                                    |
| $\alpha_{i,c}$ | : switching activity of resource $FU_{i,c}$                                             |
| $V_{i,c}$      | : operating voltage of resource $FU_{i,c}$                                              |
| $C_{i,c}$      | : load capacitance of resource $FU_{i,c}$                                               |
| $f_c$          | : frequency of control step $c$                                                         |
|                |                                                                                         |

#### **CPF Minimization: Power Model...**

☐ The power consumption for any control step c is given by,

$$P_c = \sum_{i=\{1 \to Rc\}} \alpha_{i,c} C_{i,c} V^2, f_c$$

The peak power consumption of the DFG is the maximum power consumption over all the control steps,

$$P_{\text{peak}} = \max (P_c)_{c=\{1 \to N\}} = \max (\sum_{i=\{1 \to Rc\}} \alpha_{i,c} C_{i,c} V_{i,c}^2 f_c)_{c=\{1 \to N\}}$$

 $\square$  Average power is characterized as mean cycle power ( $\mathbf{P_c}$ ):

$$P = 1/N \left( \sum_{c=\{1\to N\}} P_c \right) = 1/N \left( \sum_{c=\{1\to N\}} \sum_{i=\{1\to Rc\}} \alpha_{i,c} C_{i,c} V_{i,c}^2 f_c \right)$$

NOTE: The true average power is the energy consumption per cycle/second. The above **P** is an estimate of it.

#### **CPF Minimization: Power Models...**

#### **Background Material**

- For a set of n observations,  $x_1, x_2, x_3, ....x_n$ , from a given distribution, the sample mean (which is an unbiased estimator for the population mean,  $\mu$ ) is  $m = 1/n \sum_i x_i$ .
- The absolute deviation of these observations is defined as  $\Delta x_i = |x_i-m|$ .
- The mean deviation of the observations is given by MD = 1/n $\sum_{i} |x_{i}-m|$ .
- ❖ We model the cycle difference power DP<sub>c</sub> as the absolute deviation of cycle power Pc from the mean cycle power P.
- Similarly, the mean difference power DP is modeled as mean deviation of the cycle power P<sub>c</sub>.

#### **CPF Minimization: Power Models...**

- •Normalized mean cycle power (P<sub>norm</sub>) is defined as :
  - = mean cycle power consumption over all control steps / maximum power consumption in any control step
  - = Mean  $(P_c)$  / Maximum  $(P_c)$
  - $= P/P_{peak}$
- Normalized mean cycle difference power (DP<sub>norm</sub>) is defined as:
  - = mean cycle difference power over all control steps / maximum cycle difference power for any control step
  - = Mean (DP<sub>c</sub>) / Maximum (DP<sub>c</sub>)
  - $= DP / DP_{peak}$



#### **CPF Minimization: Power Models...**

□Cycle power function is defined as:

$$CPF = P_{norm} + DP_{norm}$$
 (1)

□In terms of peak cycle power and peak cycle difference power,

$$CPF = \frac{P}{P_{peak}} + \frac{DP}{DP_{peak}} = \frac{\frac{1}{N} \sum_{c=1}^{N} P_c}{P_{peak}} + \frac{\frac{1}{N} \sum_{c=1}^{N} |P - P_c|}{DP_{peak}}$$
 (2)

Using the switching capacitance, voltage and frequency,

$$CPF = \frac{\frac{1}{N} \sum_{c=1}^{N} \sum_{i=1}^{R_{c}} \alpha_{i,c} C_{i,c} V_{i,c}^{2} f_{c}}{max \left(\sum_{i=1}^{R_{c}} \alpha_{i,c} C_{i,c} V_{i,c}^{2} f_{c}\right)_{\forall c}} + \frac{\frac{1}{N} \sum_{c=1}^{N} \left(\left|\frac{1}{N} \sum_{c=1}^{N} \left(\sum_{i=1}^{R_{c}} \alpha_{i,c} C_{i,c} V_{i,c}^{2} f_{c}\right) - \sum_{i=1}^{R_{c}} \alpha_{i,c} C_{i,c} V_{i,c}^{2} f_{c}\right|\right)_{\forall c}}{max \left(\left|\frac{1}{N} \sum_{c=1}^{N} \left(\sum_{i=1}^{R_{c}} \alpha_{i,c} C_{i,c} V_{i,c}^{2} f_{c}\right) - \sum_{i=1}^{R_{c}} \alpha_{i,c} C_{i,c} V_{i,c}^{2} f_{c}\right|\right)_{\forall c}}$$

#### **CPF Minimization: Scheduling Algorithm**

Input: Unscheduled data flow graph,

resource constraint,

allowable voltage levels,

number of allowable frequencies,

load capacitance of each resource,

delay of each functional units

Output: Scheduled data flow graph, base frequency, cycle frequency index, operating voltage for each operation

#### **CPF Minimization: Scheduling Algorithm ...**

- Step 1: Calculate the switching activity at the each node through behavioral simulation of the DFG.
- Step 2 : Construct a LUT of effective switching capacitance.
- Step 3: Find ASAP and ALAP schedules of the UDFG.
- Step 4: Determine the number of multipliers and ALUs at different operating voltages.
- Step 5: Modify both ASAP and ALAP schedules obtained in Step 1 using the number of resources found in Step 2.
- Step 6 : No. of control steps = Max (ASAP steps, ALAP steps).
- Step 7: Find the vertices having non-zero mobility and vertices with zero mobility.
- Step 8: Use the CPF-Scheduler-Heuristics to assign the time stamp and operating voltage for the vertices, and the cycle frequencies such that CPF and time penalty are minimum (measures as  $T_{\rm D}/T_{\rm S}$ )
- Step 10: Calculate power, energy and frequency details.



## **CPF Minimization: CPF-Scheduler Heuristic Explanations**

- The heuristic is used to find proper time stamp, operating voltage for mobile vertices such that the CPF+R<sub>T</sub> is minimum for whole DFG.
- Initially assumes the modified ASAP schedule (with relaxed voltage resource constrained) as the current schedule.
- The CurrentCPF+R<sub>T</sub> value for the current schedule is calculated.
- The heuristic finds CPF values (TempCPF+R<sub>T</sub>) for each allowable control step of each mobile vertices and for each available operating voltages.
- The heuristic fixes the time step, operating voltage and hence cycle frequency for which CPF+R<sub>T</sub> is minimum.

NOTE: The worst case running time of the heuristic is  $\Theta(t_m|V|^3)$ .

## **CPF Minimization: Experimental Results**(Benchmarks and Resource Constraints used)

- 1. Auto-Regressive filter (ARF) (28 nodes, 16\*, 12+, 40 edges).
- 2. Band-Pass filter (BPF) (29 nodes, 10\*, 10+, 9-, 40 edges).
- 3. DCT filter (42 nodes, 13\*, 29+, 68 edges).
- 4. Elliptic-Wave filter (EWF) (34 nodes, 8\*, 26+, 53 edges).
- 5. FIR filter (23 nodes, 8\*, 15+, 32 edges).
- 6. HAL diff. eqn. solver (11 nodes, 6\*, 2+, 2-, 1<, 16 edges).
- 1. Number of multipliers: 1 at 2.4V; Number of ALUs: 1 at 3.3V
- 2. Number of multipliers: 2 at 2.4V; Number of ALUs: 1 at 3.3V
- 3. Number of multipliers: 2 at 2.4V; Number of ALUs: 1 at 2.4V and 1 at 3.3V
- 4. Number of multipliers: 1 at 2.4V and 1 at 3.3V; Number of ALUs: 1 at 2.4V and 1 at 3.3V

#### **CPF Minimization: Experimental Results** (Notations used)

#### Table 6.2. Notations used to Express the Results

: total energy consumption assuming single frequency and single supply voltage  $E_S$ 

 $E_D$ : total energy consumption for dynamic clocking and multiple supply voltage

 $P_{p_S}$ : peak power consumption for single frequency and single supply voltage

: peak power consumption for dynamic clocking and multiple supply voltage  $P_{p_D}$ 

 $P_{mS}$ : minimum power consumption for single frequency and single supply voltage

 $P_{mD}$ : minimum power consumption for dynamic clocking and multiple supply voltage

 $T_S$ : execution time assuming single frequency

 $T_D$ : execution time assuming dynamic frequency

: total energy reduction =  $\frac{E_S - E_D}{E_S}$  $\Delta E$ 

 $\Delta P$  : average power reduction =  $\frac{(E_S/T_S) - (E_D/T_D)}{(E_S/T_S)}$   $\Delta P_p$  : peak power reduction =  $\frac{P_{p_S} - P_{p_D}}{P_{p_S}}$   $\Delta DP$  : differential power reduction =  $\frac{(P_{p_S} - P_{m_S}) - (P_{p_D} - P_{m_D})}{(P_{p_S} - P_{m_S})}$ 

: time ratio =  $\frac{T_D}{T_C}$  $R_T$ 



#### **CPF Minimization: Experimental Results**

|     |                |                |           |                        |          |          |             |            |            |    | 1     |   |
|-----|----------------|----------------|-----------|------------------------|----------|----------|-------------|------------|------------|----|-------|---|
| K   | R              | $P_{p_S}$      | $P_{p_D}$ | $\Delta P_p$           | $P_{mS}$ | $P_{mD}$ | $\Delta DP$ | $\Delta P$ | $\Delta E$ | N  | $r_T$ |   |
| T   | С              | $(m\tilde{W})$ | (mW)      | (%)                    | (mW)     | (mW)     | (%)         | (%)        | (%)        |    |       |   |
|     | 1              | 9.30           | 2.83      | 69.60                  | 0.26     | 0.52     | 74.50       | 71.40      | 47.57      | 18 | 1.6   |   |
| Α   | 2              | 18.33          | 4.77      | 73.96                  | 0.26     | 0.52     | 76.47       | 68.30      | 47.57      | 13 | 1.4   |   |
| R   | 3              | 18.59          | 4.84      | 73.96                  | 0.26     | 0.52     | 76.44       | 71.72      | 49.87      | 11 | 1.5   | 1 |
| F   | 4              | 18.59          | 7.26      | 60.96                  | 0.26     | 0.52     | 63.25       | 59.10      | 29.49      | 11 | 1.5   |   |
|     | 1              | 9.30           | 2.45      | 73.62                  | 0.26     | 0.52     | 78.64       | 65.80      | 46.69      | 17 | 1.3   |   |
| В   | 2              | 18.33          | 4.20      | 77.10                  | 0.26     | 1.67     | 86.03       | 58.81      | 46.69      | 17 | 1.2   |   |
| P   | 3              | 18.59          | 4.84      | 73.96                  | 0.52     | 0.97     | 78.59       | 71.09      | 48.61      | 9  | 1.4   |   |
| F   | 4              | 18.59          | 7.33      | 60.60                  | 0.52     | 0.97     | 64.84       | 64.01      | 32.02      | 9  | 1.4   |   |
|     | 1              | 9.30           | 2.83      | 69.60                  | 0.26     | 0.52     | 74.50       | 50.90      | 42.44      | 29 | 1.1   |   |
| D   | 2              | 9.30           | 2.83      | 69.60                  | 0.26     | 0.52     | 74.50       | 50.90      | 42.44      | 29 | 1.1   |   |
| C   | 3              | 18.59          | 4.84      | 73.96                  | 0.26     | 0.40     | 75.75       | 67.70      | 42.93      | 15 | 1.4   |   |
| T   | 4              | 18.59          | 7.61      | 59.05                  | 0.26     | 0.40     | 60.63       | 65.19      | 38.49      | 15 | 1.4   |   |
|     | 1              | 9.30           | 2.45      | 73.62                  | 0.26     | 0.52     | 78.64       | 41.17      | 44.43      | 27 | 0.9   |   |
| Ε   | 2              | 18.07          | 4.07      | 77.49                  | 0.26     | 0.52     | 80.09       | 37.49      | 44.43      | 27 | 0.9   |   |
| W   | 3              | 18.07          | 4.07      | 77.49                  | 0.26     | 0.40     | 79.38       | 57.89      | 44.73      | 16 | 1.2   |   |
| F   | 4              | 18.07          | 6.55      | 63.75                  | 0.26     | 0.40     | 65.49       | 53.10      | 38.45      | 16 | 1.2   |   |
|     | 1              | 9.30           | 2.74      | 70.52                  | 0.26     | 0.52     | 75.45       | 58.54      | 46.11      | 15 | 1.3   |   |
| F [ | 2              | 9.30           | 2.74      | 70.52                  | 0.26     | 0.52     | 75.45       | 58.54      | 46.11      | 15 | 1.3   |   |
| Ι   | 3              | 18.59          | 4.77      | 74.32                  | 0.26     | 0.40     | 76.12       | 51.21      | 46.77      | 11 | 1.0   |   |
| R   | 4              | 18.59          | 7.04      | 62.15<br>7 <b>0.52</b> | 0.24     | 0.40     | 63.77       | 40.69      | 27.21      | 11 | 1.2   |   |
|     | Average values |                |           |                        |          |          | 75.04       | 59.59      | 43.29      |    | 1.3   |   |

#### **CPF Minimization: Power Profiles for RC2**





#### **CPF Minimization: Power Profiles for RC3**



Figure 6.5. Cycle power consumptions for resource constraint RC3



# CPF Scheduler Vs Proposed Scheduling Algorithms Available in the Literature

| Works                      | Energy savings | Time penalty   | Transient power, etc.        |  |  |
|----------------------------|----------------|----------------|------------------------------|--|--|
| Change and Pedram [15]     | 40% on average | 50% on average | Not addressed                |  |  |
| Shiue and Chakrabarti [20] | 56% on average | 50% on average | Not addressed                |  |  |
| Johnson and Roy [14]       | 46 - 58%       | 50% on average | Not addressed                |  |  |
| Johnson and Roy [13]       | 0 - 50%        | Not available  | Not addressed                |  |  |
| This work                  | 43% in average | 30% on average | 70% reduction in peak        |  |  |
|                            |                |                | 75% reduction in differntial |  |  |

From the above table it is evident that our scheme has less time penalty compared to other popular energy minimization works. Additionally, we have appreciable reductions in transient powers, which the above mentioned works do not address.



# ILP-based Framework for Simultaneous Minimization

- Aim: to provide ILP-based minimization for the CPF defined in the previous chapter.
- Two different design options: MVDFC and MVMC
- Observations about CPF:
  - CPF is a *non-linear* function.
  - A function of four parameters, such as, P, P<sub>peak</sub>, DP and DP<sub>peak</sub>.
  - The absolute function in the numerator contributes to the nonlinearity.
  - The complex behavior of the function is also contributed by the two different denominator parameters,  $P_{peak}$  and  $DP_{peak}$ .
- Non-linear programming may be more suitable, but will be large space and time complexity. We are addressing linear programming of the non-linear function.



(Linear Modeling of Nonlinearity)

#### General LP Formulations involving Absolute

General form of this type of programming:

Minimize:  $\sum_{i} |y_i|$  Subject to:  $y_i + \sum_{j} a_{ij} * x_j \le b_i, \ \forall i \text{ and } x_j \ge 0, \ \forall (j)$ 

- Let  $y_i$  be expressed as,  $y_i = y_i^1 y_i^2$ , difference of two non-negative variables.
- After algebraic manipulations using these new variables we have the following model.

Minimize: 
$$\sum_i y_i^1 + y_i^2$$
 Subject to: 
$$y_i^1 - y_i^2 + \sum_j a_{ij} * x_j \le b_i, \ \forall i$$
 
$$x_j \ge 0, \ \forall j \text{ and } y_i^1, y_i^2 \ge 0, \ \forall i$$

#### (Linear Modeling of Nonlinearity ...)

#### General LP Formulations involving Fraction

General form of this type of programming:

Minimize: 
$$\frac{\sum_{j} c_{j} * x_{j}}{\sum_{j} d_{j} * x_{j}}$$
 Subject to: 
$$\sum_{j} a_{ij} * x_{j} \leq b_{i}, \ \forall i, \ x_{j} \geq 0, \ \forall j$$
 (1)

- Assume two new variables,  $z_0 = 1/(d_0 + \Sigma_i d_j x_j)$  and  $x_j = z_j/z_0$ .
- Using the new variables the formulation becomes.

Minimize : 
$$c_0*z_0+\sum_j c_j*z_j$$
 Subject to : 
$$\sum_j a_{ij}*z_j-b_i*z_0\leq b_i, \ \forall i$$
 
$$\sum_j d_j*z_j+d_0*z_0=1, \ z_0,z_j\geq 0, \ \forall j$$

Once the new formulation is solved substitute  $z_j = x_j^* z_0$  to get the result for  $x_i$ .

(Linear Modeling of Nonlinearity ...)

#### What we learnt from the previous slides ??

- The objective function CPF has both types of nonlinearities.
- In case of a fraction: remove the denominator and introduce as constraints.
- In case of absolute: change difference in objective function to sum and introduce the difference as constraints.



## CPF\* Minimization (Modified Cycle Power Function)

- The CPF has two different denominators which may lead to increase in number of constraints and hence the overall solution space.
- We assume that  $|P-P_c|$  is upper bounded by  $P_c$  for all c, since  $|P-P_c|$  is a measure of the mean difference error of  $P_c$ . So, instead of normalizing DP with  $DP_{peak}$ , we will normalize it with  $P_{peak}$ . This reduces the number of denominator to one.
- We have the following Modified Cycle Power Function which is the objective function for the ILP formulation.

$$\begin{split} CPF^* &= \frac{P}{P_{peak}} + \frac{DP}{P_{peak}} = \frac{P + DP}{P_{peak}} = \frac{\frac{1}{N} \sum_{c=1}^{N} P_c + \frac{1}{N} \sum_{c=1}^{N} |P - P_c|}{P_{peak}} \\ &= \frac{\frac{1}{N} \sum_{c=1}^{N} \sum_{i=1}^{R_c} \alpha_{i,c} C_{i,c} V_{i,c}^2 f_c}{max \left( \sum_{i=1}^{R_c} \alpha_{i,c} C_{i,c} V_{i,c}^2 f_c \right)_{\forall c}} + \frac{\frac{1}{N} \sum_{c=1}^{N} \left( \left| \frac{1}{N} \sum_{c=1}^{N} \left( \sum_{i=1}^{R_c} \alpha_{i,c} C_{i,c} V_{i,c}^2 f_c \right) - \sum_{i=1}^{R_c} \alpha_{i,c} C_{i,c} V_{i,c}^2 f_c \right) \right)}{max \left( \sum_{i=1}^{R_c} \alpha_{i,c} C_{i,c} V_{i,c}^2 f_c \right)_{\forall c}} \end{split}$$

#### **CPF**\* Minimization: ILP Formulation (Notations)

- ${}^{\bullet}M_{k,v}$ : maximum number of functional units of type  $F_{k,v}$
- ${}^{\circ}S_{i}$ : as soon as possible time stamp for the operation  $o_{i}$
- E<sub>i</sub>: as late as possible time stamp for the operation o<sub>i</sub>
- $^{\circ}P(C_{swi}, v, f)$ : power consumption of any  $F_{k,v}$  used by operation  $o_i$
- $\mathbf{x}_{i,c,v,f}$ : decision variable, which takes the value of 1 if operation  $o_i$  is scheduled in control step c using  $F_{k,v}$  and c has frequency f
- $y_{i,v,l,m}$ : decision variable which takes the value of 1 if operation of is using the functional unit  $F_{k,v}$  and scheduled in control steps  $l \rightarrow m$
- •L<sub>i,v</sub>: latency for operation o<sub>i</sub> using resource operating at voltage v (in terms of number of clock cycles)

NOTE: The effective switching capacitance is a function of the average switching activity at the input operands of a functional unit and  $C_{swi}$  is a measure of effective switching capacitance  $FU_i$ .  $\alpha_i C_i = C_{swi}(\alpha_i^{\ 1}, \alpha_i^{\ 2})$ 



#### **CPF**\* Minimization: ILP Formulation

#### **MVDFC** Design Scenario

•Objective Function: Minimize the CPF\* for the whole DFG over all the control steps. Using the previous expressions we have,

Minimize:  $\frac{\frac{1}{N} \sum_{c=1}^{N} P_c + \frac{1}{N} \sum_{c=1}^{N} |P - P_c|}{P_{peak}}$  (1)

The denominator is removed and introduced as a constraint.

 $\label{eq:minimize} \text{Minimize}: \quad \frac{1}{N} \sum_{c=1}^{N} P_c + \frac{1}{N} \sum_{c=1}^{N} |P - P_c|$ 

Subject to: Peak power constraints

The absolute is replaced with sum and the appropriate constraints.

Minimize:  $\frac{1}{N} \sum_{c=1}^{N} P_c + \frac{1}{N} \sum_{c=1}^{N} (P + P_c)$ 

Subject to: Modified peak power constraints

After simplification,

Minimize:  $\left(\frac{3}{N}\right)\sum_{c=1}^{N}P_{c}$ 

Subject to: Modified peak power constraints

Using decision variables,

Minimize:  $\sum_{c} \sum_{i \in F_{k,v}} \sum_{v} \sum_{f} x_{i,c,v,f} * \left(\frac{3}{N}\right) * P(C_{swi}, v, f)$ 

Subject to: Modified peak power constraints

(2)

(3)

(4)

(5)

#### **CPF**\* Minimization: ILP Formulation (MVDFC)

- **Uniqueness** Constraints: ensure that every operation  $o_i$  is scheduled to one unique control step and represented as,  $\forall i, 1 \le i \le O, \Sigma_c \Sigma_v \Sigma_f x_{i,c,v,f} = 1$
- \*Precedence Constraints: guarantee that for an operation  $o_i$ , all its predecessors are scheduled in an earlier control step and its successors are scheduled in an later control step and are;  $\forall i, j, o_i$  belong to  $\text{Pred}(o_j)$ ,  $\sum_{v} \sum_{f} \sum_{\{d=S_i \to E_i\}} dx_{i,c,v,f} \sum_{v} \sum_{f} \sum_{\{d=S_j \to E_j\}} ex_{j,c,v,f} \leq -1$
- \*Resource Constraints: make sure that no control step contains more than  $F_{k,v}$  operations of type k operating at voltage v and are enforced as,  $\forall c, 1 \le c \le N$  and  $\forall v, \sum_{\{i \in F_{k,v}\}} \sum_{f} x_{i,c,v,f} \le M_{k,v}$
- \*Frequency Constraints: lower operating voltage functional unit can not be scheduled in a higher frequency control step; these constraints are expressed as,

 $\forall i, 1 \le i \le O, \forall c, 1 \le c \le N, \text{ if } f < v, \text{ then } x_{i,c,v,f} = 0.$ 



#### **CPF**\* Minimization: ILP Formulation (MVDFC)

• Peak Power Constraints: introduced to eliminate the fractional non-linearity of the objective function and are enforced as, for all  $c, 1 \le c \le N$ ,

$$\sum_{i \in F_{k,v}} \sum_{v} \sum_{f} x_{i,c,v,f} * P(C_{sw\,i},v,f) \leq P_{peak}$$

• Modified Peak Power Constraints: To eliminate the non-linearity introduced due to the absolute function introduced as, for all c,  $1 \le c \le N$ ,

$$\frac{1}{N} \sum_{c} \sum_{i \in F_{k,v}} \sum_{v} \sum_{f} x_{i,c,v,f} * P(C_{swi}, v, f) - \sum_{i \in F_{k,v}} \sum_{v} \sum_{f} x_{i,c,v,f} * P(C_{swi}, v, f) \le P_{peak}^*$$

NOTE: The unknowns  $P_{peak}$  and  $P^*_{peak}$  is added to the objective function and minimized along with it.

#### **CPF**\* Minimization: ILP Formulation

#### MVMC Design Scenario

\*Objective Function: Following the same steps as in the MVDFC case in terms of decision variables we write,

Minimize: 
$$\sum_{l} \sum_{i \in F_{k,v}} \sum_{v} y_{i,v,l,(l+L_{i,v}-1)} * \left(\frac{3}{N}\right) P(C_{swi}, v, f_{clk})$$

Subject to: Modified peak power constraints

\*Uniqueness Constraints: ensure that every operation  $o_i$  is scheduled to appropriate control steps within the range  $(S_i, E_i)$  and represented as,  $\forall i, 1 \le i \le O$ ,

$$\sum_{v} \sum_{\{l=S_i \to (S_i + E_i + 1 - L_{i,v})\}} y_{i,v,l,(l+L_{i,v} - 1)} = 1$$

\*Precedence Constraints: guarantee that for an operation  $o_i$ , all its predecessors are scheduled in an earlier control step and its successors are scheduled in an later control step;  $\forall$  i,j,  $o_i$  belong to  $Pred(o_i)$ ,

 $\sum_{v}^{\sum_{\{l=S_{i} \to E_{i}\}}^{j/i}} (l+L_{i,v}-1)y_{i,v,l,(l+L_{i},v-1)} - \sum_{v}^{\sum_{\{l=S_{i} \to E_{i}\}}} ly_{j,v,l,(l+L_{i},v-1)} \le -1$ 



#### **CPF**\* Minimization: ILP Formulation (MVMC)

Resource Constraints: make sure that no control step contains more than  $F_{k,v}$  operations of type k operating at voltage v and are enforced as,

$$\sum_{\{i \in F_{k,v}\}} \sum_{l} y_{i,v,l,(l+L_{i,v}-1)} \le M_{k,v}$$

Peak Power Constraints: introduced to eliminate the fractional non-linearity of the objective function and are enforced as, for all c, 1<= 1 <= N,

$$\sum_{i \in F_{k,v}} \sum_{v} y_{i,v,l,(l+L_{i,v}-1)} * P(C_{sw\,i},v,f_{clk}) \leq P_{peak}$$

Modified Peak Power Constraints: To eliminate the non-linearity introduced due to the absolute function introduced as, for all c,  $1 \le 1 \le N$ ,  $\frac{1}{N} \sum_{l} \sum_{i \in F_{k,v}} \sum_{v} y_{i,v,l,(l+L_{i,v}-1)} * P(C_{swi}, v, f_{clk})$ 

$$-\sum_{i \in F_{k,v}} \sum_{v} y_{i,v,l,(l+L_{i,v}-1)} * P(C_{swi},v,f_{clk}) \leq P^*_{peak}$$

#### **CPF**\* Minimization: Scheduling Algorithm

- Step 1: Construct a look up table for (effective switching capacitance, average switching activity) pairs.
- Step 2: Calculate the switching activities at the inputs of each node through behavioral simulation of the DFO.
- Step 3: Find ASAP schedule for the UDFG.
- Step 4: Find ALAP schedule for the UDFG.
- Step 5: Determine the mobility graph of each node.
- Step 6: Modify the mobility graph for MVMC.
- Step 7: Model the ILP formulations of the DFG for MVDFC or MVMC scheme using AMPL.
- Step 8: Solve the ILP formulations using LP-Solve.
- Step 9: Find the scheduled DFG.
- Step 10: Determine the cycle frequencies, cycle frequency index and base frequency for MVDFC scheme.
- Step 11: Estimate power and energy consumptions of the scheduled DFG.



# **CPF\* Minimization: Experimental Results**(Benchmarks and Resource Constraints used)

- 1. Example circuit (EXP) (8 nodes, 3\*, 3+, 9 edges)
- 2. FIR filter (11 nodes, 5\*, 4+, 19 edges)
- 3. IIR filter (11 nodes, 5\*, 4+, 19 edges)
- 4. HAL differential eqn. solver (13 nodes, 6\*, 2+, 2-, 1 <, 16 edges)
- 5. Auto-Regressive filter (ARF) (15 nodes, 5\*, 8+, 19 edges)

| Multi | pliers | AL   | Serial No |     |  |
|-------|--------|------|-----------|-----|--|
| 2.4V  | 3.3V   | 2.4V | 3.3V      |     |  |
| 2     | 1      | 1    | 1         | RC1 |  |
| 3     | 0      | 1    | 1         | RC2 |  |
| 2     | 0      | 0    | 2         | RC3 |  |
| 1     | 1      | 0    | 1         | RC4 |  |



#### CPF\* Minimization: Experimental Results ...

|   | R | $P_{p_S}$ | $P_{p_D}$ | $\Delta P_p$ | $P_{mS}$ | $P_{mD}$ | $\Delta DP$ | $P_S$ | $P_D$ | $\Delta P$ | $E_S$ | $E_D$ | $\Delta E$ |
|---|---|-----------|-----------|--------------|----------|----------|-------------|-------|-------|------------|-------|-------|------------|
|   | C | mW        | mW        | %            | mW       | mW       | %           | mW    | mW    | %          | nJ    | nJ    | %          |
| E | 1 | 17.28     | 4.56      | 73.61        | 0.46     | 0.35     | 74.97       | 8.87  | 2.42  | 72.72      | 2.96  | 1.57  | 46.8       |
| X | 2 | 17.28     | 4.56      | 73.61        | 0.46     | 0.35     | 74.97       | 8.87  | 2.42  | 72.72      | 2.96  | 1.57  | 46.8       |
| P | 3 | 17.28     | 4.56      | 73.61        | 0.46     | 0.9      | 78.24       | 8.87  | 2.61  | 70.57      | 2.96  | 1.6   | 46.0       |
| F | 1 | 17.51     | 4.62      | 73.62        | 0.23     | 0.12     | 73.96       | 8.82  | 2.35  | 73.36      | 4.9   | 2.6   | 47.20      |
| I | 2 | 25.92     | 6.84      | 73.61        | 0.23     | 0.12     | 73.84       | 8.82  | 2.36  | 73.24      | 4.9   | 2.6   | 47.20      |
| R | 3 | 17.51     | 4.67      | 73.33        | 0.23     | 0.45     | 75.58       | 8.82  | 2.5   | 71.66      | 4.9   | 2.6   | 46.22      |
| Η | 1 | 17.51     | 4.62      | 73.62        | 0.46     | 0.35     | 74.96       | 13.25 | 3.55  | 73.21      | 5.9   | 3.12  | 47.0       |
| A | 2 | 26.15     | 6.90      | 73.61        | 0.46     | 0.35     | 74.50       | 13.25 | 3.55  | 73.21      | 5.9   | 3.12  | 47.0       |
| L | 3 | 17.74     | 4.78      | 73.05        | 0.46     | 0.9      | 76.97       | 13.25 | 3.73  | 71.85      | 5.9   | 3.17  | 46.2       |
| I | 1 | 25.92     | 8.88      | 65.74        | 0.23     | 0.12     | 65.9        | 11.03 | 3.5   | 68.36      | 4.9   | 3.05  | 37.7       |
| Ι | 2 | 25.92     | 6.84      | 73.61        | 0.23     | 0.12     | 73.84       | 11.03 | 2.98  | 72.98      | 4.9   | 2.6   | 47.96      |
| R | 3 | 17.51     | 4.67      | 73.34        | 0.23     | 0.45     | 75.58       | 8.82  | 2.57  | 70.86      | 4.9   | 2.64  | 46.22      |
| A | 1 | 8.87      | 2.34      | 73.62        | 0.23     | 0.12     | 74.1        | 4.5   | 1.22  | 72.9       | 5.0   | 2.64  | 47.2       |
| R | 2 | 8.87      | 2.34      | 73.62        | 0.23     | 0.12     | 74.1        | 4.5   | 1.22  | 72.9       | 5.0   | 2.64  | 47.2       |
| F | 3 | 8.87      | 2.39      | 73.05        | 0.23     | 0.45     | 77.6        | 4.5   | 1.4   | 68.9       | 5.0   | 2.74  | 45.3       |

Dept. of CSE



#### CPF\* Minimization: Experimental Results ...

#### **MVDFC Vs MVMC % Reduction**

| Power                      | MVDFC | MVMC  |
|----------------------------|-------|-------|
| Peak Power                 | 71.70 | 26.44 |
| Peak Power<br>Differential | 74.0  | 26.73 |
| Average Power              | 70.82 | 22.52 |
| Energy                     | 44.36 | 39.05 |
| Energy Delay<br>Product    | 17.31 | 17.99 |

Dept. of CSE



#### **CPF**\* Minimization: Power Profile for RC2



USF UNIVERSITY OF SOUTH FLORIDA

#### **CPF**\* Minimization: Power Profile for RC3



Figure 7.9. Power profile for benchmark for resource constraint RC3



### Watermarking Chip Design

- 1. Architecture and implementation of spatial invisible
- 2. Architecture and implementation of spatial visible
- 3. Architecture for DCT invisible and visible (dual voltage and dual frequency operation)

NOTE: Detailed implementation of the DCT domain watermarking chip is being carried out by Karthik, a masters student, as a part of his thesis.





#### **Digital Still Camera**



#### Secure Digital Still Camera



#### **Spatial Invisible: Algorithm (Robust)**



Table 9.1. Notations used to Explain Spatial Domain Watermarking Algorithms

: Original image (gray image) W: Watermark image (binary or ternary image) : A pixel location (i,j): Watermarked image  $I_W$  $N_I \times N_I$ : Image dimension  $N_W \times N_W$ : Watermark dimension  $E, E_{1}, E_{2}$ : Watermark embedding functions D: Watermark detection function : Neighborhood radius  $I_N$ : Neighborhood image (gray image) K: Digital (watermark) key

 $\alpha_1, \alpha_2$ 

: Scaling constants (watermark strength)

#### Spatial Invisible: Algorithm (Robust) ...

- The watermark is a ternary image having pixel values  $\{0,1,2\}$ .
- Insertion: Alter the original image pixels as,

$$I_W(i,j) = \left\{ egin{array}{ll} I(i,j) & ext{if } W(i,j) = 0 \ \\ E_1ig(I(i,j),I_N(i,j)ig) & ext{if } W(i,j) = 1 \ \\ E_2ig(I(i,j),I_N(i,j)ig) & ext{if } W(i,j) = 2 \end{array} 
ight.$$

Encoding function:

$$E_1(I, I_N) = (1 - \alpha_1)I_N(i, j) + \alpha_1 I(i, j)$$
  
 $E_2(I, I_N) = (1 - \alpha_1)I_N(i, j) - \alpha_2 I(i, j)$ 

Neighborhood pixel gray value: Calculated as,

$$I_N(i,j) = \frac{\frac{I(i+1,j)+I(i+1,j+1)}{2} + I(i,j+1)}{2}$$



#### **Spatial Invisible: Algorithm (Fragile)**



Watermark insertion is performed in the k-th image bit plane using the following function.

$$I_W[0 \to k-1](i,j) = I[0 \to k-1](i,j)$$
 $I_W[k](i,j) = I[k](i,j) \text{XOR } W(i,j)$ 
 $I_W[k+1 \to 7](i,j) = I[k+1 \to 7](i,j)$ 

#### **Spatial Invisible: Overall Datapath Architecture**



#### **Spatial Invisible: Overall Controller**



#### Spatial Invisible: Datapath Layout



#### **Spatial Invisible: Controller Layout**



## **Spatial Invisible: Overall Chip**



## **Spatial Invisible: Overall Chip...**

| Table 9.4. Overall Chip Statistics |             |                             |              |  |
|------------------------------------|-------------|-----------------------------|--------------|--|
|                                    | 1           |                             |              |  |
| Area (with RAM)                    |             | $15.012 \times 14.225 mm^2$ |              |  |
| Number of gates (with RAM)         |             | 1188K                       |              |  |
| Number of gates (without RAM)      |             | 4820                        |              |  |
| Clock frequency (with RAM)         |             | 151MHz                      |              |  |
| Clock frequency (without RAM)      |             | 545MHz                      |              |  |
| Number of I/O pins                 |             | 25                          |              |  |
| Power (with RAM)                   |             | 24m                         | $\cdot W$    |  |
| Power (without RAM)                |             | 2.05                        | 47mW         |  |
|                                    |             |                             |              |  |
| IM_DATA_IN ———                     |             |                             |              |  |
| WM_DATA_IN ——►                     | SPATIAL DOM | AIN                         | → DATA_OUT   |  |
|                                    |             |                             |              |  |
| WM_DATA_SELECT                     | INVISIBLE   |                             |              |  |
| ROBUST/FRAGILE                     |             | ł                           | → BUSY       |  |
| START                              | WATERMARK   | ING                         |              |  |
| RESET                              | ENCODER     |                             | → DATA_READY |  |
| CLOCK ── <del>-</del>              |             |                             |              |  |
|                                    |             |                             |              |  |

## **Spatial Invisible: Results**



(a) Original Shuttle



(a) Original Bird



(b) Robust Watermarked



(b) Robust Watermarked



(c) Fragile Watermarked



(c) Fragile Watermarked

## Spatial Visible: Notations used in Algorithms

#### Table 9.5. List of Variables used in Algorithm Explanation

: Original (or host) image (a grayscale image)

W: Watermark image (a grayscale image)

(m,n): A pixel location

: Watermarked image  $I_W$ 

 $N_I \times N_I$  : Original image dimension  $N_W \times N_W$ : Watermark image dimension

 $i_k$ : The  $k^{th}$  block of the original image I  $w_k$ : The  $k^{th}$  block of the waterwark image

: The  $k^{th}$  block of the watermark image W $w_k$ 

: The  $k^{th}$  block of the watermarked image  $I_W$  $i_{Wn}$ 

: Scaling factor for  $k^{th}$  block (used for host image scaling)  $\alpha_k$ 

: Embedding factor for  $k^{th}$  block (used for watermark image scaling)  $\beta_k$ 

: Mean gray value of the original image I  $\mu_I$ 

: Mean gray value of the original image block  $i_k$  $\mu_{Ik}$ 

: Variance of the original image block ik  $\sigma_{Ik}$ 

: The maximum value of  $\alpha_k$  $\alpha_{max}$ : The minimum value of  $\alpha_k$  $\alpha_{min}$ : The maximum value of  $\beta_k$  $\beta_{max}$ : The minimum value of  $\beta_k$  $\beta_{min}$ 

 $I_{white}$ : Gray value corresponding to pure white pixel

: A global scaling factor  $\alpha_I$ 

 $C_1, C_2, C_3, C_4$ : Linear regression co-efficients

ISF UNIVERSITY OF

## Spatial Visible: Algorithm 1

The original algorithm proposed by Braudaway, et. al.

$$I_{W}(m,n) = \begin{cases} I(m,n) + W(m,n) \left(\frac{I_{white}}{38.667}\right) \left(\frac{I(m,n)}{I_{white}}\right)^{\frac{2}{3}} \alpha_{I} & \text{for } \frac{I(m,n)}{I_{white}} > 0.008856 \\ I(m,n) + W(m,n) \left(\frac{I(m,n)}{903.3}\right) \alpha_{I} & \text{for } \frac{I(m,n)}{I_{white}} \leq 0.008856 \end{cases}$$
ming  $I_{white} = 256$ , simplified to:

• Assuming  $I_{\text{white}} = 256$ , simplified to:

$$I_W(m,n) = \begin{cases} I(m,n) + \left(\frac{\alpha_I}{6.0976}\right) W(m,n) \left(I(m,n)\right)^{\frac{2}{3}} & \text{for } I(m,n) > 2.2583 \\ I(m,n) + \left(\frac{\alpha_I}{903.3}\right) W(m,n) I(m,n) & \text{for } I(m,n) \leq 2.2583 \end{cases}$$

Fitting piecewise linear model and regression co-efficients :

$$I_{W}(m,n) = \begin{cases} I(m,n) + \left(\frac{\alpha_{I}}{903.3}\right) W(m,n) I(m,n) & \text{for } I(m,n) \leq 2\\ I(m,n) + \left(\frac{\alpha_{I}C_{1}}{6.0976}\right) W(m,n) I(m,n) & \text{for } 2 < I(m,n) \leq 64\\ I(m,n) + \left(\frac{\alpha_{I}C_{2}}{6.0976}\right) W(m,n) I(m,n) & \text{for } 64 < I(m,n) \leq 128\\ I(m,n) + \left(\frac{\alpha_{I}C_{3}}{6.0976}\right) W(m,n) I(m,n) & \text{for } 128 < I(m,n) \leq 192\\ I(m,n) + \left(\frac{\alpha_{I}C_{4}}{6.0976}\right) W(m,n) I(m,n) & \text{for } 192 < I(m,n) < 256 \end{cases}$$

## **Spatial Visible: Algorithm 2**

Watermark insertion is carried out block-by-block using:

$$i_{Wk} = \alpha_k i_k + \beta_k w_k \qquad k = 1, 2...$$

The scaling and embedding factors are found out as,

$$\alpha_k = \frac{1}{\hat{\sigma_{Ik}}} \exp\left(-(\hat{\mu_{Ik}} - \hat{\mu}_I)^2\right)$$

$$\beta_k = \hat{\sigma_{Ik}} \left(1 - \exp\left(-(\hat{\mu_{Ik}} - \hat{\mu}_I)^2\right)\right)$$

Values are scaled to proper range :

$$\alpha_k = \alpha_{min} + (\alpha_{max} - \alpha_{min}) \frac{1}{\hat{\sigma}_{I_k}} exp\left(-(\hat{\mu}_{I_k} - \hat{\mu}_I)^2\right)$$

$$\beta_k = \beta_{min} + (\beta_{max} - \beta_{min}) \hat{\sigma}_{I_k} \left(1 - exp\left(-(\hat{\mu}_{I_k} - \hat{\mu}_I)^2\right)\right)$$

## Spatial Visible: Proposed Datapath Architecture



## **Spatial Visible: Proposed Controller**



## Spatial Visible: Overall Chip Layout



## **Spatial Visible: Overall Chip**



Figure 9.21. Pin Diagram for the Proposed Watermarking Chip Table 9.7. Overall Statistics of the Watermarking Chip

| Area               | $3.34 \times 2.89 mm^2$ |
|--------------------|-------------------------|
| Number of gates    | 28469                   |
| Clock frequency    | 292.27MHz               |
| Number of I/O pins | 72                      |
| Power              | 6.9286mW                |

## **Spatial Visible: Results**







(b) Bird



(c) Nuts and Bolts



(d) Watermark

#### Original Images and Watermark



(a) Lena



(b) Bird



(c) Nuts and Bolts

Watermarked Images using Algorithm 1

NOTE: Similar watermarked images are obtained using algorithm2. The difference lies in the SNR.

## **DCT Domain: Algorithms**

• The invisible watermark insertion involves addition of random numbers to relatively perceptual significant co-efficients of the host image.

$$c_{I_{Wk}}(m, n) = c_{Ik}(m, n) + \alpha r_k(m, n)$$

• The visible watermark is inserted in the host images block-by-block and watermarked image block is obtained.

$$c_{I_{W\,k}} = \alpha_k \, c_{I\,k} + \beta_k \, c_{W\,k}$$

Current scaling and embedding factors are obtained as,

$$\alpha_k^c = \sigma_{ACI_k} exp \left( -(\mu^*_{DCI_k} - \mu^*_{DCI})^2 \right)$$
  
 $\beta_k^c = \frac{1}{\sigma_{ACI_k}} \left( 1 - exp \left( -(\mu^*_{DCI_k} - \mu^*_{DCI})^2 \right) \right)$ 

• The current values are then linearly scaled to user defined ranges.

## **DCT Domain: Proposed Architecture**



## DCT Domain: Dual Voltage and Freq.



Figure 9.28. Dual Voltage and Dual Frequency Operation of the Datapath

## DCT Domain: Overall Chip Layout (borrowed from masters thesis of Karthik)



## **Conclusions**

| ☐ The reduction of peak power, peak power differential, average power and energy are equally important.                                                                                                                                                        |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ☐ The polynomial time-complexity resource and time constrained energy minimization scheduling algorithms could reduce energy consumption significantly with reasonable or no time penalty. ILP-based EDP minimization is an alternative to achieve same thing. |
| ☐ The function CPF could capture all the different forms of power and its minimization using heuristic or ILP could yield significant reductions in all the different forms of power.                                                                          |
| ☐ The MPG function used as an alterative results comparable reductions, except energy reduction.                                                                                                                                                               |
| ☐ The comparison of peak and average power minimization and only peak power minimization shows that there is 5% increase in peak power reduction.                                                                                                              |
| □ The MVDFC approach foundout to be better alternative. For the circuits having almost equal number of addition and multiplier operations in the critical path the savings are maximum with no time penalty for MVDFC case.                                    |
| ☐ The scheduling schemes are useful for data intensive applications.                                                                                                                                                                                           |
| ☐ It is observed that the results of hardware based watermarking schemes are comparable to that of software.                                                                                                                                                   |
|                                                                                                                                                                                                                                                                |

### **Impact of this Dissertation**

- None of the datapath scheduling algorithms available in current literature minimize transient power. There are few works available that handle peak power minimization. There are no research works handling both voltage and frequency parameters. Thus, we conclude any of the low power datapath scheduling algorithms proposed in this dissertation can create strong impact low power behavioral synthesis research.
- •All the watermarking chip designed are the first implementations in the respective category. At this digital age, when the copyright and piracy are threat to industrial growths, the secure digital devices integrated with watermarking chips can produce copyrighted multimedia data in real-time.

#### **Future Works**

- The applicability of the scheduling schemes for pipelining is to be investigated.
- The effect of switching activity is to be taken into account.
- The detail design of controller is to be done.
- The effect on clocking network is to be studied.
- Different nonlinear optimization techniques and new linear techniques can be investigated to minimize CPF / MPG.
- Similarly, the design works can be extended to develop pipelined and / or SIMD based designs.
- Implementation of video and audio watermarking algorithms can also be considered.



#### **Publications from this Dissertation**

- S. P. Mohanty, N. Ranganathan and R. K. Namballa, "VLSI Implementation of Visible Watermarking for a Secure Digital Still Camera Design", to appear in Proc. of the 17th IEEE Intl. Conf. on VLSI Design, 2004.
- 2. S. P. Mohanty, N. Ranganathan and S. K. Chappidi, "ILP Models for Energy and Transient Power Minimization During Behavioral Synthesis", to appear in Proc. of the 17th IEEE Intl. Conf. on VLSI Design, 2004.
- S. P. Mohanty, N. Ranganathan and S. K. Chappidi, "Power Fluctuation Minimization During Behavioral Synthesis using ILP-Based Datapath Scheduling", to appear in Proc. ICCD 2003.
- 4. S. P. Mohanty, N. Ranganathan and S. K. Chappidi, "Transient Power Minimization Through Datapath Scheduling in Multiple Supply Voltage Environment", to appear in Proc. of the 10th IEEE International Conference on Electronics, Circuits and Systems, 2003.
- 5. S. P. Mohanty, N. Ranganathan and R. K. Namballa, "VLSI Implementation of Invisible Digital Watermarking Algorithms Towards the Development of a Secure JPEG Encoder", in Proc. of the IEEE Workshop on Signal Processing Systems, pp. 183-188, 2003.
- 6. S. P. Mohanty, N. Ranganathan and S. K. Chappidi, "Simultaneous Peak and Average Power Minimization during Datapath Scheduling for DSP Processors", in Proc. of GLSVLSI pp. 215-220, 2003.
- 7. S. P. Mohanty, N. Ranganathan and S. K. Chappidi, "An ILP-Based Scheduling Scheme for Energy Efficient High Performance Datapath Synthesis", in Proc. of ISCAS, Vol. 5, pp. 313-316, 2003.
- 8. S. P. Mohanty, N. Ranganathan and S. K. Chappidi, "Peak Power Minimization Through Datapath Scheduling", in Proc. of ISVLSI, pp. 121-126, 2003.
- 9. S. P. Mohanty and N. Ranganathan, "A Framework for Energy and Transient Power Reduction during Behavioral Synthesis", in Proc. of the 16th IEEE Intl. Conf. on VLSI Design 2003, pp. 539-545, 2003, (Nominated for best paper award; ranked within top 5 papers out of 210 submissions).
- 10. S. P. Mohanty and N. Ranganathan, "Energy Efficient Scheduling for Datapath Synthesis", in Proc. of the 16th IEEE Intel. Conf. on VLSI Design, pp. 446-451, 2003.
- 11. S. P. Mohanty, N. Ranganathan and V. Krishna, "Datapath Scheduling using Dynamic Frequency Clocking", in Proc. of ISVLSI, pp. 65-70, 2002.



# Thank you

