# Circuit Implementation of Switchable Pins in Chip Multiprocessor

Zhou Zhao, Ashok Srivastava, Lu Peng and Shaoming Chen Division of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 70803, U.S.A. {zzhao13, eesriv, lpeng, schen26}@lsu.edu

Abstract—Transistor scaling and voltage scaling promote the development of VLSI chip design and modern portable devices. However, transistor scaling brings dark silicon to chips. Lowering supply voltage also results in many transistors to work in near threshold voltage region in which transistors are very sensitive to voltage variation. The paper analyzes the influence of IR drop in a chip multiprocessor (CMP) architecture. Based on our previous work, we focus on the circuit implementation of the proposed switchable pin in CMP using I/O pads with on-chip low cost control circuit; and to let the microprocessor work in data mode or power mode according to various requirements. Using six of eight traditional data pins in an 8-bit RISC microprocessor as switchable pins, the entire current in power mode can be boosted by 72% compared to normal data mode. For data transmission, relative signal-to-noise ratio (rSNR) are tested and the performance of the DRAM is also discussed.

#### Keywords—Dark silicon; IR drop; Switchable pins

#### I. INTRODUCTION

Advanced fabrication technology and novel computer architecture promote future chip multiprocessors (CMPs) containing more and more cores with high performance. However, very large leakage current in submicron process and the limitation of current cooling techniques block the beneficial effects of transistor scaling. Especially for emerging portable devices, the usable space for designers is so precious that integrating heat-dissipation device into board becomes very difficult. Dark silicon, which means reducing work frequency of some modules in a chip to avoid overheating of chips, will occupy larger area in future VLSI chips [1]. Another challenge is that with supply voltage scaling, numerous transistors in future chips will work in near threshold voltage region in which even a slight voltage variation might cause error code in data transmission, and so reduce work reliability [2]. Previous work [3] has proved that a microprocessor can be designed with voltage range from 280 mV to 1.2V. It is anticipated that future chips can work in ultra-low voltage with numerous cores. Complicated placement as well as route, and unique power distribution might lead to serious IR drop for future chips, which might amplify the risk of working in near threshold voltage region for future chips.

How to efficiently manage and distribute power of a chip, proposing some novel topologies for chip framework, and implementing transistors doped with new materials might be three feasible methods to solve above problems for future VLSI chips. Our current research mainly focuses on the first two aspects. Based on the previous work done by our group [4, Saraju Mohanty Department of Computer Science and Engineering, University of North Texas, Denton, TX 76207, U.S.A. {saraju.mohanty@unt.edu}



Figure 1: Two ways of placement for a CMP with a shared cache. (a) ROW placement and (b) SURROUND placement.

5], a proposed "switchable pins" with very low cost, compared to on-chip voltage regulator, has been proved and is feasible for mitigating dark silicon from chips and negative impact resulted from voltage scaling, through system-level simulation. This paper mainly analyzes novel switchable pins from circuit point of view for implementation in CMOS.

Section II shows analysis regarding IR drop as well as frequency drop for a CMP, and proposes a chip design with switchable pins in circuit level. Section III shows the test results including power boost and data transmission. Results are summarized in Section IV.

### II. ANALYSIS REGARDING SWITCHABLE PIN

Besides frequency drop dark silicon brings to future VLSI chips, IR drop is another serious problem in future VLSI design due to supply voltage scaling and potentially working



Figure 2: Ratio of IR drop to supply voltage under PTM.



Figure 3: Structure of on-chip voltage regulator in a CMP.

in near threshold voltage region. Particularly, IR drop which we mainly focus on here is that on-chip voltage variation caused by complicated route and placement. Voltage drop through pad can be ignored because of extremely low resistance of the pad.

We use a CMP with one shared cache to analyze IR drop and frequency drop. Here we define that normal placement of a CMP can be as shown in Figure 1 including ROW placement and SURROUND placement. In ROW placement, the CMP may have one row of cores but for future high performance CMPs, we define double rows due to numerous cores. To estimate IR drop, we use wires connecting the external  $V_{DD}$ and each core's on-chip power pin to approximately calculate voltage drop. How to route, influences our estimation to a large degree. Here we only consider the worst case that power wires go around the edge of the chip and no power wires goes through cores. This worst case is also a normal way current EDA tools [6] use to achieve automatic layout generation for digital integrated circuits if no fully back-end optimization is implemented by designers. For each module including cores and a unit-cache, we assume these as squares. In this case, for a CMP with n-cores restricted by ROW placement, the total wires contributing IR drop can be expressed as follows:

$$\sum L_{ROW} = \sum_{i=1}^{\frac{n}{2}} \sqrt{\left[x - \frac{\alpha}{2}(2i-1)\right]^2} + \sum_{i=1}^{\frac{n}{2}} \left[\frac{\alpha}{2}(2i-1)\right] + x + \alpha + \beta$$
(1)

In Eq. (1), external  $V_{DD}$  pin is at (x,  $2\alpha+\beta$ ) where  $\alpha$  represents the side length of one core and  $\beta$  represents the side length of one unit-cache. Note that all of unit-caches constitute a shared cache in the view of area but not real function based on an assumption that the area of a cache is proportional to core's number. Then taking derivative of Eq. (1) with respect to x, we can get:

$$\frac{\partial \sum L_{ROW}}{\partial x} = \sum_{i=1}^{\frac{n}{2}} \frac{\sqrt{\left[x - \frac{\alpha}{2}(2i-1)\right]^2}}{x - \frac{\alpha}{2}(2i-1)} + 1 = 1 + 1 + \dots (-1) + (-1)$$
(2)

It can be concluded that to get the shortest power wires as well as the minimum IR drop, the  $V_{DD}$  pin should be put in the center edge and if the  $V_{DD}$  pin is put off the center and near corner, IR drop will increase.

Same as in ROW placement, for a CMP with n-cores restricted by SURROUND placement, the total wires contributing IR drop can be shown as:

$$\sum L_{SURROUND} = \sum_{i=1}^{\frac{n+4}{4}} \sqrt{\left[x - \frac{\alpha}{2}(2i-1)\right]^2} + 3\sum_{i=1}^{\frac{n+4}{4}} \frac{\alpha}{2}(2i-1) + \frac{n+4}{2}\alpha + x \quad (3)$$

Taking derivative of Eq. (3) with respect to x, we obtain,

$$\frac{\partial \sum L_{SURROUND}}{\partial x} = \sum_{i=1}^{\frac{n+4}{4}} \frac{\sqrt{\left[x - \frac{\alpha}{2}(2i-1)\right]^2}}{x - \frac{\alpha}{2}(2i-1)} + 1 = 1 + 1 + \dots (-1) + (-1)$$
(4)

It can be seen that same as ROW placement, the power pin in SURROUND placement should be put at the center as possible.

For both placements, the worst case in which IR drop is maximum for one core's voltage supply, can be shown as follows:

$$\max\{L_{ROW}\} = x + \alpha + \beta + \alpha \left(\frac{n-1}{2}\right) \tag{5}$$

$$\max\{L_{SURROUND}\} = x + \frac{n+3}{2}\alpha \tag{6}$$

It is obvious that locating the V<sub>DD</sub> pin in the center edge is the best situation to avoid very large IR drop. From above analysis, it can be seen that the location of the external power pin only influences IR drop caused by cores closed to the external power pin. IR drop generated by those cores far from the power pin is not sensitive to the location of the power pin and can be seen as a constant. Using PTM and ITRS [7, 8], we can approximately predict the ratio of IR drop to supply voltage as shown in Figure 2. It can be seen that for future CMPs with many cores, only one external pin supplying power can generate large IR drop and might cause serious error of data transmission under voltage scaling.



Figure 4: Proposed core with switchable pins.



Figure 5: Control clock used for the core with switchable pins.



Figure 6: Pad used as switchable pin.

The other issue is concerning frequency. It is accepted that supply voltage determines how fast a core can run. Therefore, voltage loss caused by IR drop also can indirectly reduce work speed and so force some parts of a chip to be under dark silicon. Therefore, mitigating IR drop not only can avoid potential error during data transmission but also can prevent more parts from being under dark silicon.

To mitigate IR drop, putting more external power pins might be a solution, but this method will decrease the number of those pins which are used for data transmission. To improve reliability of power delivery at low cost, various on-chip voltage regulators have been proposed [9]. For CMPs, an abstractive concept of on-chip voltage regulator can be shown in Figure 3. Every regulator supplies one core with sharing a single external power pin. In this case, the function of voltage regulator is to compensate the chip for the voltage variation caused by IR drop or temperature. This method can confirm the precision of supply voltage used for each core. However, this additional module adds extra power consumption and introduces delay in the chip especially for the chip with DVFS (Dynamic Voltage Frequency Scaling). The response time of mode conversion is a problem which might result in mismatch between supply voltage and clock frequency. From another perspective, current voltage regulators include large number of capacitors and inductors, both of which are penalties to the chip area.

Therefore, proposing a low cost structure to reduce IR drop caused by complicated route is necessary. On-chip voltage regulator points that including more power pins is a good choice for avoiding overlarge IR drop and frequency drop. If we directly use most of pins to deliver power instead of data, voltage drop and frequency drop will be mitigated. However, this chip with more power pins and less data pins will obviously reduce efficiency of data transmission. An observation is that a microprocessor cannot always need to deal with a large amount of data with large bandwidth [4, 10]. We propose a switchable pin used for both power delivery and data transmission in a microprocessor according to actual requirements. The most attractive feature is that the proposed switchable pin consumes only a little extra power, and adds a little extra area compared to those on-chip voltage regulators. To implement this novel and low cost structure in CMOS, the most important things include how to properly set on-chip



Figure 7: Simulation of the core with switchable pins.



Figure 8: Frequency versus R<sub>MIN</sub> of supported cores.

switches, and how to set control clock to keep the proposed pin working in the correct mode.

Our design is based on an 8-bit RISC microprocessor. With this core, we use some of traditional data pins as switchable pins to efficiently regulate power and transfer data. In normal data mode, the core works regularly, that means all of data pins are safely used for data transmission, and only one traditional power pin is used for the power delivery. When the core enters power mode, switchable pins will start to work. In this mode, part of data pins are modified to power pins. Therefore, one thing should be mentioned is that how to ensure the accuracy of data transmission for limited data pins in a given time. This design uses frequency division to achieve accurate data transmission during the period of power mode, that means in power mode, except those new switchable pins, all of the left data pins are used to transfer all data information to RAM by dividing the time of power mode. All of data information which come from all of data outputs are transferred by limited pins by turn. It can be thought that if more data pins are used as switchable pins, the clock generator will be more complex in order to divide frequency and the

time given for each data transmission will be shorter. This core has eight outputs connecting external RAM. In this design, initially we do not let the core work in an extremely high frequency but focus on how much bonus power can be delivered by switchable pins. Therefore, as shown in Figure 4, we use six of eight data pins as switchable pins and keep two data pins to transfer data. Each data pin serves for four data outputs. So in this case, the time of power mode should be divided by four to transfer data one by one. Figure 5 shows the clock diagram used for our design. In data mode, data transmission and power delivery are performed as regular as the normal core. In power mode, six data pins begin to play the role of power pin to deliver more power to support other cores and left two data pins transfer data of eight outputs one by one via frequency division. This control clock is implemented by a 4-bit shift register. Note that six pads worked as six switchable pins are dual-direction pads as shown in Figure 6. Data\_IO\_EN is controlled by S<sub>data</sub>. In data mode, Data IO EN is 1 so that the pad lets data transfer from chip to external storage. In power mode, Data IO EN is 0 to make the pad play as an Inpad instead of an Outpad. In this mode, more power will come to the chip through the pad.

## III. TEST RESULTS

The test should firstly focus on how much bonus power can be generated via switchable pins. In power mode, we define that extra voltage goes through switchable pins to support other cores. So the current generated by switchable pins largely relies on the internal load (additionally supported cores). A small core with low resistance and capacitance will lead to overlarge current and cause large IR drop in switchable pins. Therefore, here we set a safe line that 5% of V<sub>DD</sub> is the maximal IR drop the system can tolerate. In this regulation, using HP250nm with 5V supply 5MHz clock frequency, the simulation result is shown in Figure 7. It can be seen that in normal data mode (50ns-100ns), the average current is around 25 mA, and in power mode (0-50ns), the average current is around 43 mA. In sub-module simulation, it can be also seen that each switchable pin can additionally provide 3 mA to



Figure 9: An 1-bit storage unit of the DRAM.



Figure 10:  $rSNR_{READ/WRITE}$  in the microprocessor with and without switchable pins.



Figure 11: rSNR<sub>PIN/ORIGINAL</sub> in WRITE stage and READ stage.

supported core in power mode. These results show that the bonus current is totally 18 mA and the entire power can be boosted by 72%. Using this bonus current, the chip can support other cores. Through sub-module simulation, the minimum resistance of supported cores is 80.2  $\Omega$  for a given safe line that IR drop cannot exceed 5% of V<sub>DD</sub>. For increasing frequency, Figure 8 shows the relationship between frequency and feasibly minimum resistance of supported cores is less than those bottom line, IR drop will increase over 5% of V<sub>DD</sub> to influence the system's work.

We mainly focus on for the data test if data can be sent correctly to the external storage during power mode. In this design, we use a low cost DRAM as the external storage [11]. The DRAM capacitance is 0.4pF per storage unit. As the most important part in a DRAM, the 1-bit storage unit is shown in Figure 9. Our test method is mainly guided by the concept of the eye diagram [12]. The signal-to-noise ratio (SNR) for a given sample rate can be calculated as follows:

$$SNR = \frac{V_{HIGHAVE} - V_{LOWAVE}}{\sigma_{HIGH} + \sigma_{LOW}}$$
(7)

Where  $V_{HIGHAVE}$  and  $V_{LOWAVE}$  represent the average voltage of high logic and low logic, respectively.  $\sigma_{HIGH}$  and  $\sigma_{LOW}$  are standard deviations of high logic and low logic, respectively. Therefore, we can get a real SNR for a given frequency in WRITE stage or READ stage. Once get real SNR, another term, rSNR (relative-SNR), is introduced into our test [13]. The rSNR can show the trend of performance corresponding to some factors, such as clock frequency, temperature, and direction of data transmission, which obviously influence data transmission.

For the performance comparison of READ stage and WRITE stage in the same frequency, we can use real SNRs as a bridge to get an  $rSNR_{READ/WRITE}$ , which can be shown by:

$$rSNR_{READ/WRITE} = 20\log_{10}\frac{SNR_{READ}}{SNR_{WRITE}}$$
(8)

According to the method described above, the rSNR<sub>READ/WRITE</sub> below 1GHz clock is shown in Figure 10. From the figure, we can figure out that for both the microprocessor with and without switchable pins, the performance in WRITE stage is slightly better than in READ stage, the reason can be explained by following two aspects: a) For the buffer view, in WRITE stage, data comes from off-chip to on-chip, and so can go through two buffers in pad which is also shown in Figure 6. Hence, we can get that even though out-chip data is not as clean as we expect, the two buffers in pad somehow can shape those unclean waves to ensure the correctness of data transmission. However, for READ stage, only one buffer in pad works for wave shaping. Therefore, it will generate more incorrect data. b) From the view of charge moving, for the capacitor in the DRAM, due to the fact that before WRITE stage, the capacitor might pre-discharge to some degree, the actual discharge time during WRITE stage is shorter than the ideal value. But READ stage is a charge process, and in our design, there is no pre-charge phase used for the DRAM, so the actual charge time used for READ stage is longer than the actual discharge time used for WRITE stage. Therefore, data transmission in WRITE stage is more efficient and has less delay than in READ stage. Analyzing on how to set the value of the DRAM capacitor, a small capacitor means a low RC delay, which is positive for the high speed data transmission but with a high possibility of charge leakage, which is negative for the accuracy of data transmission. With low capacitance, the frequency of refresh circuit has to be improved to ensure correct data transmission. Increasing the frequency will also increase the power consumption of the DRAM. Therefore, how to set DRAM capacitors is the trade-off between performance and power consumption.

Another calculation is needed to analyze the performance deterioration caused by switchable pins in both the WRITE stage and READ stage. Here we define  $rSNR_{PIN/ORIGINAL}$  as follows:

$$rSNR_{PIN/ORIGINAL} = 20\log_{10}\frac{SNR_{PIN}}{SNR_{ORIGINAL}}$$
(9)

In Eq. (9), SNR<sub>PIN</sub> is the SNR of the microprocessor with switchable pins and SNR<sub>ORIGINAL</sub> is the SNR of the original microprocessor without switchable pins. Note that this equation can be used in both WRITE stage and READ stage. The results are shown in Figure 11. From the figure, we can see that the performance deterioration resulted from switchable pins is very small. The design shows the best performance when the frequency is 800MHz. Then We split this figure in two parts to analyze: a) Below 800MHz, the impedance of capacitor existing in switchable pins reduces with frequency increasing, thus the microprocessor with switchable pins will be gradually like an original microprocessor and rSNR<sub>PIN/ORIGINAL</sub> will be increase. b) When frequency exceeds 800MHz, the impedance of inductor will increase and the performance degrades. The process of charging determines whether READ is accurate or not. In this case, the possibility of incomplete charging increases due to short charging time. Therefore, the performance of READ will reduce greatly.

## IV. CONCLUSIONS AND FUTURE RESEARCH

In this paper, we analyzed the negative influence IR drop brings to future CMPs and other VLSI chips, including error of data transmission and more dark silicon being on chips. We intuitively compared on-chip voltage regulator and cores with switchable pins. We used an 8-bit RISC microprocessor to prove the feasibility of the proposed switchable pin structure. The results show that this novel and low cost circuit do boost power for the original chip, and rSNR test proved that added switchable pins do not influence the data transmission largely when work frequency is less than 1GHz. Future work would focus on the optimization of the switchable pin to reduce the influence of parasitics, deliver more power into chips and improve the accuracy of high speed data transmission. It is also important using more advanced cores with more advanced fabrication processes to design switchable pins. Since our data test in layout level cannot be as accurate as for the real chip test, accurate data transmission should be tested by the oscilloscope with the function of eye diagram tracking in order to get real SNR, error rate and clock jitter. These are significant for evaluating the performance of data transmission between core and DRAM.

## ACKNOWLEDGMENT

Part of the work is supported under NSF grant 1422408.

#### REFERENCES

- [1] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. "Dark silicon and the end of multicore scaling," in *Proceedings of International Symposium on Compututer Architecture (ISCA)*, 2011, pp. 365-376.
- [2] M. Srivastav, M. B. Henry, and L. Nazhandali. "Design of low-power, scalable-throughput systems at near/sub threshold voltage," in *Proceedings of International Symposium on Quality Electronic Design (ISQED)*, 2012, pp. 609-616.
- [3] S. Jain, S. Khare, S. Yada, et al. "A 280mV-to-1.2 V wide-operating-range IA-32 processor in 32nm CMOS." in *Proceedings of International Solid-State Circuit Conference Digest of Technical Papers (ISSCC)*, 2012, pp. 66-68.
- [4] S. Chen, Y. Hu, Y. Zhang, L. Peng, J. Ardonne, S. Irving, and A. Srivastava. "Increasing off-chip bandwidth in multi-core processors with switchable pins," in *Proceedings of International Symposium on Computer Architecture (ISCA)*, 2012, pp. 385-396.
- [5] S. Chen, L. Peng, Y. Hu, Z. Zhao, A. Srivastava, Y. Zhang, J. W. Choi, B. Li, and E. Song. "Powering Up Dark Silicon: Mitigating the Limitation of Power Delivery via Dynamic Pin Switching," accepted by *IEEE Transactions on Emerging Topics in Computing (TETC)*, 2015.
- [6] R. H. Otten, R. Camposano, and P. R. Groeneveld. "Design Automation for Deepsubmicron: present and future," in *Proceedings of Design, Automation and Test in Europe Conference and Exhibition*, 2002, pp. 650-657.
- [7] Predictive Technology Model (PTM). <u>http://ptm.asu.edu/.</u> Accessed: Jun, 2015
- [8] International Technology Roadmap for Semiconductors. <u>http://www.itrs.net/.</u> Accessed: Jun, 2015
- [9] R. G. Raghavendra, and P. Mandal. "An on-chip voltage regulator with improved load regulation and light load power efficiency," in *Proceedings of International Conference on VLSI Design (VLSID)*, 2006.
- [10] A. Raghavan, Y. Luo, A. Chandawalla, M. Papaefthymiou, K. P. Pipe, T. F. Wenisch, and M. M. K. Martin. "Computational sprinting," in *Proceedings* of High Performance Computer Architecture (HPCA), 2012, pp. 1-12.
- [11] I. Kiyoo, VLSI Memory Chip Design. NY: Springer Science & Business Media, 2013.
- [12] Eye Diagram Measurements in Advanced Deisgn System. <u>http://cp.literature.agilent.com/litweb/pdf/5989-9453EN.</u> <u>pdf.</u> Accessed: Oct, 2015
- [13] B. Özbek, and D. L. Ruyet, *Feedback Strategies for Wireless Communication*. NY: Springer, 2014.