HU XU, VASILIS F. PAVLIDIS, and GIOVANNI DE MICHELI, Integrated Systems Lab - EPFL

In three-dimensional (3D) integrated circuits, the effect of process variations on clock skew differs from 2D circuits. The combined effect of inter-die and intra-die process variations on the design of 3D clock distribution networks is considered in this article. A statistical clock skew model incorporating both the systematic and random components of process variations is employed to describe this effect. Two regular 3D clock tree topologies are investigated and compared in terms of clock skew variation. The statistical skew model used to describe clock skew variations is verified through Monte-Carlo simulations. The clock skew is shown to change in different ways with the number of planes forming the 3D IC and the clock network architecture. Simulations based on a 45-nm CMOS technology show that the maximum standard deviation of clock skew can vary from 15 ps to 77 ps. Results indicate that simply increasing the number of planes of a 3D IC does not necessarily lead to lower skew variation and higher operating frequencies. A multigroup 3D clock tree topology is proposed to effectively mitigate the variability of clock skew. Tradeoffs between the investigated 3D clock distribution networks and the number of planes comprising a 3D circuit are discussed and related design guidelines are offered. The skew variation in 3D clock trees is also compared with the skew variation of clock grids.

Categories and Subject Descriptors: B.7.1 [Hardware]: Integrated Circuits—VLSI (very large scale integration)

General Terms: Performance, Reliability

Additional Key Words and Phrases: Clock distribution network, clock skew, process variations, 3D ICs

#### **ACM Reference Format:**

Xu, H., Pavlidis, V. F., and De Micheli, G. 2012. Effect of process variations in 3D global clock distribution networks. ACM J. Emerg. Technol. Comput. Syst. 8, 3, Article 20 (August 2012), 25 pages. DOI = 10.1145/2287696.2287703 http://doi.acm.org/10.1145/2287696.2287703

#### **1. INTRODUCTION**

Clock skew is, typically, defined as the difference between the propagation delays of the clock signal from the source to the sinks of the clock distribution network. High clock frequencies severely constrain the clock skew budget, so that a sufficient fraction of the clock period can be dedicated to data processing.

3D integration emerges as a potent solution to alleviate the increasing interconnect delay in modern ICs [Pavlidis and Friedman 2009a, 2009b]. Multiplane circuits considerably reduce the interconnect length because of the vertical interconnections [Joyner et al. 2001]. Considering the important synchronization issue, the reduced interconnect latency can be exploited to either relax the clock skew constraints or further increase the speed of a circuit. A careful compromise between the two approaches can, nevertheless, be a better solution to avoid timing failures while delivering higher (yet not

© 2012 ACM 1550-4832/2012/08-ART20 \$15.00

DOI 10.1145/2287696.2287703 http://doi.acm.org/10.1145/2287696.2287703

<sup>20</sup> 

This work is funded in part by the Swiss National Science Foundation (no. 260021\_126517/1), European Research Council grant (no. 246810 NANOSYS), and Intel Braunschweig Labs, Germany.

Authors' addresses: H. Xu (corresponding author), V. F. Pavlidis, and G. De Micheli, Integrated Systems Lab, EPFL, Switzerland; email: hughesxuh@gmail.com.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.

the highest) performance as compared to a 2D circuit. The operating frequency of a 3D circuit is considered to increase proportionally with the number of planes forming the circuit [Joyner et al. 2001, 2004]. The behavior of the clock skew in such diverse systems, however, is not considered in these analyses.

Clock skew is introduced at the design and fabrication stages and during the operation of ICs. There is a plethora of methods to manage the excessive clock skew in the design phase [Bakoglu et al. 1986; Cong et al. 1998; Friedman 2001]. Note that these techniques are developed for 2D circuits and are not directly applicable to 3D ICs. Careful physical design, however, does not guarantee the elimination of undesirable skew since the unwanted skew can be introduced in the fabrication phase. The primary sources of process variations include fluctuations of the gate length, doping concentrations, oxide thickness, and *Inter-Layer Dielectric* (ILD) thickness [Nassif 2000; Zarkesh-Ha et al. 1999].

The resulting process variations are generally divided into inter-die and intradie variations. Inter-die variations affect the characteristics of devices independently among dice, but the devices within one die are uniformly affected. Intra-die variations affect the characteristics of devices unequally within one die. Intra-die variations include systematic and random components [Orshansky et al. 2000; Bowman et al. 2002, 2009]. Since both of intra- and inter-die process variations are present in a 3D IC, the inclusion of these process variations is required in the analysis and design of 3D clock distribution networks.

The focus of this article is on the effect of process variations on the clock skew of potential 3D clock distribution architectures. The case studies include clock distribution networks that globally distribute the clock signal in a 3D stack [Mondal et al. 2007; Pavlidis et al. 2008; Arunachalam and Burleson 2008] and a new 3D topology. The proposed model can also be used to analyze synthesized 3D clock trees [Zhao and Lim 2010; Kim and Kim 2010; Yang et al. 2011]. The resulting skew variations in synthesized clock trees implicitly depend on the efficiency of the synthesis technique. Since the intention is to investigate the effect of process variations rather than the efficiency of a 3D clock tree synthesizer, such as in Zhao and Lim [2010], Kim and Kim [2010] and Yang et al. [2011], regular structures such as H-trees are explored.

The considered 3D technologies include the manufacturing processes where multiple physical planes bonded with different approaches are electrically connected by *Through Silicon Vias* (TSVs) [Katti et al. 2010]. In these 3D ICs, the clock distribution networks can span more than one plane, where each plane is fabricated separately. The variation of TSVs is negligible due to the small dimensions of TSVs as compared with the wave length used by the manufacturing equipement [Reda et al. 2009]. The impact of TSV stress on clock skew is discussed in Yang et al. [2011]. This impact on global clock distribution networks is assumed to be mitigated by proper keep out zones around each TSV in this article.

Simulation results indicate that 3D integration can decrease skew variation as compared to planar circuits. Nevertheless, this reduction depends on the employed 3D clock tree topology and the number of planes. Consequently, the objectives of this article are: (1) to accurately model skew variation for 3D clock trees, (2) to determine the behavior of the skew variation within 3D clock H-trees, (3) to propose an enhanced 3D clock tree topology, (4) and to provide a set of design guidelines to limit the variability of clock skew within 3D clock distribution networks.

The remainder of the article is organized as follows. Existing techniques for skew analysis in 2D circuits are briefly reviewed in the following section. A statistical model of the delay of critical paths in 3D ICs is also discussed. The problem of skew analysis for 3D clock distribution networks is formulated in Section 3. An accurate clock skew model for 3D clock trees considering the impact of process variations is described in

Section 4. This model can be extended to include the variation of interconnects, which is presented as an appendix. Simulation results and a comparison of two regular 3D H-tree topologies are presented in Section 5. A new 3D topology is described in Section 6. A comparison of skew variations between the clock trees and a typical 2D clock grid is presented in Section 7. Some conclusions are drawn in Section 8.

## 2. RELATED WORK

Several techniques for analyzing the effect of process variation on clock skew have been developed for 2D circuits emphasizing intra-die variations. Some of these techniques are briefly summarized in this section.

For 2D ICs, clock skew variations are estimated either by corner or statistical analysis [Jiang and Horiguchi 2001]. Corner analysis is useful for estimating process variations for 2D ICs but this method often introduces pessimism in the timing of a circuit. A method for statistical clock skew analysis based on Monte Carlo simulations is introduced in Malavasi et al. [2002]. The computational time of this method is, however, prohibitively high for large-scale ICs. Based on statistical timing analysis [Liou et al. 2001], other statistical skew modeling methods considering intra-die variations are presented in Jiang and Horiguchi [2001], Harris and Naffziger [2001], Agarwal et al. [2003, 2004] and Sundareswaran et al. [2008] to efficiently analyze skew variations. Although statistical skew analysis has been studied in 2D ICs, the resulting methods cannot be directly applied to 3D systems. In 2D ICs, since the inter-die process variations uniformly affect the devices within a circuit, the majority of the skew analysis methods emphasize the intra-die variations.

Recent work analyzing the effect of process variations on the performance of 3D ICs is presented in Garg and Marculescu [2009a, 2009b] and Reda et al. [2009], where the impact of process variations on the delay of critical paths is studied. The distribution of the maximum delay of datapaths is determined by considering that the datapaths have no common segments. Since the clock paths to different sinks can (actually are designed to) have common segments, these models are not directly applicable to determine the skew of clock distribution networks. The effect of process variations in 3D clock distribution networks is discussed in Xu et al. [2010], where the intra-die variation is considered independent among the devices within the same plane. Nevertheless, the spatial correlation of intra-die variations is shown to be nonnegligible when analyzing the process-induced skew in clock trees [Orshansky et al. 2000; Agarwal et al. 2003; Hashimoto et al. 2005].

#### **3. PROBLEM DEFINITION**

The problem of skew analysis for 3D clock distribution networks considering process variations is formulated in this section. As discussed in Friedman [2001], only the clock skew between the sequential elements which transfer data between each other (data-related or "sequentially-adjacent" registers) affects the performance of a circuit. Consequently, in addition to global skew, appropriate pairwise skew distributions are adopted to evaluate the performance of clock distribution networks [Sundareswaran et al. 2008].

H-tree is a common topology used to globally distribute the clock signal within a circuit [Bakoglu et al. 1986; Friedman 2001]. A typical buffered 3D H-tree is illustrated in Figure 1 [Pavlidis et al. 2008; Mondal et al. 2007]. The pairwise clock skew is defined as the skew between every pair of sinks in 3D clock distribution networks,  $S_{\text{skew}} = \{s_{i,j} | s_{i,j} = D_i - D_j, 1 \le i, j \le n_{\text{sink}}\}$ . Sinks *i* and *j* can be located in any plane of the 3D circuit and  $s_{i,j}$  denotes the skew between sinks *i* and *j*. The clock delay to sinks *i* and *j* is denoted by  $D_i$  and  $D_j$ , respectively. The number of clock sinks is  $n_{\text{sink}}$ .



Fig. 1. 3D H-trees spanning four planes, where (a) is the topology of a 3D H-tree and (b) is the 3D view of a 3D H-tree.

As discussed in Section 5.2, the number of buffers and the length of the interconnects in clock trees significantly affect the distribution of clock skew. The area and number of the physical planes comprising a 3D IC, consequently, affect the highest clock frequency that can be supported by a 3D IC. By investigating the effect of process variations on  $S_{\rm skew}$ , several guidelines for the design of 3D global clock trees with a low  $\Delta S_{\rm skew}$  are offered.

# 4. CLOCK SKEW MODELING FOR 3D CLOCK TREES

The distribution of clock skew considering inter- and intra-die process variations in 3D clock trees is modeled in this section. The presented model used to obtain the distribution of buffer delay is presented in Section 4.1. Utilizing this model, the distribution of the delay of a 3D clock path is analytically derived in Sections 4.2 and 4.3. In Section 4.4, the distribution of the clock skew between each pair of clock sinks in 3D clock trees is discussed.

#### 4.1. Buffer Delay Distribution

The variation of buffer delay originates from the device parameters deviating from their nominal values, as statistically modeled in Zarkesh-Ha et al. [1999] and Azuma et al. [1998]. The fluctuation of the buffer delay is typically approximated as being linear in the device parameter variations [Sundareswaran et al. 2008; Devgan and Kashyap 2003]. Alternatively, the variation in delay can be determined through the variations of the input capacitance and output resistance [Jiang and Horiguchi 2001; Xu et al. 2010]. The second method is enhanced by considering the input slew rate to more accurately model the distribution of the buffer delay. The interconnects constituting the 3D circuit are modeled as distributed RC wires. The circuit illustrated in Figure 2 is utilized to obtain the variation of buffer delay for different slew rates of the input signal.

Let  $R_{\rm in}$  denote the output resistance of a buffer driving the buffer under consideration. The load capacitance of the buffer under consideration is denoted by  $C_1$ . Interconnects with diverse impedance characteristics are modeled by employing different  $R_{\rm int}$ and  $C_{\rm int}$ , where  $R_{\rm int}$  and  $C_{\rm int}$  denote the resistance and capacitance of interconnects, respectively. The interconnect  $R_{\rm int}$  and  $C_{\rm int}$  can also be adjusted to produce different slew rates for the input signal of the buffer in Figure 2.



Fig. 2. An elemental circuit used to measure the variations in the buffer characteristics.

For a step input signal, the Elmore delay [Elmore 1948] from source S to nodes I and O in Figure 2, respectively, is

$$D_{SI} = 0.69R_{\rm in}C_{\rm int} + 0.38R_{\rm int}C_{\rm int} + 0.69(R_{\rm in} + R_{\rm int})C_{\rm b},\tag{1}$$

$$\Delta D_{SI} = 0.69(R_{\rm in} + R_{\rm int})\Delta C_{\rm b},\tag{2}$$

$$D_{SO} = D_{SI} + D_{\rm b} + 0.69R_{\rm b}C_{\rm l},\tag{3}$$

$$\Delta D_{SO} = \Delta D_{SI} + \Delta D_{\rm b} + 0.69C_1 \Delta R_{\rm b},\tag{4}$$

where  $C_{\rm b}$ ,  $R_{\rm b}$ , and  $D_{\rm b}$  are the input capacitance, the output resistance, and the intrinsic delay of the buffer, respectively. The variations of  $C_{\rm b}$ ,  $R_{\rm b}$ , and  $D_{\rm b}$  are denoted by  $\Delta C_{\rm b}$ ,  $\Delta R_{\rm b}$ , and  $\Delta D_{\rm b}$ , respectively. While investigating the buffer as shown in Figure 2, the presumed  $R_{\rm in}$  is considered constant (for the moment).

By measuring the delay variation at nodes I and O and setting  $C_1$  to zero (corresponding to  $\Delta D_{SO_0}$ ) and another value (e.g., 200 fF, corresponding to  $\Delta D_{SO_1}$ ) in Monte Carlo simulations, the mean value and standard deviation of  $\Delta C_b$ ,  $\Delta R_b$ , and  $\Delta D_b$  are obtained by (2) and (4) [Xu et al. 2010]. Assuming the sources of the process variations can be described by Gaussian distribution, the characteristics of a buffer can also be approximated as Gaussian distribution [Chang and Sapatnekar 2005]

$$\Delta C_{\rm b} \sim \mathcal{N}(0, \sigma_{C_{\rm b}}^2), \quad \Delta R_{\rm b} \sim \mathcal{N}(0, \sigma_{R_{\rm b}}^2), \quad \Delta D_{\rm b} \sim \mathcal{N}(0, \sigma_{D_{\rm b}}^2). \tag{5}$$

 $\sigma_{C_b}$  is directly obtained through (2). According to (2) and (4),  $\sigma_{D_{SO}}$  is determined by  $\sigma_{C_b}$ ,  $\sigma_{D_b}$ ,  $\sigma_{R_b}$ , and the covariance among these variables

$$\begin{aligned} \sigma_{D_{\rm SO}}^2 &= (0.69(R_{\rm in} + R_{\rm int})\sigma_{C_{\rm b}})^2 + \sigma_{D_{\rm b}}^2 + (0.69\sigma_{R_{\rm b}}C_{\rm l})^2 + 1.38(R_{\rm in} + R_{\rm int})\operatorname{cov}(D_{\rm b}, C_{\rm b}) \\ &+ 1.38C_{\rm l}\operatorname{cov}(D_{\rm b}, R_{\rm b}) + 0.952C_{\rm l}(R_{\rm in} + R_{\rm int})\operatorname{cov}(C_{\rm b}, R_{\rm b}), \end{aligned}$$
(6)

$$\sigma_{D_{\rm b}}^2 = \sigma_{D_{SI}}^2 - \sigma_{D_{SI}}^2 - 1.38(R_{\rm in} + R_{\rm int}) \text{cov}(D_{\rm b}, C_{\rm b}),\tag{7}$$

$$T_{C_{\rm b}} = \frac{\delta D_{SI}}{0.69(R_{\rm in} + R_{\rm int})}.$$
 (8)

 $\sigma_{R_b}$  is obtained from (6) by substituting  $\sigma_{D_b}$  and  $\sigma_{C_b}$  from (7) and (8), respectively.

Consider that  $\Delta C_{\rm b}$ ,  $\Delta R_{\rm b}$ , and  $\Delta D_{\rm b}$  are used to obtain the delay variation of each buffer stage  $\Delta d_i$ , which is similar to  $\Delta D_{SO}$ . When calculating  $\sigma_{d_i}$  (similar to recalculating  $\sigma_{D_{SO_1}}$  through (6)), the  $\sigma_{C_{\rm b}}$ ,  $\sigma_{R_{\rm b}}$ , and  $\sigma_{D_{\rm b}}$  are substituted into (6) again. In this procedure, the covariances  $\operatorname{cov}(D_{\rm b}, R_{\rm b})$ ,  $\operatorname{cov}(D_{\rm b}, C_{\rm b})$  and  $\operatorname{cov}(C_{\rm b}, R_{\rm b})$  are considerably canceled out. Consequently, the correlation among  $\Delta C_{\rm b}$ ,  $\Delta R_{\rm b}$ , and  $\Delta D_{\rm b}$  does not affect  $\Delta d_i$  significantly, as long as  $\Delta d_i$  is calculated based on this same correlation. Since  $\Delta C_{\rm b}$ ,  $\Delta R_{\rm b}$ , and  $\Delta D_{\rm b}$  are due to the same process variation sources, these variables are assumed to be fully correlated herein.

#### 4.2. Delay Distribution of a 3D Clock Path

σ

The buffer delay model described in Section 4.1 is used to produce the delay distribution of clock paths in 3D ICs. An example of a 3D clock path is illustrated in Figure 3. Note that this path is general and can be applied to any 3D clock tree in addition to the 3D

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 20, Pub. date: August 2012.



Fig. 3. The electrical model of a segment of a clock path.

topologies investigated herein. The devices in different physical planes are connected by TSVs [Katti et al. 2010], which, in turn, are modeled as RC wires of different resistance and capacitance as compared to the horizontal wires (e.g.,  $R_{\text{TSV}}$  and  $C_{\text{TSV}}$  in Figure 3).  $R_{\text{TSV}}$  and  $C_{\text{TSV}}$  are considered fixed.

Consider the clock path consisting of buffers i - 1, i, and i + 1. From (2) and (4), the delay variation  $\Delta d_i$  attributed to the variation of buffer i along the investigated path is

$$\Delta d_{i} = 0.69(R'_{\text{in}(i)} + \Delta R_{\text{b}(i-1)})\Delta C_{\text{b}(i)} + 0.69\Delta R_{\text{b}(i)}(C'_{\text{l}(i)} + \Delta C_{\text{b}(i+1)} + \Delta C_{\text{b}(j)}) + 0.69R'_{\text{b}(i)}\Delta C_{\text{b}(j)} + \Delta D_{\text{b}(i)},$$
(9)

$$R_{\text{in}(i)} = R_{\text{b}(i-1)} + R_{\text{TSV}},$$
(10)

$$C_{l(i)} = 2C_{int} + C_{b(i+1)} + C_{b(j)},$$
(11)

where the prime (') denotes the nominal value. For buffer *i*, the  $\Delta R_{b(i-1)}$  of the upstream buffer and  $\Delta C_{b(i+1)}$  of the downstream buffer are both included in (9). To determine the delay of a clock path,  $\Delta d_i$  for all the buffers along this path is summed up. In this case,  $\Delta R_{b(i-1)}\Delta C_{b(i)}$  and  $\Delta R_{b(i)}\Delta C_{b(i+1)}$  are duplicated. Therefore, one of these two terms needs to be removed. Consequently,  $\Delta d_i$  is rewritten as

$$\Delta d_{i} = 0.69 \left( R'_{\text{in}(i)} \Delta C_{\text{b}(i)} + \Delta R_{\text{b}(i)} (C'_{\text{l}(i)} + \Delta C_{\text{b}(i+1)} + \Delta C_{\text{b}(j)}) + R'_{\text{b}(i)} \Delta C_{\text{b}(j)} \right) + \Delta D_{\text{b}(i)}$$
  
= 0.69  $\left( R'_{\text{in}(i)} \Delta C_{\text{b}(i)} + R'_{\text{b}(i)} \Delta C_{\text{b}(j)} \right) + \Delta D_{\text{b}(i)} + \delta_{i},$  (12)

where  $\delta_i = 0.69 \Delta R_{b(i)} (C'_{l(i)} + \Delta C_{b(i+1)} + \Delta C_{b(j)}).$ 

The variation of  $\Delta C_{\rm b}$  is relatively low as compared with the nominal  $C_{\rm b}$  ( $\sigma/\mu < 3\%$  for both D2D and WID variations as reported in Table II). The observed delay variation of buffers in other works is also typically much lower than the nominal value (e.g.,  $\sigma/\mu \leq 5\%$  for both D2D and WID variations as reported in Bowman et al. [2009] ).  $\delta_i$ can, therefore, be approximated linearly using a first-order Taylor expansion around zero [Chang and Sapatnekar 2005].

$$\delta_{i} \approx \left[\frac{\partial \delta_{i}}{\partial \Delta R_{b(i)}}\right]_{0} \Delta R_{b(i)} + \left[\frac{\partial \delta_{i}}{\partial \Delta C_{b(i+1)}}\right]_{0} \Delta C_{b(i+1)} + \left[\frac{\partial \delta_{i}}{\partial \Delta C_{b(j)}}\right]_{0} \Delta C_{b(j)}$$

$$= 0.69C'_{l(i)} \Delta R_{b(i)}$$
(13)

As reported in Table II and discussed in Bowman et al. [2009] and Garg and Marculescu [2009a, 2009b], the  $\sigma/\mu$  of the transistor characteristics is typically considered  $\leq 5\%$ . The  $3\sigma$  variation is smaller than 15% of the nominal value for  $R_{\rm b}$  and 10% for  $\Delta C_{\rm b}$ . Since  $\Delta C_{\rm b}$  and  $\Delta R_{\rm b}$  are modeled by Gaussian distributions, for more than

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 20, Pub. date: August 2012.

99.7% buffers,  $\Delta C_b \Delta R_b$  is lower than 1.5%  $C_b R_b$ . Moreover, from the nominal value and standard deviation of  $C_b$ ,  $R_b$ , and  $D_b$  reported in Table II,  $0.69 \Delta C_b \Delta R_b$  and  $0.69 C_b R_b$  are much lower than  $\Delta D_b$  and  $D_b$ , respectively. Consequently, approximating  $\delta_i$  with (13) does not introduce significant loss of accuracy.

As mentioned previously,  $\Delta R_{b(i)}$ ,  $\Delta C_{b(i)}$ , and  $\Delta D_{b(i)}$  are approximated as Gaussian distributions and can be assumed to be fully correlated. According to (12) and (13),  $\Delta d_i$  can be approximated as a Gaussian distribution.

$$\Delta d_i \sim \mathcal{N} \left( 0, \sigma_{d_i^{\text{D2D}}}^2 + \sigma_{d_i^{\text{WID}}}^2 \right) \tag{14}$$

$$\sigma_{d_{i}^{\text{D2D}}}^{2} = \begin{cases} (\sigma_{1} + \sigma_{2} + \sigma_{3})^{2} + \sigma_{4}^{2} & \text{if buffers } i \text{ and } j \text{ are in different planes} \\ (\sigma_{1} + \sigma_{2} + \sigma_{3} + \sigma_{4})^{2} & \text{if buffers } i \text{ and } j \text{ are in the same plane} \end{cases}$$
(15)

$$\sigma_{d_i^{\text{WID}}}^2 = (\sigma_5 + \sigma_6 + \sigma_7)^2 + \sigma_8^2 + 2\text{corr}(i, j)(\sigma_5 + \sigma_6 + \sigma_7)\sigma_8$$
(16)

$$\sigma_1 = 0.69 R'_{\mathrm{in}(i)} \sigma_{C_{\mathrm{b}(i)}^{\mathrm{D2D}}}, \quad \sigma_2 = 0.69 C'_{\mathrm{l}(i)} \sigma_{R_{\mathrm{b}(i)}^{\mathrm{D2D}}}, \quad \sigma_3 = \sigma_{D_{\mathrm{b}(i)}^{\mathrm{D2D}}}, \quad \sigma_4 = 0.69 R'_{\mathrm{b}(i)} \sigma_{C_{\mathrm{b}(j)}^{\mathrm{D2D}}}, \quad \sigma_5 = 0.69 R'_{\mathrm{in}(i)} \sigma_{C_{\mathrm{b}(i)}^{\mathrm{wD}}}, \quad \sigma_6 = 0.69 C'_{\mathrm{l}(i)} \sigma_{R_{\mathrm{b}(i)}^{\mathrm{wD}}}, \quad \sigma_7 = \sigma_{D_{\mathrm{b}(i)}^{\mathrm{wD}}}, \quad \sigma_8 = 0.69 R'_{\mathrm{b}(i)} \sigma_{C_{\mathrm{b}(i)}^{\mathrm{wD}}}$$

The correlation between buffers i and j is denoted by corr(i, j), the model of which is discussed in Section 4.3.

Consequently, for a 3D clock path to a sink u which includes  $n_u$  clock buffers, the variation of the delay is expressed as the summation of (12) applied to each buffer along the path. The variance of the distribution of a 3D clock path is a Gaussian distribution consisting of the WID and D2D variations of the buffers.

$$\Delta D_u = \sum_{i=1}^{n_u} \Delta d_i \tag{17}$$

$$\Delta D_u \sim \mathcal{N} \left( 0, \sigma_{D_u^{\text{DDD}}}^2 + \sigma_{D_u^{\text{WD}}}^2 \right)$$
(18)

The D2D and WID sources of delay variation along a 3D clock path are, respectively, discussed in the following subsections.

4.2.1. D2D Variation Model for the Delay of 3D Clock Paths. The variation of the delay of 3D clock paths due to the D2D process variations is the sum of the D2D variations of the buffer delay in all the planes

$$\Delta D_u^{\text{D2D}} = \sum_{i=1}^{N_p} \Delta D_{u(j)}^{\text{D2D}},\tag{19}$$

$$\Delta D_{u(j)}^{\text{D2D}} = \sum_{i=1}^{n_{u(j)}} \Delta D_{u(j,i)}^{\text{D2D}},$$
(20)

where  $N_p$  is the number of the planes that the clock tree spans.  $\Delta D_{u(j)}^{\text{D2D}}$  is the variation of the delay of the clock path from the clock source to sink u in plane j. The number of buffers located in plane j along this clock path is denoted by  $n_{u(j)}$ . The variation of the delay related to the  $i^{th}$  buffer in plane j is denoted by  $\Delta D_{u(j,i)}$ .

Since the D2D variations affect the buffers in the same plane equally, according to (14), (15), and (19), the distribution of  $\Delta D_{u(j)}^{\text{D2D}}$  is a Gaussian distribution. The D2D variations affect the buffers in different planes independently and, therefore,  $\Delta D_{u(j)}^{\text{D2D}}$  is independent from  $\Delta D_{u(k)}^{\text{D2D}}$  for any  $j \neq k$ . Consequently, according to (19), the distribution

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 20, Pub. date: August 2012.

of  $\Delta D_{\mu}^{\text{D2D}}$  is also a Gaussian distribution.

$$\Delta D_u^{\text{D2D}} \sim \mathcal{N}\left(0, \sigma_{\Delta D_u^{\text{D2D}}}^2\right) \tag{21}$$

$$\sigma_{D_{u}^{\text{DDD}}}^{2} = \sum_{j=1}^{N_{p}} \sigma_{D_{u(j)}^{\text{DDD}}}^{2} = \sum_{j=1}^{N_{p}} \left( \sum_{i=1}^{n_{u(j)}} \sigma_{D_{u(j,i)}^{\text{DDD}}} \right)^{2}$$
(22)

4.2.2. WID Variation Model for the Delay of 3D Clock Paths. The delay of a 3D clock path affected by WID variations is the sum of WID variations of all the buffers along this path. Consequently, according to (16), the distribution of  $\Delta D_u^{\text{WID}}$  is also a Gaussian distribution. The resulting variance of the delay of sink u due to WID variations is

$$\Delta D_u^{\text{WID}} \sim \mathcal{N}(0, \sigma_{D^{\text{WID}}}^2), \tag{23}$$

$$\sigma_{D_{u}^{\text{WID}}}^{2} = \sum_{i=1}^{n_{u}} \sigma_{d_{i}^{\text{WID}}}^{2} + 2 \sum_{1 \le i < j \le n_{u}} \operatorname{corr}(i, j) \sigma_{d_{i}^{\text{WID}}} \sigma_{d_{j}^{\text{WID}}},$$
(24)

where  $\operatorname{corr}(i, j)$  is the correlation between the WID variations of buffers *i* and *j*. If buffers *i* and *j* are located in different planes,  $\operatorname{corr}(i, j) = 0$ . The correlation of the impact of WID variations on different buffers within the same plane can be classified as systematic and random. The systematic WID variations typically exhibit a spatial correlation [Bowman et al. 2009; Agarwal et al. 2003; Orshansky et al. 2000; Hashimoto et al. 2005]. For the buffers located within the same plane, two types of correlations for the WID variation are investigated, as presented in Section 4.3.

## 4.3. Correlations between the WID Variations of Buffers within the Same Plane

The correlation between the WID variations of buffers i and j within one plane is described by corr(i, j). Two types of correlations are considered in this article: uncorrelated and multilevel correlation.

4.3.1. Uncorrelated WID Variations. In this case, the WID variation of the buffers within one plane are considered independent from each other. For any pair of buffers i and j, corr(i, j) = 0. Consequently, from (24), the standard deviation of the delay of sink u due to WID variations is

$$\sigma_{D_u^{\text{WID}}}^2 = \sum_{i=1}^{n_u} \sigma_{d_i^{\text{WID}}}^2.$$
(25)

4.3.2. Multilevel Correlation. The adopted spatial correlation model (multilevel correlation) is based on the statistical timing analysis method proposed in Agarwal et al. [2003]. A multilevel quad-tree partitioning is used and the intra-die variations of a device are divided into l levels, as illustrated in Figure 4 [Agarwal et al. 2003]. At the  $l^{\text{th}}$  level, there are  $4^{l-1}$  regions.

An independent variable is assigned to each region to represent a component of the WID variation of a device. The overall WID variation of a buffer k is composed by the sum of these independent components at different levels:  $\Delta L_{\text{WID},k} = \sum_{\text{region } r \text{ intersects } k} \Delta L_{i,r}$ , where  $\Delta L_{i,r}$  is the random variable associated with the quadtree at level i, region (i, r). The distribution of  $\Delta L_{\text{WID},k}$  is captured by the elementary circuit in Figure 2. This distribution is obtained by assigning the same probability distribution to all the random variables associated with a particular level and by dividing the total intra-die variability among the different levels. Consequently, the spatial correlation between devices in the same plane can be modeled. Devices located close to

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 20, Pub. date: August 2012.



Fig. 4. Modeling spatial correlations using quad-tree partitioning [Agarwal et al. 2003].

each other are highly correlated, while the devices located in large horizontal distance exhibit low correlation.

The correlation between the WID variations of buffers i and j are described by the sum of the correlations at all the levels

$$\operatorname{corr}(i, j) = \frac{1}{l} \sum_{m=1}^{l} \operatorname{corr}_{m}(i, j),$$
(26)

where  $\operatorname{corr}_m(i, j)$  is the correlation between buffers *i* and *j* at the *m*<sup>th</sup> level. As illustrated in Figure 4, assuming buffers *i* and *j* are located in the zones  $(m, \operatorname{region}_i)$  and  $(m, \operatorname{region}_i)$ , respectively,

$$\operatorname{corr}_{m}(i, j) = \begin{cases} 1, & \text{if } (m, \operatorname{zone}_{i}) = (m, \operatorname{zone}_{j}) \\ 0, & \text{if } (m, \operatorname{zone}_{i}) \neq (m, \operatorname{zone}_{j}) \end{cases}.$$
(27)

The uncorrelated WID variations and the multilevel correlation model are implemented for several 3D clock trees. The resulting model used to describe the skew variation is presented in the following section.

## 4.4. Clock Skew Distribution in 3D Clock Trees

The clock skew between any pair of sinks in a 3D clock tree is the difference of the clock delay between these sinks. For a 3D clock tree with  $n_{\text{sink}}$  sinks distributed in  $N_{\text{p}}$  planes, the nominal value and the variation of clock skew  $s_{u,v}$  between sinks u and v, respectively, are

$$s'_{u,v} = D'_u - D'_v, (28)$$

$$\Delta s_{u,v} = \Delta s_{u,v}^{\text{WID}} + \Delta s_{u,v}^{\text{D2D}} = \Delta D_u^{\text{WID}} - \Delta D_v^{\text{WID}} + \Delta D_u^{\text{D2D}} - \Delta D_v^{\text{D2D}}.$$
 (29)

The mean value of  $\Delta s_{u,v}$  is  $E(\Delta s_{u,v}) = E(\Delta s_{u,v}^{\text{WID}}) = E(\Delta s_{u,v}^{\text{D2D}}) = 0$ .  $\Delta D_u^{\text{WID}} - \Delta D_v^{\text{WID}}$  and  $\Delta D_u^{\text{D2D}} - \Delta D_v^{\text{D2D}}$  are independent from each other. Consequently,  $\Delta s_{u,v}^{\text{D2D}}$  and  $\Delta s_{u,v}^{\text{WID}}$  are discussed separately in the following subsections.

4.4.1. Skew Model of 3D Clock Trees with D2D Variations. The correlation between every two terms in the expression of  $\Delta s_{u,v}^{\text{D2D}}$  can be one or zero (i.e., fully correlated or uncorrelated, respectively). According to (19),  $\Delta s_{u,v}^{\text{D2D}}$  can be written as the sum of the terms in different planes

$$\Delta s_{u,v}^{\text{D2D}} = \sum_{j=1}^{N_{\text{p}}} \Delta s_{(u,v)_{j}}^{\text{D2D}}, \ \Delta s_{(u,v)_{j}}^{\text{D2D}} = \sum_{i=1}^{n_{u(j)}} \Delta D_{u(j,i)}^{\text{D2D}} - \sum_{i=1}^{n_{v(j)}} \Delta D_{v(j,i)}^{\text{D2D}}, \tag{30}$$

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 20, Pub. date: August 2012.



Fig. 5. The clock paths to sinks u and v where the paths share  $n_{u,v}$  buffers.

where  $\Delta D_{u(j,i)}^{\text{D2D}}$  is the D2D delay variation related to the *i*<sup>th</sup> buffer in the *j*<sup>th</sup> plane along the clock path ending at sink *u*. The number of buffers in the *j*<sup>th</sup> plane along this path is denoted as  $n_{u(j)}$ .

All the buffers in the same plane are equally affected by the D2D variations, which means that the correlation between each pair of variables in (30) is one. Since  $\Delta D_{u(j,i)}^{D2D}$  and  $\Delta D_{v(j,i)}^{D2D}$  are both modeled as Gaussian distributions,  $\Delta s_{(u,v)_j}^{D2D}$  is also a Gaussian distribution. In (30),  $\forall j_1 \neq j_2 (1 \leq j_1, j_2 \leq N_p)$ ,  $\Delta s_{(u,v)_j}^{D2D}$  is independent from  $\Delta s_{(u,v)_j_2}^{D2D}$ . Consequently,  $\Delta s_{u,v}^{D2D}$  is also described by a Gaussian distribution.

$$\Delta s_{u,v}^{\text{D2D}} \sim \mathcal{N}\left(0, \sigma_{s_{u,v}}^{2}\right) \tag{31}$$

$$\sigma_{s_{u,v}^{\text{D2D}}}^2 = \sum_{j=1}^{N_p} \sigma_{s_{u,v(j)}^{\text{D2D}}}^2 = \sum_{j=1}^{N_p} \left( \sum_{i=1}^{n_{u(j)}} \sigma_{D_{u(j,i)}^{\text{D2D}}} - \sum_{i=1}^{n_{v(j)}} \sigma_{D_{v(j,i)}^{\text{D2D}}} \right)^2$$
(32)

4.4.2. Skew Model of 3D Clock Trees with WID Variations. According to (24), the distribution of  $\Delta s_{u,v}^{\text{WID}}$  is also a Gaussian distribution

$$\Delta s_{u,v}^{\text{WID}} \sim \mathcal{N}(0, \sigma_{s_{u,v}}^{2}), \tag{33}$$

$$\sigma_{s_{u,v}}^{2} = \sum_{i=n_{u,v}+1}^{n_{u}} \sigma_{D_{u(i)}}^{2} + \sum_{j=n_{u,v}+1}^{n_{v}} \sigma_{D_{v(j)}}^{2} + 2 \sum_{\substack{i,j=n_{u,v}+1\\i< j}}^{n_{u}} \operatorname{corr}(i,j) \sigma_{D_{u(i)}}^{\text{WID}} \sigma_{D_{u(j)}}^{\text{WID}} + 2 \sum_{\substack{i,j=n_{u,v}+1\\i< j}}^{n_{v}} \operatorname{corr}(i,j) \sigma_{D_{u(i)}}^{\text{WID}} \sigma_{D_{u(j)}}^{\text{WID}} \sigma_{$$

where  $n_{u,v}$  is the number of the buffers shared by the clock paths ending at sinks u and v, as depicted in Figure 5. Downstream buffer  $n_{u,v}$ , the subpaths to u and v do not share any buffer. The correlation between the variation of buffers is presented in Section 4.3.

According to (29) through (34), the variation of the clock skew  $\Delta s_{u,v}$  between sinks u and v in a 3D clock tree is modeled as a Gaussian distribution.

$$\Delta s_{u,v} \sim \mathcal{N}\left(0, \sigma_{s_{u,v}^{\text{MID}}}^2 + \sigma_{s_{u,v}^{\text{D2D}}}^2\right) \tag{35}$$

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 20, Pub. date: August 2012.

Table I. Device and Interconnect Parameters of the Investigated Circuit

| Parameter | $W_{ m n}/L_{ m n}$                        | $W_{ m p}/L_{ m p}$  | $V_{\rm dd}$ [V]      | $R_{ m b}$ [ $\Omega$ ]  | $C_{\rm b}$ [fF]                       | $D_{\rm b} ~[{\rm ps}]$ |
|-----------|--------------------------------------------|----------------------|-----------------------|--------------------------|----------------------------------------|-------------------------|
| Value     | 30                                         | 60                   | 1.0                   | 349.0                    | 5.7                                    | 24.8                    |
| Parameter | $r_{\rm int} \left[\Omega/{\rm mm}\right]$ | $c_{ m int}$ [fF/mm] | ø <sub>TSV</sub> [µm] | $l_{\rm TSV}$ [ $\mu$ m] | $R_{\mathrm{TSV}}  [\mathrm{m}\Omega]$ | $C_{\rm TSV}$ [fF]      |
| Value     | 51.2                                       | 230.2                | 2                     | 20                       | 133                                    | 52                      |

| Input Slow    |              | $R_{\rm b}$ [ $\Omega$ ] |                   |              | $C_{\rm b}$ [fF]  |                   |              | D <sub>b</sub> [ps] |                   |
|---------------|--------------|--------------------------|-------------------|--------------|-------------------|-------------------|--------------|---------------------|-------------------|
| Input Siew    | $\mu$        | $\sigma_{ m WID}$        | $\sigma_{ m D2D}$ | $\mu$        | $\sigma_{ m WID}$ | $\sigma_{ m D2D}$ | $\mu$        | $\sigma_{ m WID}$   | $\sigma_{ m D2D}$ |
| 47 [mV/ps]    | 371          | 18.8                     | 15.3              | 4.9          | 0.04              | 0.03              | 19.9         | 1.04                | 0.85              |
|               | $\sigma/\mu$ | 5.1%                     | 4.1%              | $\sigma/\mu$ | 0.8%              | 0.7%              | $\sigma/\mu$ | 5.2%                | 4.3%              |
| 16 [mV/ns]    | 349          | 17.8                     | 14.7              | 5.7          | 0.31              | 0.16              | 24.8         | 1.49                | 1.21              |
| 10 [11 (//p5] | $\sigma/\mu$ | 5.1%                     | 4.2%              | $\sigma/\mu$ | 2.3%              | 2.1%              | $\sigma/\mu$ | 6.0%                | 4.9%              |
| 6 [mV/ps]     | 345          | 16.7                     | 13.7              | 7.2          | 0.08              | 0.06              | 30.1         | 2.19                | 1.79              |
|               | $\sigma/\mu$ | 4.8%                     | 4.0%              | $\sigma/\mu$ | 1.1%              | 0.9%              | $\sigma/\mu$ | 7.3%                | 5.9%              |

Table II. Variations of the Electrical Characteristics of the Buffers

If the maximum tolerant skew variation is  $\Delta S \ge 0$ , the probability that a 3D clock tree satisfies this constraint is

$$P(|s_{u,v}| \le \Delta S) = \int_{-\Delta S - s'_{u,v}}^{\Delta S - s_{u,v}} f_{\Delta s_{u,v}}(t) dt, \qquad (36)$$

$$f_{\Delta s_{u,v}}(t) = \frac{1}{\sqrt{2\pi\sigma_{s_{u,v}}^2}} e^{-t^2/(2\sigma_{s_{u,v}}^2)}.$$
(37)

The model of skew variations is used to analyze the effect of process variations in various 3D clock trees. This model can be extended to include the variations of horizontal interconnects, as analyzed in the appendix. The investigated 3D clock distribution networks and simulation results are presented in the following section.

# 5. SIMULATION RESULTS

Based on the skew variation model presented in Section 4.4, two H-tree-based 3D topologies are investigated highlighting the impact of process variations on the clock skew in 3D circuits. The accuracy of the variation model is discussed in Section 5.1. The investigated global 3D H-tree topologies are described in Section 5.2. Simulation results and a comparison between these 3D clock trees are presented in Sections 5.3 and 5.4.

#### 5.1. Accuracy of the Clock Skew Variation Model for 3D Clock Trees

The skew variation model is compared with Monte Carlo simulations in this section. The structure used for this purpose is an H-tree clock distribution network. This H-tree is placed in a circuit with total area  $10 \text{ mm} \times 10 \text{ mm}$ .

The circuit is assumed to be implemented at a 45nm CMOS technology. The parameters of the transistors and the interconnects are extracted from the PTM 45 nm CMOS and global interconnect models [NIMO ASU 2008] and the ITRS reports [ITRS 2009]. The clock buffers consist of two inverters connected in series. The circuit parameters used in the following sections are listed in Table I. The ratio of the width to the channel length is denoted by  $W_n/L_n$  and  $W_p/L_p$  for NMOS and PMOS, respectively. The interconnect resistance and capacitance per unit length are denoted by  $r_{int}$  and  $c_{int}$ , respectively. The physical and electrical characteristics of TSVs are also listed in Table I and are based on the data reported in Katti et al. [2010] and Savidis and Friedman [2008]. The diameter and length of the TSVs are notated as  $\phi_{TSV}$  and  $l_{TSV}$ , respectively.

The variation of the effective channel length of transistors,  $L_{\text{eff}}$ , is considered in this article, which has been identified as the most significant component of device variations

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 20, Pub. date: August 2012.



Fig. 6. A single-via 3D clock H-tree where different views (a) 2D and (b) 3D are illustrated.

[Nassif 2000; Bowman et al. 2002, 2009; Agarwal et al. 2003]. Note that the effect of other sources of process variations can also be determined by the circuit illustrated in Figure 2 and described with the proposed model. The corresponding nominal  $L_{\rm eff}$ , D2D variation  $(3\sigma_{L_{\rm eff}}^{\rm D2D})$ , and WID variation  $(3\sigma_{L_{\rm eff}}^{\rm WID})$  are 27 nm, 2.2 nm, and 2.7 nm, respectively [ITRS 2009]. Cadence Spectre is used for the Monte Carlo simulations [Cadence Design Systems, Inc. 2008]. The resulting variations of  $R_{\rm b}$ ,  $C_{\rm b}$ , and  $D_{\rm b}$  are listed in Table II, which are obtained as discussed in Section 4.1 with the input slew rates being 47 mV/ps, 16 mV/ps, and 6 mV/ps. The mean value and standard deviation are denoted by  $\mu$  and  $\sigma$ , respectively. The ratio  $\sigma/\mu$  usually indicates the importance of variations [Bowman et al. 2002]. The Monte Carlo simulation is repeated 1000 times. As reported in Table II, the  $\sigma$  of  $R_{\rm b}$ ,  $C_{\rm b}$ , and  $D_{\rm b}$  all depend on the input slew rate. To consider the slew rate and the load is, therefore, necessary while evaluating the variations of the buffer delay. The method presented in Section 4.1 is applicable to accurately consider this dependence.

Two H-tree topologies are used to verify the accuracy of the skew variation model. The first topology (multi-via) is illustrated in Figure 1(b). The second topology (single-via) is illustrated in Figure 6. Both these topologies are introduced in the following section. The H-tree spans four planes. The clock source is located at the center of the first plane. There are 128 clock sinks in total, 32 in each plane. Clock buffers, which are marked with  $\triangle$ , are inserted following the technique described in Téllez and Sarrafzadeh [1997]. The clock frequency is 1 GHz and the constraint on the input slew rate is 16 mV/ps (the transition time is 5% of the clock period). The numbers of the inserted buffers in multi-via and single-via topologies are 168 and 540, respectively. Only few buffers are illustrated in Figures 1 and 6 for improved readability. The wire segments between two buffers are simulated using a standard  $\pi$  model.

As shown in Figure 1(b), sinks 1, 2, and 3 are located in the first plane. Sinks 4 and 5 are located in the topmost plane. Skews  $s_{1,2}$ ,  $s_{1,3}$ ,  $s_{1,4}$ , and  $s_{1,5}$  are considered to demonstrate the accuracy of the adopted model. The difference between the resulting standard deviation  $\sigma$  of Spectre simulation and the skew variation model is reported in Table III. The skew variations with uncorrelated (independent) WID variations are reported below "Indep". The variations modeled by the multilevel spatial correlation are reported below "Multi-Level", where five levels are assumed (l = 5). The error of the skew variation model is below 6% between any pair of sinks in the investigated clock tree. As listed in Table III, the distribution of the clock skew determined by the skew variation model exhibits reasonable accuracy as compared with Monte Carlo simulations.

|                         |              |                    |                    |                    |                    |                    | -                  |                    |                    |          |
|-------------------------|--------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|----------|
| Cor                     | relation     |                    | In                 | dep.               |                    |                    | Mult               | i-Level            |                    | CPU time |
| Skew                    | variation    | $\sigma_{s_{1,2}}$ | $\sigma_{s_{1,3}}$ | $\sigma_{s_{1,4}}$ | $\sigma_{s_{1,5}}$ | $\sigma_{s_{1,2}}$ | $\sigma_{s_{1,3}}$ | $\sigma_{s_{1,4}}$ | $\sigma_{s_{1,5}}$ |          |
|                         | Model [ps]   | 6.96               | 14.76              | 7.41               | 14.97              | 7.00               | 33.55              | 7.11               | 32.92              | 26 sec.  |
| Multi-via<br>Single-via | Spectre [ps] | 7.27               | 15.71              | 7.56               | 15.97              | 7.40               | 34.90              | 7.31               | 35.10              | 48 min.  |
|                         | Error [%]    | -4                 | -6                 | -2                 | -6                 | -5                 | -4                 | -3                 | -6                 | -        |
|                         | Model [ps]   | 3.91               | 13.59              | 51.33              | 51.33              | 2.43               | 29.01              | 71.14              | 69.63              | 39 sec.  |
| Single-via              | Spectre [ps] | 3.84               | 13.13              | 50.13              | 50.09              | 2.34               | 28.1               | 68.19              | 67.9               | 55 min.  |
|                         | Error [%]    | 2                  | 3                  | $^{2}$             | $^{2}$             | 4                  | 3                  | 4                  | 3                  | -        |

Table III.  $\sigma$  of Skew Variation of the 3D Circuits Shown in Figures 1 and 6

The computational time for the proposed model and the Spice-based Monte Carlo simulations is also listed in Table III. The runtime is reported as the average time for independent and spatially correlated WID variations. To decrease the runtime, only the clock paths related to the reported  $\sigma$  are simulated. Although only a part of the 3D clock trees is simulated, using the proposed model, the runtime is reduced by  $85 \times$ . The efficiency of the variability-aware design of 3D clock distribution networks can be significantly improved by the proposed model, especially when different topologies of clock trees are compared and iterative design modifications are required. The comparison of the runtime for the entire 3D clock distribution networks is discussed in Section 6.

## 5.2. Typical 3D Clock H-Tree Topologies

The skew variation for two types of 3D global H-trees is investigated in the following subsections. Both of these networks have been utilized in the design of a prototype 3D circuit [Pavlidis et al. 2008] and other case studies of multiplane circuits [Mondal et al. 2007]. H-trees are typically used to globally deliver the clock signal to large-scale circuits [Friedman 2001]. The regularity of these topologies facilitates the investigation of WID and D2D variations as compared to synthesized clock trees which exhibit significantly different wire length and TSV density characteristics. In other words, the main objective is to demonstrate the physical behavior of a 3D system under these variations and the related trade-offs, rather than the decrease in the wire length of a clock tree which is produced by an efficient 3D clock trees [Zhao and Lim 2010], and rings can be used to distribute the clock signal in the vicinity of each leaf of the H-tree. Although H-trees are considered, the analysis also applies to other global clock architectures, such as X-trees.

The first topology (multi-via topology) is shown in Figure 1, where the clock source and buffers (except for the buffers at the last level) of a 3D H-tree are located in a single physical plane (e.g., the first plane). In this topology, the clock signal is propagated to the sinks in other planes by multiple TSVs. The vertical lines at each leaf correspond to a cluster of TSVs.

The second clock tree topology (single-via topology) is illustrated in Figure 6, where a 2D H-tree is replicated in each plane. The clock signal is propagated by a single via (or a group of TSVs to prevent TSV failures and to lower the resistance of this vertical path) connecting the clock source to each H-tree replica.

## 5.3. Skew Variations between the Clock Sinks in the Same Plane

In 3D H-trees and for intra-plane paths, the number, size, and location of the buffers along these paths are equal for a single plane, since the multi-via and single-via topologies are both symmetric topologies (at least within the x and y directions). The D2D variations in each plane, therefore, affect these clock paths equally. Consequently, according to (30), for both the multi-via and single-via topologies, only WID variations affect the variation of skew between sinks located in the same plane. For both the

single-via and multi-via tree topologies, the variation of skew between the buffers located in the same plane exhibits the same behavior as in 2D circuits.

For the considered topologies, the clock buffers are inserted by uniform buffer insertion techniques under the same constraints of skew and slew rate [Téllez and Sarrafzadeh 1997].  $R_{\rm in}$  and  $C_{\rm l}$  of each buffer, therefore, are approximately equal. For a 3D clock tree, as described by (34), if  $R_{\rm in}$  and  $C_{\rm l}$  of each buffer remain unchanged, the  $\sigma_{S_{(u,v)}^{\rm WD}}$ between two sinks decreases as the number of the nonshared clock buffers (e.g., the buffers after the  $n_{u,v}$  buffers in Figure 5) decreases. For a 3D IC with total area A, the side length of each plane is  $L \propto \sqrt{\frac{A}{N_{\rm p}}}$ . Consequently, the number of buffers in one plane decreases as L decreases for an increasing number of planes forming the 3D circuit.

For the single-via topology, all the clock sinks within a plane are connected to the clock source by the same TSV. The length of this TSV and the increasing number of buffers vertically connected to this TSV do not affect the intra-plane skew. Consequently, based on the proposed model and the preceding analysis, it is concluded that the following holds.

*Conjecture* 5.1. For the single-via topology, the distribution of the skew between the clock sinks in the same plane becomes narrower as the number of planes increases.

For the multi-via topology, however, the clock sinks in the same plane connect to different TSVs. As the number of planes increases, both the number of buffers connecting to a TSV and the length of the TSVs increase. The input slew rate decreases since an increasing load is driven. As reported in Table II, the resulting delay variation of the buffers after the TSVs increases. Moreover, the load of the buffers driving the TSVs increases. These changes of the topological characteristics result in the increase of  $\sigma_{d(i)}$ , as described by (16). This increase, consequently, counteracts and can surmount the decrease in variations due to the decreasing number of clock buffers along the clock paths.

*Conjecture* 5.2. For the multi-via topology, the distribution of the skew between the clock sinks in the same plane changes nonmonotonically as the number of planes increases.

*Example 1.* Simulation results exhibiting the different behavior of the single-via and multi-via topologies are shown in Figure 7. In this example, a global clock tree with 256 sinks is placed in a 3D IC with increasing number of planes. A sink can be either a subtree, a local clock mesh, or a cluster of registers (or a buffer driving any of these structures). The total area of the circuit is 100 mm<sup>2</sup>. The impact of process variations on the skew between pairs of sinks within the same plane is demonstrated by skews  $s_{1,2}$  and  $s_{1,3}$  as illustrated in Figure 1(b). The results of the simulations with the independent and multilevel correlated WID variations (l = 5) are illustrated in Figures 7(a) and 7(b), respectively.

The buffers inserted into the 3D clock trees are reported in Table IV. The number of buffers inserted within one plane in the single-via topology is lower than the multi-via topology, which introduces a lower skew variation than the multi-via topology. The total number of buffers in a single-via topology is, however, much higher than the multi-via topology.

*Conjecture* 5.1. The variance of the skew between two intra-plane sinks of the singlevia 3D H-tree is smaller than the corresponding variance of the skew in a multi-via 3D H-tree.

The numbers of buffers in each plane for both topologies decrease as the number of planes increases. The increasing number of buffers connected to TSVs (due to the

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 20, Pub. date: August 2012.



Fig. 7.  $\sigma$  of skew within the first plane for increasing number of planes, where the WID variations are considered (a) independent and (b) multilevel correlated.

Table IV. The Number of Buffers Inserted into the 3D Clock Trees

| # of planes            | 1   | 2   | 3    | 4   | 5   | 6    | 7    | 8   | 9   | 10   |
|------------------------|-----|-----|------|-----|-----|------|------|-----|-----|------|
| multi-via              | 981 | 588 | 558  | 296 | 264 | 242  | 234  | 138 | 134 | 134  |
| single-via (per plane) | 981 | 460 | 430  | 231 | 199 | 177  | 169  | 105 | 101 | 101  |
| single-via (total)     | 981 | 920 | 1290 | 924 | 995 | 1062 | 1183 | 840 | 909 | 1010 |

greater number of planes) increases  $\sigma_{s_{u,v}}$  in the multi-via topology but does not affect  $\sigma_{s_{u,v}}$  in the single-via topology. Consequently, the decrease in the number of buffers leads to a reduction in skew variation within the same plane for the single-via topology, as shown by the  $\blacklozenge$  and  $\blacksquare$  curves in Figure 7. Nevertheless, for the multi-via topology, as shown by the  $\triangle$  and  $\times$  curves,  $\sigma_{s_{u,v}}$  within the same plane changes nonmonotonically with the number of planes. For the sinks with short distance, the  $\sigma_{s_{1,2}}$  even increases with the number of planes. As a result, for multi-via 3D H-trees, simply increasing the number of planes does not necessarily improve the clock skew. By employing the proposed skew variation model, the number of planes that produces the lowest skew variation is determined.

The maximum supported clock frequency  $f_{\text{max}}$  of a circuit is constrained by skew [Friedman 2001]. Although  $f_{\text{max}}$  is typically determined by the critical path delay, the skew criterion is used here to offer a tangible explanation of the effect of process variations on the performance of circuits. The maximum allowed skew is assumed to be  $10\% \frac{1}{f_{\text{max}}}$  for the simulated 3D clock trees. To achieve a timing yield higher than 99%,  $f_{\text{max}}$  should be smaller than  $10\% \frac{1}{3\sigma_{s_{u,v}}}$ , where  $3\sigma_{s_{u,v}}$  is the skew at the  $3\sigma$  point from the mean value. Assuming that the clock frequency is limited by the largest skew variation, the  $f_{\text{max}}$  corresponding to  $\sigma_{s_{1,3}}$  shown in Figure 7, is illustrated in Figure 8. The results

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 20, Pub. date: August 2012.



Fig. 8. The maximum supported clock frequency determined by the skew variation within one plane.

with independent WID variations are illustrated by "multi-via (I)" and "single-via (I)". The results with multilevel WID correlations are illustrated by "multi-via (II)" and "single-via (II)".

As illustrated in Figure 8, the single-via topology can produce an up to 53% and 64% higher clock frequency for independent and multilevel correlated WID variations, respectively, as compared to the multi-via topology. This improvement increases as the number of planes increases. The  $f_{\rm max}$  in the multi-via topology changes nonmonotonically with the number of planes.

*Guideline* 1. In a 3D circuit, if the data-related sinks are located mostly within the same plane, the single-via topology is more efficient in reducing the skew variations and can support a higher clock frequency.

The  $f_{\text{max}}$  produced by a 3D tree with independent WID variations, not surprisingly, is higher than a 3D tree with multilevel correlated WID variations. According to (34), this situation is due to the larger spatial correlation between devices which introduces higher skew variations into a 3D clock tree.

#### 5.4. Skew Variations between the Clock Sinks in Different Planes

As described by (30) through (32), when the investigated clock sinks are located in different planes, the corresponding clock skew is also affected by D2D variations. As a result, the skew variations between inter-plane sinks vary from the intra-plane skew variation. To demonstrate this difference, the  $\sigma_{s_{u,v}}$  between each pair of sinks of a multivia 3D clock tree is illustrated in Figure 9. Since  $\sigma_{s_{u,v}} = \sigma_{s_{v,u}}$ , only half of the skew array is shown in Figure 9 for clarity.

*Example 2.* In this example, a 3D clock tree spanning eight planes is implemented using the single- and multi-via topologies. The resulting skew of the multi-via topology is illustrated in Figure 9. The electrical parameters of the wires are given in Section 5.1. There are 32 sinks in each plane and 256 sinks in total. Sinks 1 to 32 are located in the first plane and sinks 33 to 64 are located in the second plane, etc. As an example, consider the  $\sigma$  of the skew between sinks 3 and 4. This standard deviation is determined by the z value of the point x = 3 and y = 4. From these figures, the  $\sigma_s$  of inter-plane skew is larger than the  $\sigma_s$  of intra-plane skew. The change of skew variation between inter-plane sinks with the number of planes is illustrated in Figure 10.

For the single-via topology, the skew variation for the interplane pairs of sinks remains approximately the same irrespective of the planes to which the sinks belong. This behavior is because the paths to sinks located in different planes do not share any common segments (see Figure 6).

When the number of planes is greater than two, the skew variation decreases as the number of planes increases, as also shown in Figure 10. Since the paths lay in different

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 20, Pub. date: August 2012.



Fig. 9. The  $\sigma$  of skew between each pair of clock sinks under the multi-via topology where (a) is the 3D view and (b) is the top view.



Fig. 10.  $\sigma$  of skew between the sinks in the first and the topmost plane for the single-via and multi-via topologies. The locations of the pairs of sinks defining  $s_{1,4}$  and  $s_{1,5}$  are shown in Figure 1(b). (a) is based on independent WID variations and (b) is based on multilevel correlated WID variations.

|    | # of planes      | 1 (2-D) | 2    | 3    | 4    | 5    | 6    | 7    | 8    | 9    | 10   |
|----|------------------|---------|------|------|------|------|------|------|------|------|------|
|    | multi-via [GHz]  | 1.62    | 1.93 | 2.07 | 2.16 | 2.15 | 2.13 | 2.11 | 2.09 | 2.00 | 1.92 |
| I  | single-via [GHz] | 1.62    | 0.44 | 0.52 | 0.60 | 0.64 | 0.69 | 0.75 | 0.82 | 0.84 | 0.86 |
|    | multi / single   | 1       | 4.4  | 4.0  | 3.6  | 3.4  | 3.1  | 2.8  | 2.5  | 2.4  | 2.2  |
|    | multi-via [GHz]  | 0.49    | 0.74 | 0.98 | 0.96 | 1.05 | 1.14 | 1.19 | 1.33 | 1.50 | 1.45 |
| II | single-via [GHz] | 0.49    | 0.34 | 0.40 | 0.44 | 0.47 | 0.51 | 0.56 | 0.61 | 0.62 | 0.63 |
|    | multi / single   | 1.0     | 2.2  | 2.5  | 2.2  | 2.2  | 2.2  | 2.1  | 2.2  | 2.4  | 2.3  |

Table V. The Maximum Clock Frequency Supported by Multi-via and Single-Via Topologies

planes, according to (32), the effect of D2D variations on the 3D single-via topology is much larger than in planar H-trees.

The skew variation under the multi-via topology varies significantly from the singlevia topology, as illustrated in Figure 10. The skew variation between planes significantly depends on the location of the related sinks. According to (30)–(32), the impact of D2D variations increases as the number of buffers located in different planes increases. For the multi-via topology, all clock paths preceding the TSVs are in the first plane. The effect of D2D variations on the multi-via topology, therefore, is much smaller than the single-via topology, as shown in Figure 10. Nevertheless, as shown in Figure 10, the skew variation of the multi-via topology changes nonmonotonically with the number of planes. The reason is similar to the skew variation within the same plane as discussed previously.

*Guideline* 2. In a 3D circuit, if the data-related sinks are widely distributed in several planes, the multi-via topology is more efficient in reducing the skew variation and supports a higher clock frequency.

Assuming the  $f_{\text{max}}$  of a 3D IC is limited by the inter-plane skew variation, the maximum operating frequency supported by the single-via and multi-via topologies is reported in Table V. The results based on independent and multilevel correlated WID variations are reported after "I" and "II", respectively. As listed in this table, for a circuit with a different number of planes and different clock tree topology, the  $f_{\text{max}}$  can vary from 440 MHz to 2.16 GHz. The corresponding largest  $\sigma_{s_{u,v}}$  varies from 77 ps to 15 ps. For the same number of planes, the  $f_{\text{max}}$  of the multi-via topology is up to 4.4 times higher than the single-via topology. As reported by "I", the single-via topology produces lower  $f_{\text{max}}$  than a planar tree when the WID variation is completely random. The single-via topology, however, can produce a higher  $f_{\text{max}}$  as compared to a 2D tree when the systematic WID variation is considered as reported by "II".

3D integration is considered to significantly reduce the interconnect delay and enhance the clock frequency of circuits [Pavlidis and Friedman 2009b; Joyner et al. 2004]. This enhancement, however, is shown not to grow directly proportionally with the number of planes where process variations (both WID and D2D) are considered in the design process.

Results indicate that the performance improvement in a 3D clock network depends significantly on the distribution of the sinks (and consequently the clock paths) among the planes. As reported in Table V, when the data-related sinks are distributed at different planes, the skew of single-via 3D clock trees is affected more by process variations than the corresponding 2D clock trees. This behavior is consistent with the conclusions made in Garg and Marculescu [2009a] and Akopyan et al. [2008]. The effect of process variations on 3D clock distribution networks can be mitigated by employing a multi-via topology in this case. This topology can better exploit the traits of vertical integration (i.e., shorter wires) to significantly increase the operating frequency.



Fig. 11. An example of the multigroup 3D clock H-tree topology.

## 6. MULTIGROUP 3D CLOCK TREE TOPOLOGY

As stated in Guidelines 5.3 and 5.4, the single-via 3D clock H-tree topology is more efficient in reducing the skew variation within a single plane, while the multi-via topology is more efficient in reducing the skew variation between planes. To exploit these advantages, a hybrid H-tree topology (multigroup topology) combining the features of these topologies is proposed in this section.

The new multigroup topology is illustrated in Figure 11. The key idea is that the  $N_{\rm p}$  planes forming a 3D circuit are divided in *G* groups of "data-related planes". The data-related planes are the physical planes containing data-related registers. The *i*<sup>th</sup> group of data-related planes consists of  $h_i (\leq N_{\rm p})$  physical planes. The clock signal is distributed within these  $h_i$  planes by a multi-via topology.

An example of this H-tree topology is illustrated in Figure 11. This H-tree includes two groups of data-related planes (G = 2). Each group spans three ( $h_1 = 3$ ) and two physical planes ( $h_2 = 2$ ), respectively. The buffers contained in each group of datarelated planes are denoted by  $\triangle$  and  $\circ$ . The TSVs connecting these buffers are called "sink-TSVs". The roots of the multi-via topologies are connected with a "root-TSV" (or a cluster of TSVs) as illustrated by the segment at the center of the planes.

For a 3D IC, if all the data-related clock sinks cannot be located within the same plane but in adjacent planes, the multigroup topology is more efficient in reducing the skew variation than the aforementioned topologies. Compared with the single-via topology, using *G* instead of  $N_p$  H-trees, the multigroup topology significantly reduces the skew variation between data-related planes. Compared with the multi-via topology, the buffers connected to the sink-TSVs for the multigroup topology are fewer than the buffers connected to the TSVs of the multi-via topology. Therefore, both the skew variation within a single plane and the skew variation between data-related planes are reduced.

*Example 3.* A 3D circuit with eight planes is simulated for the three topologies to investigate the efficiency of the multigroup 3D clock tree. The physical and electrical characteristics of the circuit are reported in Section 5.1. Two variants of the multigroup topology are simulated, including two groups (hybrid 2, G = 2,  $h_i = 4$ ) and four groups (hybrid 4, G = 4,  $h_i = 2$ ) of data-related planes, respectively. Simulation results are shown in Figure 12.

In Figure 12, skews  $s_{1,2}$ ,  $s_{1,3}$  (defined in Figure 1(b)),  $s_{1,6}$ , and  $s_{1,7}$  (defined in Figure 11) are depicted showing the skew variation between the nearest and the farthest sinks. The results based on independent and multilevel correlated WID variations are denoted by (I) and (II), respectively. The  $\sigma_{s_{1,2}}$  and  $\sigma_{s_{1,3}}$  produced by the multigroup topology are lower than the multi-via topology and decrease as the number of sub-H-trees increases. For the topology with four sub-H-trees (hybrid\_4),  $s_{1,2}(I)$ ,  $s_{1,3}(I)$ ,  $s_{1,2}(II)$ , and  $s_{1,3}(II)$  are

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 20, Pub. date: August 2012.



Fig. 12.  $\sigma$  of skew for three 3D clock tree topologies. (a) Intra-plane skews  $s_{1,2}$  and  $s_{1,3}$ . (b) Inter-plane skews  $s_{1,6}$  and  $s_{1,7}$  within a group of data-related planes.

|                | 1,1          |            |           |          |          |
|----------------|--------------|------------|-----------|----------|----------|
| To             | opology      | single-via | multi-via | hybrid_2 | hybrid_4 |
|                | Model [ps]   | 40.6       | 15.9      | 13.0     | 11.9     |
| $\sigma_{1.7}$ | Spectre [ps] | 41.6       | 16.8      | 13.7     | 12.3     |
|                | Error [%]    | -2         | -5        | -5       | -3       |
|                | Model [min.] | 29         | 28        | 25       | 27       |
| CPU time       | Spectre [h.] | 500        | 173       | 221      | 265      |
|                | Spec./Model  | 1034       | 364       | 535      | 582      |

Table VI.  $\sigma_{s_{1,7}}$  and Computational Time of Three 3D Clock Tree Topologies

reduced by 55%, 23%, 44%, and 10% respectively, as compared with the multi-via topology.

Although the  $\sigma_{s_{1,3}}$  within the same plane of the multigroup topology is still greater (4% for hybrid\_4) than the single-via topology, the inter-plane skews  $\sigma_{s_{1,6}}$  and  $\sigma_{s_{1,7}}$  within a group of data-related planes of the multigroup topology are significantly reduced as shown in Figure 12(b). This reduction is also greater than the multi-via topology. The number of sub-H-trees within a multigroup 3D topology is determined by the distribution of the data-related sinks.

*Guideline* 3. When the data-related sinks are located in adjacent planes of a 3D circuit, the multigroup 3D clock tree topology is more efficient in reducing the skew variation than both the single- and multi-via topologies.

Furthermore, as illustrated in Figure 12, for the sinks with a short horizontal distance in a multigroup topology, the systematic WID variations (denoted by (II)) introduce lower skew variation than the random WID variations (denoted by (I)), for example,  $s_{1,2}$  and  $s_{1,6}$ . For the sinks with a large horizontal distance (e.g.,  $s_{1,3}$  and  $s_{1,7}$ ), the skew variation produced by the systematic WID variations is higher.

The results illustrated in Figure 12 are compared with Monte Carlo simulations. The setup of the Monte Carlo simulation environment is listed in Section 5.1. The  $\sigma$  of  $s_{1,6}$  and  $s_{1,7}$  within a group of data-related planes is reported for the independent WID variations in Table VI. As reported in this table, the preceding analysis on the multigroup 3D H-trees is consistent with the results of Monte Carlo simulations. The error of the skew variation model is typically smaller than 5% as compared with the Monte Carlo simulations.

The computational time is also listed for different topologies in Table VI. Since this runtime is for the entire 3D clock trees, the computational time is significantly higher than that reported in Table III for both the proposed model and Monte Carlo simulations. As the complexity of the 3D clock tree increases, the time savings by the proposed model significantly increases, up to  $1000 \times$ . Consequently, the efficiency of the variability-aware design of 3D clock distribution networks can be considerably improved by estimating the skew variation with the proposed model.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 20, Pub. date: August 2012.



Fig. 13. An example of combining clock trees and grids, where (a) is the topology of a tree-grid structure [Restle et al. 2002] and (b) is the investigated global grid.

Table VII. Monte Carlo Results of Different Clock Distribution Networks

| # of planes  | $\mu_{\rm max} \ [{\rm ps}]$                           |        |      |       | $\sigma_{\rm max}$ [ps] |       | [mW]   |       |        |
|--------------|--------------------------------------------------------|--------|------|-------|-------------------------|-------|--------|-------|--------|
| # of plattes | $\begin{array}{c c c c c c c c c c c c c c c c c c c $ | single | grid | multi | single                  |       |        |       |        |
| 1            | 9.77                                                   | 0.     | 00   | 12.00 | 21                      | .53   | 189.60 | 125   | 5.90   |
| 2            | 8.83                                                   | 0.04   | 0.04 | 11.25 | 18.30                   | 76.10 | 105.10 | 72.01 | 110.00 |
| 4            | 4.15                                                   | 0.15   | 0.18 | 6.15  | 16.22                   | 56.65 | 67.86  | 51.70 | 103.10 |
| 8            | 3.10                                                   | 0.34   | 0.38 | 6.39  | 16.82                   | 41.58 | 47.68  | 39.84 | 89.71  |

# 7. SKEW VARIATION IN CLOCK GRIDS

To compare the skew variation in 3D clock trees with other clock distribution topologies, the skew of clock grids is discussed in this section. A typical hybrid structure of clock trees and grids is simulated and compared with the 3D clock trees in terms of process-induced clock skew.

A pregrid clock distribution network is required to drive a grid structure. A combination of clock trees and grids (tree-grid structure) can meet this requirement [Restle et al. 2002], as illustrated in Figure 13(a). The sinks of the clock tree are connected to a global clock grid. Buffers are inserted into the clock tree to meet the constraint on the slew rate.

A clock tree-grid structure with 256 sinks is compared with the previous 3D clock trees in terms of skew variations. The 2D global clock grid has 256 nodes and the area of each cell is 0.58 mm × 0.58 mm, as illustrated in Figure 13(b). When the grid is embedded to a 3D IC, the area of the grid is shrunk proportionally to the area of each plane. The grid is located in the first plane and the clock signal is propagated to other planes through TSVs at each node similar to the multi-via 3D tree. The electrical and physical characteristics of the circuit are reported in Section 5.3. The Monte Carlo simulation results of the tree-grid are reported in Table VII, where the independent WID variations are considered. The results correspond to the circuits with one, two, four, and eight planes, respectively. The maximum mean skew and the standard deviation within each clock distribution network are denoted by  $\mu_{max}$  and  $\sigma_{max}$ , respectively. The average power for each clock distribution network for the nominal device parameters is also reported at the clock frequency of 1 GHz.

As shown in Table VII, the tree-grid structure produces the lowest skew variation  $(\sigma_{\text{max}})$  compared with the other topologies. Nevertheless, the tree-grid produces the largest mean skew which is significantly higher than the 3D H-trees. This situation is due to the considerable delay and capacitance of the wires in the global clock grid. Furthermore, the average power consumed by the tree-grid is the highest among all the three topologies. Extending the grid to multiple planes can, however, improve the power

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 20, Pub. date: August 2012.

consumption and mean skew. In conclusion, clock grids reduce the skew variations compared with clock trees but increase the mean skew and the power consumption. The multi-via 3D H-trees can significantly reduce the power consumption and maintain a sufficiently low mean skew while reducing the skew variation. The single-via 3D H-trees produce the highest power consumption and skew variations due to the large number of buffers, as reported in Table IV.

# 8. CONCLUSIONS

The effect of process variations of devices on the clock skew of 3D ICs is analyzed. An accurate method to estimate the skew variation within 3D clock trees considering spatial intra-die correlations is presented. By applying this method, the skew variation for single- and multi-via 3D clock tree topologies is investigated. Both the process variations of transistors and wires are included in this model.

Results based on a 45nm CMOS technology demonstrate that the proper clock tree topology should be determined according to the distribution of data-related sinks. When data-related sinks are mostly located in the same plane, a single-via 3D H-tree is more efficient in reducing the skew variation, where the skew variation decreases as the number of planes increases. If the data-related sinks are widely distributed in different planes, multi-via 3D H-trees produce considerably smaller skew variation. Based on the features of these topologies, a multigroup 3D clock tree topology is proposed. When the data-related sinks are distributed in adjacent planes, the skew variation is further reduced by the proposed multigroup 3D clock tree.

The clock trees are also compared with a typical clock grid. The simulation results show that a global clock grid can decrease the maximum skew variations but significantly increase the power and the mean skew. Considering that stricter power budgets can apply in 3D systems due to the accentuated thermal issues, carefully designed 3D H-trees can be a preferred solution. Following the proposed design guidelines results in those 3D H-trees with improved skew variations and low power dissipation. The proposed multigroup 3D clock tree exhibits a superior performance combining the advantages of both the multi- and single-via topologies.

#### APPENDIX: Extension of the Proposed Model to Include Variations of Wires

The proposed model can be extended to include the variations of horizontal wires. The extended model is presented in this section. Consider the 3D clock tree shown in Figure 3, where the delay variation of a buffer stage  $\Delta d_{\text{stage}(i)}$  includes the variation of the capacitance  $\Delta C_{\text{int}}$  and resistance  $\Delta R_{\text{int}}$  of the wires.

$$\Delta d_{\text{stage}(i)} = \Delta d_i + 0.69(R'_{b(i)} + \Delta R_{b(i)})\Delta C_{\text{int}} + 0.38(R'_{\text{int}}\Delta C_{\text{int}} + \Delta R_{\text{int}}C'_{\text{int}} + \Delta R_{\text{int}}\Delta C_{\text{int}}) + 0.69(R'_{\text{int}}\Delta C_{b(i+1)} + \Delta R_{\text{int}}C'_{b(i+1)} + \Delta R_{\text{int}}\Delta C_{b(i+1)})$$
(38)

According to the definition of  $\Delta d_i$  in (12), the term  $0.69R'_{\text{int}}\Delta C_{b(i+1)}$  is included in  $\Delta d_{i+1}$ . Consequently,  $\Delta d_{\text{stage}(i)}$  is rewritten as

$$\Delta d_{\text{stage}(i)} = \Delta d_i + 0.69(R'_{\text{b}(i)} + \Delta R_{\text{b}(i)})\Delta C_{\text{int}} + 0.38(R'_{\text{int}}\Delta C_{\text{int}} + \Delta R_{\text{int}}C'_{\text{int}} + \Delta R_{\text{int}}\Delta C_{\text{int}}) + 0.69(\Delta R_{\text{int}}C'_{\text{b}(i+1)} + \Delta R_{\text{int}}\Delta C_{\text{b}(i+1)}) = \Delta d_i + \Delta d_{\text{int}(i)},$$
(39)

where the delay variation due to the wires is denoted by  $\Delta d_{int(i)}$ .

As discussed in Chang and Sapatnekar [2005], since the variation of characteristics of metal wires is relatively low as compared with the nominal value, the variation of the wire delay can be approximated by the first-order Taylor expansion without significant

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 20, Pub. date: August 2012.

Table VIII. Parameters of Horizontal Interconnects

| Parameters         | W <sub>m</sub> [nm] | t <sub>m</sub> [nm] | t <sub>ILD</sub> [nm] |
|--------------------|---------------------|---------------------|-----------------------|
| Nominal            | 430                 | 1000                | 160                   |
| $3\sigma_{ m D2D}$ | 43                  | 50                  | 12                    |
| $3\sigma_{ m WID}$ | 21.5                | 25                  | 6                     |

Table IX. Skew Variation of the 3D Circuits Considering Wire Variations

| Topology       |                    | mult               | ci-via             |                    |                    | sing               | le-via             |                    |
|----------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
| Skew variation | $\sigma_{s_{1,2}}$ | $\sigma_{s_{1,3}}$ | $\sigma_{s_{1,4}}$ | $\sigma_{s_{1,5}}$ | $\sigma_{s_{1,2}}$ | $\sigma_{s_{1,3}}$ | $\sigma_{s_{1,4}}$ | $\sigma_{s_{1,5}}$ |
| Model [ps]     | 7.01               | 15.09              | 7.45               | 15.3               | 3.99               | 13.94              | 56.46              | 56.46              |
| Spectre [ps]   | 7.19               | 16.44              | 7.64               | 16.55              | 4.00               | 13.77              | 56.38              | 56.3               |
| Error [%]      | -3                 | -8                 | -2                 | -8                 | 0                  | 1                  | 0                  | 0                  |

loss of accuracy. Similar to expression (13),  $\Delta d_{int(i)}$  can be approximated as

$$\Delta d_{\text{int}(i)} \approx \sum_{p_j \in \vec{P}} \left( \left[ \frac{\partial \Delta d_{\text{int}(i)}}{\partial \Delta R_{\text{int}}} \frac{\partial \Delta R_{\text{int}}}{\partial \Delta p_j} \right]_0 \Delta p_j + \left[ \frac{\partial \Delta d_{\text{int}(i)}}{\partial \Delta C_{\text{int}}} \frac{\partial \Delta C_{\text{int}}}{\partial \Delta p_j} \right]_0 \Delta p_j \right), \tag{40}$$

where  $p_j$  is the  $j^{\text{th}}$  parameter of the wire and  $\vec{P}$  is the vector of parameters of wires affected by process variations. For example, consider the variation of the width and the thickness of the metal and the thickness of ILD [Chang and Sapatnekar 2005; Bowman et al. 2009],  $\vec{P} = (W_{\text{m}}, t_{\text{m}}, t_{\text{ILD}})$ . Assuming these parameters are modeled by Gaussian distributions and independent from each other [Chang and Sapatnekar 2005], the distribution of  $\Delta d_{\text{int}(i)}$  can be approximated by a Gaussian distribution. We have

$$\Delta d_{\text{int}(i)} \sim \mathcal{N}(0, \sigma_{d_{\text{int}(i)}}^2), \tag{41}$$

$$\sigma_{d_{\text{int}(i)}}^{2} = \sum_{p_{j}\in\vec{P}} \left( \left[ \frac{\partial \Delta d_{\text{int}(i)}}{\partial \Delta R_{\text{int}}} \frac{\partial \Delta R_{\text{int}}}{\partial \Delta p_{j}} \right]_{0} + \left[ \frac{\partial \Delta d_{\text{int}(i)}}{\partial \Delta C_{\text{int}}} \frac{\partial \Delta C_{\text{int}}}{\partial \Delta p_{j}} \right]_{0} \right)^{2} \sigma_{p_{j}}^{2}, \tag{42}$$

$$R_{\rm int} = \frac{\rho l}{t_{\rm m} W_{\rm m}},\tag{43}$$

$$C_{\rm int} = 2(C_{\rm g} + C_{\rm c})l,\tag{44}$$

where  $C_{\rm g}$  is the ground and fringe capacitance and  $C_{\rm c}$  is the coupling capacitance. The expressions of  $C_{\rm g}$  and  $C_{\rm c}$  are obtained from Wong et al. [2000] and NIMO ASU [2008].

Considering the delay variation caused by both the clock buffers and wires in (39), the skew variation  $\Delta s_{u,v}$  includes two terms,  $\Delta s_{u,v} = \Delta s_{b(u,v)} + \Delta s_{int(u,v)}$ . The distribution of  $\Delta s_{b(u,v)}$  is obtained through (12) to (35). The distribution of  $\Delta s_{int(u,v)}$  can be obtained through (17) to (35) by substituting  $\Delta d_{int(i)}$  for  $\Delta d_i$ . Consequently,  $\Delta s_{u,v}$  can be described by Gaussian distribution.

$$\Delta s_{u,v} \sim \mathcal{N}(0, \sigma_{s_{\mathrm{b}(u,v)}}^2 + \sigma_{s_{\mathrm{int}(u,v)}}^2) \tag{45}$$

The extended model is compared with Monte Carlo simulations including the variations of  $r_{int}$  and  $c_{int}$  in the  $\pi$  model of interconnects. Based on the parameters used in Chang and Sapatnekar [2005], the nominal value and standard deviation of the parameters of wires are listed in Table VIII. The multi-via and single-via trees simulated in Section 5.1 are used to verify the accuracy of the extended model. The results for the independent WID variations are reported in Table IX. As reported in Table IX, the accuracy of the model including the variation of wires is reasonably high.

#### REFERENCES

- AGARWAL, A., BLAAUW, D., AND ZOLOTOV, V. 2003. Statistical timing analysis for intra-die process variations with spatial correlations. In Proceedings of the IEEE / ACM International Conference on Computer-Aided Design. 900–907.
- AGARWAL, A., ZOLOTOV, V., AND BLAAUW, D. T. 2004. Statistical clock skew analysis considering intradie-process variations. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 23, 8, 1231–1242.
- AKOPYAN, F., OTERO, C., FANG, D., JACKSON, S. J., AND MANOHAR, R. 2008. Variability in 3-D integrated circuits. In Proceedings of the IEEE Custom Integrated Circuits Conference. 659–662.
- ARUNACHALAM, V. AND BURLESON, W. 2008. Low-Power clock distribution in a multilayer core 3D microprocessor. In Proceedings of the Great Lakes Symposium on VLSI. 429–434.
- AZUMA, A., OISHI, A., OKAYAMA, Y., KASAI, K., AND TOYOSHIMA, Y. 1998. Methodology of mosfet characteristics fluctuation description using bsim3v3 spice model for statistical circuit simulations. In Proceedings of the International Workshop on Statistical Metrology. 14–17.
- BAKOGLU, H. B., WALKER, T. J., AND MEINDL, J. D. 1986. A symmetric clock-distribution tree and optimized high-speed interconnections for reduced clock skew in ulsi and wsi circuits. In Proceedings of the IEEE International Conference on Computer Design. 118–122.
- BOWMAN, K. A., ALAMELDEEN, A. R., SRINIVASAN, S. T., AND WILKERSON, C. B. 2009. Impact of die-to-die and within-die parameter variations on the clock frequency and throughput of multi-core processors. *IEEE Trans. VLSI. Syst. 17*, 12, 1679–1690.
- BOMAN, K. A., DUVALL, S. G., AND MEINDL, J. D. 2002. Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration. *IEEE J. Solid-State Circ.* 37, 2, 183–190.
- CADENCE DESIGN SYSTEMS. 2008. Virtuoso Spectre Circuit Simulator User Guide 7.0.1 Ed. Cadence Design Systems, Inc.
- CHANG, H. AND SAPATNEKAR, S. 2005. Statistical timing analysis under spatial correlations. *IEEE Trans.* Comput. Aid. Des. Integr. Circ. Syst. 24, 9, 1467–1482.
- CONG, J., KAHNG, A. B., KOH, C.-K., AND TSAO, C.-W. A. 1998. Bounded-Skew clock and steiner routing. ACM Trans. Des. Autom. Electron. Syst. 3, 3, 341–388.
- DEVGAN, A. AND KASHYAP, C. 2003. Block-Based static timing analysis with uncertainty. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. 607–614.
- ELMORE, W. 1948. The transient response of damped linear networks with particular regard to wide-band amplifiers. J. Appl. Phys. 19, 1, 55-63.
- FRIEDMAN, E. 2001. Clock distribution networks in synchronous digital integrated circuits. *Proc. IEEE 89*, 5, 665–692.
- GARG, S. AND MARCULESCU, D. 2009a. 3D-GCP: An analytical model for the impact of process variations on the critical path delay distribution of 3D ICs. In Proceedings of the International Symposium on Quality of Electronic Design. 147–155.
- GARG, S. AND MARCULESCU, D. 2009b. System-Level process variability analysis and mitigation for 3D MPSoCs. In Proceedings of the Design, Automation and Test in Europe Conference. 604–609.
- HARRIS, D. AND NAFFZIGER, S. 2001. Statistical clock skew modeling with data delay variations. *IEEE Trans.* VLSI Syst. 9, 6, 888–898.
- HASHIMOTO, M., YAMAMOTO, T., AND ONODERA, H. 2005. Statistical analysis of clock skew variation in H-tree structure. In Proceedings of the International Symposium on Quality of Electronic Design. Vol. 88. 402– 407.
- ITRS. 2009. International technology roadmap for semiconductors. http://www.itrs.net/
- JIANG, X. AND HORIGUCHI, S. 2001. Statistical skew modeling for general clock distribution networks in presence of process variations. *IEEE Trans. VLSI. Syst.* 9, 5, 704–717.
- JOYNER, J. W., VENKATESAN, R., ZARKESH-HA, P., DAVIS, J. A., AND MEINDL, J. D. 2001. Impact of three-dimensional architectures on interconnects in gigascale integration. *IEEE Trans. VLSI. Syst. 9*, 6, 922–928.
- JOYNER, J. W., ZARKESH-HA, P., AND MEINDL, J. D. 2004. Global interconnect design in a three-dimensional system-on-a-chip. *IEEE Trans. VLSI. Syst. 12*, 4, 367–372.
- KATTI, G., STUCCHI, M., DE MEYER, K., AND DEHAENE, W. 2010. Electrical modeling and characterization of through silicon via for three-dimensional ICs. *IEEE Trans. Electron. Dev.* 57, 1, 256–262.
- KIM, T.-Y. AND KIM, T. 2010. Clock tree embedding for 3D ICs. In Proceedings of the Asia and South Pacific Design Automation Conference. 486–491.
- LIOU, J., CHENG, K.-T., KUNDU, S., AND KRSTIĆ, A. 2001. Fast statistical timing analysis by probabilistic event propagation. In Proceedings of the IEEE/ACM Design Automation Conference. 661–666.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 20, Pub. date: August 2012.

- MALAVASI, E., ZANELLA, S., CAO, M., USCHERSOHN, J., MISHELOFF, M., AND GUARDIANI, C. 2002. Impact analysis of process variability on clock skew. In Proceedings of the International Symposium on Quality Electronic Design. 129–132.
- MONDAL, M., RICKETTS, A. J., KIROLOS, S., RAGHEB, T., LINK, G., VIJAYKRISHNAN, N., AND MASSOUD, Y. 2007. Thermally robust clocking schemes for 3D integrated circuits. In Proceedings of the Design, Automation and Test in Europe Conference. 1206–1211.
- NASSIF, S. 2000. Delay variability: Sources, impacts and trends. In Proceedings of the IEEE International Solid-State Circuits Conference. 368–369.
- NEWMAN, M., MUTHUKUMAR, S., SCHUELEIN, M., DAMBRAUSKAS, T., DUNAWAY, P.A., JORDAN, J.M., KULKARNI, S., LINDE, C.D., OPHEIM, T.A., STEINGEL, R.A., WORWAG, W., TOPIC, L.A., AND SWAN, J.M. 2006. Fabrication and electrical characterization of 3D vertical interconnects. In *Proceedings of Electronic Components and Technology Conference*. 394–398.
- NIMO ASU. 2008. ASU predictive technology model. http://ptm.asu.edu/
- ORSHANSKY, M., MILOR, L., CHEN, P., KEUTZER, K., AND HU, C. 2000. Impact of systematic spatial intra-chip gate length variability on performance of high-speed digital circuits. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. 62–67.

PAVLIDIS, V. AND FRIEDMAN, E. 2009a. Three-Dimensional Integrated Circuit Design. Morgan Kaufmann.

- PAVLIDIS, V. AND FRIEDMAN, E. 2009b. Interconnect-Based design methodologies for three-dimensional integrated circuits. Proc. IEEE 97, 1, 123–140.
- PAVLIDIS, V., SAVIDIS, I., AND FRIEDMAN, E. 2008. Clock distribution networks for 3-D integrated circuits. In Proceedings of the IEEE Custom Integrated Circuits Conference. 651–654.
- REDA, S., SI, A., AND BAHAR, R. 2009. Reducing the leakage and timing variability of 2D ICs using 3D ICs. In Proceedings of the IEEE / ACM International Symposium on Low Power Electronics and Design. 283–286.
- RESTLE, P., CARTER, C. A., ECKHARDT, J. P., KRAUTER, B. L., MCCREDIE, B. D., JENKINS, K. A., WEGER, A. J., AND MULE, A. V. 2002. The clock distribution of the power4 microprocessor. In *Proceedings of the IEEE International Solid-State Circuits Conference*. Vol. 88, 144–145.
- SAVIDIS, I. AND FRIEDMAN, E. 2008. Electrical modeling and characterization of 3-D vias. In Proceedings of the IEEE International Symposium on Circuits and Systems. 784–787.
- SUNDARESWARAN, S., NECHANICKA, L., PANDA, R., GAVRILOV, S., SOLOVYEV, R., AND ABRAHAM, J. A. 2008. A timing methodology considering within-die clock skew variations. In *Proceedings of the IEEE International* SOC Conference. 351–356.
- TÉLLEZ, G. E. AND SARRAFZADEH, M. 1997. Minimal buffer insertion in clock trees with skew and slew rate constraints. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 16, 4, 333–342.
- WONG, S.-C., LEE, G.-Y., AND MA, D.-J. 2000. Modeling of interconnect capacitance, delay, and crosstalk in VLSI. IEEE Trans. Semicond. Manufact. 13, 1, 108–111.
- XU, H., PAVLIDIS, V., AND MICHELI, G. D. 2010. Process-Induced skew variation for scaled 2-D and 3-D ICs. In Proceedings of the IEEE/ACM System Level Interconnect Prediction Workshop. 17–24.
- YANG, J., PAK, J., ZHAO, X., AND LIM, S. 2011. Robust clock tree synthesis with timing yield optimization for 3D-ICs. In Proceedings of the IEEE/ACM Asia South Pacific Design Automation Conference. 621–626.
- ZARKESH-HA, P., MULE, T., AND MEINDL, J. D. 1999. Characterization and modeling of clock skew with process variations. In Proceedings of the IEEE Custom Integrated Circuits Conference. 441–444.
- ZHAO, X. AND LIM, S. K. 2010. Power and slew-aware clock network design for through-silicon-via (TSV) based 3D ICs. In Proceedings of the Asia and South Pacific Design Automation Conference. 175–180.

Received February 2011; revised June 2011; accepted September 2011