# Towards Multiphase Clocking in Single-Flux Quantum Systems

Rassul Bairamkulov and Giovanni De Micheli Integrated Systems Laboratory, EPFL Lausanne, Switzerland

rassul.bairamkulov@epfl.ch, giovanni.demicheli@epfl.ch

Abstract—Rapid single-flux quantum (RSFQ) is one of the most advanced superconductive electronics technologies. SFQ systems operate at tens of gigahertz with up to three orders of magnitude smaller power as compared to CMOS. In conventional SFQ systems, most gates require clock signal. Each gate should have the fanins with equal logic depth, necessitating insertion of path-balancing (PB) DFFs, incurring prohibitive area penalty.

Multiphase clocking is the effective method for reducing the path-balancing overhead at the cost of reduced throughput. However, existing tools are not directly applicable for technology mapping of multiphase systems. To overcome this limitation, in this work, we propose a technology mapping tool for multiphase systems. Our contribution is threefold. First, we formulate a phase assignment as a Constraint Programming with Satisfiability (CP-SAT) problem, to determine the phase of each element within the network. Second, we formulate the path balancing problem as a CP-SAT to optimize the number of DFFs within an asynchronous datapath. Finally, we integrate these methods into a technology mapping flow to convert a logic network into a multiphase SFO circuit. In our case studies, by using seven phases, the size of the circuit (expressed as the number of Josephson junctions) is reduced, on average, by 59.94 % as compared to the dual (fastslow) clocking method, while outperforming the state-of-the-art single-phase SFQ mapping tools.

#### I. Introduction

Superconductive electronics is one of the most promising beyond-CMOS technologies, with Rapid Single-Flux Quantum (RSFQ) [1] as the most prominent superconductive technology. RSFQ systems consistently achieve operating frequencies on the order of tens of gigahertz [2], [3], with particular structures accelerating to hundreds of gigahertz [4], [5]. Furthermore, the operating power of the RSFQ systems is two to three orders of magnitude smaller than CMOS, even considering the refrigeration power [6]. Due to the low power, low noise, and high speed RSFQ systems are utilized in high-resolution sensors [7] and high-performance wireless communication systems [8]. In addition, these advantages make the RSFQ a promising candidate for large-scale stationary computing systems (e.g., data centers) [6], as well as low-power computing in space [9].

This work was supported by the SNF grant "Supercool: Design methods and tools for superconducting electronics" under Grant 200021-1920981.

The authors thank Professor Peter Beerel from the University of Southern California for the insightful discussions.

Elevating complexity of RSFQ systems is however a challenging task. Due to the major differences between RSFQ and CMOS technologies, standard design methodologies and tools are not applicable to RSFQ. In CMOS, transistors encode information by voltage levels. In contrast, in RSFQ, Josephson Junctions (JJ) encode information using single flux quantum (SFQ) pulses [10]. <sup>1</sup> The absence or presence of a SFQ pulse encode, respectively, a logic 0 or 1. The flow of pulses within a logic network requires synchronization. While a few structures, such as splitter and 0R, can operate asynchronously, most logic gates in the original RSFQ cell library, such as AND, NOT and XOR, require clock signal [1]. This feature requires the data to be pipelined at the gate level.

Two features further complicate scaling the RSFQ systems. Due to the quantized nature of SFQ pulses, most RSFQ logic gates are limited to fanout of one. A special gate called *splitter* is necessary to duplicate a signal [9], [10], as illustrated in Fig. 1b. In addition, path balancing is required to ensure the correct order of data propagation within the network, as indicated by four path balancing D-flip-flops (DFF) in Fig. 1b. The number of path balancing DFFs and splitters can be prohibitively large, degrading the area and yield. The complexity of the clock tree synthesis is also increased due to the large number of clocked elements. Despite the advances in technology mapping [11], [12], the overhead of path balancing and clock distribution remains large, motivating the investigation of alternative RSFQ architectures.

Alternative clocking methods have been explored in the literature to alleviate the issue of path balancing. In dual clock methodology [13], a logic circuit is partitioned into separate clocking domains. The gates within a single domain are clocked at high frequency  $f_{hi}$ , while cross-domain communication is performed via the nondestructive readout (NDRO) registers operating at the low frequency  $f_{lo} = f_{hi}/N$ . In [13], the number of path balancing DFFs is reduced by 35% with N=5 [13]. The throughput of the system is however reduced by a factor of 5, establishing a tradeoff between the area and the throughput of the system. The primary limitation of the dual clock method is the use of the large registers to repeat the data signals and additional AND gates to suppress the spurious

<sup>1</sup>For an in-depth discussion of the SFQ technology we refer an interested reader to [9], [10].



Fig. 1. a) An example of a CMOS circuit. b) Equivalent SFQ circuit with a splitter and two path balancing DFFs. c) A four-phase SFQ realization with no path balancing DFFs. The numbers indicate the phase corresponding to each clocked gate.

pulses within the system [14]. In addition, similar to full path balancing, distributing the high frequency clock signal makes the timing closure more challenging to achieve.

Multiphase clocking has recently been proposed for designing area-efficient SFQ systems [15]. Using several phase-shifted clock signals allows data to propagate to the subsequent stages without DFFs. For example, by using four-phase clocking, the path balancing DFFs can be completely removed, as shown in Fig. 1c. Using multiple phases was demonstrated effective in reducing the SFQ circuit area, thanks to the reduction in DFF count (i.e., smaller logic network) and fewer clocked elements yielding smaller clock network size. With only two phases, the number of DFFs and area are reduced by, respectively, 55% and 41% [15], despite the overhead of generating and distributing separate clock signals. Additional phases further reduce the system area. Furthermore, in an n-phase system, the clock signals have n times lower frequency, simplifying the clock distribution process. Similar to dual clock method, the area reduction is achieved by sacrificing the throughput. An n-phase system has n times smaller throughput as compared to a single phase system.

Although the multiphase clocking has been shown to reduce the path balancing overhead, existing technology mapping tools offer limited support for the multiphase systems. For example, path balancing DFFs cannot be efficiently distributed within the complex datapaths utilizing asynchronous SFQ gates. To bridge this gap, in this paper, we propose a novel methodology to apply the multiphase clocking to SFQ circuits. Our contribution is threefold,

- we extend the DFF insertion methodology to assign the phase to each gate within the circuit, including unclocked elements.
- 2) we formulate PB DFF insertion as a Constraint Programming with SAT (CP-SAT) problem to satisfy the timing constraints with minimum number of PB DFFs. We improve the scalability by identifying independent paths allowing the asynchronous paths to be processed separately.
- 3) we integrate phase assignment and DFF insertion into the technology mapping flow to realize an arbitrary logic network with multiple clock phases.

With only two phases, our flow outperforms the state-of-theart single-phase methods. Compared with the dual clocking method, our technology mapping flow achieves up to 82% smaller size, while offering equal throughput.

The rest of the paper is organized as follows. The principles of SFQ technology and multiphase clocking are reviewed in Section II. The CP-SAT-based method for phase assignment is presented in Section III. PB DFF insertion are described in Section IV. Experimental results and comparison with the prior works are provided in Section V, followed by conclusions in Section VI.

# II. BACKGROUND

A logic network  $\mathcal{N} = (\mathcal{V}, \mathcal{E})$  is a directed acyclic graph (DAG) where  $\mathcal{V}$  is a set of nodes and  $\mathcal{E} \subseteq \mathcal{V} \times \mathcal{V}$  is a set of edges. The set of nodes  $\mathcal{V} = \mathcal{I} \cup \mathcal{O} \cup \mathcal{G}$  consists of three disjoint subsets representing, respectively, the primary inputs (PI), primary outputs (PO) and gates within the network. A

set of fanins FI(g) (fanouts FO(g)) of gate  $g \in \mathcal{V}$  denotes the nodes connected to g via and incoming (outgoing) edge.

Technology mapping refers to the process of transforming an arbitrary logic network  $\mathcal{N}$  into a technology-specific representation  $\mathcal{N}'$  using a particular set of primitives. The technology mapping aims to optimize the target characteristics of the mapped network  $\mathcal{N}'$ , such as area, delay, or yield The nature of a technology greatly influences the technology mapping process. In the next subsection, the properties of the SFQ technology are reviewed.

### A. Single-Flux Quantum technology

Rapid Single-flux Quantum (RSFQ) is a cryogenic superconductive computing logic family based on Josephson junctions (JJ). RSFQ gates consist of superconductive loops storing quantized magnetic flux. The information is transferred between the superconductive loops in form of SFQ pulses with the area of  $\Phi_0 = \hbar/2e \approx 2.07$  mV·ps [9]. Based on synchronization mechanisms, the SFQ logic gates can be divided into three major categories [16].

- Asynchronous input, Asynchronous output (AA) components process the input information immediately upon arrival.
   The most common gates in this category are splitter and merger. A *splitter* produces two pulses for each incoming pulse. A *merger* directs signals from two input branches into one output branch, effectively performing an OR function.
- Asynchronous input, Synchronous output (AS) elements
  process the input information immediately upon arrival and
  release the output synchronously after the arrival of the clock
  signal. The components of this type include *D-flip-flop (DFF)*,
  inverter (NOT) and exclusive-or (XOR).
- Synchronous input, Asynchronous output (SA) elements require the input signals to arrive simultaneously to operate correctly. The result of the computation is released immediately after processing. Assuming inputs arrive simultaneously, a CB can be tuned to produce at most a single output pulse, producing an OR element [17]. Furthermore, by adjusting the JJ size and bias current, the OR structure can be transformed into AND element, requiring both signals to arrive simultaneously to produce an output pulse.

In conventional RSFQ [1], the OR and AND gates incorporate the DFFs at the inputs to ensure the simultaneous release of the data pulses. However, the simultaneity can be achieved by other AS gates, such as NOT and XOR [16], allowing richer functionality to be realized in fewer clock cycles.

To correctly consider the type of each element, we divide the set of logic gates into three disjoint subsets  $\mathcal{G} = \mathcal{G}_{AA} \cup \mathcal{G}_{AS} \cup \mathcal{G}_{SA}$ , where each subset represents the elements of the corresponding category. Note that only gates in the subset  $\mathcal{G}_{AS}$  require clock signal.

## B. Multiphase clocking

Multiphase clocking (MPC) is a clocking technique based on several clock signals with equal clock period. The seminal work by Li et al. [15] demonstrated MPC as an effective remedy for path balancing in SFQ systems. A n-phase system utilizes n periodic signals  $\{t_0, \cdots, t_{n-1}\}$  operating at the same frequency and different phases. Each clocked element  $g \in \mathcal{G}_{AS}$  within the network is synchronized by only one clock signal at phase  $\varphi(g)$ . The epoch S(g) of a gate g is defined as the

number of clock cycles separating the gate g from the PIs, as illustrated in Fig. 2. The clock signals are ordered by phase  $\varphi \in \{0, \cdots, n-1\}$ , i.e., during any epoch, the clock signal  $t_i$  arrives before clock signal  $t_j$  if i < j. For consistency in I/O timing, the PIs are placed at the same epoch 0 and the POs are placed at the same epoch. For example, the POs are Fig. 2, the fanouts x and y are both placed at epoch 2. Observe that a path balancing DFF is inserted before x to equalize the epochs of both POs. For convenience, we define a stage  $\sigma(g)$  of a gate g as

$$\sigma(g) = nS(g) + \varphi(g). \tag{1}$$

Multiphase clocking relaxes the path balancing requirements of a clocked system. With n phases, the stage difference  $\Delta\sigma(i,j)$  between any pair of sequentially adjacent elements (i,j) should be no greater than n. Compared with single-phase systems where a q-stage difference is balanced by q DFFs, the same difference within an n-phase system can be balanced with only  $\left\lfloor \frac{q}{n} \right\rfloor$  DFFs, greatly reducing the path balancing overhead in imbalanced paths.

To realize the potential of multiphase clocking in SFQ systems, we propose the two-step DFF insertion methodology based on CP-SAT. During the phase assignment step, the SFQ gates are assigned a phase to reduce the number of DFFs within the system, as described in Section III. During the DFF insertion step, optimal DFF insertion is determined for each datapath, as described in Section IV.

#### III. PHASE ASSIGNMENT

The phase assignment is the procedure of determining the stage (i.e., epoch and phase) of each gate within the network. An ILP-based methodology for phase assignment to reduce the number of path balancing DFFs is proposed in [15],

$$\min_{\sigma(g) \ \forall g \in \mathcal{G}} \sum_{(i,j) \in \mathcal{E}} \left[ \frac{\sigma(j) - \sigma(i)}{n} \right], \tag{2}$$

Subject to:

$$\sigma(i) < \sigma(j) \quad \forall (i,j) \in \mathcal{E},$$
 (3)

$$\left\lfloor \frac{\sigma(i)}{n} \right\rfloor = 0 \quad \forall i \in \mathcal{I}, \tag{4}$$

$$\left\lfloor \frac{\sigma(i)}{n} \right\rfloor = \left\lfloor \frac{\sigma(j)}{n} \right\rfloor \quad \forall i, j \in \mathcal{O}. \tag{5}$$

This formulation is highly effective when considering the systems consisting of only clocked gates, i.e.  $\mathcal{G}=\mathcal{G}_{AS}$ ,  $\mathcal{G}_{AA}=\mathcal{G}_{SA}=\varnothing$  This formulation however offers limited



Fig. 2. An example two-phase network. Each PI is placed at stage 0 or 1, at epoch 0. The POs are placed at stages 4 and 5, corresponding to epoch 2.

support for the unclocked gates, such as mergers and SA AND gate. Unclocked elements can greatly improve the quality of SFQ technology mapping by enriching the functionality available within a single clock cycle. In [18], for example, the area of the circuit was reduced by up to 54% by utilizing asynchronous elements. However, the unclocked gates impose timing constraints different from the clocked gates. To extend the support of this ILP model to unclocked gates, additional constraints are introduced in this section.

While the notions of phase, epoch, and stage are primarily relevant for the clocked gates, it is however important to also assign a stage to the AA and SA elements, since a stage of an unclocked element guides the DFF insertion. Consider two networks containing an asynchronous datapath illustrated in Fig. 3. Both networks realize the same function, however, the possible locations of path balancing DFFs are different. Assigning a stage to an unclocked gate is important to correctly place the path balancing DFFs.

### A. SA element constraints

Recall that the inputs of the SA gates require simultaneous arrival of data pulses to the gate. To ensure the simultaneous arrival, the AS gate should be placed immediately before the SA gate, at the same clock stage, as shown in Fig. 4a. If the stage of AS gate is earlier than the SA gate, a path balancing DFF should be inserted. In addition, the SA gate cannot be placed at the same stage as the fanin, if the fanin is not an AS gate. For example, if the fanin of the SA gate has fanout greater than one, a splitter is inserted after the AS gate. Thus, the fanin type of the SA gate is a splitter (AA gate). The propagation delay incurred by the AA gates, such as splitters, may desynchronize the inputs arriving to the SA gate [18], producing a data hazard, as illustrated in Fig. 4b. To avoid this condition, the minimum stage of gate g is increased by one, as shown in Fig. 4c.

# B. Asynchronous paths

The sharing of DFFs by a complex unclocked path has been highlighed as early as 1991 in [19], where sharing of DFFs is maximized to minimize the circuit area. While estimating the minimum number of DFFs in special cases, such as splitter tree, is relatively simple, more complex asynchronous paths, however, require more sophisticated processing and, hence, prohibitive runtime. During the phase assignment stage we



Fig. 3. The effect of phase assigned to splitters and mergers. Datapaths (a) and (b) execute the same function. The phases assigned to the splitters and mergers (denoted as S and M, respectively) are however different. The potential DFF locations (depicted as yellow rectangles) are therefore different in the two datapaths.







Fig. 4. Timing constraints of an SA gate. a) The gate should be preceded by an AS gate at the same stage. b) Example of an invalid assignment. The AS gate is preceded by an AA gate at the same stage. c) The issue is resolved by shifting the SA gate to the next stage and inserting two path balancing DFFs.

therefore approximate the number of DFFs as

$$C\left(\sigma\left(g\right)\right) = \left\lfloor \frac{\max\limits_{a \in FO(g)} \left(\sigma\left(a\right)\right) - \sigma\left(g\right)}{n} \right\rfloor. \tag{6}$$

The exact number of DFFs is determined during the DFF insertion stage described in Section IV. The combined optimization problem is formulated as

$$\min_{\sigma(g)} \sum_{\forall g \in \mathcal{G}} \left[ \frac{\sigma(j) - \sigma(i) + (j \in \mathcal{G}_{SA})}{n} \right], \quad (7)$$

# Subject to:

$$\sigma(i) < \sigma(j) \quad \forall (i,j) \in \mathcal{E}, j \in \mathcal{G}_{AS},$$
 (8)

$$\sigma(i) \le \sigma(j) \quad \forall (i,j) \in \mathcal{E}, j \notin \mathcal{G}_{AS},$$
 (9)

$$\sigma(g) \ge \sigma(a) + (a \notin \mathcal{G}_{AS}) \quad \forall g \in \mathcal{G}_{SA}, a \in FI(g), \quad (10)$$

$$\left| \frac{\sigma(i)}{n} \right| = 0 \quad \forall i \in \mathcal{I}, \tag{11}$$

$$\left\lfloor \frac{\sigma(i)}{n} \right\rfloor = \left\lfloor \frac{\sigma(j)}{n} \right\rfloor \quad \forall i, j \in \mathcal{O}. \tag{12}$$

Each gate g with #FO(g)>1 requires a splitter tree to copy the signal to each of the fanouts. After completing the phase assignment, a splitter tree that maximizes sharing of path balancing DFFs can be produced. First, the fanouts FO(g) of a gate g are sorted by phase in the ascending order, producing a sequence  $A=[a_0,a_1,\cdots,a_k],$   $\sigma(a_i)\leq\sigma(a_j)\iff i< j.$  Next, the splitters  $[s_0,...,s_{k-1}]$  are inserted at phases  $a_0,\cdots,a_{k-1}$ . Each splitter  $s_i$  has a splitter  $s_{i-1}$  as a fanin, except  $s_0$  whose fanin is the gate g. Each splitter  $s_i$  has the  $a_i$  and  $a_k$ . Using this method, the splitters are placed as late as possible within the network, maximizing sharing of the datapath.

## IV. DFF PLACEMENT

The complex asynchronous datapaths greatly complicate the DFF insertion process. Consider for example the datapath shown in Fig. 3. The optimal placement of the DFFs cannot be easily inferred from the datapath topology. In this section, we propose a CP-SAT based placement methodology for determining the optimal location of the DFFs. Our methodology consists of two stages. First, we identify *independent datapaths* within an unbalanced logic network, as described in Subsection IV-A. Next, for each independent datapaths, we formulate a CP-SAT problem where the timing constraints are satisfied using the minimum number of DFFs, as described in Subsection IV-B.

## A. Independent datapaths

Recall that only a single clocked gate can be placed at the same stage along a datapath. The AS and SA elements therefore prevent a DFF from being placed at the same stage. Furthermore, any timing constraint imposed by a DFF placed before the AS/SA element, will be dominated by the timing constraint imposed by the AS/SA element itself at any location after the AS/SA element. This feature allows us to represent a network-wide DFF insertion problem as multiple smaller problems. This partitioning capability brings two major advantages to the DFF insertion process. First, while the network-wide DFF insertion problem can be solved in  $O(e^N)$  time, the complexity of a partitioned problem is  $O(e^k N_{path})$ , where N is the network size,  $N_{path} \propto N$  is the number of independent datapaths and k is the size of the largest partition (typically,  $k \ll N$ ). Second, partitioning facilitates multithreaded processing, allowing faster processing on a multicore system.

The proposed DFF insertion method is applied to a single independent datapath constrained by the synchronous gates at the input and the output, as illustrated in Fig. 5a. The datapath P can be described as a portion of the logic network P=(I,A,O), where set  $A\subseteq \mathcal{G}_{\mathtt{AA}}$  contains the AA gates within the datapath and sets  $I,O\subseteq \mathcal{G}_{\mathtt{AS}}\cup \mathcal{G}_{\mathtt{SA}}$  describe, respectively, the AS and SA gates at the input and output of the datapath. In Fig. 5a, the three independent paths are

$$\begin{split} P_1 &= (I = \{\mathtt{A},\mathtt{B},\mathtt{D}\}, A = \{\mathtt{1},\mathtt{2},\mathtt{3},\mathtt{4},\mathtt{5}\}, O = \{\mathtt{W},\mathtt{X},\mathtt{Y},\mathtt{Z}\})\,, \\ P_2 &= (I = \{\mathtt{E}\}, A = \varnothing, O = \{\mathtt{W}\})\,, \\ P_3 &= (I = \{\mathtt{C},\mathtt{F}\}, A = \{\mathtt{6}\}, O = \{\mathtt{Z}\})\,. \end{split}$$

## B. DFF insertion

After identifying each independent path within the network, we determine the potential DFF sites for subsequent DFF insertion. To uniquely identify each potential DFF site, we define  $d=(g_d^i,g_d^o,\sigma(d))$ , where  $g_d^i\in I\cup A$  and  $g_d^o\in A\cup O$  are the fanin and fanout elements of the DFFs and  $\sigma(d)$  is the stage of the DFF site. We define a chain  $Q=(d_{min},\cdots,d_{max})$  as a sequence of sequentially adjacent DFF locations situated between  $d_{min}$  and  $d_{max}$ . The examples of DFF sites and a chain are shown in Fig 5b. For convenience, we define the length of chain  $\Delta\sigma(Q)$  as the stage difference between  $d_{min}$  and  $d_{max}$ .

For each DFF site, we introduce a binary variable  $\delta(d)$  equal to 1 if the DFF is placed at d and 0 otherwise. The problem of minimizing the number of path balancing DFFs can therefore be formulated as a CP-SAT problem minimizing

$$\sum_{d \in P} \delta(d). \tag{13}$$

Several constraints describe the valid placement of the DFFs within an independent path. First, for each stage, only a single DFF can be placed along a chain. In Fig. 5b, for example, the stage  $\sigma=5$  has five potential DFF locations, a,b,c,d, and e. The DFFs here should however satisfy the constraint

$$\sum_{d \in Q, \sigma_d = i} \delta(d) \le 1. \tag{14}$$

The combinations of DFF sites satisfying this constraint are



Fig. 5. Example of an asynchronous datapath within a four-phase network (n=4). The black triangles, and black curved shapes represent the AS and SA elements, respectively. Red circles denote the AA elements. a) Three independent paths are drawn with red solid, black solid, and black dotted arrows. b) DFF sites shown as rectangles within the red independent path. The grey DFF sites represent the chain Q of length n. Diagonally shaded DFF sites precede the SA gates. DFFs should therefore be placed at these sites.

 $\{a,c,e\}$ ,  $\{a,d\}$ ,  $\{b,e\}$ , since only these sets do not place multiple DFFs along any chain. Other conflicting DFF sites can be found at stages 3 and 6.

Second, the distance between any pair of clocked elements along a chain should not exceed n in an n-phase system. This requirement can be formulated as a condition

$$\bigvee_{d \in Q, \Delta\sigma(Q) = n} \delta(d) = 1 \tag{15}$$

stating that at least one DFF should be placed along any chain of length n.

Finally, recall that the SA gates should be preceded by an AS gate to function correctly. Thus, if an SA element is not directly preceded by an AS gate, a DFF should be placed before the SA element.

$$\delta(d) = 1 \quad \forall g_d^o \in \mathcal{G}_{SA}, \sigma(g_d) = \sigma_d. \tag{16}$$

The DFF sites before the SA gates are shaded with diagonal hatch pattern in Fig. 5b.

Equations (13)-(16) constitute a CP-SAT problem where the conditions (14)-(16) are satisfied with the minimum number of DFFs (13).

# V. EXPERIMENTAL RESULTS

We integrate the DFF placement methodology into the technology mapping flow for SFQ compound gate circuits. The SFQ circuits are synthesized with mockturtle using a database of precomputed compound gate structures [18]. The circuits are next decomposed into the functional primitives while removing all DFFs within the circuit. We next apply our CP-SAT-based phase assignment and DFF insertion procedures using Google OR-Tools [20].

We apply our mapping flow to synthesize a subset of EPFL [21] and ISCAS [22] benchmark circuits. For each benchmark circuit, the total runtime does not exceed 60.64 seconds. Phase assignment is a major component of the total runtime. Optimal phase assignment has been successfully determined in all circuits. In most cases, the phase assignment is completed in under 5 seconds, with the longest runtime (42 seconds) required by the voter circuit. Naturally, larger

circuits, such as s13207, voter and priority require longer runtime for phase assignment. Applying our tool to larger circuits would require additional processing time. Note that although finding the optimal phase assignment may require prohibitive runtime, a feasible solution is often available much earlier. Therefore, by relaxing the optimality requirement, the our framework can support larger circuits. <sup>2</sup>

We first compare our results with dual clocking method [14]. Two experiments are reported where the frequency of the slow clock is 7 and 12 times smaller than the frequency of the fast clock. For fair comparison, we use 7 and 12 phases to equalize the throughput. The results are shown in Table I. Note that unlike [14], the DFFs within the AND and OR gate are explicitly counted in our work due to the use of the unclocked (SA) AND and OR elements. Furthermore, additional elements, such as pulse repeaters and NDRO flip-flops are needed in DCM. We therefore focus on the number of JJs, commonly used in SFQ technology as an area metric. The number of JJs is reduced, on average, by 59.9% when using 7-phase clocking. The largest reduction is observed in the priority benchmark circuit containing many imbalanced paths that require long chains of path balancing DFFs to be inserted. With multiphase clocking, these chains can be balanced by approximately 7 times fewer DFFs, contributing to the area savings.

For 12-phase clocking, the area is reduced by 47.5%. Observe that the smallest reduction in area (and the increase in DFF count) are observed in circuits with relatively small depth. These circuits therefore have fewer imbalanced paths that could benefit from multiphase clocking. Despite relatively large number of DFFs, we still manage to reduce the total JJ count due to the use of unclocked elements (AA and SA) supported by our tool.

The Fig. 6 (left) describes the relationship between the number of DFFs and the number of phases. The numbers of DFFs are normalized with respect to a single phase system mapped with our flow. Consistent with [15] the number of DFFs reduces with multiple phases. The area savings due to an additional phase gradually diminishes as more phases are added. We compare our results against PBMap [11] and gate compounding [18], the state-of-the-art single-phase SFQ mapping algorithms. With our mapping flow, two phases are sufficient, on average, to outperform gate compounding. With three phases our flow yields fewer DFFs than PBMap. The Fig. 6 (right) describes the relationship between the circuit area (expressed as the number of JJs) and the number of phases. Our flow outperforms PBMap with a single phase, primarily due to the use of asynchronous gates (e.g., merger instead of an OR gate). A two-phase network synthesized with our flow requires fewer JJs than a single-phase compound-gate network.

#### VI. CONCLUSIONS

RSFQ technology presents a remarkable opportunity to achieve unprecedented performance and energy efficiency of computing systems. Realizing its full potential however requires overcoming major technological issues, including

<sup>2</sup>Mapping of hyp (the largest circuit in EPFL suite) has been completed within one hour with only 15 and 20 minutes allocated to phase assignment and DFF insertion, respectively (not presented due to space constraints).

TABLE I COMPARISON OF MULTIPHASE CLOCKING WITH DUAL CLOCKING METHOD [13] FOR DIFFERENT THROUGHPUTS

|           | 1/7 throughput |           |                       |           |                    |        |         | 1/12 throughput |           |                        |           |                    |        |         |
|-----------|----------------|-----------|-----------------------|-----------|--------------------|--------|---------|-----------------|-----------|------------------------|-----------|--------------------|--------|---------|
|           | DCM HDFF #JJ   |           | Multiphase   #DFF #JJ |           | Change<br>#DFF #JJ |        | Runtime | DCM<br>#DFF #JJ |           | Multiphase<br>#DFF #JJ |           | Change<br>#DFF #JJ |        | Runtime |
|           | #DIT           | #33       | #DIT                  | #33       | #DIT               | #JJ    | (s)     | #DIT            | #33       | #DIT                   | #33       | #DIT               | #33    | (s)     |
| int2float | 117            | 7,770     | 217                   | 5'136     | +85%               | -34%   | 4.42    | 39              | 5'140     | 46                     | 3'939     | +18%               | -23%   | 4.13    |
| priority  | 8'562          | 257'252   | 3'285                 | 45'094    | -62%               | -82%   | 54.01   | 4'225           | 158'568   | 1'775                  | 34'524    | -58%               | -78%   | 30.33   |
| voter     | 7'204          | 447'044   | 2'180                 | 162'804   | -70%               | -64%   | 60.64   | 3'732           | 355'144   | 1'568                  | 158'520   | -58%               | -55%   | 48.94   |
| c432      | 224            | 10'734    | 342                   | 5'116     | +53%               | -52%   | 14.92   | 118             | 7'124     | 240                    |           | +103%              | -38%   | 10.14   |
| c880      | 362            | 14'658    | 254                   | 6'190     | -30%               | -58%   | 7.56    | 187             | 9'483     | 119                    | 5'245     | -36%               | -45%   | 7.25    |
| c1908     | 282            | 13'169    | 125                   | 3'529     | -56%               | -73%   | 10.22   | 144             | 8'739     | 69                     | 3'137     | -52%               | -64%   | 8.78    |
| c3540     | 776            | 43'437    | 589                   | 17'016    | -24%               | -61%   | 11.83   | 282             | 26'897    | 440                    | 15'973    | +56%               | -41%   | 9.29    |
| c1355     | 193            | 8'739     | 46                    | 4'515     | -76%               | -48%   | 4.18    | 119             | 6'149     | 44                     | 4'501     | -63%               | -27%   | 3.89    |
| s13207    | 1'795          | 106'346   | 1'837                 | 42'382    | +2%                | -60%   | 27.87   | 571             | 60'766    | 1'082                  | 37'097    | +89%               | -39%   | 25.90   |
| s5378     | 645            | 50'766    | 808                   | 21'761    | +25%               | -57%   | 7.87    | 255             | 34'053    | 368                    | 18'681    | +44%               | -45%   | 7.07    |
| s382      | 56             | 4'448     | 89                    | 2'411     | +59%               | -46%   | 4.17    | 9               | 2'750     | 48                     | 2'124     | +433%              | -23%   | 3.93    |
| Geomean   | 557.04         | 29'868.20 | 413.48                | 11'965.54 | -25.8%             | -59.9% | 12.02   | 227.85          | 19'565.38 | 229.57                 | 10'467.10 | +0.8%              | -47.5% | 10.03   |





Fig. 6. Normalized average number of DFFs (left) and JJs (right) with different number of phases. The lines in the background represent normalized number of DFFs and JJs for individual benchmarks.

path balancing. Multiphase clocking can substantially reduce the path balancing overhead in SFQ systems. Existing path balancing tools however offer limited support of multiphase clocking in systems with asynchronous SFQ gates. In this work, we presented a technology mapping and path balancing methodology for multiphase SFQ systems based on constraint programming with satisfiability (CP-SAT). An SFO logic network is initially synthesized with mockturtle logic synthesis library. Next, the circuit is decomposed into primitives and the DFFs are removed from the network. Each primitive gate is assigned a phase using the CP-SAT while satisfying the special timing constraints imposed by the unclocked SFQ elements. Finally, using the novel CP-SAT formulation, the minimum number of path balancing DFFs satisfying the timing constraints is determined. In the experimental results, we showed an average of 59% reduction in the number of JJs when compared to dual clocking method. Furthermore, with only two phases, our methodology yields smaller networks than the state-of-theart single-phase SFQ mapping techniques.

## REFERENCES

- [1] K. Likharev, O. Mukhanov, and V. Semenov, "Resistive Single Flux Quantum Logic for the Josephson-Junction Digital Technology," Proc. SQUID, Vol. 85, June 1985.
- [2] T. Kawaguchi, M. Tanaka, K. Takagi, and N. Takagi, "Demonstration of an 8-Bit SFQ Carry Look-Ahead Adder Using Clockless Logic Cells," Proc. ISEC, July 2015.
- [3] Z. J. Deng, N. Yoshikawa, S. R. Whiteley, and T. Van Duzer, "Data-Driven Self-Timed RSFQ High-Speed Test System," IEEE TASC, Vol. 7. No. 4, 1997.
- W. Chen et al., "Rapid Single Flux Quantum T-Flip Flop Operating up to 770 GHz," IEEE TASC, Vol. 9, No. 2, 1999.
- Q. P. Herr, A. D. Smith, and M. S. Wire, "High Speed Data Link between Digital Superconductor Chips," Appl. Phys. Lett., Vol. 80, No. 17, 2002.

- [6] D. S. Holmes, A. L. Ripple, and M. A. Manheimer, "Energy-Efficient Superconducting computing—Power Budgets and Requirements," IEEE TASC, Vol. 23, No. 3, 2013.
- [7] K. Sternickel and A. I. Braginski, "Biomagnetism Using SQUIDs: Status and Perspectives," Supercond. Sci. Technol., Vol. 19, No. 3, 2006.
- W. H. Tuttlebee, Software Defined Radio: Enabling Technologies, John Wiley and Sons, 2002.
- [9] G. Krylov and E. G. Friedman, Single Flux Quantum Integrated Circuit Design, Springer, 2022.
- [10] P. Bunyk, K. Likharev, and D. Zinoviev, "RSFQ Technology: Physics and Devices," Int. J. High Speed Electron. Syst., Vol. 11, No. 01, 2001.
- [11] G. Pasandi and M. Pedram, "PBMap: A Path Balancing Technology Mapping Algorithm for Single Flux Quantum Logic Circuits," IEEE TASC, Vol. 29, No. 4, June 2019.
- [12] N. Kito, K. Takagi, and N. Takagi, "Logic-Depth-Aware Technology Mapping Method for RSFQ Logic Circuits With Special RSFQ Gates, IEEE TASC, Vol. 32, No. 4, 2021.
- [13] G. Pasandi and M. Pedram, "An Efficient Pipelined Architecture for Superconducting Single Flux Quantum Logic Circuits Utilizing Dual Clocks," IEEE TASC, Vol. 30, No. 2, 2020.
- [14] G. Pasandi and M. Pedram, "Depth-Bounded Graph Partitioning Algorithm and Dual Clocking Method for Realization of Superconducting SFQ Circuits," ACM JETC, Vol. 17, No. 1, October 2020.
- [15] X. Li, M. Pan, T. Liu, and P. A. Beerel, "Multi-Phase Clocking for Multi-Threaded Gate-Level-Pipelined Superconductive Logic," Proc. ISVLSI, pp. 62-67, 2022.
- [16] R. Bairamkulov and G. De Micheli, "Compound Logic Gates for Pipeline Depth Minimization in SFQ Integrated Systems," Proc. GLSVLSI, 2023.
- [17] O. Mukhanov, V. Semenov, and K. Likharev, "Ultimate Performance of the RSFQ Logic Circuits," *IEEE Trans. Magn.*, Vol. 23, No. 2, 1987.
- [18] R. Bairamkulov, A. Tempia Calvino, and G. De Micheli, "Synthesis of SFQ Circuits with Compound Gates," *Proc. VLSI-SoC*, 2023. C. E. Leiserson and J. B. Saxe, "Retiming Synchronous Circuitry,"
- Algorithmica, Vol. 6, No. 1-6, June 1991.
- [20] L. Perron, "Operations Research and Constraint Programming at Google," Proc. CP, 2011.
- [21] M. Soeken et al., "The EPFL Logic Synthesis Libraries," arXiv Preprint arXiv:1805.05121v3, 2018.
- M. C. Hansen, H. Yalcin, and J. P. Hayes, "Unveiling the ISCAS-85 Benchmarks: A Case Study in Reverse Engineering," IEEE Des. Test. Comput., Vol. 16, No. 3, 1999.