# Program Implementation Schemes for Hardware-Software Systems Rajesh K. Gupta Claudionor N. Coelho, Jr. Giovanni De Micheli Computer Systems Laboratory Departments of Electrical Engineering and Computer Science Stanford University, Stanford, CA 94305-4055. rgupta@sirius.stanford.edu, (415) 725-1776 #### Abstract Mixed system designs consist of interacting hardware and software components. The hardware component consists of a re-programmable component, like an off-the-shelf processor, and application-specific chips. The software component consists of a set of concurrently executing program fragments, called threads. We have considered the problem of generating hardware and software components from a given system model under input/output data rate constraints in [1]. In this paper we present schemes for implementation of the software component of mixed system designs. The primary motivation for software implementation schemes is to provide a low overhead scheme for concurrent software implementation on a single processor which makes the overall hardware-software system design feasible. As an illustration of the program implementation schemes and feasibility of mixed system designs, we present design of a graphics controller and its simulation results. #### 1 Introduction Prompted by the success in recent years in synthesis of large-scale integrated circuits either as a single chip or as a collection of chips [2] [3] [4] [5], attention has now been focused on synthesis of systems where the target architecture contains application-specific as well as general purpose re-programmable components. The re-programmable components of a system architecture refers to a microprocessor like the Mips R3000 which executes part of the system functionality implemented as programs assisted by dedicated ASIC hardware. Motivation to such a target architecture comes from the realization that most complex system designs consist of mixed components that blend ASIC chips with processors, memory and other special purpose modules. The ASIC components are chosen to complement performance or add functionality not achievable by pure program implementations. Synthesis of systems containing re-programmable components can be thought of as extension of high-level synthesis techniques to systems containing 'generalized resources'. However, due to differences in the computation model of the operations implemented in re-programmable and application-specific components, the overall problem of system synthesis is much more complex. The re-programmable components implement system functionality in an *instruction-driven* software component as an ordered set of machine instructions with a statically allocated memory space, whereas the application-specific hardware components essentially operate as *data-driven* reactive computational elements. Because of Figure 1: System Synthesis Procedure this difference in the hardware and software components, the problem of system synthesis is formulated as a hardware-software co-design problem. Figure 1 illustrates the overall approach to synthesis of systems containing both hardware and software components. We model system behavior using the *HardwareC* [6] language that has a C-like syntax and supports timing and resource constraints. *HardwareC* supports specification of *unbounded and unknown delay operations* that can arise from data dependent decisions and external synchronization operations. The *HardwareC* description is compiled into a system graph model based on data-flow graphs [6]. The system graph model consists of vertices representing operations, and edges which represent serialization among operations. Overall the system graph model is composed of concurrent data-flow sections which are ordered by the system control flow. Associated with input/output statements, the user can specify corresponding constraints on input/output data rates. The input (output) rate constraints refer to bounds on the rates at which data is required to be consumed (produced). The system graph model is partitioned based on feasibility of the overall system implementation and satisfaction of applicable data-rate constraints. One such scheme relies on identifying and partitioning unbounded delay operations[1]. As a result of system partitioning, we have a set of concurrently executing 'hardware' and 'software' models. These models consist of hierarchical acyclic system graph models. We consider input/output operations related to message-passing and data-dependent loops to be unbounded delay operations. Since data-dependent operations may offer unbounded delays it becomes necessary to schedule these operations dynamically. Therefore, we refer to data-dependent delay Figure 2: Target System Architecture operations as points of synchronization in the system model. Our approach to synthesis of hardware under relative scheduling formulation has been described in detail elsewhere [6]. Briefly, the relative scheduling formulation makes it possible to achieve a data-driven dynamic schedule of operations with respect to synchronization points. Due to inherently different model and rate of computation between hardware and software modules, it is necessary to allow multiple executions of individual hardware and software models in order to achieve high system throughput. Differences in rates of computation causes variation in the rates of communication across different models. In order to facilitate this form of distributed computation, appropriate buffering and handshake mechanisms between hardware and software components are needed [7]. The software graph models are then serialized to minimize temporary register storage requirements[8]. From the serialized graph models, we generate a corresponding C-code description. The C-code is then compiled into assembly code for the target processor using existing software compilers. For simulation purposes, we use an event-driven simulator, named *Poseidon*, that performs concurrent execution of multiple functional models implemented either as a program or as application-specific hardware [7]. *Poseidon* currently supports simulation of assembly code for the DLX microprocessor [9], a RISC oriented load/store processor that supports instruction-set architecture of the commercially available Mips R3000 processor. The hardware component of system design can be simulated either before or after the structural synthesis phase. Input to *Poseidon* consists of model declarations, interconnections and corresponding interface protocols. Each model has an associated clock signal and clock cycle-time used for its simulation. Interface protocol for data-transfer between models is specified via *guarded* commands [10]. Section 5 presents an example of interface protocol specification in *Poseidon*. In this paper, we focus on the problem of generation of the software components from the graph model that is output of system partitioning as shown by the shaded area in Figure 1. ### Target Architecture and Assumptions We choose a target architecture that contains the essential elements of hardware-software systems. As shown in Figure 2, the target architecture consists of a general-purpose processor assisted by application-specific hardware components. The following lists the relevant assumptions relating to target architecture. - We restrict ourselves to use of a single re-programmable component because presence of multiple re-programmable components requires additional software synchronization and memory protection considerations to facilitate safe multiprocessing. Multiprocessor implementations also increase the system cost due to requirements for additional system bus bandwidth to facilitate inter-processor communications. We make this simplifying assumption in order to make the symmetric esis tasks manageable. - The memory used for program and data-storage may be on-board the process. However, the interface buffer memory needs to be accessible to the hardware modules directly. Because of the complexities associated with modeling hierarchical memory design, so far we have considered the case where all memory accesses are to a single-level memory, i.e., outside the re-programmable component. The hardware modules are connected to the system address and data busses. Thus all the communication between the processor and different hardware modules takes place over a shared medium. - The re-programmable component is always the bus master. Almost all processors come with facilities for bus control. On the other hand, inclusion of such functionality on the applicationspecific component would greatly increase the total hardware cost. - All the communication between the re-programmable component and the ASIC components is done over named channels whose width (i.e. number of bits) is same as the corresponding port widths used by read and write instructions in the software component. The physical communication takes place over a shared bus. The problem of encoding and sharing multiple virtual channels over a physical bus is a subject of continuing research at Stanford. - The re-programmable component contains a 'sufficient' number of maskable interrupt input signals. For purposes of simplicity, we assume that these interrupts are unvectored and there exists a predefined unique destination address associated with each interrupt signal. # 2 Implementation of Software Components As mentioned earlier, the system 'software component' is described as a set of hierarchical acyclic graph models where the vertices represent operations and edges represent dependencies. Through the use of hierarchy when describing conditionals and loops, the graph model pushes the uncertainty of conditional execution paths into the uncertainty of delay of execution of various vertices. In other words, for a given graph model all its operations at any level of hierarchy are eventually executed and the model does not contain any conditional (or cyclic) execution paths. The uncertainty of conditional executions manifests itself as the uncertainty of delay of conditional operation nodes. Example 1 shows Figure 3: Example of a graph model containing unbounded and unknown delay operations a Hardware C process description containing 3 unbounded/unknown delay operations: message-passing receive operation, conditional and loop. **Example 1:** Example of a *HardwareC* process with unbounded delay operations ``` process example (a, b, c) in port a[8]; in channel b[8]; out port c; boolean x[8], y[8], z[8] x = read(a); y = receive(b); if (x > y) z = x - y; else z = x * y; while (z >= 0) { write c = y; z = z - 1; } } ``` Notes on execution semantics: (1). A process in *HardwareC* executes concurrently with other processes in the system specification. A process restarts itself upon completion of the last operation in the process body. Thus there exists an implied outer-most loop that contains the body of the process model. In other languages, this loop can be specified by an explicit outer loop statement. (2). Operations within a process body need not be executed sequentially (as is the case in a process specification in VHDL, for example). A process body can be specified with varying degrees of parallelism such as parallel, data-parallel or sequential. Figure 3 shows the corresponding hierarchical graph model for the process example that consists of four graphs, labeled $G_0$ , $G_{c0}$ , $G_{c1}$ and $G_{loop}$ . The double-circles indicate operations with unbounded or unknown execution delays. Depending upon the points of synchronization in a model, the graph can be implemented as a single or multiple routines. In absence of multiple points of synchronization, a simple graph model can be implemented as a single routine. A hierarchical system model is implemented as a set of routines where each routine corresponds to a graph in the model hierarchy. A program implementation of a graph is referred to as a program thread due to the fact that the operations of the graph model are serialized in software. Thus, the software component consists of a set of program threads. The program threads may be hierarchically related. In addition, some program threads may need to be executed concurrently based on the concurrency among the corresponding graph models. Concurrency between program threads can be achieved by using an inter-leaved execution model as explained later in this section. A program thread may be initiated by a synchronization operation, such as a blocking receive operation (rcv\_synch). However, within each thread all operations have fixed delay. The (unknown) delay in executing the synchronization operation appears as a delay in scheduling the program thread and it is not considered a part of the thread latency. Therefore, for a given re-programmable device the latency of each thread is known statically. Referring to the example in Figure 3, there are four program threads $T_0, T_{loop}, T_{c0}$ and $T_{c1}$ . Thread, $T_0$ consists of the receive operation followed by the port read operation while the other threads consist of serialized operations in the corresponding graph bodies. | T <sub>0</sub> | $T_{c0}$ | T <sub>c1</sub> | T <sub>loop</sub> | |----------------|----------|-----------------|-------------------| | rcv_sync | h | | loop_synch | | read | ор | op | write | | detach | detach | detach | op | | | | | detach | Though only a feature of representation, this use of hierarchy is well suited to eventual implementation of the software component as a set of program routines. Since all the operations in a given graph model are always executed, the corresponding routines can be constructed with known and fixed latencies as explained earlier. As with the graph model, the uncertainty due to data-dependent delay operations is related to invocations of the individual routines. A software implementation consisting of dynamic invocations of fixed latency program threads simplifies the task of software characterization for satisfaction of data rate constraints. Satisfaction of imposed data rate constraints depends upon the performance of the software component. Even in presence of unbounded delay operations bounds on software performance can be determined based on its implementation of program threads. In the following sections, we describe a code-level transformation of the data-dependent loop operations that makes it possible to observe imposed input/output rate constraints. In cases, where such transformations are not possible, we use processor interrupts along with bounds on number of interrupts and interrupt latencies to ensure satisfaction of rate constraints. ### Rate constraints and software performance The data rate constraints on the inputs and outputs of the software component are derived based on the corresponding constraints on system inputs and outputs. A data rate constraint on an input (output) specifies a lower bound (in terms of samples/sec) on the rate at which the particular I/O data should be consumed (produced). In case of a deterministic software component, that is, software component with known and bounded execution delays, precise data rates can be computed and checked against corresponding data rate constraints. However, the presence of an unbounded-delay operation between consecutive read (write) operations requires computation of statistical measures (such as distribution of input data value and inter-arrival time) in order to determine the rate of data production and consumption. A major contribution to the variability of data rates is due to the data-dependent loop operations since the delay due to these operations consists of active execution times rather than 'busy-wait'-type delays encountered by other synchronization operations. In some cases, the need for statistical measures can be avoided by transforming the *dynamic* loop execution model into a corresponding *cyclo-static* loop execution model as follows. Consider, for example, a software component that consists of reading a value followed by a data-dependent delay operation shown in Example 2. **Example 2:** Consider a mixed implementation shown by the figure below. The ASIC component sends to the processor some data on port x at an input rate constraint of $\rho$ samples/sec. The function to be implemented by the processor is modeled by the following HardwareC process fragment. x is a boolean array that represents an integer. In its software implementation, this behavior is translated into a set of two program threads shown on the right, where one thread performs the reading operations, and the other thread consists of operations in the body of the loop. For each execution of thread T1 there are x execution of thread T2. $\Box$ For the HardwareC process in Example 2, the interval between successive executions of the read operation is determined by the overall execution time of the while statement. Due to this variable-delay loop operation, the input data rate at port x is variable and is dependent upon value of x as a function of time. For each invocation of thread T1 there are x invocations of thread T2. In other words, thread T1 can be resumed after x invocations of thread T2. In absence of any other data-dependency to operations in the loop body, thread T1 can be restarted before completing all invocations of thread T2 by buffering the data transfer from thread T1 to T2. Further, if variable x is used only for indexing the loop iterations, the need for inter-thread buffering can be obviated by accumulating value of x into a separate loop counter as shown in example below. We call such an implementation of a loop construct in software a cyclo-static loop based on the fact that an upper bound on the number of iterations of the loop body is statically determined by the data rate constraints on inputs and outputs that are affected by the data-dependent loop operation. A cyclo-static loop implementation assumes that there exists a *repeat-count* counter associated with every loop and a loop body is required to be executed as long as its repeat-count is a non-zero number. Additionally, the repeat-count is not used by the corresponding loop body for any purposes other than keeping a count of number of iterations remaining. Under such conditions, the above component can be transformed into two program threads where one thread reads port x and increments the *repeat-count* for the loop body contained in the other thread. Example 3: Transformation of data-dependent loop in Example 2 into a cyclo-static loop ``` process test(x, ...) Thread T1 Thread T2 in port x [SIZE] read loop_synch integer repeat-count = 0 add op <loop-body> detach repeat-count-- read x ; detach repeat-count = repeat-count + x while (repeat-count >= 0) <loop-body> repeat-count = repeat-count-1 ``` - (1). For each execution of thread T1 there are $\max(x,m)$ execution of thread T2 where constant m is determined by input data rate constraint, $\rho$ , on the read operation in T1 given by the relation: $\frac{1}{\rho} = (\lambda_{T1} + m \cdot \lambda_{T2}) \cdot t_{cy}$ where thread latencies $\lambda_{T1}$ and $\lambda_{T2}$ include synchronization overheads. $t_{cy}$ denotes cycle time of the processor. - (2). Initialization of variables is performed during system RESET state. In this case, we can provide a bound on the rate at which port is read by ensuring that the read thread, T1, is scheduled, say after utmost m iterations of the loop body. Due to accumulation of repeat-count additional care must be taken to avoid any potential overflow of this counter. [Generally, overflow can be avoided if m is greater than or equal to the average value of x. In the extreme, it can be guaranteed not to overflow if m is at least maximum of x which is equivalent to assigning worst-case delay to the loop operation]. ### Decoupling output data rate from software non-determinism Due to unbounded delay operations in the software component that is translated into a data-dependent number of invocations of some threads of execution, use of cyclo-static loops may not always be possible or it may lead to implementations that under-utilize the system bus bandwidths, for example, by reserving worst case data-transfer rates for some I/O operations. With concurrent threads, to a certain extent, we can insulate the input/output data rates from variable delays due to other threads by buffering the data transfers between threads. Thus, the inter-thread buffers hold the data in order to facilitate multiple executions among program threads. Threads containing specific input/output operations are scheduled at fixed intervals via processor interrupt routines as shown in the Example 4 below. In this scheme, finite-sized buffers are allocated for each data-transfer between program threads. In order to ensure the input/output data rates for each thread, we associate a timer process with every I/O operation that interrupts the processor once the timer expires. The associated interrupt service routine performs the respective I/O operation and restarts the timer. In case a data is not ready the processor can send the previous output and (optionally) raise an error flag. **Example 4:** Implementation of program threads in Example 2 with inter-thread buffer. | Thread T2 | Timer Process | T1 (interrupt service routine) | |----------------------------------------------------------------|---------------------------------------------------------------------|---------------------------------------------------------| | <pre>loop_synch <loop_body> x = x - 1 detach</loop_body></pre> | <pre>timer per clock tick if (timer == 0) interrupt</pre> | read x<br>load timer = CONSTANT<br>enqueue (x) on dFIFO | Thread T1 is now implemented into an interrupt service routine that is invoked at each expiration of the timer process. Timer process represents a processor timer (or an external hardware timer) that is used to generate interrupts at regular intervals. The interruption interval constant is determined by the rate constraint and latencies of interrupt service routines. dFIFO in the interrupt service routine refers to the buffer between threads T1 and T2. This scheme is particularly helpful in case of widely non-uniform rates of production and consumption. In this case, data transfer from processor to ASICs is handled by the interrupt routines thereby leading to a relatively smaller program size for the cost of increased latencies of the interrupt service routines. Section 6 presents implementation costs and performance of this scheme. Next we consider the problem of software synchronization and scheduling mechanisms to make a hardware-software system design feasible. # 3 Control Flow in the Software Component Our model of software component relies on the sequential execution of each thread of execution. Concurrency between threads is achieved through interleaved execution of the threads. Since multiple program threads may be created out of a graph model each starting with an unbounded-delay operation, therefore, software synchronization is needed to ensure correct ordering of operations within the program threads and between different threads. In presence of multiple threads of executions due branching and loop operations, the control flow between threads is represented by a directed flow graph the nodes of which are individual program threads and edges indicate control dependencies. Since the total number program threads and their dependencies are known statically, the programs threads are constructed to observe these dependencies. The threads are identified by unique tags. A run-time FIFO, called control FIFO, maintains the identifier tags of the threads that are ready to run based on control flow (while they may still be waiting for data). Before detaching, each thread performs one or more *enqueue* operations to the FIFO for its successor threads as shown in Example 5 below. **Example 5:** Inter-thread control dependencies. <br/><body> refers to the (linearized) set of operations from the corresponding graph models. Control dependency from thread T1 to T2 is built into the code of T1 by the enqueue operation on the control FIFO. <a href="https://doi.org/10.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-t0.1001/j.j.gov/press-refered-to-t0.1001/j.j.gov/press-refered-t0 A thread dependency on more than one predecessor thread (that is a multiple indegree (fanin) node in the flow graph) is observed by ensuring multiple enqueue operations for the thread by means of a counter. For example, a thread node with a indegree of 2 would contain a synchronization preamble code as indicated by the while statement shown in Example 6 below. **Example 6:** Thread with multiple input control dependencies. 0 Control transfer for multiple fanin nodes entails program overheads that add to the latency of the corresponding threads. For this reason, an attempt should be made to reduce multiple dependencies for a program thread through a careful dependency analysis. In case of multiple outdegree nodes in the flow graph, a necessary serialization among enabling of successor threads occurs. However, this serialization is of little significance since there exists only a single re-programmable component. # 4 Concurrency in Software Through Interleaving The problem of achieving concurrency through interleaved execution in the software component was considered in [7]. The different program threads can be implemented as program subroutines that operate under a global task scheduler (or the *main* program). It was shown that the overheads in terms of number of cycles for each inter-thread control transfer operation can be reduced significantly by placing program subroutines in a co-operative, rather than hierarchical, relationship to each other by implementing them as *Coroutines* [11]. For the DLX microprocessor a coroutine implementation of program threads constitutes an overhead of 19 cycles. Another scheme for implementation of the software component is to construct a program routine using the method of *Description by cases*. In this method, we attempt to construct a single program with a unique case assignment for each thread (in a rather large conditional in the final program). A set of global state registers is used to store the state of execution of each thread. The overheads due to this scheme depends strongly upon the instruction set architecture of processor. For the DLX microprocessor, the overhead amounts to 35 cycles for each inter-thread transfer operation. In case of so-called 'CISC' processors this scheme reduces the overhead by reducing amount of ALU operations in favor of a slight increase in memory input-output operations. ## 5 Hardware-Software Synchronization Due to the serial execution of the software component a data transfer from hardware to software must be explicitly synchronized. Using a polling strategy, the software component can be designed to perform pre-meditated transfers from the hardware components based on its data requirements. This requires static scheduling of the hardware component. In cases where the software functionality is communication limited, that is, the processor is busy-waiting for an input-output operation most of the time, such a scheme would be sufficient. Further, in absence of any unbounded-delay operations, the software component in this scheme can be simplified to a single program thread and a single data channel since all data transfers are serialized. However, this would not support any branching, no reordering of data arrivals since dynamic scheduling of operations in hardware would not be supported. In order to accommodate different rates of execution of the hardware and software components, and due to unbounded delay operations, we look for a *dynamic* scheduling of different threads of execution. Such a scheduling is done based on availability of data. One mechanism perform such scheduling is by means of a control FIFO mentioned in the previous section which attempts to enforce the policy that data items are consumed in the order in which they are produced. The hardware-software interface consists of data queues on each channel and a (control) FIFO that holds the identifiers for the enabled program threads in the order in which their input data arrives. The control FIFO depth is sized with the number of threads of execution, since a thread execution is stalled pending availability of the requested data. Example 7 below describes the hardware-software interface. #### Example 7: Hardware-Software Interface. Figure 4 shows schematic connection of the FIFO control signals for a single data queue. In this example, the data queue is memory mapped at address 0xee000 while the data queue request signal is identified by bit 0 of address 0xee004 and enable from the microprocessor (up\_en) is generated from bit 0 of address 0xee008. The control logic needed for generation of the enqueue is described by a simple state transition diagram shown in Figure 5. The control FIPO is ready to enqueue (indicated by gn = 1) a process id if the corresponding data request (q rq) is high and the process has enabled the thread for execution (up en). Signal up ab indicates completion of a control FIPO read operation by the processor. Figure 4: Control FIFO schematic In case of multiple indegree queues, the enqueue\_rq is generated by OR-ing the requests of all inputs to the queues. In case of multiple-outdegree queues, the signal dequeue\_rq is generated also by OR-ing all dequeue requests from the queue. The control FIFO and associated control logic can be implemented either in hardware as a part of the ASIC component or in software. In case the control FIFO is implemented in software the FIFO control logic is no longer needed since the control flow is already in software. In this case, the q\_rq lines from data queues are connected to processor unvectored interrupt lines, where the respective interrupt service routines are used to enqueue the thread identifier tags into the control FIFO. During the enqueue operations the interrupts are disabled in order to preserve integrity of the software control flow. The hardware-software interface protocol is described using guarded commands as shown in the Example 8 below. **Example 8:** Specification of the control FIFO based on two threads of execution. ``` queue [2] controlFIFO [1]; queue [16] line_queue [1], circle_queue [1] when ((line_queue.dequeue_rq+ & !line_queue.empty) & !controlFIFO.full) do controlFIFO enqueue $2; when ((circle_queue.dequeue_rq+ & !circle_dequeue.empty) & !controlFIFO.full] do controlFIFO enqueue $1; when (controlFIFO.dequeue_rq+ & !controlFIFO.empty) do controlFIFO dequeue dlx.0xff000([:0]; ``` In this example, two data queues with 16 bits of width and 1 bit of depth, line\_queue and circle\_queue, and one queue with 2 bits of width and 1 bit of depth controlFIFO are declared. The guarded commands specify the conditions on which the number 1 or the number 2 are enqueued – here, a '+' after a signal name means a positive edge and a '-' after the signal means a negative edge. The first when condition states that when a dequeue request for the queue line\_queue comes and this queue is not empty and the queue controlFIFO is not full, then enqueue the value 2 (representing identifier for a corresponding program thread that consumes data from the line queue) into the controlFIFO. Figure 5: FIFO control state transition diagram ### 6 Results In order to illustrate the effect of software and hardware-software interface implementation, we present design of a graphics controller that outputs pixel coordinates for lines and circles given the end coordinates (and radius in case of circle). The final implementation of the system design consists of line and circle drawing routines in the software component while the ASIC hardware performs initial coordinate generation and coordinate transfer to the video ram. The software component consists of two threads of execution corresponding to the line and circle drawing routines. Both program threads generate coordinates that are used by the dedicated hardware. The data-driven dynamic scheduling of program threads is achieved through the use of a 3-deep control FIFO. The circle and line drawing program threads are identified by id numbers 1 and 2 respectively. The program threads are implemented using the coroutine scheme described in Section 4. Example 9 shows the main program in case of a hardware control FIFO implementation. Like the line and circle drawing routines, this program is compiled using existing C-compiler. Example 9: Main program of the graphics controller software component ``` #include 'transfer_to.h' int lastPC(MAXCOROUTINES) = (scheduler, circle, line, main); int current = MAIN; int *controlFIFO_out = (int *) 0xaa0000; int *controlFIFO = (int *) 0xab0000; int *controlFIFO_outak = (int *) 0xac0000; #include *line.c* #include *circle.c* void main() ( resume (SCHEDULER); int nextCoroutine: void scheduler() { resume(LINE); resume (CIRCLE); while(!RESET) ( do { nextCoroutine = *controlFIFO; ) while ((nextCoroutine & 0x4) != 0x4); resume(nextCoroutine & 0x3); } } ``` | Scheme | Program | Synchronization | Input data rate-1 | Output data rate-1 | | | | |---------------------|---------|-----------------|---------------------|---------------------|------|-------|------| | | size | overhead | (cycles/coordinate) | (cycles/coordinate) | | e) | | | | | delay | | line circle | | cle | | | | (bytes) | (% cycles) | | ave. | peak | ave. | peak | | Hardware CFIFO | 5972 | 0 | 81 | 535.2 | 502 | 76.4 | 30 | | Software CFIFO | 6588 | 50 | 95 | 749.5 | 407 | 106.8 | 31 | | Opt. Software CFIFO | 6360 | 29.4 | 95 | 651 | 330 | 94 | 31 | Table 1: A comparison of control FIFO implementation schemes Table 1 compares the performance of different program implementations using control FIFO either in hardware or in software component. The hardware implementation of a control FIFO with a fanin of 3. when synthesized by program Hebe and mapped to LSI 10K library of gates using program Ceres, costs 228 gates. An equivalent software implementation adds 388 bytes to the overall program size of the software component. Note that the cost of hardware control FIFO increases as the number of data queues increases. On the other hand, software implementation of control FIFO using interrupt routines to perform the control FIFO enqueue operations offers lower implementation cost for a 50% increase in the thread latencies. In case of software implementation of control FIFO, the enqueue and dequeue operations are described in C which are then compiled for DLX assembly. The overhead due to enqueue and dequeue operations is reduced further by manually optimizing the assembly code for enqueue and dequeue operations as indicated by the entry 'Opt. Software CFIFO'. This one-time optimization of enqueue and dequeue routines, which does not affect the C-code description of the program threads, leads to a reduction in the program size and program thread overhead to 29.4% thereby improving the rate at which the data is output. Note that data input and output rates have been expressed in terms of number of cycles it takes to input or output a coordinate. Due to the data-dependent behavior of program threads, the actual data input and output rates would vary with respect to value of the input data. In this example simulation, the input rate has been expressed for a simultaneous drawing of a line and 5 pixel radius with width of 1 pixel each and the results are accurate to one pixel. An input rate of 81 cycles/coordinate corresponds to approximately 0.25 million samples/sec for a processor running at 20 MHz. Similarly, a peak circle output rate of 30 cycles/coordinate corresponds to a rate of 0.67 million samples/sec. An implementation of line and circle drawing program threads for the graphics controller example using inter-thread buffering and I/O timers mentioned in Section 2, comes to a total program size of 5788 bytes for a 62% overhead delay per program thread. Though instructive, the line and circle drawing algorithms are simple enough that their software implementation do not fully exploit the potential of a mixed implementation. However, a more computationally intensive operation like spline generation or operations involving floating point arithmetic would greatly benefit by their program implementations. As an example of complex system design, we present design of an ethernet-based network co-processor. This processor is modeled as a set of 13 concurrently executing processes which interact with each other by means of 24 send and 40 receive operations. The | Example | Program implementation | Program size bytes | Max delay cycles | |---------------------|--------------------------------|--------------------|------------------| | Graphics controller | Coroutines, Hardware CFIFO | 5972 | 806, 859 | | Network coprocessor | Desc. by cases, Hardware CFIFO | 8572 | 56 | Table 2: Software component for system examples total HardwareC description consists of 1036 lines of code. A mixed implementation using a single program containing 17 cases using the method of description by cases for the software component takes 8572 bytes of program and data storage but it reduces the overall cost of the application-specific component by 23% from 10882 gates to 8394 gates using LSI logic 10K library of gates. The mixed implementation delegates much of execution unit and operations relating to frame assembly and dis-assembly to a software component. The mixed implementation meets the imposed performance requirements like maximum propagation delay of 46.4 $\mu$ s, maximum jam time of 4.8 $\mu$ s, minimum interframe spacing of 67.2 $\mu$ s and input bit-rate for a 10 Mb/sec operation using ethernet protocol. Table 2 lists characteristics of the software component for the graphics controller and the network co-processor. ## 7 Summary and Conclusions Presence of re-programmable processors in target architectures promises a practical approach towards realizing complex system designs without associated increase in the cost of application-specific components required to implement the system functionality. Existing software compilers and presence of hardware special-purpose units like floating point, vector and graphics processing on-board the re-programmable component makes it possible specify system functionality using operations like real and floating-point operations. Where possible, such operations can be delegated to the software component without incurring the cost of implementing these functions in the application-specific hardware components. Synthesis of systems containing both general-purpose re-programmable as well as application-specific components can be formulated as a hardware-software co-design problem due to two predominantly different computation models used by the system components. Software component design for such systems poses interesting problems due to inherently serial nature of program execution that must interact with concurrently operating hardware components. In our approach to system synthesis, the software component is implemented as a set of program routines, called threads. Concurrency between program threads, achieved by inter-leaved execution of threads, preserves concurrency inherent in the system model. Further, dynamic scheduling of fixed latency threads avoids unnecessary serialization of operations in a graph model for generation of the software component. The program routines corresponding to threads can be implemented as subroutines or coroutines. A coroutine implementation reduces overheads by treating all routines symmetrically, therefore, the context information needed to be saved/restored is reduced. However, the necessity to embed control flow into the individual coroutines reduces this gain somewhat. At the same time, the ability to do intelligent dependency analysis on the system graph model can reduce this overhead. There is a tradeoff between when graph serialization is done and when program threads are generated. We construct program flow graphs that consist of potential threads as basic blocks based on points of synchronization in the system graph model. A completely serialized graph model would loose the advantage of coroutine implementation and may benefit by the subroutine implementation instead. However, the loss of all concurrency may make eventual software component in feasible with respect to the imposed data-rate constraints. We have proposed a scheme to achieve hardware-software synchronization. We have demonstrated feasibility of control FIFO-based hardware-software synchronization schemes where the FIFO control can be implemented either as a dedicated hardware or as a program. The software implementation of control FIFO reduces the size of hardware component of system design, but it increases program size and adds to the latencies of program threads. This makes the input data rate about 15% slower in case of the graphics controller example. Depending upon the objective of system synthesis either of the hardware and software alternatives can be selected and simulated using program *Poseidon*. Generally, an implementation that aims to rapidly prototype the system design would favor software component of the system design for a small loss of performance. Using the synchronization scheme proposed, we are able to synthesize and simulate mixed system designs. As example, we have presented designs of a graphics controller and a network coprocessor that employ software components to achieve part of the system functionality. Even with the simplifying assumptions relating to the target architecture, the problems of accurately characterizing software component and its synthesis are challenging problems. This work takes a first step in solving the problem of system software synthesis. Work is in progress to extend the model to include the effects of hierarchical memory schemes and reduction of communication overheads by using channel sharing and data encoding schemes. ### 8 Acknowledgments This research was sponsored by NSF-ARPA, under grant No. MIP 8719546 and, by DEC jointly with NSF, under a PYI Award program, and by a fellowship provided by Philips/Signetics. We acknowledge also support from ARPA, under contract No. J-FBI-89-101. The second author was partially supported by CNPq-Brazil under contract 200212/90.7. ### References - 1] R. K. Gupta and G. D. Micheli, "System-level Synthesis Using Re-programmable Components," in *Proceedings of the European Design Automation Conference*, pp. 2-7, Mar. 1992. - [2] J. Rabaey, H. D. Man, and et. al., Silicon Compilation, D. Gajski, editor, ch. Cathedral II: A Synthesis System for Multiprocessor DSP Systems, pp. 311-360. Addison Wesley, 1988. - [3] R. Camposano and W. Rosenstiel, "Synthesizing Circuits from Behavioral Descriptions," *IEEE Transactions on CAD/ICAS*, vol. 8, no. 2, pp. 171–180, Feb. 1989. - [4] D. Thomas, E. Lagnese, R. Walker, J. Nestor, J. Rajan, and R. Blackburn, Algorithmic and Register-Transfer Level: The System Architect's Workbench. Kluwer Academic Publishers, 1990. - [5] G. D. Micheli, D. C. Ku, F. Mailhot, and T. Truong, "The Olympus Synthesis System for Digital Design," *IEEE Design and Test Magazine*, pp. 37-53, Oct. 1990. - [6] D. Ku and G. D. Micheli, Constrained Synthesis and Optimizations of Digital Circuits from Behavioral Specifications. Kluwer Academic Publishers, 1992. - [7] R. K. Gupta, C. C. Jr., and G. D. Micheli, "Synthesis and Simulation of Digital Systems Containing Interacting Hardware and Software Components," in *Proceedings of the* 29<sup>th</sup> Design Automation Conference, June 1992. - [8] A. V. Aho, R. Sethi, and J. D. Ullman, *Compilers: Principles, Techniques and Tools*, ch. Code Generation, pp. 557-565. Addison Wesley, 1986. - [9] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, ch. 3. Morgan-Kaufmann, 1990. - [10] E. W. Dijkstra, "Guarded Commands, Nondeterminacy, and Formal Derivation of Programs," CACM, vol. 18, no. 8, pp. 453–457, Aug. 1975. - [11] M. E. Conway, "Design of a Separate Transition-Diagram Compiler," Comm. of the ACM, vol. 6, pp. 396-408, 1963.