# Transmitter and Channel Design for Multi-Chip Communication Interfaces

DISSERTATION zur Erlangung des Grades eines Doktors der Ingenieurwissenschaften

vorgelegt von MSc.EE. Muhammad Waqas Chaudhary

eingereicht bei der Naturwissenschaftlich-Technischen Fakultät der Universität Siegen Siegen 2020 gedruckt auf alterungsbeständigem holz- und säurefreiem Papier

Betreuer und erster Gutachter Prof. Dr. Bhaskar Choubey Universität Siegen

Zweiter Gutachter Prof. Dr.-Ing. Rainer Kokozinski Universität Duisburg-Essen

Tag der mündlichen Prüfung 25. Februar 2021

## Abstract

The size of integrated circuit (IC) die has continuously increased due to Moore's law in the last few decades. A large system on chip (SOC) contains many complex analog and digital blocks which must run at high clock rate to support the needs of today's applications. These large SOCs suffer from global interconnect delay bottleneck and increased design complexity. In order to deal with this problem, the SOC can be divided into smaller chips which could be placed together in a multi-chip-module (MCM) or in a 2.5D interposer system. The chips must communicate with each other, which poses the challenges of transmitter and channel design along with system optimization.

This work addresses the three challenges of multi-chip system design: (i) transmitter design for moderate speed unterminated signalling and high speed multi-Gb/s terminated signalling, (ii) channel analysis and design for minimum area usage while meeting the bandwidth and energy requirements of memory and high speed interfaces, (iii) design methodologies for transmitter and channel co-design, and design flow for optimum memory interface in multi-chip systems.

This work tackles the transmitter design challenge for multi-chip systems by offering two types of transmitters: an unterminated low swing driver for moderate data rates, and a high speed terminated transmitter for multi-Gb/s communication interfaces. Both transmitters are designed in 22 nm FDSOI technology and taped out. Channel analysis is done for various width, spacing and length of interconnects in 2.5D silicon interposer technology. The signal integrity analysis of memory and serial interfaces (SERDES) directs the designer to choose the right width and spacing of channel for optimum energy or area metrics. Two design methologies are presented in this work: first is current mode logic (CML) differential driver and interposer co-design for minimum energy and area performance metric, second is a design flow for optimum memory interface design by choosing the right memory and integration technology based on given cost and bandwidth constraints. The proposed transmitters, channel analysis and suggested methodologies can be used by industry and research community to design energy and area efficient multi-chip interfaces.

# Kurzfassung

Die Größe der integrierten Schaltkreise (ICs) hat aufgrund des Mooreschen Gesetzes in den letzten Jahrzehnten kontinuierlich zugenommen. Dabei enthalten große System on Chip (SOC) Lösungen viele komplexe analoge und digitale Blöcke, die mit hoher Taktrate laufen müssen, um die Anforderungen der heutigen Anwendungen zu unterstützen. Die mitunter größten Herausforderungen derartiger SOCs sind die Verzögerungen aufgrund sehr langer Verbindungen und die erhöhte Design Komplexität. Um diese Probleme zu lösen, kann das SOC in kleinere Chips unterteilt werden, die zusammen in einem Multi-Chip-Modul (MCM) oder auf einem 2.5D-Interposer basierten System platziert werden können. Die notwendige Kommunikation der Chips impliziert neue Herausforderungen bzgl. des Sender- und Kanaldesigns und der Systemoptimierung.

Diese Arbeit befasst sich mit den drei Herausforderungen des Multi-Chip-System designs: (i) Dem Design der Sender für nicht terminierte Signalübertragung bei mittlerer Geschwindigkeit und terminierte Signalübertragung bei hoher Geschwindigkeit mit mehreren Gb/s. (ii) Der Kanalanalyse und -gestaltung mit minimaler Flächennutzung unter Beachtung der Bandbreiten- und Energieanforderungen, welche vom Speicher und Hochgeschwindigkeitsschnittstellen an das System gestellt werden. (iii) Den Designmethoden für das Co-Design von Sendern und Kanal sowie dem Design-Flow für eine optimale Speicherschnittstelle in Multi-Chip-Systemen.

Die vorliegende Arbeit geht auf die Herausforderungen im Senderdesing von Mulit-chip-Systemen durch zwei Arten von Sendern ein: Einen nicht terminierten Treiber mit geringer Schwingung für moderate Datenraten und einen terminierten Hochgeschwindigkeitssender für eine Kommunikationsschnittstelle mit mehreren Gb/s. Beide Sender wurden auf Basis der 22 nm FDSOI-Technologie designt und gefertigt. Die Kanalanalyse umfasst verschiedene Breiten, Abstände und Längen der Verbindungen auf einem 2.5D-Silizium-Interposer. Die Signalintegritätsanalyse des Speichers und der seriellen Schnittstellen (SERDES) liefert die optimale Breite und den optimalen Abstand der Kanäle unter Berücksichtigung von Energieoder Flächenmetriken. In dieser Arbeit werden zwei Design-Flows vorgestellt: Erstens die Stromschaltlogik (Current Mode Logic; CML) für Differentialtreiber und das Interposer-Co-Design für minimale Energie und Flächenleistungsmetrik, zweitens ein Design-Flow für optimale Speicherschnittstellen bezüglich der Auswahl der richtigen Speicher und Integrationstechnologie auf Grundlage gegebener Kosten und Bandbreitenbeschränkungen. Die vorgeschlagenen Sender, die Kanalanalyse und die vorgeschlagenen Methoden können von Industrie und Forschungsgemeinschaften zum Entwurf energie- und flächeneffizienter Multi-Chip-Schnittstellen verwendet werden.

To my Ammi and Abu

## Acknowledgements

I would like to say many thanks to my supervisor Dr. Bhaskar Choubey for his extensive support and guidance in research and thesis writing. He always had time for me and answered all my questions. This thesis would never be possible without him and his efforts to help me succeed. Also, i would like to thank Dr. Rainer Brück for his supervision and guidance during initial years.

The complete work in this thesis was performed at Fraunhofer Institute IIS/EAS Dresden under the leadership of Mr. Andy Heinig. I would like to thank Andy for his support during these years. He gave me the time and environment needed for research. He was always there to accommodate my requests for any material and equipment needed. Fraunhofer IIS/EAS institute has been a great place to do research. I would like to also thank my colleagues Alexander Steinhardt and Robert Trieb for their help with the interconnect models. They always responded to me quickly regarding the simulations in 3D solvers. Alexander has also been very helpful with any English to German or vice versa translations.

I must say that i could not have done any of this work without the support of my late mother (may her soul rest in peace) who always supported me throughout my career from early school till she departed this world. I am forever indebted to her and my Father for their extraordinary love and support which has been the force that enabled me to reach this stage. During my stay in Germany, my wife has supported me through thick and thins, and whatever kind of hardship may it be. She has proved to be an anchor for me during these years which has helped me to stay strong during all these times. I am extremely grateful to her for her support and this thesis would not be possible without her. My siblings including my sisters, and my brother have also supported me with their best wishes throughout my educational career and i would like to say thanks to them for their love, support and encouragement.

# Contents

| A        | bstra | ct      |                                               | i             |
|----------|-------|---------|-----------------------------------------------|---------------|
| K        | urzfa | ssung   |                                               | ii            |
| D        | edica | tion    |                                               | iii           |
| A        | ckno  | wledge  | ements                                        | $\mathbf{iv}$ |
| 1        | Intr  | oducti  | ion                                           | 1             |
|          | 1.1   | Future  | e of Scaling                                  | . 2           |
|          | 1.2   | Why I   | Multi-chip systems (MCS) ?                    | . 4           |
|          |       | 1.2.1   | Challenges in PCB                             | . 4           |
|          |       | 1.2.2   | More than Moore                               | . 5           |
|          |       | 1.2.3   | Challenges in SOC                             | . 7           |
|          |       | 1.2.4   | Global Interconnect Delay                     | . 8           |
|          | 1.3   | MCS     | Challenges and Constraints                    | . 11          |
|          |       | 1.3.1   | Multi-chip Module (MCM)                       | . 13          |
|          |       | 1.3.2   | System in Package (SiP)                       | . 14          |
|          |       | 1.3.3   | Multiple dies on Interposer in Package (2.5D) | . 14          |
|          |       | 1.3.4   | Three dimensional Integration (3D)            | . 15          |
|          | 1.4   | Multi-  | chip Communication Interfaces                 | . 15          |
|          |       | 1.4.1   | Transmitter $(T_x)$                           | . 16          |
|          |       | 1.4.2   | Channel                                       | . 17          |
|          |       | 1.4.3   | Receiver $(\mathbf{R}_x)$                     | . 18          |
|          | 1.5   | Conclu  | usion                                         | . 19          |
| <b>2</b> | Lite  | erature | e Survey                                      | 20            |
|          | 2.1   | Transi  | mitter design                                 | . 21          |
|          |       | 2.1.1   | State of the Art                              | . 24          |
|          |       | 2.1.2   | Weaknesses in state of the art                | . 28          |
|          |       | 2.1.3   | Transmitter Design Problem update             | . 29          |
|          | 2.2   | Chanr   | nel and Interconnect                          | . 29          |
|          |       | 2.2.1   | State of the Art                              | . 30          |
|          |       | 2.2.2   | Weaknesses in State of the Art                | . 33          |
|          |       | 2.2.3   | Channel Design Problem update                 | . 33          |
|          | 2.3   | Co-de   | sign Methodologies                            | . 33          |
|          |       | 2.3.1   | State of the Art                              | . 34          |
|          |       | 2.3.2   | Weaknesses in State of Art                    | . 35          |
|          |       | 2.3.3   | Design Methodology Problem Update             | . 35          |

|          | 2.4 | Conclusion                                                   |     |     |   |   |      | 36 |  |  |
|----------|-----|--------------------------------------------------------------|-----|-----|---|---|------|----|--|--|
| 3        | Tra | Transmitter 38                                               |     |     |   |   |      |    |  |  |
|          | 3.1 | Problem 1: BOW interface transmitter                         |     |     |   |   |      | 38 |  |  |
|          |     | 3.1.1 Estimation of Load Capacitance $(C_L)$                 |     |     |   |   | •    | 40 |  |  |
|          |     | 3.1.2 Capacitive load Slew rate and Bandwidth                |     |     |   |   | . '  | 42 |  |  |
|          |     | 3.1.3 Targeted Channels                                      |     |     |   |   |      | 45 |  |  |
|          |     | 3.1.4 Termination required ?                                 |     |     |   |   |      | 47 |  |  |
|          |     | 3.1.5 Topology or Architecture Choice                        |     |     |   |   |      | 49 |  |  |
|          |     | 3.1.6 Transmitter Design                                     |     |     |   |   |      | 57 |  |  |
|          |     | 3.1.7 SSTL-LCM Driver & Pre-driver                           |     |     |   |   |      | 58 |  |  |
|          |     | 3.1.8 HSUL Driver and Pre-driver                             |     |     |   |   | . (  | 65 |  |  |
|          |     | 3.1.9 High Speed Digital Blocks                              |     |     |   |   |      | 67 |  |  |
|          |     | 3.1.10 Clock Distribution and Buffer Sizing                  |     |     | • | • |      | 71 |  |  |
|          |     | 3.1.11 Clock Generator                                       | • • | ••• | • | · | •    | 73 |  |  |
|          |     | 3.1.12 Simulation Results and Analysis                       | • • | ••• | • | · | •    | 74 |  |  |
|          |     | 3.1.13 Comparison with State of the Art                      | • • | ••• | • | • | •    | 77 |  |  |
|          | 32  | Problem 2: Driver Ontimization Example                       | • • | ••• | • | • | •    | 79 |  |  |
|          | 0.2 | 3.2.1 Analysis and Results                                   | • • | ••• | • | • | •    | 22 |  |  |
|          | 22  | Conclusion                                                   | • • | ••• | • | · | • •  | 32 |  |  |
|          | 0.0 |                                                              | • • | ••• | • | • | • •  | 51 |  |  |
| 4        | Mea | asurement Results                                            |     |     |   |   | 8    | 88 |  |  |
|          | 4.1 | Bunch of Wires Transmitter                                   | • • | • • | • | · |      | 88 |  |  |
|          |     | 4.1.1 Test Setup                                             | • • | • • | • | · |      | 88 |  |  |
|          |     | 4.1.2 Results                                                | • • | • • | • | · |      | 89 |  |  |
|          | 4.2 | Source Follower Driver Measurements                          | • • |     | • | • | . !  | 94 |  |  |
|          |     | 4.2.1 Test Setup                                             | • • |     | • | • | . !  | 94 |  |  |
|          |     | 4.2.2 Results                                                |     |     | • | • |      | 98 |  |  |
|          |     | 4.2.3 400Mbps Test Limitation                                |     |     | • | • | . 1  | 01 |  |  |
|          | 4.3 | Conclusion                                                   | • • |     | • | • | . 1  | 02 |  |  |
| <b>5</b> | Cha | Channel 103                                                  |     |     |   |   |      |    |  |  |
|          | 5.1 | Channel design for DDR3 2.5D Interface                       |     |     |   |   | . 1  | 03 |  |  |
|          |     | 5.1.1 Silicon Interposer Channel Characterization            | • • |     |   |   | . 1  | 05 |  |  |
|          | 5.2 | Eye-digram Mask based DDR3 Signal Analysis                   | • • |     |   |   | . 1  | 11 |  |  |
|          |     | 5.2.1 Minimum Channel Area and DDR3 Energy Values            |     |     |   |   | . 1  | 13 |  |  |
|          |     | 5.2.2 PCB channel and DDR3 Eye-digaram Analysis              |     |     |   |   | . 1  | 14 |  |  |
|          |     | 5.2.3 PCB vs 2.5D DDR3 Interface Comparison                  | • • |     |   |   | . 1  | 18 |  |  |
|          |     | 5.2.4 DDR4 Memory Standard Discussion                        | •   |     |   |   | . 12 | 20 |  |  |
|          | 5.3 | Channel Design for High Speed Interfaces                     |     |     |   |   | . 12 | 21 |  |  |
|          |     | 5.3.1 Channel versus Transmitter Variation Comparison .      | •   |     |   |   | . 12 | 24 |  |  |
|          | 5.4 | Conclusion                                                   | • • |     |   |   | . 12 | 27 |  |  |
| 6        | Des | ign Methodologies                                            |     |     |   |   | 12   | 28 |  |  |
|          | 6.1 | Holistic Multi-Chip Interface Design                         |     |     |   |   | . 1  | 29 |  |  |
|          |     | 6.1.1 Design Flow Description                                |     |     |   |   | . 1: | 30 |  |  |
|          |     | 6.1.2 Example of low resistivity silicon substrate interface |     |     | • |   | . 1  | 32 |  |  |
|          |     | 6.1.3 CML Front-End and Silicon substrate Example            |     |     | • |   | . 1  | 34 |  |  |
|          | 6.2 | Memory-CPU MCS Design Methodology                            |     |     | • | • | . 1  | 40 |  |  |
|          |     |                                                              |     |     | • | • |      |    |  |  |

|              |     | 6.2.1                            | Introduction to Memory standards         | . 141 |  |  |  |
|--------------|-----|----------------------------------|------------------------------------------|-------|--|--|--|
|              |     | 6.2.2                            | Design Algorithm                         | . 144 |  |  |  |
|              |     | 6.2.3                            | Memory Interface 400 Gb/s Design Example | . 149 |  |  |  |
|              |     | 6.2.4                            | Final Remarks                            | . 152 |  |  |  |
|              | 6.3 | Conclu                           | 1sion                                    | . 152 |  |  |  |
| 7            | Cor | clusio                           | n                                        | 153   |  |  |  |
| •            | 7 1 | 7.1 Challenges                   |                                          |       |  |  |  |
|              | 7.1 | Dogool                           | where the summary and Conclusions        | 155   |  |  |  |
|              | 1.4 | Research Summary and Conclusions |                                          |       |  |  |  |
|              | 1.3 | Research Outlook                 |                                          |       |  |  |  |
|              |     | 7.3.1                            | General work                             | . 157 |  |  |  |
|              |     | 7.3.2                            | BOW Transceiver                          | . 158 |  |  |  |
|              |     | 7.3.3                            | Channel Design                           | . 158 |  |  |  |
|              |     | 7.3.4                            | MCS Holistic Design Methodology          | . 158 |  |  |  |
|              |     | 7.3.5                            | MCS CPU-Memory Interface                 | . 159 |  |  |  |
| A            | IEE | E ©M                             | IWSCAS 2020 Paper                        | 160   |  |  |  |
| В            | IEE | E ©E                             | PEPS 2020 Paper                          | 165   |  |  |  |
| Bibliography |     |                                  |                                          |       |  |  |  |

## Chapter 1

## Introduction

Electronic products are an essential part of human life today. An electronic product is used in daily life for almost all kinds of purposes, e.g. communication, office work, household, travel, information, entertainment and leisure. Professionals in service sector and industry, e.g. medicine, military, construction, and manufacturing are using electronic devices. It has been made possible through invention of the transistor in 1947 [1].

Transistor in its early stages was mainly used as a switch, i.e. its output changed from one state to other corresponding to change in input. Transistor became mainstream with the introduction of integrated circuit technology (ICT) [2] and very large scale integration (VLSI) [3]. Though other transistor technologies like bipolar junction transistors (BJT) are still being used in specific applications but complementary metal oxide semiconductor (CMOS) is the most used technology. The reason for CMOS success was shrinking of channel length according to Moore's law [4]. The minimum transistor channel length in CMOS has reduced dramatically from  $3 \,\mu$ m in 1977 to 10 nm in 2019 [5].

Figure 1.1 shows cross-section of CMOS technology depicting an NMOS transistor. B, S, G, and D represent the bulk, source, gate and drain terminals, respectively, whereas L depicts the channel length between the n-doped drain and source regions. Bulk is generally p-doped silicon substrate connected to a low voltage using bulk terminal (B). During the transistor operation, a voltage applied at gate causes a channel formation between drain and source. A potential difference between drain and source terminals causes the charge carriers (electrons/holes) to flow between the drain and source through this channel. Intuitively, as the channel length will decrease, the time required for the electrons to travel from drain to source or viceversa will decrease. This leads to higher frequency performance and eventually faster circuits [3].

Rest of this chapter is divided into five sections. Section 1.1 describes the speed bottleneck problem due to continuous technology scaling. Section 1.2 explains the need for multi-chip systems (MCS) and why do we need to transit from PCB or SOC solutions to multi-die solutions. Section 1.3 explains the new challenges in the design of multi-chip systems and states design problems along with common theme of the thesis . Section 1.4 describes the typical communication interfaces in PCB based systems and their generic MCS counterparts. Section 1.5 concludes the chapter.



Figure 1.1: Single NMOS structure on P-Substrate in CMOS technology

## 1.1 Future of Scaling

Transistor scaling has led to integrated circuits containing millions of transistors performing in high frequency ranges with small power supply voltages. During last few decades, industry has been continuously decreasing the transistor channel length by 0.7x every 2 years [4] to make the circuits faster with lower supply voltages. This reduction of channel length to make faster circuits was a linear relationship because the other factors in chip design were not critical enough to disturb this trend. A scaling trend of transistor length in Intel chips [5] over the years is shown in Figure 1.2.

With channel length shrinking and number of transistors in IC increasing, the interconnect between transistors was also down scaled. This led to reduced height and width of wires which increased the resistivity of interconnect eventually making it slower due to higher RC delay. So, transistors are working faster individually but are eventually bottlenecked by slower interconnect [6]. In digital circuits, travel time of signal from one gate to another gate is called "propagation delay or gate delay" [3]. The signal transfer time from input to output of gate is called as "intrinsic delay" which reduced with scaling. But interconnect delay increased by an average of 1.26x with each generation [7] as shown in Figure 1.3.

For large chips with millions of transistors, critical factor is the global interconnect delay. It defines the maximum working frequency of the chip. Large chips such



Figure 1.2: Intel transistor length scaling trend [5]



Figure 1.3: Intel interconnect scaling delay trend [7]

as processors have been bottlenecked by this global interconnect delay. In order to further increase the processor performance, a new architecture was proposed by IBM in 2001 [8]. It presented a new IC with multiple small processing cores in the form of Power 4 processor. Several tasks could be run in parallel in multiple cores on the same chip. Furthermore, on chip memories were placed and directly connected to respective cores in a single system on chip (SOC). With time, several analog and digital blocks were added to large SOCs which increased the area, complexity and cost of SOCs tremendously.



Figure 1.4: Why we need multi-chip systems (MCS)

## 1.2 Why Multi-chip systems (MCS) ?

Traditionally systems are designed with packaged integrated circuits (ICs) placed on a printed circuit board (PCB). The individual packages with input-output (IO) pins communicate through metal interconnect on the PCB substrate. With the increase in number of packages on the PCB and increase in the required system performance needs, the challenges in PCB design kept increasing. Similarly, the race for integration and lithography developments for transistor development paved the way for integration of all the system blocks on a single chip, i.e. system-on-chip (SOC). But increase in system performance with time, required the size of SOC to be very large which increased the design complexity and time to market. These challenges in the design of PCBs and SOCs for a given performance need to be discussed in detail, in order to understand clearly the need and design constraints of multi-chip systems (MCS). A pictorial representation of the movement towards MCS is shown in Figure 1.4 and details of the challenges are described below.

#### 1.2.1 Challenges in PCB

PCB based systems contain many packages on a single board. These packages include microprocessor, dynamic ram (DRAM), power regulators, frequency generators and phase locked loop (PLL). The size of PCB depends upon the system memory, power, memory bandwidth, maximum energy consumption, maximum total area allowed and total cost of running the system over a certain time period. With the rise of mobiles and hand held systems, it has become necessary to keep the size of the system board as small as possible, so that it can fit inside the body of product.

Consider a PCB with a microprocessor in deep sub-micron technology and external dynamic random access memory (DRAM) packages. For a typical today's application of flagship mobile systems, the minimum needed external RAM is about 8-16 gigabyte (GB) [9]. The modern processors have reached billions of transistors per chip [10] and their size has also increased with typical values around 250 mm<sup>2</sup> [11]. The size of a low power double data rate (LPDDR3) memory package for 16 Gb or 2 GB size is  $15 \times 15 \text{ mm}^2$  [12]. For 16 GB memory access, for PCB based solution, the design would require minimum 8 packages of RAM around the CPU. Although their are ways around this problem by increasing the number of ranks in memory access, but ideal parallel access capability for read/write of all memories at the same time would require  $32 \times 64$  data channels, making in total 2048 interconnects for just the data lines. The minimum size for such a PCB based system would need minimum area of  $(8 \times 225 + 250) 2050 \text{ mm}^2$ .

By this example, the need for miniaturization towards smaller systems and placing many chips in a single package could lead to reduced number of interconnects on the eventual system board where the MCS is placed. The simple PCB solution is shown in Figure 1.5, where the PCB based solution has to route minimum of 2048 data lines ( $32 \times 64$  channels) on the PCB. While in an MCS solution, 2048 lines are inside the MCS substrate and size of MCS is also reduced due to direct dies placement inside the package. Furthermore, the length of interconnect inside the MCS is drastically reduced as compared to PCB, which leads to reduction of power in IO signalling between memory and processor dies. One of the challenges in the transition to multi-chip system is to find the optimized MCS solution for a given memory-cpu interface bandwidth and area/size requirement. This path finding problem from PCB-to-MCS for a given memory-cpu interface is one of the challenges dealt with in this thesis.

#### MCS memory-cpu solution path finding problem statement

The cpu-memory interface MCS solution path finding problem can be stated as: Find the minimum energy and area cost  $(\psi)$  integration solution for given maximum area, minimum bandwidth, and maximum energy consumption processor-memory interface constraints. The problem statement is depicted in the Figure 1.6, where the constraints for path finding problem solvers and desired flow are demonstrated.

### 1.2.2 More than Moore

For a memory-cpu system path finding problem described above, it can be said that the solver must be able to minimize the system PPAC (power performance area costs) within the available solutions. The term PPAC was historically used for Moore's law scaling/shrinking of technology node over the years. But international technology roadmap for semiconductors (ITRS) introduced the term "More than Moore" in their 2005 report and extended it further in the 2009 report for further PPAC system improvements [13].

ITRS named the functional diversification as More than Moore and defined it as the incorporation of new functionalities into the devices that may not scale by following the Moore's law, e.g. analog and mixed signal blocks, radio frequency (RF) circuits, sensors, and actuators etc. One of the challenges specified in ITRS report for More than Moore was design and development of co-design methodologies and tools for multi-chip systems and partitioning of large SOCs into heterogeneous system within a package similar to a MCS described in this work. It was understood



Figure 1.5: Traditional 16 GB LPDDR3 2 GB package based CPU-memory interface example for PCB



Figure 1.6: Problem statement and flow for memory-cpu system design



Figure 1.7: Moore scaling and More than Moore (ITRS) combined approach [13]

by the industry and ITRS members that only going below 20 nm and beyond CMOS for 5 nm channel length in multiple fins transistors (FinFet) or gate all around transistors (LGAA) will not be enough for achieving the gains of PPAC as before [14]. Therefore, a combination of the effort for shorter channel transistors and effort for heterogeneous systems integration is required to achieve the higher value systems as shown in Figure 1.7.

The combination of scaling and More than Moore was discussed by Kahng [15]. He pointed out that it was the first time ever that co-design tools and simulation methods were regarded by ITRS report as one of the techniques for further scaling of system PPAC performance improvements in the future. New methods for system level design and trade-offs analysis between different integration technologies are deemed critical for future. The problem statement of memory-cpu system optimization described previously also falls in the ITRS system co-design challenges for future system performance improvement.

Loke et al. demonstrated another issue with beyond CMOS scaling [14]. They showed that analog and mixed signal blocks do not necessarily benefit greatly from the Moore scaling. Rather, it was shown that special circuit design blocks such as high voltage IO cells and analog blocks are even difficult to design with the same performance in below 20 nm technology nodes than longer channel length nodes. This led to another design problem for multi-chip systems which should deal with the input-output (IO) or transmitter/receiver blocks for heterogeneous integrated systems.

#### MCS IO design problem statement

The IO blocks for MCS require special design considerations in context of beyond CMOS scaling and More than Moore perspective. The IO design problem for MCS can be stated as: Design the IO cells for fast and low power data transfer between multiple chips, in a co-design approach for different types of interconnect in heterogeneous systems for More than Moore scaling. The energy-area performance metric should generally hold the highest priority in design flow development for highly miniaturized systems of future.

## 1.2.3 Challenges in SOC

In order to reduce the size of the systems, one idea which has been sought extensively in last few years is to integrate different functionalities on a single chip. For example, a typical large system on chip (SOC) block distribution is shown in Figure 1.8, where radio frequency (RF), analog/mixed signal, digital signal processing (DSP), sensors and actuators micro-electromechanical systems (MEMS), memory (DRAM) and high speed serialized data IO front end (SERDES) are placed alongside the central microprocessor (CPU) block in a single chip. Integration of so many functionalities on a single chip though seem very attractive, poses immense challenges for design, testing, verification and system level optimization.



Figure 1.8: Typical large SOC block diagram

Main challenges in design of large SOCs and their development are depicted in Figure 1.9. Noise coupling between digital and analog blocks is a serious problem in mixed signal SOC [16]. Noise generated on power supply and substrate due to digital switching can couple to sensitive analog lines, resulting in degradation of



Figure 1.9: SOC design challenges

analog block performance. In today's applications, the needs for large amounts of memory are a critical design parameter for SOC design.

It would be ideal to have the dynamic memory (DRAM) placed in the SOC just like the cache (Static RAM or SRAM) is placed inside the SOC. Along with the obvious huge increase in the size of SOC, there are cost and technology optimization hurdles which block the integration of CPU and DRAM in a single SOC [17]. A process optimized for logic power minimization and speed enhancement is expensive and not always suited for dynamic memory development which needs special capacitor leakage requirements for power minimization. Also, the costs of DRAM process are typically much less than the costs of a state of the art CPU process. Therefore, these technology and cost metrics present a continued hurdle to inclusion of DRAM on the SOC. Hence, one way or other, a CPU must be accompanied by other DRAM devices as shown previously in Figure 1.6 and fast energy efficient data transfer techniques directly impact the memory access.

Another problem which arises with increase in the size of SOC is the global interconnect and the delay associated with it. The number of long wires in the SOC lead to higher capacitive load presented to transistors in SOC and increase the power consumption. Furthermore, the propagation time from one end of chip to other end on long global wires is much higher than short wires, resulting in degradation in the speed performance of chip. This topic needs special attention in context of relationship between SOC and multi-chip systems and is discussed in detail below.

#### 1.2.4 Global Interconnect Delay

In order to understand the impact of SOC size on the global interconnect delay, consider the metal line and rectangular SOC shown in Figure 1.10. The longest interconnect length l in a SOC with area A is given as  $\sqrt{A}$  [18]. The height and width are labeled as h and w respectively. The area of the SOC could be reduced



Figure 1.10: SOC partitioning and interconnect scaling

by partitioning it into multiple dies, for example,  $\alpha$  number of equal dies of area  $A/\alpha$ . In Figure 1.10, partitioning into 4 equal dies is shown. The maximum length of interconnect in each die is now reduced to  $\sqrt{A/4}$ .

The delay of signal transmission from one end of interconnect to other end is defined by the RC characteristics of the interconnect. The resistance R of a metal line of length l is given by [19]

$$R = \rho l / A_c = r l \tag{1.1}$$

where  $\rho$  is the resistivity of the interconnect and  $A_c$  is the area of the metal line. The resistance per unit length r is given by  $\rho/A_c$ . The capacitance of interconnect is defined by the coupling to side interconnects and coupling to metal lines beneath and above. Consider the metal line structure shown in Figure 1.11, where the signal line is surrounded by two lines on right and left side. Furthermore, there are metal lines above and beneath the interconnect. The total capacitance and per unit length capacitances are then given as

$$C = C_{side} + C_{vertical} = c_{total}l \tag{1.2}$$

The side capacitance and vertical capacitances are given as

$$C_{side} = hl\epsilon/s_h = c_{side}l \tag{1.3}$$

$$C_{vertical} = w l \epsilon / s_v = c_{vertical} l \tag{1.4}$$

where w is the line width, h is the height,  $\epsilon$  is the inter metal dielectric permittivity,  $s_h$  and  $s_v$  are the horizontal and vertical metal spacings respectively. Hence, the 50%

rise propagation delay  $t_d$  from one end of line to other end, assuming the distributed resistance and capacitance approach is given as [19]

$$t_d = 0.38RC = 0.38rc_{total}l^2 = 0.38rc_{total}A = 0.38\rho/A\left(c_{side} + c_{vertical}\right)l^2 \qquad (1.5)$$

Thus, the propagation delay has square relationship to the length of interconnect, meaning that by doubling the length, the delay increases by four times.

By partitioning the SOC into multiple dies, the interconnect length is reduced and the maximum length propagation delay after partitioning by factor  $\alpha$  is given as

$$t_{d\alpha} = 0.38rc_{total}(\sqrt{A/\alpha})^2 = 0.38rc_{total}A/\sqrt{\alpha}$$
(1.6)

Therefore, the propagation delay scales by  $1/\sqrt{\alpha}$  by partitioning the SOC with a factor  $\alpha$ . For example, by partitioning the SOC into four dies, the global interconnect maximum propagation delay is reduced by half.

In order to understand the impact of interconnect delay reduction on system performance, consider the system clock dependencies on the interconnect and gate delay as shown in the Figure 1.12. The system clock is denoted as Clk with time period  $t_{Ck}$ . The digital system on a SOC consisting of flip-flops running at this clock rate must fulfill the following timing constraint.

$$t_{Ck} \ge t_{c-q} + t_{CL} + t_d + t_{SU} \tag{1.7}$$

Where  $t_{c-q}$  is the performance metric of flip-flop denoting the transition time from clock rising edge to change at the flip-flop output,  $t_{CL}$  is the delay due to the combinational logic consisting of digital gates,  $t_d$  is the interconnect delay due to



Figure 1.11: Capacitance of interconnect to sides and vertical



Figure 1.12: System clock dependency on interconnect delay

longest global interconnect and  $t_{SU}$  is the setup time of the flip-flop (FF). As clear from this constraint, the minimum possible clock period is limited by the gate delay and interconnect delay. With the advancement in technology nodes and shortening of channel length, the delay due to digital gates has reduced dramatically as shown in Figure 1.2 while the interconnect delay has not scaled that fast as shown in Figure 1.3, leading to interconnect delay being the dominant factor in clock period constraint. For increasing the frequency performance of the SOC, somehow the global interconnect delay must be reduced. One way to do this is to partition the SOC into multiple dies and enhance the bandwidth of individual dies.

By  $\alpha$  factor partitioning, the constraint for  $t_{Ck}$  can be written as

$$t_{Ck\alpha} \ge t_{c-q} + t_{CL} + t_d / \sqrt{\alpha} + t_{SU} \tag{1.8}$$

This results in reduction of clock time period by

$$\Delta t_{Ck} = t_{Ck} - t_{Ck\alpha} = t_d \left( 1 - 1/\sqrt{\alpha} \right) \tag{1.9}$$

And system frequency increases by

$$\Delta f = 1/\Delta t_{Ck} = f_{\alpha} - f = 1/t_d \left(1 - 1/\sqrt{\alpha}\right)^{-1}$$
(1.10)

For example, by partitioning the large SOC into four dies, the system clock rate can be doubled due to reduction of interconnect delay by half, assuming that the bottleneck is the interconnect not the gates.

### **1.3 MCS Challenges and Constraints**

The new design options of multi-chip systems and why we need them is discussed in previous section. There are interconnect delay reduction advantages, system miniaturization (from PCB to MCS), design complexity reduction and noise reduction advantages of placing multiple dies together in a system. But there are others problems and design challenges introduced by this technology, which must be dealt with efficiently to make usage of MCS possible. By analyzing the challenges in widely used SOC and PCB based systems, three interrelated design problems have been



Figure 1.13: Depiction of typical 2-die MCS architecture

identified for MCS. Consider the Figure 1.13, where two dies are placed side-by-side in a typical multi-chip system architecture. The two dies could be, for example, memory die and processor die placed together. The MCS can be derived either from an application previously designed on PCB or from a large SOC which could be divided into multiple dies to reduce the global interconnect delay and enhance the system performance. In both cases of MCS derivation, the constraints for the multi-chip system design should be dependent upon the corresponding application previously done in PCB or a large SOC. As visible from the Figure 1.13, the design effort for MCS shall focus on the area and energy usage of communication circuits between the dies and the area used by the channel.

For a multi-chip system, one main design block is the transmitter and receiver circuits for chip to chip communication. As shown in Figure 1.13, both dies need transceiver (Txvr) physical front end (PHY) to send and receive data between each other. This opens a new branch of communication circuit design for extra short reach interfaces (XSR). The main design metrics for chip to chip IO design are energy consumption (pJ/bit) and area consumption (mm<sup>2</sup>). Both of these metrics must be reduced as much as possible especially in comparison to the PCB based system to ensure that moving towards an MCS solution would benefit the designer in energy and area costs.

For an MCS solution for previously PCB based system, such as memory-cpu or high speed serial interface (SERDES), the following constraint for PHY design must be met:

$$\psi_{MCS} < \psi_{PCB}$$

where  $\psi$  defines the energy-area metric combination defined by  $pJ/bitmm^2$ . By reducing the area and energy costs, the MCS solution should be used.

For an MCS solution derived from SOC division into multiple dies for system clock enhancement and global interconnect delay reduction, the extra power used by the communication between partitioned dies should be less than the power consumed due to global interconnect based communication in large SOC. Also, the area of communication circuits must be minimized and the energy-area performance of communication between dies must be less than that within a large SOC.

 $\psi_{MCS-IO} < \psi_{Global-interconnect-IO-SOC}$ 

#### Problem Statements and Constraints Derivation Flow

Also "More than Moore" discussion done previously in this chapter dictates the need for co-design methodologies of IO and channel for overall system energy-area reduction or optimization in given design constraints of bandwidth, and maximum area. Hence, the three main challenges or design problems in transition from PCB or SOC to multi-chip systems can be stated as:

- Design the MCS channel for widely used PCB based IO transceivers
- Design the chip-to-chip IO transmitter front end (IO driver) for given MCS channels
- Determine the co-design MCS channel and IO methodologies for improved  $\psi$  performance  $(pJ/bitmm^2)$



Figure 1.14: Research methodology and work flow

The common theme or thread of the above problem statements can be described as:

This thesis deals with the design of energy-area efficient multi-chip systems for transition from PCB or large SOC solutions to multi-die solutions, which needs the development of energy-area efficient chip-to-chip transmitters, efficient channel design and exploration of co-design methodologies to make sure that MCS solutions outperform the PCB or large SOC solutions.

The above problem statements and common theme description, though define a general direction, does not specify which PCB transceivers are targeted and why. Similarly, the design constraints for transmitter design need to be specified. These constraints are derived from the weaknesses found in state of art analysis. This constraints derivation flow and research methodology is shown in Figure 1.14.

Next, an introduction to typical multi-chip system architectures is described and their basic properties and advantages. This shall help greatly in making the discussion easier later on in the thesis. Also, typical transmitter and receiver architectures for PCB based systems are shown for basic understanding and deeper discussion foundation for later chapters.

#### 1.3.1 Multi-chip Module (MCM)

Multi chip module (MCM) is the most basic integration of multiple ICs and discrete components in horizontal dimension on a substrate [20][21]. Substrate is generally made of ceramic or organic material.

Along with ICs, discrete components such as capacitors, resistors and inductors

can be placed on the substrate as shown in Figure 1.15. MCMs are historically used for high speed applications in networking servers [22]. Multi-chip modules are well known since 90's but could not enter mass production due to high costs at that time and limitations of line spacing and width in substrate. But now these costs are much less than the costs of large SOCs, MCM and similar technologies have become popular again.

An old application of MCM technology was Intel Pentium Pro processor die along with two cache memory dies in a dual cavity ceramic module [23]. A recent application is AMD Zen-2 micro-architecture for high end processors, where computing dies are separated from input/output (I/O) dies and placed together in a multi-chip module [24]. This separation of I/O and computing dies allows the use of 7 nm technology node for computing dies and 12 nm for I/O dies. It also provides easy performance scaling for different applications by simply adjusting the number of same computing dies in the module while I/O dies can be made smaller or larger.

## 1.3.2 System in Package (SiP)

The term system in package (SiP) is also used for packages with multiple chips and discrete components but with smaller area and vertical bonding [21]. Historically, the term got its name with the rise of fanout wafer level packaging (FOWLP) for mobile and hand held devices [25]. System in package has multiple dies bonded together in vertical dimension and connected to package using bond wires. SiP term is also used for package on package bonded systems. Most common example of system in package is a stack of memory dies placed on top of each other in a single package. In some cases, the memory packages are also vertically connected together to provide a single outside interface to the processor (PoP). A model of SiP is shown in Figure 1.16, where multiple dies are vertically bonded and connected to the substrate at bottom using bond wires.

## 1.3.3 Multiple dies on Interposer in Package (2.5D)

Multiple dies can be placed together on a passive silicon die (interposer) and connected together using highly dense metal layer routing of interposer [26],[27]. Package C4 bumps can be connected to dies using through silicon vias (TSVs) or back end of line (BEOL) routing of interposer. This integration of multiple chips is called "2.5D integrated system" as shown in Figure 1.17, where a SOC die along with a



Figure 1.15: Multi-chip module technology



Figure 1.16: System in package technology with vertical placed dies

memory die is placed on a silicon interposer. TSMC pioneered in developing and refining 2.5D integration under the name of chip on wafer on substrate (CoWoS) technology.

#### 1.3.4 Three dimensional Integration (3D)

The idea of three dimensional (3D) IC surfaced with the idea of using the TSVs for connecting multiple dies together in vertical direction. A 3D IC model is shown in Figure 1.18, where 4 silicon dies are connected through TSVs and then to a package using micro bumps. 3D integration is currently used for non-volatile memories by industry [28]. Heterogeneous integration of different dies is still a challenge for this technology due to thermal and manufacturing cost issues [29].

## 1.4 Multi-chip Communication Interfaces

ICs in a multi-chip system need to communicate and exchange data at specific rates for different applications. For example, if memory and SOC dies are placed in a multi-chip module (MCM) or on a silicon interposer in 2.5D system, memory standard specific data rates must be supported by the communication interface between the dies. In general, any communication link can be represented by Figure 1.19, where  $T_x$  represents the transmitter sending random binary data in the form of pulse train. The transmitted pulses are distorted by the channel behavior and sensed at the receiver  $R_x$ . Communication interfaces, as we know them, are mostly



Figure 1.17: 2.5D integrated system on interposer technology



Figure 1.18: 3D integrated system technology

between packages placed on a printed circuit board (PCB) [30]. But in multi-chip integration technologies, the signalling environment and requirements are different in many aspects [31]. Hence, three components of die-to-die interfaces, i.e. transmitter, receiver and channel have to be designed specifically for these systems in order to enable energy and area efficient communication.

## 1.4.1 Transmitter $(T_x)$

A standard package to package communication transmitter circuit for high speed communication [32] is shown in Figure 1.20. Data from digital part of chip is sent to the physical interface (PHY) block. PHY consists of data encoder (e.g.



Figure 1.19: Basic communication link



Figure 1.20: Standard transmitter architecture for PCB package to package communication

8/10 binary encoding), multiplexer for low speed multiple channels to high speed single channel conversion, phase locked loop (PLL) for clock generation, clock dividers for multiplexer, pre-driver and driver with several calibration/control blocks to meet the signal integrity requirements. Drivers are mostly differential and have programmability in current, output impedance termination, and feed forward equalization (FFE). For multi-chip interfaces, the constraints of signal equalization, data speeds, channel loss and driver programmability are significantly different from PCB [33]. The blocks of transmitter which can be optimized for multi-chip signalling environment include driver, pre-driver, current control, output impedance control (if it is required for specific channel), and voltage swing control. The optimization of these blocks in transmitter can provide much better energy efficiencies [34]. Hence, optimized transmitter design is one main part of this thesis.

### 1.4.2 Channel

In multi-chip packaging and integration technologies, the signal channel from transmitter to receiver is very different from standard PCB channel. A conventional channel in PCB has number of impedance discontinuities as shown in Figure 1.21. The signal must travel from  $T_x$  through pad capacitance  $(C_{Pad})$ , package bondwire/C4 ball inductance  $(L_{C4})$  and capacitance  $(C_{C4})$ , package routing, PCB solder ball inductance  $(L_{Solder})$  and capacitance  $(C_{Solder})$ , PCB routing and again through package to  $R_x$ . The discrete discontinuities and interconnect losses in package and PCB connections can distort the signal quality before it reaches the receiver.



Figure 1.21: Standard channel for PCB package to package communication

In multi-chip communication interfaces (MCCI), signal channel elements are less than in PCB. A typical channel for MCM system is shown in Figure 1.22a. The signal travels from transmitter node through package interconnect and connections



interfaces interpose

(b) Channel model for 2.5D silicon interposer based interfaces

Figure 1.22: Multi-chip communication interface channel models

only. It can be assumed that signal distortion and losses would be much less than in PCB but it depends upon the substrate used in the package, signal spacing and lengths.

In 2.5D chip-to-chip interfaces on interposer, the type of substrate is changed to silicon and routing is more dense than in packages. A typical 2.5D channel is shown in Figure 1.22b, where the signal travels through small copper pillars or micro bumps ( $\mu bump$ ) and short interposer routing in miniaturized 2.5D systems. Signal distortion and losses in 2.5D interfaces are typically assumed to be higher than in packages due to silicon substrate higher dielectric constant. But all these factors must be considered together for an energy and area efficient multi-chip communication interface.

### 1.4.3 Receiver $(\mathbf{R}_x)$

The transmitted data after distortion and signal quality losses through the channel arrives at the receiver  $(R_x)$  node with capacitance  $(C_{Pad})$ . The receiver front end circuit must be sensitive enough to sense the input voltage or current and transform it to digital binary format at baud rate. Generally the front end of receiver for package to package communication on PCB consists of input amplifier with reference voltage, input termination programmable circuitry, high frequency bandwidth enhancement circuits (e.g. continuous time linear equalization (CTLE) and decision feedback equalization (DFE)), clock and data recovery (CDR) with phase locked (PLL) or delay locked loop (DLL) and 1-bit analog to digital converter based Slicer [35] as shown in Figure 1.23. The high speed received binary data is then de-multiplexed into slower multiple data streams which are decoded by a data decoder (e.g. 10b/8b decoder). Decoded data is then sent to digital part of the chip at the system clock rate much less than interface baud rate.



Figure 1.23: Standard receiver architecture for PCB package to package communication

For multi-chip interfaces in MCM, SiP and 2.5D technologies, the front end of receiver can be greatly optimized. Since the channel in these technologies presents different distortion properties than PCB and lengths are usually shorter, receiver input circuity blocks need to be co-designed. Efficiency in terms of energy per bit (pJ/bit) and area can be enhanced by  $R_x$  optimization.

## 1.5 Conclusion

The transition from a large and complex system on chip (SOC) to multi-chip system in MCM, SiP and 2.5D technologies requires energy and area efficient design of chip-to-chip communication interfaces. The transmitter, channel and receiver parts of the interface should be co-designed together in a holistic approach to achieve overall better energy and area efficiency. This thesis focuses on the transmitter  $(T_x)$ and channel design of multi-chip communication interfaces. It also presents design methodologies for co-design of MCS systems.

CMOS technology scaling has reached a bottleneck due to interconnect bandwidth, large size and design complexity of system on chips. Multi-chip integration technologies such as multi-chip module, system in package and 2.5D integration of dies on silicon interposer offer a solution to this problem. These technologies exploit the short interconnect between chips and reduced design complexity to provide high bandwidth systems. The communication interfaces for these technologies need to be optimized for their specific signalling environment. This thesis addresses the transmitter, and channel design of multi-chip interfaces. This work also presents design methodologies for high speed and moderate speed interfaces. The research community can benefit from this work to further enhance the energy and area efficiency of multi-chip communication interfaces.

Three chips for MCS IO circuits were manufactured during the course of this work to test the designed transmitter circuits. Measurements at wafer level were performed to correlate with the simulation results. Channel or interconnect design was performed for PCB b based interfaces and related MCS solution. Design was done through perspective of minimum possible energy consumption while meeting the signal integrity specifications. Design methodology was derived for IO driver and channel combined energy-area efficiency enhancement. Path finding or design exploration based MCS system derivation methodology was presented.

Chapter 2 will detail the literature survey of the main focus areas of thesis and derivation of the design problem specific constraints. Chapter 3 will describe the work on transmitter design for multi-chip interfaces. Chapter 4 will present the measured results on wafer level. Chapter 5 will detail the channel design for MCS solutions. Chapter 6 will demonstrate the co-design methodologies and system level path finding approach for transition to MCS from PCB based systems. Chapter 7 will conclude the thesis with key scientific gains, open questions and future work.

## Chapter 2

## Literature Survey

This thesis targets the design of energy-area efficient multi-chip interfaces and to achieve that, contains three main focus areas as depicted in Figure 2.1, i.e. transmitter for multi-chip interfaces, channel or interconnect design, optimization and system level design methodologies. This chapter will describe the main challenges of thesis analytically and then analyze the state of the art. From the state of the art, the weaknesses and missing design aspects or trade-offs will be detailed. From this state of art analysis, specifications for the design challenges of the thesis will be further narrowed down and described. Later chapters shall describe the solutions to the challenges and compare the offered solutions with the state of the art.

Section 2.1 describes the transmitter design problem analytically and discusses the state of the art. It details the prior art in different kinds of commonly used driver topologies. Section 2.2 presents the channel design problem analytically and discusses the previous work on interconnect design in different substrate materials for various data rates. Section 2.3 demonstrates the problem of co-design of channel and transmitter, and the system level path finding problem. Then it reviews the literature on communication interface optimization methodologies, co-design approaches and system level previous path finding work. Each section has a subsection which describes the shortcomings in state of art for the problem statements and based on them another subsection lists out the design constraints and narrows down the research direction for the thesis target problems for MCS systems. Finally, section 2.4 concludes this chapter with remarks on prior art and objectives of this thesis.



Figure 2.1: Thesis theme and three main challenges



Figure 2.2: Main blocks of transmitter design

### 2.1 Transmitter design

Transmitter front end (Driver) for short reach interfaces in MCM, SiP and 2.5D systems needs to be optimized for required data rate and channel characteristics; meanwhile, several performance targets should also be met by the overall transmitter in terms of power, bandwidth and silicon area. A general or basic blocks of transmitter design are shown in Figure 2.2, where the driver, pre-driver and serializer blocks are clearly shown, which represent the main design effort required for an energy-efficient transmitter. The serializer block converts slow data streams from left to a single line fast data stream, which is then sent to the pad through pre-driver and driver with generally separate power supplies termed as VDD/VSS and VDDA/VSSA respectively.

The transmitted data sent on the pad is received at the receiver die  $(\mathbf{R}_x)$  or chip pad after going through the channel or interconnect, and can be represented by an eye diagram as shown in the Figure 2.3.  $\mathbf{R}_x$  compares the voltage to a reference voltage  $(\mathbf{v}_{ref})$  and amplifies the signal accordingly. Due to non-ideal nature of receiver, minimum input voltage swing  $(V_{sw})$  above and below the reference voltage  $(\mathbf{v}_{ref})$  is necessary for correct interpretation of received signal. The minimum necessary high and low voltage signals are termed  $v_{oh}$  and  $v_{ol}$  respectively as shown in Figure 2.3.

Red curve is the received signal at input of  $R_x$ .  $V_{DDH}$  and  $V_{SSL}$  are power supply and ground nodes of receiver. Purple rectangular box in the middle defines the eye mask through which ideally no signal should cross.  $t_{setup}$  and  $t_{hold}$  are the  $R_x$  input sampler timing requirements defined by the clock rising or falling edge position depending upon positive or negative edge sampling. Eye width and height define the voltage swing and horizontal opening of the signal eye diagram at the receiver. These eye parameters are necessary to understand as it will be extensively used in the problems description below. Some receivers have signal maximum overshoot and undershoot constraints which must be fulfilled.

The problem statement for transmitter for chip to chip interfaces can be further detailed as below.

**Design Inputs:** Frequency or bandwidth f (Gb/s), length range of interconnect l (mm), minimum received eye height or voltage swin  $V_{sw}$  (mV), minimum received eye width in terms of unit interval or single bit time  $t_{min}$  (UI), related PCB based system energy-area cost if transitioning from a PCB to MCS solution

 $\psi_{PCB} \ (pJ/bitmm^2)$ , related SOC energy-area cost  $\psi_{largeSOC} \ (pJ/bitmm^2)$  if transitioning from a large SOC to a partitioned MCS solution.

**Design Cost metric:** In order to reach a final design, the cost metrics to be followed is the minimization of energy-area cost  $\psi_{MCS}$   $(pJ/bitmm^2)$  of transmitter which must be less than the related  $\psi_{PCB}$  and  $\psi_{largeSOC}$  in order for the solution to be viable.

**Design effort result:** Type of driver circuit  $\alpha$ 

$$\alpha \in A = [HSUL, CML, LVDS, SSTL]$$

driver power supply VDDA(V), pre-driver and serializer power supply VDD(V), width of transistors in driver  $W(\mu m)$ , channel length of transistors  $L(\mu m)$ , type of driver termination if required  $T \in [Series, Parallel]$ .

The data rate f requirements or design inputs are clear when the design is being transitioned from a PCB based solution, e.g. a memory-processor interface on PCB. A good example is the memory-cpu 400 Gb/s interface bandwidth requirement which shall clearly dictate the bandwidth per wire requirement. In order to get a feel of the design target of bit rate for MCS systems, a thorough state of art analysis would be very helpful. Also, communication standards could be explored to get a direct frequency or bandwidth per wire requirement. Similarly, the length of wire is also not clear, and standards along with state of art analysis shall help with the understanding and derivation of MCS channel length range in millimeters. So, the two design inputs f, l, unless coming directly from PCB or large SOC transition to



Figure 2.3: Receiver input signal eye diagram parameters



Figure 2.4: Problem clarity and design targets derivation from state of art and standards

MCS, can be derived from the state of art analysis and standards initiated in the industry for multi-chip systems. Hence, the state of art described below shall be used to further narrow down the design input constraints and make the problem more clear as shown in the Figure 2.4. This method shall be used for all three focus areas of the thesis to make the design problem clear by identifying the research direction from state of art and identifying the weaknesses.

A detailed analysis of driver circuits here does not make sense, since it is a part of the design process based on the design targets which are derived after the state of art analysis. The detailed driver analysis, hence, shall be presented in next chapter to describe the design flow for the derived design problem at end of this chapter. Still to make the discussion more clear to the reader, a short discussion on common driver architectures is along with state of art is presented.

There are several challenges in the design of driver circuits, i.e. impedance matching, signal rise and fall times  $(t_{rise}, t_{fall})$ , and signal swing  $(v_{sw})$ . Generally driver circuits are categorized in terms of signalling mode, i.e. voltage and current [36]. Type of signalling constrains the driver applicability for a certain channel, and defines the level of complexity required to meet the performance requirements. Voltage mode drivers can either be single ended or differential. But current mode drivers are almost exclusively differential. There are several signalling schemes, e.g. non-return-to-zero pulse amplitude modulation (NRZ PAM-2), multi level signalling (MLS) using four levels (PAM-4), and duo-binary coding etc.

The most commonly used scheme is NRZ PAM-2 where the bits 1 and 0 are modulated to two different voltages or currents by the driver. A voltage mode driver modulates the incoming binary data [1 -1] (0 changed to -1 as it is NRZ) into two different voltages  $[v_{high} v_{low}]$ . For a typical high swing voltage mode driver,  $v_{high}$ is "vdd" and  $v_{low}$  is ground denoted as "vss" as shown in Figure 2.5. During low to high transition, the PMOS transistor pulls up the output node connected with channel to vdd. During high to low transition, the NMOS transistor pulls down the output to vss. This kind of high swing unterminated driver is generally named as HSUL (high swing unterminated logic) or push-pull driver.



Figure 2.5: High swing unterminated logic (HSUL) PMOS-over-NMOS (P-N) topology

#### 2.1.1 State of the Art

Jeong et al. demonstrated 20 Gb/s signalling HSUL driver with resistive feedback which converts it into a transimpedance amplifier (R) [37]. Small size HSUL blocks were used as pre-driver current sources to the driver. This topology has the advantage of impedance matching by controlling the value of driver transconductance (gm) through a feedback loop. This architecture saves power using small size of driver NMOS and PMOS transistors.

Dehlaghi et al. demonstrated the usage of a push-pull HSUL driver (Figure 2.5) to transmit 20 Gb/s on 3.5 mm long silicon interposer aluminum interconnect [33]. Even for such small channels, they used a passive high pass filter (equalizer) at the driver side to speed up the rising and falling edges. On the receiver side, a series RC termination was used, where the resistance is much higher  $(4\times)$  to increase the voltage swing. It was shown that for I/O in silicon interposer, 0.32 pJ/bit energy consumption can realize 20 Gb/s signalling using simple driver with passive RC equalizer.

Lin et al. showed 1.1 Gb/s silicon interposer communication using NMOS-over-NMOS low voltage swing driver with 0.3 V power supply [34]. Only channels of length up to 1 mm were supported by this topology. There was no termination at either end of the communication interface which was one of the main constraints hindering the data rate. Secondly, the low voltage power supply of 0.3 V was very low in comparison to what was required for higher data rates. This was due to triode region operation of the pull-up NMOS transistor which supports very small currents.

For terminated voltage mode signalling, source series terminated logic (SSTL) is one of the most commonly used topologies as shown in Figure 2.6. This kind of driver is mainly used in memory to processor interfaces. Series resistor  $R_s$  is connected to pull-up P-type transistor whose source is connected to vdd. Similarly, pull-down N-type transistor is connected to series resistor  $R_s$  and its source node is connected to ground vss. This source series terminated driver is connected to



Figure 2.6: Source series terminated logic (SSTL) PMOS-over-NMOS (P-N) topology

channel and then parallel terminated through resistor  $R_T$  at the receiver end to voltage supply (vtt) which is typically equal to vdd/2.

During pull-up mode, the PMOS transistor is turned on while NMOS is turned off. This results in current flowing from vdd to vtt and the input to  $R_x$  buffer goes to high level  $v_{oh}$ , ideally equal to 3/4·vdd. During the pull-down mode, the current flows from vtt through the NMOS to the ground making the output equal to  $v_{ol}$ , ideally 1/2·vdd. DC power is dissipated in SSTL topology during both high and low states. Also, there can be some switching leakage current from vdd to ground at the transmitter side.

In order to save power in SSTL topology for communication interfaces on silicon interposer, Kim et al. proposed to use high receiver termination and very low transmitter termination [38]. They used 1 k $\Omega$  resistor at R<sub>x</sub> instead of 50 $\Omega$ , reducing the current consumption from 4 mA to 0.2 mA for 800 mV swing. Due to channel losses in silicon interposer interconnect, a receiver continuous time infinite impulse response decision feedback equalizer (IIR-DFE) was proposed. Kim et al. demonstrated 8.9 Gb/s data rate transmission on 40 mm silicon interposer channel with 1.9 pJ/bit energy efficiency.

Wong et al. showed voltage mode signalling for chip-to-chip applications up to 3.6 Gb/s on 80 mm FR4 substrate interconnect [39]. They used low common mode NMOS-over-NMOS topology instead of PMOS-over-NMOS topology. This resulted in significant power saving on the driver side by using voltage supply of 0.5 V.

Dickson et al. [40] demonstrated PMOS-over-PMOS low swing driver to achieve 10 Gb/s data rate on silicon carrier interconnect using the similar IIR-DFE presented in [38]. They used low transmitter driver impedance and high receiver input impedance to achieve large voltage swing with minimum power. Dickson et al. argued based on the signal integrity simulations that unmatched termination impedance link on silicon interconnect is not much effected by reflections due to losses in the channel. They used this result to save the power consumed due to low value termination resistors at both ends of the link.

Poulton et al. showed 0.5 pJ/bit energy efficiency signalling for short reach 4.5

mm interfaces in multi chip modules or packages with organic substrate [41]. They proposed a ground referenced single ended signalling with a charge pump driver that eliminates the simultaneous switching noise problem in large communication interfaces. It was terminated on both transmitter and receiver ends enabling the volage swing equal to  $I_C \cdot R_T/2$ .

Another work by Poulton et al. showed 25 Gb/s ground referenced charge pump driver based signalling for MCM and SiP systems [42]. This work added an edge detector based equalizer to the driver to speed up the edges during the transition. This idea was similar to the edge detection based equalization used in [34]. Signalling at 25 Gb/s was shown on 10 mm organic package substrate with -4dB channel attenuation.

Differential signalling is mostly in the form of current mode [37]. There are two main topologies for current mode signalling, i.e. current mode logic (CML) and low voltage differential signalling (LVDS). In order to explain the current mode signalling, a typical schematic for LVDS driver is shown in Figure 2.7. Driver consists of two HSUL drivers in parallel with differential inputs  $(v_{in}^+, v_{in}^-)$ . They are connected at top and bottom to vdd and vss through current source I. When  $v_{in}^+$ is high and  $v_{in}^-$  is low, left PMOS and right NMOS turn fully on. They carry the whole current provided by current source, and this current travels from PMOS  $\rightarrow$ channel  $\rightarrow R_T \rightarrow$  channel  $\rightarrow$  NMOS  $\rightarrow$  vss. This results in positive voltage swing  $v_{out} = I \cdot R_T$  across  $R_T$ . Alternatively, when  $v_{in}^-$  is high and  $v_{in}^+$  is low, right PMOS and left NMOS turn on, resulting in current flow in opposite direction across  $R_T$ . This opposite flow of current produces a negative swing  $v_{out} = -I \cdot R_T$  across  $R_T$ .



Figure 2.7: Low voltage swing differential current mode signalling (LVDS)

comparator (slicer or sampler) in the receiver. It is important to note that termination is only applied at the receiver end. At high data rates, the reflections due to imperfect receiver termination and channel discontinuities can be reflected back from the transmitter side. In order to avoid this problem, a termination resistor can be connected between output nodes at the transmitter side (not shown in Figure 2.7). Hence, switched current flowing through  $R_T$  becomes halved as half current now flows through transmitter termination. This results in the available positive or negative voltage swing  $v_{out} = \pm I \cdot R_T/2$ .

Current mode differential transmitters are the most used form of driver architectures including current mode logic (CML) and low voltage differential signalling (LVDS) topologies. They have been widely used in PCB based and backplane based communications. Many works on LVDS drivers have been reported [43] [44][45][46][47]. Similarly, the CML architecture based high speed drivers for electrical PAM-2 and PAM-4 have been reported [48][49][50][51].

Both voltage and current mode transmitters have their own advantages and disadvantages. Current mode transmitters consume more static current independent of the data rate, while voltage mode transmitters consume power dependent on the data rate. This is shown by Jeong et al. that under certain design conditions, the voltage transmitters are better in terms of power efficiency up to maximum of 25 Gb/s data rates [37]. For  $\geq 25$  Gb/s, current mode transmission is more energy efficient but discussion on the driver area usage along with interconnect length based optimization discussion is missing.

Wang et al. proposed the usage of both CML and LVDS circuits for 2.5D die-todie interfaces [52]. They used on chip passive inductors in current mode driver for signal bandwidth enhancement up to 12.8 Gb/s at the cost of large silicon area. But the length of the interposer line was only considered to be 3 mm which is extremely low for real 2.5D transmission links. LVDS was also shown for 2.5D interconnect capable of operating up to 10 Gb/s.

Lee et al. presented a transceiver for 10 mm long interconnect either on chip or on an interposer [53]. Differential current mode signalling was used with high impedance load at transmitter connected to the receiver with low input impedance. Receiver was designed as a transimpedance receiver amplifier (TIA) which converted the current input to voltage swing. In order to enhance the bandwidth, pre-emphasis at the driver and active inductor peaking at the TIA were used. They demonstrated an energy efficiency of 29.4 fJ/b/mm. But the test was performed on a single chip which did not include the die to interposer link effect, along with cross talk on silicon substrate interposer effect and did not discuss the area to bandwidth efficiency when used in a 2.5D application.

[54] showed 19 mm MCM interface at 12 Gb/s using CML driver with 2 mA maximum current based on PMOS transistors. Driver current was adjustable allowing output voltage swing range of 100 mV peak to peak differential (mVppd) to 400 mVppd. This adjustable current allowed the power efficiency control for different interconnect lengths. Maximum energy efficiency of 1.4 pJ/bit was achieved with this CML driver, which scaled with the channel properties or distortions.

[55] showed 1.02 pJ/bit energy efficient 20 Gb/s interface for -6dB channel loss MCM systems. Chord signalling technique with low simultaneous switching noise (SSN), inter symbol interference (ISI), crosstalk (XTALK), and common mode noise (CMN) was presented. In this technique, each bit is distributed over multiple wires
which consequently limits the maximum possible skew between these wires to be  $\leq 4$  ps. It is showed that chord signalling although has similar sensitivity to ISI but provides better pin efficiency as it uses (n+1) wires for n-bits while PAM2-NRZ differential signalling uses  $(n^*2)$  wires for n-bits.

For PCB to MCS solution comparison discussed in chapter 1, based on the state of the art discussed here, it is quite clear that MCS solutions will be more miniaturized and lower area and energy consuming due to higher specialized chipto-chip transmitter designs already exploiting the short interconnect lengths for reducing the energy consumption. Hence, the PCB-to-MCS comparison makes more sense in the next sections on channel design and system design or path finding methodologies where transmitter circuits are fixed and the only optimization space left is the interconnect or at higher abstraction level regarding the choice of correct transmitters etc. One good example is the memory-cpu based system design path finding problem where based on the bandwidth requirements, the right memory and interconnects are chosen for overall energy-area efficient memory-cpu system, which could be either MCS or PCB based solution depending upon the requirements.

Most standards in industry, e.g. HDMI, PCI Express, USB, Ethernet use differential current mode signalling architectures for long wire transmission. Recently, due to large interest in the multi-chip systems, open domain specific architecture (ODSA) society introduced a draft for a chip-to-chip communication interface [56]. This interface, for the first time, reduced the effort needed for finding the right data rate per wire and the optimum target channel length. ODSA bunch of wires (BOW) interface standard showed the channel length targets to be up to 10 mm and data rates up to 16 Gb/s/wire.

## 2.1.2 Weaknesses in state of the art

From the state of the art analysis of previously reported transmitters for chip to chip interfaces, there were several weaknesses identified. The main weaknesses identified are:

- Large area usage due to large driver sizes, e.g.  $1323 \,\mu\text{m}^2$  in [53] and  $1500 \,\mu\text{m}^2$  in [33]
- Extremely short length of interconnect supported in some works, e.g. only 1 mm in [34], 4.5 mm in [41] and 3.5 mm in [33]
- Lack of diverse driver topologies within a single PHY for power optimization based on interconnect routing length variations, e.g. [54] does show the power reduction by changing the current in the CML driver and scaling the voltage swing with interconnect length but does not show if a PHY could implement a multi-driver topology architecture
- Usage of hybrid serializer architectures, using single ended CMOS style for low data rates up to 4-5 Gb/s and differential style (CML, transmission gates etc.) for higher data rates up to full rate, e.g. [54] uses 3-stage serializer with first stage being single ended and last two serialization stages being differential, similarly [57] uses charge steering CML style multiplexers for higher data rates and only uses CMOS style architecture up to 5 Gb/s

• lack of a any reported ODSA BOW standard [56] specific publication or reported work with dual driver topology and various channel length and data rates support

## 2.1.3 Transmitter Design Problem update

From the weaknesses identified in the state of the art and the trends regarding the data rates and channel lengths, it is now easier to define the design targets for this work chip-to-chip transmitter design. Also, the BOW interface standard [56] makes it even straightforward to further clear the design targets.

#### Final Design problems and inputs:

- Based on bunch of wires interface standard, for frequency range f of 2-16 Gb/s, organic interconnect with low loss with lengths l in range of 1-10 mm, design a transmitter with energy-area  $pJ/bitmm^2$  combined efficiency per unit interconnect millimeter length less than the state of the art around 5.5 pJ/bitnm, energy consumption less than 0.4 pJ/bit and the main priority of extremely low transmitter area less than the state of the art designs, i.e. < 1323  $\mu$ m<sup>2</sup>,  $V_{swmin} \geq 250$  mV and eye width minimum  $t_{min} \geq 0.6$  UI
- Show if a transmitter design for multi-chip systems can be optimized based on the data rate and interconnect length with an example driver for short interconnects.

## 2.2 Channel and Interconnect

The second part of this thesis focuses on the interconnect and channel design for multi-chip interfaces. The problem is to identify the channel physical parameters for a specific widely used PCB based chip-to-chip communication system in a 2.5D silicon substrate interconnect environment and determine if it makes sense to transition from PCB to 2.5D MCS solution in these systems. This work effort shall add knowledge in this area and shall help the designers to make better design tradeoff decisions when choosing the integration platform for their applications. Special attention is given to 2.5D interface channels because they are still relatively new in comparison to organic substrate based MCM interfaces. The idea is to test and develop understanding of performance of 2.5D silicon interconnect for different interfaces to determine if the channel with a certain width, length and spacing can support a target communication interface.

The problem statement for transmitter for 2.5D silicon substrate based mutli-chip system channel design for widely used PCB based systems can be further detailed as below.

**Design Inputs:** Type of interface I, Frequency or bandwidth specific to the interface f (Gb/s), interface specific receiver minimum eye height  $V_{sw}$  (mV) and minimum width  $t_{min}$  (UI), related PCB based system energy-area cost  $\psi_{PCB}$  ( $pJ/bitmm^2$ )

**Design Cost metric:** In order to find the channel design parameters, the cost metrics to be followed is the minimization of energy-area cost  $\psi_{MCS}$   $(pJ/bitmm^2)$  of interface which must be less than the related  $\psi_{PCB}$ .

**Design effort result:** Width of channel interconnect  $W(\mu m)$ , supported length range of interconnect l(mm), and spacing between lines  $S(\mu m)$ , optimal settings of interface transceiver circuits for minimum energy-area cost  $\psi$ .

The main missing input in the channel design problem is the type of interface I. In order to find this, state of the art analysis is required which shall help to find the mostly used PCB interfaces and in which areas there is still missing knowledge with regards to optimal 2.5D channel design.

## 2.2.1 State of the Art

Channels are typically characterized by their insertion, and reflection loss scattering parameters (S<sub>12</sub>, S<sub>11</sub>). An example is shown in Figure 2.8 for 10 mm long silicon substrate based 2.5D interconnect with width 1  $\mu$ m. Previous work in this area of channel test, design and performance evaluation for different interfaces is discussed below.

Wang et al. discussed the signal and power integrity of wide-I/O memory [58] on 2.5D integrated system [59]. Wide-I/O is specifically designed for miniaturized systems like 2.5D interfaces. Wang et al. studied the system containing a SOC die and a wide-I/O memory die on silicon interposer. The interconnect on interposer was modeled and used for signal performance analysis on both data write to memory and read from memory modes at 266 MHz data rate. The through silicon via (TSV) scattering parameters (s-parameters) were extracted from simulation and compared with the measured s-parameters.

Another type of memory designed for high speed 2.5D and 3D interfaces is high bandwidth memory (HBM) [60]. Cho et al. modeled a 2.5D interface using a 6 metal layer interposer for 4 HBM dies [61]. 3 out of 6 layers were used for signal transmission on the interposer while the rest 3 were used for power, ground and control signals. The thickness of metal interconnect used is 1  $\mu m$  and critical length is chosen as 5 mm. The spacing between two signal lines was 3  $\mu m$ . For signal integrity (SI) simulation, a simple RC network was used to model the SOC I/O and a simple capacitive load was used to represent the HBM input due to lack of HBM I/O model. Eye diagrams at 2 Gb/s data rate showed significant eye width and eye height margins. Optimum results were obtained with interconnect width of 3  $\mu m$ and 3.6  $\mu m$  spacing.

Lee et al. discussed the silicon, glass and organic substrate based interposer performance for HBM memory [62]. They simulated the 1 GHz HBM signals on the interconnect models extracted using 3D electromagnetic solvers and determined the signal eye height and width variations along with jitter trends on different substrate material interconnect. It was concluded in the discussion that the farthest metal layers on interposer are better in terms of performance than lower closer to substrate layers. Also, glass interposer was shown to perform better due to lower tangent loss  $tan\delta$ .

Wang et al. show the signal and power integrity simulation results of wide-I/O memory on 2.5D silicon interposer [63]. Crosstalk analysis is performed for long and short parallel lines on interposer. It is shown that the crosstalk problem becomes worse for longer parallel interconnects, which is understandable due to larger coupling. It is also shown that on die decoupling capacitor is extremely important for reducing the power supply noise below 5% voltage swing.



Figure 2.8: Silicon substrate interconnect S-parameters for L = 10 mm,  $W = 1 \mu \text{m}$ 

Lee et al. demonstrated a 6 layer silicon interposer for terabyte/s bandwidth graphics application [64]. They showed that HBM can support such extremely high bandwidth application requirements on interposer. The signal integrity was tested by simulating the extracted interconnect models with the HBM signal driver model. The results showed that demonstrated interposer qualified the signal integrity requirements of HBM signalling.

Choi et al. presented an eye diagram estimation method using worst case crosstalk and statistical eye diagram methods [65]. The estimation time was shown to be reduced with the proposed method and was found that both worst case and stastical methods should be used together for quick estimation. This method was then used to analyze the HBM memory signal integrity and the proposed method could be used to design the interposer for reduced crosstalk and better eye quality.

Chandrasekar et al. [66] discussed the timing performance oriented analysis of 0.5-1 GHz wide-I/O memory die with an FPGA die on a silicon interposer of 3-6 mm interconnect length. It was shown that for double data rate operation of wide-I/O interface (1-2 Gb/s), the timing budget allocation for an FPGA interface must be in the range of 150 to 250 ps. This budget was proposed to account for the timing jitter values reported in the eye diagrams. It was also shown that on 3 mm interconnect, even low drive strengths outperform the high drive strength drivers on 6 mm interconnect. Chandrasekar et al. did not incorporate the influence of different termination values at the receiver side on the signal integrity and overall performance in terms of energy and eye margins.

Egawa et al. tried to improve the power efficiency of vector supercomputers using the 2.5D interposer integration [67]. They used processor SX9 and memory related memory module for PCB. HBM was selected as memory for 2.5D interface and SX9 single processor die as SOC. Power consumption of silicon interposer interface and PCB interface was evaluated under specific memory access benchmarks. It was reported that the power consumption difference between PCB and silicon interposer interposer interposer interposer interconnect increased with wire length from 1 to 30 mm. The paper estimated almost 83% power reduction with interposer interface. This huge power reduction was due to the lack of high power consuming analog blocks in the PCB interface model which were not required in the 2.5D interface.

Dehlaghi et al. [68] investigated the performance of interposer up to 20 Gb/s data rates on low cost 0.35  $\mu m$  silicon interposer. They used direct connections from measurement equipment to send and receive data. The data was driven on 4.2 mm and 6 mm interposer interconnects. Insertion loss was measured and simulated. It was shown that 6 mm interconnect with 0.64  $\mu m$  height and 2  $\mu m$  width had -22dB insertion loss at 10 GHz. It was also shown that silicon interposer has better aggregate bandwidth as compared to organic substrate based interposer. But they did not evaluate the effect of equalization, current strength, and power supply on energy efficiency. Also, the impact of different trace widths, and spacing with different transmitter parameters was not studied.

For higher data rates than the speeds supported by the processor and cpu, the data lines n are multiplexed n:1 on single line and sent to the other chip. These circuits are termed as SERDES transceivers supporting extremely high data rates in the range of tens of gigabits per second (Gb/s). There are several works reporting the analysis of channel and its performance for SERDES data transmission.

Karim et al. demonstrated the capacitive coupled signalling idea for 2.5D interposer multi-chip system [69]. On chip metal insulator metal capacitor was used to implement the 2-tap feed forward equalization for interconnect insertion loss. Very thin wires of width  $1 \,\mu$ m with  $10 \,\mu$ m pitch were used. The signalling was shown to work for 10 mm long 2.5D interconnects at data rates of 30 Gb/s which resulted in the high bandwidth density of  $3 \,\text{Gb/s}/\mu$ m.

Kim et al. showed that 2.5D interconnect can also be used for signal bandwidth peaking or enhancement by making passive inductors [70]. The 2-turn and 4-turn inductors for differential signalling were demonstrated to flatten the frequency response of interconnect without any kind of power consumption. The technique was shown to work at 10 Gb/s test vehicle and eye diagrams were shown to open with the usage of passive on interposer metal interconnect equalization.

Xue et al. discussed the problem of resonance cavity in the co-planar waveguide 2.5D signal interconnects [71]. The routing distribution layer (RDL) on interposer goes through through silicon via (TSV) and then back to RDL. This RD-TSV-RDL structure is shown to have resonance insertion loss drop problems which lead to poor signal integrity. Authors proposed the usage of multiple ground TSVs on both sides of signal line around the signal TSVs. This method was used to demonstrate the reduction of insertion loss dip by about 2 dB.

Sawyer et al. demonstrated the usage of glass as a substitute to the silicon substrate for 2.5D multi-chip systems [72]. The lower dielectric constant and low tangent loss properties of glass enable high speed data transmission. Test structures with different spacing rules and length along with calibration structures were included on panel. The measurement results showed an insertion loss of 0.05 dB/mm at frequency of 14 GHz. Furthermore, crosstalk of only -30 dB was reported for 200  $\mu$ m spacing differential pair at 40 GHz.

Kim et al. measured and simulated the complete die-interposer-package-pcb 15 cm interconnect for 28 Gb/s FPGA SERDES [73]. The interconnect was mainly PCB and the aim of the work was to determine the impact of multiple types of materials in the channel, which could cause the impedance discontinuities and signal

reflection problems resulting in degradation of eye width and height.

## 2.2.2 Weaknesses in State of the Art

The state of the art discussed above show that 2.5D interposer interconnect is characterized and evaluated for low speed memory interfaces and then high speed differential signalling SERDES interfaces. From the analysis above, the following weaknesses or shortcomings are identified.

- Only HBM and wide-I/O memories are used for 2.5D interconnect performance evaluation while highly common other memories such as double data rate (DDR) memories were not focused
- HBM and wide-I/O memories were not evaluated for different driver and receiver settings and how they could impact the performance of channel and signal integrity
- SERDES 2.5D characterization was performed in published work but the discussion on SERDES IP settings optimization for 2.5D interconnect is lacking, which could really help the system designer in reducing the system power or channel area usage depending upon the design requirements

## 2.2.3 Channel Design Problem update

Based on the shortcomings identified above, the interconnect or channel design and characterization problem for 2.5D multi-chip systems can be narrowed down as below.

## Final Design problems for MCS 2.5D Channel:

- Characterize and design a memory-cpu 2.5D MCS system channel for DDR3 memory-cpu interface which should reduce the energy and area cost metrics for 2.5D DDR3 system versus the PCB system and make a comparison
- Characterize the 2.5D MCS channel interconnect for high speed serial interfaces used in industry to reduce the energy and area costs and also determine the energy reduction possibilities in SERDES PHY for 2.5D MCS channels

## 2.3 Co-design Methodologies

The third section of this thesis focuses on the co-design methodologies for communication interfaces, especially in context of short reach multi-chip modules and 2.5D systems. In previous two sections, the transmitter is designed based on the channel and channel is designed based on the transmitter, but a co-design of both is also important to achieve an energy-area minimum value for a given application. This problem is extended at higher abstraction level for memory-cpu path finding and design exploration problems, which could give the minimum energy area consuming memory choice and integration technology choice for a given bandwidth application. The problems can be stated as : **Design Inputs:** For a driver  $\alpha \in A = [HSUL, LSUL, CML, LVDS, SSTL]$ , length l of channel, data rate requirement f in Gb/s, determine the optimum channel width W and spacing S along with optimum driver impedance value Z in  $\Omega$ and equalization tap number n in driver or receiver based on the channel pulse response. For given bandwidth f, choose the minimum cost memory and integration technology.

**Design Cost metric:** In order to find the channel and driver co-designed optimum design parameters, the cost metric to be followed is the minimization of energy-area cost  $\psi$  ( $pJ/bitmm^2$ ). Similar to be used for memory-cpu system design problem.

**Design effort result:** Width of channel interconnect  $W(\mu m)$ , and spacing between lines  $S(\mu m)$ , optimal energy impedance and equalization settings, i.e. Zin  $\Omega$  of driver and number of equalization taps n in driver or receiver. Optimum choice of memory and substrate material or integration platform for memory-cpu interface.

#### 2.3.1 State of the Art

High speed multi Gb/s I/O data rates are essential for chip to chip communication on PCB and backplane systems. The methodologies developed for optimization of these interfaces can be adjusted for short reach interfaces in multi-chip modules and 2.5D systems. Several papers have been published by research community in this area as discussed below.

Hatamkhani et al. published one of the first works on modeling the deterministic jitter at the high speed transmitter output for different kinds of drivers [74]. HSUL and low swing NMOS-over-NMOS topologies were studied in single ended and differential topologies. Equation models for deterministic jitter were derived using mean square error (MSE) fitting of simulated drivers. They provided an approach to optimize the size of inverter chains for minimum power while meeting the 11% deterministic jitter constraint set in the standards. These buffers are critical to maximize the overall energy efficiency of high speed transmitters.

Balamurugan and coworkers demonstrated a statistical analysis of the transmitter jitter and its relationship with the channel [75]. They showed jitter estimation due to several noise contributions in the transmitter and also the channel impulse response. Channel and transmitter were both modeled as a linear time invariant (LTI) system. The jitter distribution was modeled as dual-Dirac normal distribution (Gaussian+bi-modal) for the analysis. They presented a method to estimate the distribution parameters by estimating the noise voltages at the transmitter and receiver.

A work on power optimization of transmitter using ideal equalization was presented by Hatamkhani et al. [76]. They tried to minimize the power consumption using the optimum data rate for given driver topology in a channel. It was shown that for same channel, low common mode signalling optimum data rate was less than high common mode optimum data rate. Optimum energy per bit scaling was shown from 180 nm to 90 nm technology node and it would scale with even smaller technologies. When using very small technologies, it was shown that the optimum energy per bit was bounded by the channel rather than the technology.

Palaniappan et al. [77] presented a serial link optimization methodology for

different equalization schemes using the current mode logic (CML) driver topology. They related the capacitance load with the required voltage swing and the data rate. From capacitance, they calculated the power consumption for different equalization topologies and optimized the power for given application. It was concluded that low loss of channel, minimally complex transmitter design and low voltage swing requirements could greatly enhance the energy efficiency of the communication interface.

Another work by Hatamkhani et al. demonstrated the deterministic jitter estimation for different driver topologies [78]. Deterministic jitter was modeled as difference in delay and its probability density functions were estimated for the transmitter. The work calculated the jitter due to transmitter, channel inter symbol interference, and receiver. Total jitter at receiver was combined with receiver offset voltage and decision time noise, in order to estimate the energy costs per bit for certain voltage noise margins at the receiver. This would help in finding the optimum data rate and energy per bit for a link under given receiver and channel characteristics.

One of the most important applications of 2.5D integration is memory-processor interface. In order to achieve the system level minimum power and maximum bandwidth in minimum possible area, a holistic design methodology is required. Xu et al. presented the idea of a data pattern aware memory controller for optimum routing and handling of the incoming data from SOC to the memory stack on the interposer [79]. The idea was to build a dynamic reconfigurable controller for adjusting the utilization of 2.5D signal channels using crossbar switches for congestion control and workload balancing. This work was unique to introduce a co-design approach for memory to SOC interface optimization under specific 2.5D system constraints.

Yazdani et al. showed a system integration and optimization methodology [80]. The I/O buffer, bump, and package ball placement was optimized for 2.5D integrated system with multiple dies. A hierarchical technique was used for optimization, where first the logic die I/O placement was optimized, and then the package ball bumps were optimized. DDR4 memory package was used for demonstrating this work and the optimization was only performed for logic and package I/O points.

## 2.3.2 Weaknesses in State of Art

- Only transmitter data rate and energy per bit was optimized in [78][76], while the energy efficiency dependency on the channel was shown in [77] but no channel-Tx-Rx co-design was demonstrated.
- There was no work demonstrating the system integration optimization with various choices of memories and integration technologies. Yazdani et al. did show some integration optimization but for only I/O buffer locations and package ball placements for single type of memory, i.e. DDR4 [80].

## 2.3.3 Design Methodology Problem Update

After going through the state of the art regarding the design methodologies for transceiver and channel interconnect, several weaknesses were identified. The main weakness the lacking of a wider co-design discussion for interconnect and drivers/receivers. This thesis last topic of co-design shall work on following narrowed down more specific problems.

#### Final Co-Design problems for MCS:

- Present a co-design methodology for energy-area  $\psi$  minimized channel and transceiver for MCS channel. Demonstrate the methodology with a widely used high speed SERDES driver such as CML for 2.5D silicon substrate based interconnect.
- Determine a memory-cpu MCS path finding / design exploration methodology which could list out the optimum choice of memory, and integration platform for given bandwidth, memory size and maximum energy-area constraints.

## 2.4 Conclusion

For energy efficient communication interfaces, effort is required in transmitter design, channel design and their co-design. When multiple dies area placed together in a package or 2.5D system, high data rates for multi-chip communication with minimum possible energy and area usage are necessary. This thesis focuses on the transmitter , channel and co-design areas to address this energy efficiency and minimum area requirements in multi-chip interfaces. This chapter presented literature survey with some background of these areas.

Transmitter circuits can be broadly divided into voltage and current mode signalling circuits. Transmitter voltage mode topologies for MCM and 2.5D systems initially focused on high swing unterminated logic (HSUL) topology. Due to short channel lengths and low signal losses, simple topologies suddenly became popular for multi-chip interfaces. Some works demonstrated low swing architectures to achieve lower power consumption. Other works showed the usage of mismatched impedance topologies for higher voltage swing with less current. This was possible due to voltage line alike behavior of channels when their length is very short and losses in the channel also dampen the reflected high frequency parts of the signal. Some works even showed current mode transmitter schemes using transimpedance amplifier at the receiver end with high resistive gain. There were many weaknesses identified in the state of the art for transmitters in MCS, e.g. large area of some transmitters, low interconnect length support, missing discussion on low swing simple driver optimization for MCS channel length and lack of bunch of wires (BOW) standard PHY published work.

Research community has tried to study the behavior of channels in multi-chip systems and their signal integrity at different data rates. Memory interfaces were explored by some researchers, especially for newer memory technologies, e.g. HBM, and wide-I/O. Some works tried to explore the high speed multi Gb/s communication interfaces on silicon interposer and MCM channels. They concluded that channel losses impact the design of high speed systems greatly in short interfaces, thus making channel design a significant aspect of high speed interface design. However, there were several shortcomings such as no discussion on widely used DDR3 memories for 2.5D interconnect and their signal quality characterization versus the PCB. Also, the high speed SERDES characterization was performed for fixed channels but no energy minimization discussion was shown.

Few works tried to optimize the transmitter design for minimum power at specific data rates, mostly in context of PCB based communication links. Different driver topologies were studied for minimizing the power for typical receiver minimum voltage swing and maximum jitter constraints. It was shown that energy consumption does scale with technology nodes but reaches a lower bound due to channel constraints. Low common mode driver topologies were shown to outperform high common mode topologies for similar channel by supporting lower optimum energy per bit performance metric. It was also shown that total deterministic jitter at receiver end can be estimated by combining the transmitter, and channel jitter impact. This jitter can then be further added with the receiver offset voltage and timing characteristics to finally obtain the minimum voltage and timing margins for a communication interface. However, the discussion on the co-design of channel with the transmitter and receiver is missing in the state of the art, which could benefit the system designer in making energy and area trade-offs. Some path finding work for design and placement of I/O of dies and packages in multi-chip systems was shown but a holistic memory-cpu interface design flow for given bandwidth requirements was missing, which could help designer choose the type of memory and integration platform (MCM, PCB, 2.5D etc).

Chapter 3 and 4 will describe in detail the proposed transmitter designs, and wafer level measurements. Chapter 5 will focus on channel or interconnect design and signal integrity characterization for memory and high speed interfaces in multichip systems. Chapter 6 will detail the co-design methodology for channel and transceiver. It shall also demonstrate the memory-cpu design flow. Chapter 7 will conclude the thesis with summary of thesis, main contributions and future work.

# Chapter 3

## Transmitter

This chapter describes the problem and solution for design of signal transmitter in 22 nm fully depleted silicon on insulator (FDSOI) technology node for communication through multi-chip module or silicon interposer interconnect. Transmitter design aspects are discussed and analyzed in terms of different transmitter topologies for different applications. Comparison with the state of the art is performed based on the results.

The first section 3.1 of this chapter deals with the first problem regarding transmitter design for bunch of wires interface standard. Second section deals with the driver optimization example for data rate and interconnect. Third section concludes the chapter.

**Design Problems and Constraints** The transmitter design is one important aspect of energy area efficient multi chip interfaces in order to realize the next generation of multi chip systems. The problems narrowed down in the last chapter for transmitter design are listed as below.

- 1. For frequency range f of 2-16 Gb/s, organic interconnect with low loss with lengths l in range of 1-10 mm, design a transmitter with energy-area  $pJ/bitmm^2$ combined efficiency per unit interconnect millimeter length less than the state of the art around 5.5 pJ/bitnm, energy consumption less than 0.4 pJ/bit and the main priority of extremely low transmitter area less than the state of the art designs, i.e. < 1323  $\mu$ m<sup>2</sup>,  $V_{swmin} \ge 250$  mV and eye width minimum  $t_{min} \ge 0.6$  UI
- 2. Show if a transmitter design for multi-chip systems can be optimized based on the data rate and interconnect length with an example driver for short interconnects.

## 3.1 Problem 1: BOW interface transmitter

For first problem, a transmitter with wide range of data transmission is designed. The target is to reach as high as possible frequency or bandwidth with minimal size and energy costs. The ideal target is 16 Gb/s with minimum voltage swing of 250 mV at the receiver end. The power supply range is 0.8-1 V for 22 FDSOI technology node used in this work. The minimum power supply will be used to save energy where possible. The target interconnect length is up to 10 mm. But this is

the maximum length given in the standard, while lower interconnects should also be supported by the transmitter. The main requirement is the dual driver topology for low and high frequencies. The design flow for this transmitter design is shown in Figure 3.1.

The first step in the design process is the determination of termination and impedance matching requirements. This shall help in the choice of topologies or driver architectures. The choice of topology is then made based upon the minimum energy consumption metrics. After the driver topology is chosen, then the sizing of transistors is performed based upon the capacitive load and frequency of transmission. Finally, the analysis is performed and results are compared with the state of the art.

The signal from the transmitter must reach the receiver input in such a shape that it is easily convertible to the logical 1 or 0 level. The critical factors deciding the driver design are the frequency or data rate of the signal  $(f_{UI})$  with time interval  $(t_{UI})$ , the capacitive loads  $(C_L)$ , the resistive load  $(R_T)$ , output voltage high  $(V_{oh})$ , output voltage low  $(V_{ol})$ , and target channel metrics as shown in Figure 3.2. In order to keep the signal reflections within low values for given data rate and channel length, terminated and unterminated topology can be chosen. Generally for low data rates and short channels, unterminated topology makes more sense due to reduced design, energy  $E_b$ , and area costs. While for longer channels and higher data rates, terminated topologies are a necessity. This work discusses both of these driver types and designs them for respective constraints for BOW interface.

Before we start with each step in the design flow shown in Figure 3.1, one missing constraint regarding pad capacitance must be determined. Also, the discussion regarding the slew rate dependency upon the current and load capacitance along with its typical relationship with resistance and capacitance (RC) is important. The load capacitance estimation and the slew rate calculation for capacitive load are shown below which shall help the design methodology become more clear and shall be discussed later.



Figure 3.1: Transmitter design flow



Figure 3.2: Driver design aspects for terminated and unterminated topologies

## 3.1.1 Estimation of Load Capacitance $(C_L)$

For typical PCB based systems, the transmitter cells require large electrostatic discharge (ESD) protection blocks at output node to meet the industrial standards. Also, these cells must be sized large enough to drive a certain defined capacitive load normally in the range of picofarads mainly caused by the packaging of IC and input loading of the receiver IC. But these conditions are very different in MCM and 2.5D signalling systems. The total load capacitance at the output node of transmitter is defined by three capacitance sources, i.e. pad capacitance ( $C_{PAD}$ ), ESD protection block capacitance ( $C_{ESD}$ ), self capacitance of driver transistors ( $C_{self}$ ) and wiring capacitance ( $C_{wire}$ ) as given below.

$$C_L = C_{PAD} + C_{ESD} + C_{wire} + C_{self} \tag{3.1}$$

 $C_{wire}$  is directly proportional to the size of driver transistors. This is because of the division of large transistor into multiple small transistors connected by wiring. Also, the wire width from driver to the pad of chip is large in order to reduce the resistive loss in wire. Then, there is also the wiring in ESD blocks. Typically, the wire capacitance at output node is calculated together with the pad capacitance  $C_{PAD}$ . The self loading of driver  $C_{self}$  is generally small in low data rate unterminated topologies as compared to the total load capacitance  $C_L$ .

Pad Capacitance  $(C_{PAD})$  The pad capacitance was large in wire bonded chips with pad sizes of few hundred micrometers. But with the scaling of technology, the size of pad and pitch between pads has reduced tremendously. The capacitance in traditional wire bonded chips was about 0.1 pF. In given technology of 22 nm, pad pitch is 100 µm with octagonal pads of spacing 40 µm. A pad of octagonal shape with 60 µm is shown in Figure 3.3. The simulated pad capacitance with typical wiring is only 30 fF. This is much less than the typical chips in package.



Figure 3.3: 60 µm octagonal pad with 100 µm pitch

ESD Capacitance  $(C_{ESD})$  The JEDEC organization defines the ESD protection for ICs in packages. It defines the ESD protection minimum requirements in the standards for IC design companies. There are two main types of ESD protection blocks required for input and output I/O cells in ICs, i.e. human body protection (HBM) and charge device model (CDM). The JEDEC recommended HBM protection standard for SOC is 1 kV according to JEP-155 document [81] while recommended JEDEC standard for CDM ESD according to JEP-157 document [82] is 250 V. These values are quite high due to the fact that signal is continuously exposed to outside world through package connections to PCB. But in MCM and 2.5D designs, the signal has to go only from one die to the other die through package or interposer interconnect. The drivers and receivers are not exposed to the similar level of ESD as in typical PCB based designs. The driver designs in this thesis are specifically for inter-die communication, so the ESD standards for typical HBM and CDM protection do not apply.

Global Semiconductor Alliance ESD Association GSA-ESDA has a special working group called 3D IC packaging working group. This group published the version 1.0 of 3D and 2.5D IC ESD recommendations in January 2015 [83]. According to these special recommendations for inter-die communication I/Os, the HBM ESD protection should be maximum of 100 V only while CDM should be 20 V only. This means that multi-chip communication solutions offer a great advantage to I/O designers in terms of very small ESD protection standards of HBM and CDM ESD. Smaller ESD block capacitance can enable the I/Os to reach multi Gb/s data rates as shown in the ESD capacitance loading requirements tables in JEP-155 and JEP-157 documents [81][82]. For this work, ESD at the driver output is estimated to causes about 20 fF which adds up with the pad capacitance to make the total load capacitance at driver (excluding the self loading) to be 50 fF.

Wire Capacitance  $(C_{wire})$  The wire routing from transistor drain to the pad also impacts the total capacitive load seen by the transistor. The simulation performed with the pad along with generic non-ideal routing leads to wire capacitance of about 20 fF. Thus, the total capacitance at the output pad including the wire, ESD and pad capacitance is 70 fF. Micro-bumps Capacitance and Inductance Dehlaghi et al. estimated the capacitance of C4  $\mu$ bumps of size 70 µm to be 5 fF only including the via and the bump [68]. The inductance of the  $\mu$ bump for die connection to silicon interposer is calculated to be 9 pH. Similarly for package or MCM connection of die pad, the capacitance and inductance of 100 µm bump is 9 fF and 15 pH, respectively. As expected for multi-chip systems, these values are also quite small and do not impact the driver design significantly especially in the case of unterminated short channel links.

#### 3.1.2 Capacitive load Slew rate and Bandwidth

For unterminated driver topology, the load is capacitive. As shown in Figure 3.2, the energy required to charge the capacitance load  $C_L$  is given as  $f_{UI} \cdot C_L \cdot V_{oh}^2$ , where  $f_{UI}$  defines the data bit rate or unit interval frequency and  $V_{oh}$  defines the voltage to be stored in the capacitive load. In order to reduce the energy consumption per bit (pJ/bit), output voltage swing  $V_{oh}$  should be reduced along with  $C_L$ . But voltage swing directly impacts the slew rate (dV/dt) which directly defines the maximum possible bit frequency  $(f_{UI})$ . Hence, it is necessary to understand the relationship between driving current, capacitor loading and slew rate with the maximum data frequency of driver.

Jan Rabaey signified the rise time as propagation delay of CMOS inverters with capacitive load as 50% input to output 50% rise time [84]. He defined the propagation delay of the push-pull PMOS-NMOS circuit as:

$$tpd = 0.69 \cdot C_{PAD} \left(\frac{R_{NMOS} + R_{PMOS}}{2}\right) \tag{3.2}$$

where  $R_{NMOS}$  and  $R_{PMOS}$  are average resistance of the PMOS and NMOS during the transient operation of the inverter. According to Eric Bogatin, from the perspective of drivers and data transmission, 10-90% output rise time  $t_{rise}$  carries much more significance in terms of maximum data rate and bandwidth [85]. Bogatin related the  $-3 \,\mathrm{dB}$  bandwidth of signal with the 10-90% rise time  $t_{rise}$  using the relationship:

$$-3\,\mathrm{dB}Bandwidth = \frac{0.35}{t_{rise}}\tag{3.3}$$

where  $t_{rise}$  is in nano-seconds while  $-3 \,\mathrm{dB}$  Bandwidth is in GHz. Rearranging the Eq. 3.3 and using the similarity between  $-3 \,\mathrm{dB}$  bandwidth and Nyquist frequency  $(1/2 \cdot f_{UI})$ , the maximum achievable data rate for slew rate dependent rise time can be given as:

$$f_{UImax} = \frac{0.7}{t_{rise}} \tag{3.4}$$

Rabaey calculated the rise time of a capacitor charged through PMOS-NMOS inverter using  $V_{oh}$  to  $V_{oh}/2$  transition [84]. And resistance-capacitance time constant technique was used to estimate the propagation delay. Due to significance of 10-90 % rise time for drivers, this work calculates the load capacitor charge time using as accurate as possible current equations for PMOS. Figure 3.4 shows the schematic used to analyze the rise time where V is the instantaneous voltage across  $C_L$ , i(V)is the current entering the  $C_L$ ,  $i_{ds}(V)$  is drain to source current of PMOS,  $V_G$  is the input at the gate of PMOS (generally = 0 for rise transition),  $V_{oh}$  is the supply



Figure 3.4: Capacitor load charging through PMOS

voltage to which eventually V rises and vss is the low voltage of circuit generally referred to as ground.

As  $C_L$  voltage V rises from  $0.1 V_{oh}$  to  $0.9 V_{oh}$ , the current i(V) keeps changing due to current through PMOS dependence upon the drain-source voltage. Instantaneous voltage change dV across  $C_L$  due to instantaneous current i(V) for time dt is written as:

$$dV = \frac{1}{C_L} \cdot i(V) \cdot dt \tag{3.5}$$

Voltage increase across  $C_L$  due to current i(V) can be written as:

$$\int dV = \frac{1}{C_L} \int i(V) \cdot dt \tag{3.6}$$

Rearranging and integrating for rise time  $t_{rise}$  from  $0.1 V_{oh}$  to  $0.9 V_{oh}$ :

$$C_L \int_{0.1 \,\mathrm{V_{oh}}}^{0.9 \,\mathrm{V_{oh}}} \frac{1}{i(V)} dV = \int_{t_{rise}} dt = t_{rise}$$
(3.7)

Please note here that  $C_L$  also includes the self loading capacitance of driver  $C_{self}$  which changes with the voltage V across the capacitive load. It is assumed here that load capacitance due to pad, ESD, and wiring is much larger than the self capacitance of driver, which is especially true for short reach low data rate unterminated links. Hence,  $C_L$  is kept out of the integral in equations above. As voltage V across capacitor increases, the operating region of transistor will change. Hence, the current i(V) can be given as:

$$i(V) = \begin{cases} i_{sat}(V) = -\frac{1}{2}\mu C_{ox} \frac{W}{L} V_{eff}^{2} (1+\lambda|V-V_{oh}-V_{eff}|) & \text{if } |V-V_{oh}| \ge |V_{eff}| \\ i_{tri}(V) = -\mu C_{ox} \frac{W}{L} \left[ V_{eff} (V-V_{oh}) - \frac{(V-V_{oh})^{2}}{2} \right] & \text{if } |V-V_{oh}| < |V_{eff}| \\ \end{cases}$$
(3.8)

where

$$V_{eff} = V_{GS} - V_T = V_G - V_{oh} - V_T$$
$$V - V_{oh} - V_{eff} = V - V_{oh} - V_G + V_{oh} + V_T = V - V_G + V_T$$
$$I_{DSAT} = \frac{1}{2} \mu C_{ox} \frac{W}{L} V_{eff}^2$$



Figure 3.5: Capacitor load charging through 20 nm PMOS with  $0.25 \text{ V} V_T$ 

Hence, the rise time integral can be re-written as:

$$t_{rise} = t_{rise_{sat}} + t_{rise_{tri}} = -\frac{C_L}{I_{DSAT}} \int_{0.1\,\mathrm{V_{oh}}}^{(\mathrm{V_G}\,-\,\mathrm{V_T})} i_{sat}(V)dV + \int_{(\mathrm{V_G}\,-\,\mathrm{V_T})}^{0.9\,(\mathrm{V_{oh}})} i_{tri}(V)dV \quad (3.9)$$

First we start with the amount of time taken to charge the load capacitance during the saturation region of transistor.

$$t_{rise_{sat}} = -\frac{C_L}{I_{DSAT}} \int_{0.1 \,\mathrm{V_{oh}}}^{(\mathrm{V_G} - \mathrm{V_T})} \frac{1}{[1 + \lambda |V - V_G + V_T|]} dV$$
(3.10)

Please note that  $V_T$  is a negative value since it is PMOS used for charging the load capacitance in this example. Integrating the equation above gives:

$$t_{rise_{sat}} = -\frac{C_L}{\lambda I_{DSAT}} \left[ \ln \left( 1 + \lambda \left| V - V_G + V_T \right| \right) \right] \Big|_{0.1 \, \mathrm{V_oh}}^{V_G - V_T}$$

The rise time  $t_{rise}$  during saturation region of transistor can be written after putting in the integration limits as:

$$t_{rise_{sat}} = \frac{C_L}{\lambda I_{DSAT}} \left[ \ln \left( 1 + \lambda \left| 0.1 V_{oh} - V_G + V_T \right| \right) \right]$$
(3.11)

The current capability of transistor is quite high in saturation region while the current carried through triode region of transistor is generally low and mathematical analysis is an overkill. For total rise time, an assumption based on simulations can



Figure 3.6: Capacitor load discharging through NMOS

be made, i.e. the saturation region current based on only the effective gate to source voltage can be used to estimate the rise time for a given capacitive load. This relationship is expressed as:

$$t_{rise} = 0.8V_{oh}\frac{C_L}{I_{DSAT}} \tag{3.12}$$

These methods of saturation region rise time and total rise calculations are compared with simulation results, which confirm their accuracy. A PMOS is simulated with 50 fF load and saturation region and total rise times are calculated as shown in Figure 3.5.

The saturation region rise time is only 250 ps while the total rise time is 726 ps. The value calculated for  $t_{rise_{sat}}$  from Eq. 3.11 is 204 ps which is quite close to the simulated value especially considering the fact that simulation is performed for an extremely short channel transistor in FDSOI technology with complex transistor model. The total rise time calculated using Eq. 6.6 is 676 ps which is also very close to simulated value of 726 ps. These results can provide the designer with extremely easy way to size the transistors for given capacitive load including pad, ESD, wiring and self loading.

For falling edge, the NMOS transistor draws current from the load capacitor and discharges it based on the saturation current available to transistor. The schematic of such transition is shown in Figure 3.6 where the current i(V) flows out from  $C_L$  through NMOS to vss. Similar to PMOS, the fall times can be calculated using equations similar to what are described above.

#### 3.1.3 Targeted Channels

The drivers are designed in this work for MCM and 2.5D systems. These systems have different kinds of interconnect and channel properties. In order to evaluate the driver performance and ensure its usage for these interconnects, s-parameters or other interconnect models are necessary.

**Organic substrate interconnect** First model used in this work is for an organic glass substrate based high density interconnect. The stackup for the low loss interconnect is designed and manufactured. Ground-signal-ground (GSG) lines of length 3.8 mm are fabricated and s-parameters are measured using vector network analyzer (VNA). The stackup is shown in Figure 3.7, where  $\mu$  pillars are used for die to substrate connection. The width of signal line is 10 µm while the spacing between signal and ground line is also  $10\,\mu\text{m}$ . The width of ground line is  $16\,\mu\text{m}$ . The manufactured interconnect is shown in Figure 3.8. The pads for connection to VNA can be seen in this figure. The measured and simulated  $S_{11}$  and  $S_{21}$  parameters are shown in Figure 3.9.  $S_{11}$  parameter signifies the reflected signal energy received at the same port when a signal with  $50 \Omega$  port impedance is given on one end of interconnect. While  $S_{21}$  shows the signal energy at other end of interconnect (port 2) due to some signal driven at port 1. In other words,  $S_{11}$  is a measure of impedance mismatch of the interconnect from the reference port impedance aand  $S_{21}$  is a measure of signal loss in the interconnect. As shown in Figure 3.9, simulation to measurement correlation is quite good. The  $S_{11}$  values for this interconnect is around  $-10 \,\mathrm{dB}$  and becomes better for large frequencies. The insertion loss  $S_{21}$ 



Figure 3.7: Organic glass substrate based stackup

is less than  $-1 \,\mathrm{dB}$  which makes this interconnect quite useful for low power data transmission without requiring channel equalization techniques.

**Silicon substrate** The second application for multi-chip systems is communication between dies on a silicon interposer. The BEOL routing of interposer provides very dense interconnects. Typical interconnect widths on silicon interposer can range from less than  $1 \,\mu m$  up to  $10 \,\mu m$ . The spacing between the interconnects can also range from  $0.5 \times$  to  $3 \times$ . A silicon substrate interconnect stackup is shown in Figure 3.10, where the signal lines are shown in differential mode with ground lines on each side and below. All the interconnects have same widths and spacing for maximum dense routing possible for a certain data rate. Though the substrate is silicon, but the dielectric in which the metal routing is done, is made of silicon dioxide  $(SiO_2)$ . Silicon has dielectric constant  $(\epsilon_r)$  of 11.9 which is quite high and can result in large signal insertion loss. But the ground lines around signal lines in  $SiO_2$  material with low dielectric constant of 3.9 can reduce these losses. More details on width, spacing and length relationship with signal quality will be detailed in the channel design chapter 5. An insertion loss graph of different width interconnects on silicon interposer are shown in Figure 3.11 below. The greater the width, smaller the signal loss. Insertion loss increases with frequency due to skin effect and substrate loss of signal. It should be noted that the increase in width does not linearly decrease the insertion loss. Only increasing the width up to a certain value has good impact on signal loss reduction but more increment in width does not lead to significant decrement of  $S_{21}$ . These kind of trade-offs play an important role in the optimization of overall multi-chip systems, especially in terms of energy and area reduction for a given application.



Figure 3.8: Organic glass substrate based GSG interconnect



Figure 3.9: Organic glass interconnect s-parameters

## 3.1.4 Termination required ?

Now that the capacitive load is clear, i.e. 70 fF and resistive load is restricted to be typical 50  $\Omega$  value, the decision for termination requirement is important. The idea of critical length is used to determine the termination requirement of an interconnect [86][85]. Any interconnect of length l shall delay the incoming signal from one end until it reaches the other end of the interconnect. This delay is called as the propagation delay  $t_{pd}$  of the interconnect. The interconnect must be terminated if the propagation delay is greater than or equal to the one third of the rise time of the signal [86]. The rise time of signal as described earlier is defined the bandwidth or data rate and is calculated as

$$t_{rise} = 0.7/f$$



Figure 3.10: Silicon substrate based interconnect



Figure 3.11: Silicon substrate interconnect insertion loss

where f is the bit rate or unit interval frequency in Gb/s. Hence, any interconnect with propagation delay  $t_{pd}$  must be used for signalling with termination if

$$t_{pd} \ge \frac{t_{rise}}{3}$$

For organic glass substrate interconnect in this problem for BOW interface transmitter, the insulator around interconnect is polymide with dielectric constant of 3.4 [87]. The thick substrate beneath the polymide is the glass with dielectric constant range of 5-10. In this work, we use the worst case dielectric constant ( $\epsilon$ ) of glass, i.e. 10. Since the interconnect is in a non-homogeneous medium due to different characteristics of glass and polymide, the average of the two dielectric constant is used. Hence,

$$\epsilon_{avg} = \frac{\epsilon_{polymide} + \epsilon_{glass}}{2} = \frac{3.4 + 10}{2} = 6.7$$

In order to use the termination requirement relationship between delay and rise time, the interconnect delay needs to be calculated. The delay is dependent upon the velocity v of signal through a medium, which is given as [85]

$$v = \frac{c}{\sqrt{\epsilon_{avg}}}$$

where c is the speed of light in vacuum, i.e.  $3 \times 10^8$ . Hence, the propagation delay of interconnect with length l is given as

$$t_{pd} = \frac{l \cdot \sqrt{\epsilon_{avg}}}{c}$$

A graph showing the propagation delay  $t_{pd}$  for lengths from 1 to 20 mm and one third of the signal rise times for data rates from 1 to 5 Gb/s is shown in Figure 3.12. As expected,  $t_{pd}$  increases linearly with the length. 1 Gb/s rise time remains



Figure 3.12: Termination requirement based on critical length of interconnect versus one third of signal rise time for various data rates

larger than the propagation delay even at 20 mm interconnect length. Hence, there is no need for termination at 1 Gb/s even at 20 mm length. But 2 Gb/s line crosses the  $t_{pd}$  at about 14 mm point. Therefore, for interconnects equal to or greater than 14 mm for data rate 2 Gb/s or higher must use termination. Similarly, at 5 Gb/s, lengths longer than only 5 mm must be used with termination. Thus, higher data rates lead to short interconnect length support without termination. While, even longer lengths like 20 mm can be run without termination at low data rates, e.g. 1 Gb/s or lower.

The above discussion leads to the conclusion that for given problem of 10 mm signalling over organic substrate, higher than 3 Gb/s data rates for 10 mm full length should be supported through terminated circuits. If the interconnect length is shorter than 10 mm, then higher data rates could be supported without termination as shown in the BOW interface standard [56]. Therefore, the unterminated circuits designed in this work shall target maximum data rate of 3 Gb/s for 10 mm.

#### 3.1.5 Topology or Architecture Choice

The next step after the termination choice explanation is the methodology behind the choice of topology for the transmitter design problem. The set of available topologies is  $A \in [HSUL, LVDS, CML, SSTL]$ . The only unterminated topology available in the given set A is high swing unterminated logic HSUL based upon PMOS-over-NMOS push pull architecture. Hence, for high speed terminated signalling, the set reduces to LVDS, CML, and SST. In order to choose between the three topologies, the energy efficiency per bit  $E_b$  in pJ/bit is the most important metric as this is a part of the energy-area metric  $\psi$ .

For given minimum voltage swing  $V_{sw}$  requirement of  $250 \,\mathrm{mV}$  in the problem

statement, then energy efficiency  $E_b$  for each topology is estimated for the data rate range of 3-16 Gb/s. The topology with minimum estimated energy consumption is chosen and then designed in detail in later sections.

**LVDS** The topology for low voltage differential signalling is shown in Figure 3.13. It has differential architecture and works on the current mode approach. The current is driven from the driver through the receiver termination which leads to a certain voltage swing  $V_{sw}$  across the receiver termination resistor  $R_T$ . The driver side transistors must be matched to the channel impedance in order to avoid the reflections. In order to achieve this, the driver side is also terminated with resistance  $R_T$  matching the channel impedance for absorbing the reflections if any. For high speed signalling, this double side impedance matching is important to avoid the reflections. The current in the LVDS driver I is split into half due to double side termination, and only I/2 goes through the receiver termination, resulting in voltage swing  $V_{sw}$  of  $I/2 \cdot R_T$ . For example, for the specified channel impedance of 50  $\Omega$  in the problem statement for terminated transmitters, the minimum current I for 250 mV swing is 10 mA. The relationship between I and  $V_{sw}$  is then

$$I = \frac{2V_{sw}}{R_T}$$

This current is static current and is driven from the power supply  $V_{DD}$  at all times irrespective of the switching speed or the frequency or data rate of transmission. The energy consumption per bit  $E_b$  can be written as power consumption P multiplied



Figure 3.13: LVDS topology

by the time interval per unit bit  $t_{UI}$  which is the inverse of bit rate frequency f.

$$E_b = P \cdot t_{UI} = \frac{P}{f}$$

Hence, the  $E_b$  for LVDS can be written as

$$E_{bLVDS} = \frac{V_{DD} \cdot I}{f} = \frac{V_{DD} \cdot 2V_{sw}}{f \cdot R_T}$$
(3.13)

This means that energy efficiency increases by increasing the bit rate or frequency of transmission because the power consumption is constant and independent of the bitrate f. Thus, at very high frequency, it is expected that energy efficiency of LVDS drivers shall go down.

**SSTL-HCM** The source series terminated topology can be divided into two types based upon the receiver side termination voltage  $V_{TT}$ . If the receiver termination voltage is not zero, then the output common mode is high leading to the name source series terminated logic (SSTL) with high common mode (HCM) output. This topology is shown in Figure 3.14, where PMOS-over-NMOS HSUL type driver with source series resistance  $R_s$  is terminated on transmitter side along with receiver side  $R_T$  termination to voltage  $V_{tt}$ . This topology is commonly used in double data rate (DDR) memory interfaces [88]. The voltage swing is defined by the difference between  $v_{oh}$  and  $v_{ol}$ , i.e. the output values for high signal and low signal respectively. Assuming the ideal case of driver side termination being equal to exactly  $R_T$ , the output high and low values along with voltage swing in SSTL-HCM topology can be written as

$$v_{oh} = V_{DD} - \frac{V_{DD} - V_{tt}}{2R_T} \cdot R_T = 0.5V_{DD} + 0.5V_{tt}$$
(3.14)

$$v_{ol} = V_{tt} - \frac{V_{tt}}{2R_T} \cdot R_T = 0.5V_{tt}$$
(3.15)



Figure 3.14: SSTL-HCM topology

$$V_{sw} = v_{oh} - v_{ol} = 0.5V_{DD} + 0.5V_{tt} - 0.5V_{tt} = 0.5V_{DD}$$
(3.16)

For 250 mV swing at output,  $V_{DD}$  of only 0.5 V is required. Due to limitation of threshold voltage being around 0.45 V for available regular threshold voltage transistors and minimum supply voltage range of 0.8 V, the choice of power supply is straightforward the minimum possible or available, i.e. 0.8 V. This shall result in the output swing of 400 mV at the receiver, which is ok because it is higher than the minimum allowed in the problem statement.

Through high bit transmission, current flows from  $V_{DD}$  through the channel and through  $R_T$  to  $V_{tt}$ . This current  $I_h$  is written as

$$I_{hVdd} = \frac{V_{DD} - V_{tt}}{2R_T}$$

The current during low bit transmission from  $V_{DD}$  is zero as the top PMOS is completely turned off, while the NMOS is completely turned on. Hence, the rms current from  $V_{DD}$  assuming equal high and low bits is

$$I_{rmsVdd} = \frac{I_{hVdd}}{\sqrt{2}} = \frac{\frac{V_{DD} - V_{tt}}{2R_T}}{\sqrt{2}}$$

During the low bit transmission, there is however current being driven from  $V_{tt}$  through the channel and through driver NMOS to ground. This current can be given as

$$I_{lVtt} = \frac{V_{tt}}{2R_T}$$

The power supply  $V_{tt}$  is however sinking current during the high bit transmission. Hence, the high bit current from  $V_{tt}$  is

$$I_{hVtt} = -I_{hVdd} = \frac{V_{tt} - V_{DD}}{2R_T}$$

The current profile of supply  $V_{tt}$  can be estimated as DC shifted square wave as shown in Figure 3.15. The rms current of supply  $V_{tt}$  can then be given as

$$I_{rmsVtt} = \sqrt{\left(\frac{2V_{tt} - V_{DD}}{4R_T}\right)^2 + \left(\frac{V_{DD}}{4R_T}\right)^2} = \frac{\sqrt{V_{DD}^2 + 2V_{tt}^2 - 2V_{tt}V_{DD}}}{2\sqrt{2}R_T}$$
(3.17)



Figure 3.15: SSTL-HCM  $V_{tt}$  current profile

The total rms power in this topology can be given as the sum of rms power of both supplies.

$$P = P_{V_{tt}} + P_{V_{DD}} = V_{tt}I_{rmsV_{tt}} + V_{DD}I_{rmsV_{DD}}$$
$$E_{bssthcm} = P \cdot t_{UI} = \frac{P}{f} = V_{DD}\frac{V_{DD} - V_{tt}}{2\sqrt{2}fR_T} + V_{tt}\frac{\sqrt{V_{DD}^2 + 2V_{tt}^2 - 2V_{tt}V_{DD}}}{2\sqrt{2}fR_T}$$
(3.18)

Assuming the generally used  $V_{tt} = V_{DD}/2$ , and the relationship  $V_{sw} = V_{DD}/2$ , the above equation can be simplified as

$$E_{bssthcm} = \frac{1.2V_{sw}^2}{fR_T} = \frac{V_{DD}^2}{3.3fR_T}$$
(3.19)

The above energy consumption is for 50% duty cycle or equal number of high and low bits in the data pattern. In order to see the performance of this topology when there is a large number of low bits, the energy consumption under 10% high bits data pattern case must be calculated and compared. the rms current for supply  $V_{tt}$ for 90% low bits and 10% high bits for SST-HCM can be written as

$$I_{rmsvtt10\%} = \sqrt{\frac{1}{T} \left[ 0.9T \left( \frac{V_{tt}}{2R_T} \right)^2 + 0.1T \left( \frac{V_{tt} - V_{DD}}{2R_T} \right)^2 \right]}$$

Assuming the general case of  $V_{sw} = V_{DD}/2$  and  $V_{tt} = V_{DD}/2$ ,

$$I_{rmsvtt10\%} = \sqrt{\frac{V_{tt}^2 + 0.1V_{DD}^2 - 0.2V_{tt}V_{DD}}{4R_T^2}} = \frac{V_{sw}}{2R_T}$$
(3.20)

The rms current for supply  $V_{DD}$  for 90% low bits and 10% high bits for SST-HCM can be written as

$$I_{rmsvdd10\%} = \sqrt{\frac{1}{T} \left[ 0.1T \left( \frac{V_{DD} - V_{tt}}{2R_T} \right)^2 + 0 \right]}$$

Assuming the general case of  $V_{sw} = V_{DD}/2$  and  $V_{tt} = V_{DD}/2$ ,

$$I_{rmsvdd10\%} = \sqrt{0.1 \frac{V_{DD}^2 + V_{tt}^2 - 2V_{tt}V_{DD}}{4R_T^2}} = \frac{V_{sw}}{2\sqrt{10}R_T}$$
(3.21)

The energy per bit for 90% low bits and 10% high bits for SST-HCM can be written as

$$E_{bsshcm10\%} = \frac{1}{f} \left[ V_{tt} \frac{V_{sw}}{2R_T} + \frac{V_{DD} V_{sw}}{2\sqrt{10}R_T} \right] = \frac{0.82V_{sw}^2}{fR_T}$$
(3.22)

**SSTL-LCM** The next type of SSTL transmission topology uses the receiver end termination to ground. This leads to the common mode output being lower, thus the name source series terminated logic with low common mode (LCM). Topology is shown in Figure 3.16, where the driver is based on PMOS-over-NMOS topology with series termination  $R_s$  and receiver end termination  $R_T$  to ground. The high and low bits output voltage can be given as

$$v_{oh} = V_{DD} - \frac{V_{DD}}{2R_T} \cdot R_T = 0.5 V_{DD}$$
(3.23)

$$v_{ol} = 0 \tag{3.24}$$

$$V_{sw} = v_{oh} - v_{ol} = 0.5 V_{DD} \tag{3.25}$$

The voltage swing is the same as in the high common mode SSTL. However, the current during the low bit transmission is zero. The current during high bit transmission is

$$I_h = \frac{V_{DL}}{2R_7}$$

Hence, the rms current for supply  $V_{DD}$  is

$$I_{rms} = \frac{I_h}{\sqrt{2}} = \frac{V_{DD}}{2\sqrt{2}R_T} = \frac{V_{sw}}{\sqrt{2}R_T}$$

The energy per bit can be written as

$$E_{bsslcm} = \frac{P}{f} = \frac{V_{DD}I_{rms}}{f} = \frac{V_{DD}^2}{2\sqrt{2}fR_T} = \frac{\sqrt{2}V_{sw}^2}{fR_T}$$
(3.26)

This energy per bit is dependent upon the pattern of high and low bits or in other words the duty cycle of the current profile of the power supply. If there are large number of high bits, the energy consumption would be large, while smaller number of high bits or transmitting many zero bits leads to energy consumption reduction. Assuming the  $V_{sw} = V_{DD}/2$ , the rms current for 90% low bits and 10% high bits for SST-LCM can be written as

$$I_{rmssslcm10\%} = \sqrt{\frac{1}{T} \left[ 0.1T \left( \frac{V_{DD}}{2R_T} \right)^2 + 0 \right]} = \sqrt{0.1 \left( \frac{V_{DD}}{2R_T} \right)^2} = \frac{V_{DD}}{2\sqrt{10}R_T} = \frac{V_{sw}}{\sqrt{10}R_T}$$
(3.27)

The energy per bit for 90% low bits and 10% high bits for SST-LCM can be written as

$$E_{bsslcm10\%} = \frac{P}{f} = \frac{V_{DD}I_{rms}}{f} = \frac{V_{DD}^2}{2\sqrt{10}fR_T} = \frac{2V_{sw}^2}{\sqrt{10}fR_T} = \frac{0.63V_{sw}^2}{fR_T}$$
(3.28)



Figure 3.16: SSTL-LCM topology



Figure 3.17: CML topology

**CML** Another common topology used for high speed signalling is the current mode logic (CML) differential architecture as shown in Figure 3.17. The driver contains a bias tail transistor carrying the static current I at all times, irrespective of the switching of the transistors based upon the input data pattern. For good termination and impedance matching, the resistors  $R_D$  on driver side are matched to the odd mode impedance  $Z_{odd}$  of differential pair of interconnects. For a given voltage swing, the minimum bias current in the driver is

$$I = \frac{2V_{sw}}{R_T}$$

The energy per bit of CML driver can be written as

$$E_{bCML} = \frac{V_{DD} \cdot I}{f} = \frac{V_{DD} \cdot 2V_{sw}}{f \cdot R_T}$$
(3.29)

This is same as LVDS due to similar current mode signalling architecture. Also, the power consumption is again independent of the data pattern.

Comparison and Choice Decision In order to compare the topologies and make a choice for the design problem, the energy per bit functions derived above for topologies are plotted in Figures 3.18 and 3.19 respectively for 50% and 10% high bit patterns. The figures are plotted using voltage swing  $V_{sw}$  value of 0.3 V and  $V_{DD}$  of 0.6 V. The minimum required voltage swing in problem statement is only 250 mV which is added with a small tolerance of only 50 mV for reliability. Another reason is that 0.5 V of  $V_{DD}$  is extremely low for LVDS and CML topologies, hence, 0.6 V supply and half of it 0.3 V is used to make a comparison. The CML/LVDS topologies only make sense after around 14 Gb/s data rate due to their higher energy usage. While both SSTL topologies reach the required 0.4 pJ/bit efficiency at 4 Gb/s and higher for 50% high and low bit pattern. The high common mode topology uses a

little less power than low common mode SSTL topology. But this is just for ideal 50% bit pattern.

For the case of long low bits which could happen a lot in the real world, the energy efficiency of CML/LVDS is even worse due to static power consumption and only makes sense to be used after 19 Gb/s. While SSTL topologies reach the required efficiency at less than 4 Gb/s. In this large number of low bits pattern, the low common mode SSTL-LCM topology shows better energy efficiency due to presence of static power consumption in SSTL-HCM during low bit or zero transmission.

Hence, for the problem at hand for data range of up to  $16 \,\mathrm{Gb/s}$  only, the CML/LVDS topologies are not very power efficient, even though they are widely used for these data rates in industry. This result is due to the the decrement in power supply values and reduction in the voltage swing requirements. Hence, we can say that CML/LVDS topologies only make sense when the target data rate is around  $20 \,\mathrm{Gb/s}$  or higher. The choice between HCM and LCM SSTL topologies is done based on two negative points regarding SSTL-HCM :

- SSTL-HCM needs an extra power supply  $V_{tt}$  at receiver end
- SSTL-HCM has static power consumption during the low bits transmission

Therefore, for the design problem, SSTL-LCM topology is chosen for high data rates in this work and HSUL is used for low speed unterminated extremely short interconnects.



Figure 3.18: Energy per bit comparison of topologies for 50% high and low bits



Figure 3.19: Energy per bit comparison of topologies for 10% high and 90% low bits

#### 3.1.6 Transmitter Design

As stated in the previous chapter, this is the first ever reported dual driver topology based bunch of wires interface transmitter [56]. The design constraints require the topology choice for two types of drivers, which are selected as: HSUL and SSTL-LCM based upon the energy efficiency comparison in previous section. According to Figure 3.1, since the topology and termination choices are made, this section now details the transmitter transistor sizes and their selection procedure.

A typical transmitter includes the clock generation and distribution circuitry, pre-driver, driver, data multiplexer to convert slow data streams into high speed output stream, and digital blocks for serialization and data generation. The architecture is shown in Figure 3.20, where  $clk_{hr}$  denotes half rate clock with time period equal to two times the output data unit interval, i.e. 2 UI. Half rate clock  $clk_{hr}$  drives the pseudo random bit stream (PRBS-7) data generator, and two to one (2:1) multiplexer which converts two half rate data streams into one single full rate stream. Pre-driver with 4-bit calibration control enables and disables the driver transistors for required drive strength and output impedance. Depending upon the channel properties, calibration bits in pre-driver can change the power consumption and output swing of driver.

Two types of front end drivers along with pre-drivers are shown in Figure 3.20, which refer to the unterminated low speed according to standard HSUL based driver and high speed terminated SSTL-LCM based driver. The multiplexed data or serialized double data rate (DDR) data goes through the high speed SSTL driver while



Figure 3.20: Transmitter and system architecture

single data rate (SDR) data is sent out through the HSUL driver represented as DDR terminated Tx and unterminated Tx in Figure 3.20. Following sections detail the size of each type of driver transistors along with supporting digital and clock blocks.

## 3.1.7 SSTL-LCM Driver & Pre-driver

For high speed data transmission higher than 10 Gb/s, impedance matching is necessary for good transmitted signal quality. The terminated source series low voltage topology is used for this transmitter is shown in Figure 3.16. The pull up and pull down transistors are both PMOS and NMOS, respectively. The voltage supply of the pre-driver which produces the 'pu' and 'pd' signals, and the power supply of the driver is 'vdd'. The different power supply voltages for pre-driver and driver can also be used to change the region of operation of transistors. In this work, we use the same power supply in order to decrease the number of power supplies required for system while trying to meet the impedance matching requirements.

During the 0-1 transition, the top PMOS turns on while bottom NMOS is completely turned off. The current starts flowing from PMOS through series resistor  $R_s$ to channel and then through receiver termination resistor  $R_T$ . If  $V_{oh}$  is the output voltage for '1' bit, then the current during this bit transmission is  $V_{oh}/R_T$ . During the 1-0 transition, the bottom NMOS turns on while PMOS is completely turned off. The current flows from charge stored at receiver input capacitance and pad capacitance of transmitter through bottom NMOS to ground 'vss'. As compared to the source follower topology shown in previous section, there is no current drawn from 'vdd' during '0' bit transmission. Also, there is no waste of current during '1' bit transmission, as bottom NMOS is turned off and whole current flows from transmitter to receiver termination. This technique along with low power supply voltage and low voltage swing help to reduce the power consumption of this transmitter as was shown in comparison to other topologies in Figures 3.18 and 3.19 respectively.

If the number of 1's and 0's in the data stream is assumed same, then the average energy consumption of this driver per bit as shown before can be written as:

$$E_{b50\%} = \sqrt{2} \frac{V_{sw}^2}{fR_T}$$



Figure 3.21: Dependency of  $R_{ds}$  on  $V_{DS}$  and  $V_{GS}$ 

This equation clearly dictates the need for lower voltage swing, higher termination resistor and lower 'vdd' to decrease the energy consumption per bit. For the targeted voltage swing  $V_{sw}$  of 250 mV given in the problem statement, power supply of  $V_{DD}$  of 0.5 V would have been enough. But due to threshold voltage limitation of transistors which is around 0.42 V for given 22 nm technology, and voltage range of 0.8-1 V, the minimum supply chosen is 0.8 V. This shall lead to the voltage swing of 400 mV in ideal case, which would increase the tolerance to even higher interconnect losses than required. Hence, the problem statement is modified according to the technology limitations and transistor limitations to:  $V_{sw}$  requirement of 400 mV and  $V_{DD}$  of 0.8 V.

A current of 8 mA is needed for 400 mV swing assuming termination resistance of 50  $\Omega$ . For a given current and voltage swing requirement, the size of PMOS should be such that the resistance of PMOS is very small in comparison to  $R_s$ . This is required to fulfill the output impedance  $Zo_{pu}$  matching requirement given as:

$$R_T = Zo_{pu} = R_{ds} + R_s = \frac{1}{\mu C_{ox} \frac{W}{L} (V_{GS} - V_T)} + R_s$$
(3.30)

This requirement is due to the fact that during pull up, the top PMOS always stays in triode region. Hence, the impedance looking into the drain of PMOS  $R_{ds}$ should be small. This resistance  $R_{ds}$  together with series resistor  $R_s$  must equal the channel and termination impedance  $R_T$ . Impedance linearity is controlled by the choice of the series resistor  $R_s$ . Higher the value of  $R_s$ , lower the nonlinearity in output impedance of the driver. Impedance nonlinearity can be expressed as:

Impedance nonlinearity factor 
$$= \frac{R_T - R_s}{R_T} = \frac{R_{ds}}{R_T}$$
 (3.31)

Since  $R_{ds}$  is not entirely dependent upon the gate source overdrive voltage  $V_{GS} - V_T$  due to short channel length nonlinearities, it will change during the output transition with changing  $V_{ds}$ . In order to keep this impedance change small, the value of resistor  $R_s$  should be increased. The dependency of the drain-source  $R_{ds}$  resistance upon  $V_{ds}$  and  $V_{GS} - V_T$  is shown in Figure 3.21, where the resistance decreases with the increase in gate to source voltage as expected. But there is also increase in the  $R_{ds}$  with the drain-source voltage increment. Thus, the drain-source voltage must be kept as small as possible for high linearity and more dependency of resistance on the poly resistor rather than the PMOS or NMOS devices.

The voltage swing requirement and termination resistance give the required current  $I_D$ . The value of  $R_s$  dictates the value of  $R_{ds}$ . The size of pull up PMOS transistor can be calculated as:

$$\left(\frac{W}{L}\right)_{pu} = \frac{1}{\left(R_T - R_s\right)\left[\mu C_{ox}\left(vdd - V_T\right)\right]}$$
(3.32)

During pull down operation, voltage swing and power supply are chosen such that the bottom NMOS never goes into saturation region. This is critical due to the fact that if the bottom NMOS ever enters the saturation region, the impedance looking into the drain is very high and can drastically impact the signal integrity. In this work,  $40 \Omega$  series resistor, and correct sizing control of driver using pre-driver calibration bits, the transistor at the start of pull down operation is ensured to be in triode or edge of triode region. As the pull down operation goes through, the drain source voltage of bottom NMOS decreases and is always in triode region. In triode region, bottom NMOS resistance almost remains constant and depends only on the gate source voltage. For pull down, the total output impedance of driver  $Zo_{pd}$  can be written as:

$$Zo_{pd} = \frac{1}{\left[\mu C_{ox} \frac{W}{L} \left(V_{GS} - V_T\right)\right]} + R_s$$
(3.33)

If 'vdd' is the supply of pre-driver, then the size of pull down NMOS can be calculated as:

$$\left(\frac{W}{L}\right)_{pd} = \frac{1}{\left(R_T - R_s\right)\left[\mu C_{ox}\left(vdd - V_T\right)\right]}$$
(3.34)

Using  $40 \Omega$  as the source series resistance in this driver topology, only  $10 \Omega$  resistance and 0.08 V drain-source voltage is allowed for the transistors. As shown in Figure 3.21, at  $0.08 \text{ V} V_{DS}$ , the combined resistance-width is about  $320 \Omega \,\mu\text{m}$ . Therefore, to achieve the  $10 \Omega$  resistance of devices, the width of devices is chosen as  $320/10 = 32 \mu m$ . The total size is divided into 5-slices with sizes 2, 2, 4, 8, and 16. The driver schematic is shown in Figure 3.22.

Pull up and pull down enable control bits pu/pd < 3 : 0 > are generated by pre-driver. There is a direct connection from data input 'in' to driver smallest slice of size 2 µm through signal 'fixpre' as shown in Figure 3.26. The data in the form of 4 pull up and 4 pull down signals along with a minimum size signal 'fixpre' are given to driver constructed in the form of transistor binary size slices.



Figure 3.22: Slice architecture of low swing source series terminated driver

**Pre-driver design and sizing** In order to control the size of driver, the voltage swing and the output impedance, the calibration bits are provided at the predriver. 4-bit calibration bits can change the driver size in binary format from minimum of  $1 \times$  to  $16 \times (2-32 \,\mu\text{m})$  by turning on and off the respective devices. The pre-driver must be fast enough to transmit the incoming full rate data stream while providing the control for driver output impedance.

The requirement is that the pre-driver must be able to let the signal go through to the PMOS and NMOS slices or block them and disable the slices. The requirement for PMOS control pre-driver input is shown in Figure 3.23. The pull up enable predriver signals must combine with the data signal D such that the resulting output should either disable the slice when enable signal is zero or generate the data inverted signal  $\overline{D}$  which shall pull the output of driver high as desired. The table shown in Figure 3.23 is similar to a NAND gate table and can be written as

$$pu < 3: 0 > = \overline{D \cdot pen < 3: 0 >}$$

For NMOS slices, the requirement is shown in Figure 3.24, where the enable signals for pull-down NMOS in driver should let the signal through when the enable signal is high, but the control signal should go to zero if the enable signal is zero. This disable condition is opposite to the PMOS condition because the NMOS must be turned off if it is disabled which requires a zero control signal. By searching for different gate tables, it was found that the easiest method is to invert the NMOS enable signals, which results in the normal NOR table as shown in Figure 3.24 and can be written as:

$$pd < 3: 0 >= D + \overline{nen < 3: 0 >}$$

| Data signal | Pre-driver enable<br>signal<br>(pen<3:0>) | Final signal to Driver<br>PMOS slices (pu<3:0>) |
|-------------|-------------------------------------------|-------------------------------------------------|
| D           | 0                                         | 1                                               |
| D           | 1                                         | $\overline{D}$                                  |

Figure 3.23: Pull up enable pre-driver requirement table

| Data signal | Pre-driver enable<br>signal<br>(nen<3:0>)            | Final signal to Driver<br>NMOS slices (pd<3:0>) |  |
|-------------|------------------------------------------------------|-------------------------------------------------|--|
| D           | 1                                                    | $\overline{D}$                                  |  |
| D           | 0                                                    | 0                                               |  |
|             |                                                      |                                                 |  |
| Data signal | Pre-driver enable<br>signal inverted<br>(nen_b<3:0>) | Final signal to Driver<br>NMOS slices (pd<3:0>) |  |
| D           | 0                                                    | $\overline{D}$                                  |  |
| D           | 1                                                    | 0                                               |  |

Figure 3.24: Pull down enable pre-driver requirement table

The NAND and NOR implementation is shown in Figure 3.25. The sizing of the transistors in these NAND and NOR is based upon the size of the driver PMOS and NMOS slices they are driving. For example, the size of the pre-driver NOR delivering the signal pd < 3 > to the 16 µm NMOS slice is 4 µm based on minimum delay based optimum fan-out of 4 (precisely 3.6 with self loading) (FO4) generally used in digital logic [84]. While the size of transistor controlled by the data signal D in pre-driver NAND delivering the signal pu < 3 > to the largest PMOS slice of 16 µm is 8 µm based on the fan out of 2 (FO2). The NMOS and PMOS both are sized same in the driver, which results in the rise and fall time mismatch.

This problem is relieved in this design by decreasing the fanout ratio of the PMOS pre-driver to FO2 from FO4 which is used in NMOS. In other words, the rise and fall time mismatch normally solved by increasing the PMOS transistors in driver is solved in this work by lower fan out ratio of previous stage, i.e. the pre-driver. Eye-



Figure 3.25: Pre-driver NAND NOR implementation



Figure 3.26: Pre-driver for calibration and data transmission control of driver

diagram results at the end of this section shall show that this methodology works and is used to reduce the area of the PMOS. Otherwise, by standard methodology of just doubling the PMOS size, the driver would have required  $64 \,\mu\text{m}$  of PMOS size. Rather the PMOS and NMOS both are sized  $32 \,\mu\text{m}$  and rise/fall time mismatch is solved using previous stages fanout ratio change.

The schematic of pre-driver is shown in Figure 3.26. The four pull up and pull down control bits are decoded by two bits < 1 : 0 > supplied from outside using wafer DC probe pads. The output impedance change of driver by the pre-driver control bits is analyzed as shown in the Figure 3.27.

As expected from Eq. 3.33, the increase in size of driver results in decrease in output impedance. Since this driver is designed to match to  $50 \Omega$  channel and



Figure 3.27: Output impedance of driver under different calibration bits controlled driver size


Figure 3.28:  $Z_{11}$  and  $S_{11}$  of driver under nen< 01 > and pen< 00 >

receiver termination, it can be seen from the plot that minimum size of driver is not enough. Simply turning on the next driver slice produces much better impedance matching. Using the minimum size of PMOS and  $2 \times$  of NMOS gives the following AC instantaneous input voltage dependent  $S_{11}$  (dB) and  $Z_{11}$  parameters. This graph is very critical to ensure the linearity of output impedance during the output transition with respect to input data transition. Graph shows that with the chosen calibration values, the output impedance remains constant throughout the data transition range.

The layout of driver and pre-driver are shown in Figure 3.29. Pre-driver takes input data stream from left side and uses the calibration bits from bottom to generate the pull up and pull down signals to the driver on the right side. Resistor is placed on right of the driver which is then connected to the pad using wide metal line. The driver is designed using many small  $1 \,\mu m$  wide transistors placed in array structure. This method reduces the resistance of the wires along with their current density due to many wires placed in parallel to connect array transistors. This lower current density in the metal wires is useful from the perspective of electro migration. The total area consumed by the driver is only  $1200 \,\mu\text{m}^2$  which is less than the target of  $1323\,\mu\text{m}^2$  specified in the design problem statement based on the state of the art analysis. The idea of sizing the PMOS same as the NMOS played crucial role in saving the area. Also, as can be seen, the calibration and sizing control through predriver takes the larger part of the front end design. The driver only itself takes only  $400\,\mu\text{m}^2$  area. That means that if the calibration is not required and data lines can be directly sent to the PMOS and NMOS with some FO3-4 buffering, large amounts of area in pre-driver could be saved. Nonetheless, the result is quite good as it is less the area in the state of the art designs. Furthermore, the energy consumption of



Figure 3.29: Pre-driver and driver layout

only the driver itself is around 0.25 pJ/bit, which leads to energy area efficiency of the driver as  $100 \text{ pJ/bit}\mu\text{m}^2$ . When used over an interconnect of 12 mm length, the energy area efficiency per unit length for driver is 8.3 pJ/bitmm much less than the target of 40 pJ/bitmm in problem statement based on the state of the art analysis for such data rates and lengths drivers. Detailed state of art comparison shall be presented in results section later in the chapter.

## 3.1.8 HSUL Driver and Pre-driver

The low speed single data rate transmitter required in the bunch of wires standard is implemented in this work using the HSUL topology chosen in previous section. This is standard PMOS-over-NMOS without any termination inverter based driver design which does not consume any static power and only consumes power while switching or output transitioning. The sizing of the driver is based on the estimated capacitance load constraint in problem statement, i.e. 70 fF. Based on the rise time calculation methodology discussed previously in chapter, equation 6.6 says that the current capability of the transistors in driver shall determine the slew rate and hence the rise time. Higher the slew rate, lower the rise time. For target data rate of 6-8 GHz single data rate (SDR) without multiplexer signalling, the driver must be sized to have a rise time less than or equal to  $0.35/f_{BW}$  where  $f_{BW}$  is the half of the target data rate or the Nyquist frequency. For 8 GHz, the rise time must be less than or equal to 88 ps under the target capacitive load of 70 fF. The size of transistors shall be kept as small as possible in order to make it comparable or less than the energy and area consumption in the state of the art designs.

The first step is to find the size of the last stage driver inverter buffer whose output is connected to the estimated load capacitance. In order to find the size, the *Idsat* is required which is calculated as:

$$I_{dsat} = \frac{0.8V_{oh}C_L}{t_{rise}}$$

For 88 ps rise time for 8 Gb/s or 4 GHz Nyquist frequency and  $V_{oh}$  equal to the  $V_{DD}$  of 0.8 V, the  $I_{dsat}$  is derived as 0.5 mA. The width can be easily derived using the current density profile of 20 nm transistors as shown in the Figure 3.30. From



Figure 3.30: Current profile of 20 nm PMOS transistor various  $V_{DS}$  and  $V_{GS}$ 

the graph, the width of the final inverter buffer driver stage is derived as  $1.4 \,\mu\text{m}$ . Estimating the similar load capacitance at the receiver end, double the width is required. After adding a little tolerance,  $3 \,\mu\text{m}$  size devices are used.

This unterminated HSUL driver is designed based upon the buffer chain methodology as shown in the Figure 3.31, where the minimum size of buffer has to drive a certain capacitive load, in this case 70 fF, through a buffer chain. The mixture of fanout of 4 and fanout of 3 is used in the chain which is a design choice made to ensure the fast rise times and thats why higher fanout than 4 is avoided. It is more of a tolerance based fanout reduction, FO5 could work also but to deal with unwanted extra capacitances in the layout and process variations, lower fanout ratio is used. One can ask that where is the pre-driver in this unterminated transmitter, the first 3 stages of the buffer chain are basically the pre-driver stages and the last stage is the driver stage. Because there is no control implemented in this design,



Figure 3.31: Buffer chain sizing with fanout 2 and 3 ratio



Figure 3.32: Driver buffer chain layout

therefore, the pre-driver and driver stages get mixed together.

The layout is shown in Figure 3.32 where the input comes in from the left side and goes out on the right side to the output pad.

# 3.1.9 High Speed Digital Blocks

In order to test the transmitter, high speed data input is required. Getting multi Gb/s data input streams is very difficult and costly. Instead, an on chip data generator is more efficient and better method to test the transmitter systems. Generally extremely low speed data generators working in the range of few hundred MHz are used in most research papers. But this work instead produces random data at half clock rate. A parallel PRBS-7 generator is implemented using two phase clock  $C^2MOS$  logic which is one of the most high speed digital logic techniques along with true single phase clock logic (TSPC) [89].  $C^2MOS$  logic is chosen to make use of complementary clocks and reduce the coupling to other interconnects in dense routing by usage of differential CMOS clock lines tightly coupled to each other.

Typical PRBS is implemented using serial shift register as shown in Figure 3.33. It implements the polynomial

$$PRBS - 7 - Polynomial = x^6 + x^7 + 1$$

which requires seven flip-flops and an XOR gate. It generates a pattern of  $2^7 - 1$  or 127 bits which repeats itself, i.e. hence the name pseudo random. The bits on



Figure 3.33: Series shift register  $2^7 - 1$  random generator



Figure 3.34: 2:1 Mux based PRBS requirement

each flipflop are in an incremental order, which means that the seven flip-flops at any certain time shall carry the bit-x and bit-x+n where n is the flip-flop number. For example, if the flip-flop 1 has the output equal to the bit-1 in the pattern of 127 bits, then the flip-flop FF7 has the output bit-7 of the 127-bit pattern of PRBS-7.

The problem with the shift register architecture of PRBS is that the bits on seven flip-flops are in incremental format and hence, the bit outputs of these flip-flops can not be multiplexed together due to their serial shifting architecture. For example, in order to serialize the two outputs from flip-flops running at the data rate of f Gb/s, the two bit streams must be in the phase difference of 180° as shown in the Figure 3.34. This problem of re-timing the data and parallelization has been



Figure 3.35: Parallel 8 outputs  $2^7 - 1$  PRBS topology based on [90]

explained and dealt with in detail by Laskin et al. [90]. They offered a solution to the series PRBS architecture and instead demonstrated a parallel  $2^7 - 1$  PRBS with 8 parallel outputs. Similar architecture is used for PRBS implementation in this work for testing the driver and signal transmission as shown in Figure 3.35. For a double data rate signal to be sent to the SSTL-LCM driver, the 180-degree phase difference PRBS outputs are multiplexed together for double data rate PRBS.

Parallel PRBS-7 has the advantage of generating multiple data phases at the same time. These time shifted data streams at half rate can be multiplexed by a serializer to produce higher frequency data streams which follow the same PRBS pattern as the lower rate data streams. In order to realize such a data generator, high speed flip-flop and exclusive or (XOR) gates are required which are discussed below and their sizing issues are detailed.

**Flip-Flop Sizing** There are several different topologies possible for flip-flop design, such as sense-amplifier differential flip-flop,  $C^2MOS$ , true single phase clock (TSPC) flip-flop etc [91]. A choice is based on the simplicity for digital design, power consumption and speed. Nicolic et al. presented a comparison of these metric for several flip-flop topologies. Based on the comparison,  $C^2MOS$  was chosen for its low power for pure digital design architecture as compared to traditional sense-amplifier differential flip-flop. Another reason is the ease of sizing of these flip-flops similar to the CMOS inverter sizing based on the capacitive load and fanout.

The flip-flop is designed using master slave architecture where the first latch is a negative clock edge triggered latch while the second latch is a positive clock edge triggered latch. Thus, the two latches combine to make a positive edge triggered



Figure 3.36: High speed flip-flop and XOR gates for transmitter

flip-flop. The schematic of the flip-flop is shown in Figure 3.36a. Each latch consists of two PMOS and two NMOS transistors. Complementary clocks with correct phase relationship of 180° are supplied at the middle PMOS and NMOS transistors. For negative edge latch, when clk is high and  $\overline{clk}$  is low, the data D is not latched. When clk goes low and  $\overline{clk}$  goes high, data D is latched. This  $C^2MOS$  logic can avoid race conditions easily due to large timing margin. This is one of the reasons why it is the go to technique when multi GHz digital logic is required.

The sizing of flip-flop is very important in this design, in order to keep the signal rise and fall times low enough. The sizing is done based on the capacitive load presented. From the Figure 3.35, each flip-flop has to drive at maximum three different blocks, i.e. two XOR, and an input of multiplexer if these outputs are to be serialized for higher data rate output. The sizing is performed for the desired data rate of flip-flop to be  $\leq 8$  GHz and chosen to be half of the size chosen for HSUL driver for 70 fF load, i.e. 0.75 µm is chosen as the width of each flip-flop and 0.02 µm is chosen as the transistor channel length which shall ensure that even with interconnect based resistance and capacitance losses, the output rise and fall times of the flip-flop shall remain low enough.

**XOR** Once the topology for flip-flop is chosen, the topology and design for other gates is adjusted to that topology as well for a streamlined straightforward design. Hence, the exclusive OR (XOR) gate is designed using similar  $C^2MOS$  logic. Design is done based on the truth table for XOR gate and equation using the XOR equation

$$A \oplus B = A \cdot \overline{B} + \overline{A} \cdot B$$

which means that XOR also requires the inverted inputs. The data input A and B are inverted locally using CMOS inverters. Inverted inputs are then driven to the truth table CMOS implementation of XOR gate where the output  $out = A \oplus B$ . The schematic of XOR is shown in Figure 3.36b.

Sizing of XOR gate is based on the speed requirement similar to the flip-flop and the output load. Since, the load of XOR is maximum one flip-flop and another XOR gate, the size is chosen to ensure fast output before the next clock edge and to keep the design simple. Thus, the size of the XOR devices is chosen similar to the flip-flop size equal to 0.75 µm width and 0.02 µm channel length of transistors.

2:1 Mux and Sizing Typically serializer or large ratio multiplexer is designed in the form of a tree structure consisting of many 2:1 multiplexers. For example, an 8:1 serializer requires 4 2:1 multiplexers. When the data generator is working at high speed, energy savings are achieved in the serializer by reducing the tree structure into smaller tree or in this case to just a single 2:1 multiplexer. Generally a single 2:1 mux is always implemented using 5 latches [57]. But by proper maintenance of timing relationship between data streams and clocks, the number of latches can be reduced.

The design of this 2:1 mux is based on equation

$$out_{fr} = A_{hr} \cdot \overline{clk} + B_{hr} \cdot clk \tag{3.35}$$

which can described as

• when clock goes low, the half rate input A should be selected



Figure 3.37: 2:1 Mux timing

• when clock goes high, the half rate input B must be selected

For clock falling edge and selecting the input A is simple, because the input A produced by a flip-flop at a previous stage in PRBS is clocked to the rising edge. Hence, the clock falling edge falls in middle of the input A as shown in Figure 3.37. But the rising edge of clock and input B have same timing relationship and hence, there is no timing margin for input B. This is resolved, by using a single latch in the path of input B which delays the input B by a half clock cycle and then the clock rising edge falls in the middle of the delayed input B as shown in the Figure 3.37.

Since, the latch for input B inverts the data, an inverter is also placed in the path of data A as shown in the Figure 3.38b. The final output of mux is again inverted and thus,  $out_{fr}$  is same as the inputs  $A_{hr}$  and  $B_{hr}$ . This is another advantage of C<sup>2</sup>MOS mux topology because latch and selector combine to given the same output as before the mux but at double data rate.

Therefore, this work uses only 1 latch to generate the  $180^{\circ}$  clock phase between the two data streams. These even and odd data streams are then sent to a single selector cell which generates the full rate multiplexed data. The multiplexer latch based topology and the schematic is shown in Figures 3.38a and 3.38b, respectively. Half rate data  $A_{hr}$  and  $B_{hr}$  are converted to full rate output  $out_{fr}$  using half rate complementary clk and  $\overline{clk}$  clocks.

# 3.1.10 Clock Distribution and Buffer Sizing

Clock in complementary format clk and clk must be delivered to all the flip-flops, latches and the selector in the multiplexer. The clock distribution is based on the clock buffer chain methodology based on the total load[84]. The total load consists of

- 8 flip-flops in parallel PRBS-7 = 2 latches per flip-flop = 2  $\times$  8 = 16 devices = 16\*0.75 = 12  $\mu m$
- 2 latches and 2 selectors in multiplexer = 2 + 2\*2 = 6 devices = 6\*.75 =  $4.5\,\mu\mathrm{m}$



Figure 3.38: 2:1 multiplexer schematic for transmitter using half rate clk and  $\overline{clk}$ 

(b) Selector cell





Hence, the total deice width which needs to be driven from clock is  $16.5 \,\mu\text{m}$ . Using again the fanout of 4 (FO4) rule for minimum propagation delay, the sizing of last stage clock inverter which drives the whole load is  $4\,\mu\text{m}$ , which is then buffered through three more CMOS inverters with  $1\,\mu\text{m}$ ,  $0.25\,\mu\text{m}$  and  $0.08\,\mu\text{m}$  respectively as shown in Figure 3.39.

## 3.1.11 Clock Generator

The maximum target frequency for the clock generator based on the target data rates of ideally 16 Gb/s is 8 GHz. However, there is one restriction, i.e. the frequency can be modulated only using the power supply for clock generator and has the range of 0.8-1 V. Hence, the clock generator must be designed in such a manner that modulation by voltage supply gives the clock frequency in the middle of primary target range of 2-8 Ghz. Lower and higher end frequencies in the primary range may not be met due to small power supply modulation range. However, the middle range of 4-6 GHz should be met and should be the prime target.



Figure 3.40: Clock generator 3-stage ring oscillator

In order to generate the clock using ring oscillator topology, there are several design choices which need to be made, i.e. sizes of transistors, channel length of transistors, capacitance loads at each stage, and number of stages.

Number of stages: The number of stages must be odd, for the oscillator to work. The minimum number is chosen in this work, i.e. 5 to reduce the size and power of the oscillator.

Size of devices: The size of stage is chosen based on the output frequency and simulations are performed to select the width and length which can give the output clock with required frequency. After sweep simulations with length  $5 \times$  the minimum channel length, width of devices chosen is equal to  $1 \,\mu\text{m}$ .

Therefore, clock is generated using 3-stage CMOS inverter based ring oscillator as shown in Figure 3.40. In order to reduce the phase difference between the complementary clocks, cross coupled inverters are placed at the three nodes of ring oscillator. Since the cross coupled inverter are only required for reducing the phase difference between the complementary clocks, fanout of 8 (FO8) is used to reduce the power. The size of cross coupled inverter is about  $1/4 \times$  of the inverters in oscillator. Instead of placing discrete capacitors at the nodes of oscillators, longer channel lengths are used to reduce the phase noise of the oscillator. By this method, area is saved which would be taken by the capacitive loads at the oscillator stages.

The simulation of ring oscillator after layout RC extraction is shown in Figure 3.41. The clocks are purely complementary and 4.64 GHz is generated using power supply of 0.8 V. In order to reduce the phase noise, the ring oscillator is supplied with a separate power supply which is only used to power this clock generator block. The simulated phase noise of the oscillator is shown in Figure 3.42, which shows the low phase noise performance of the oscillator. This is very important to achieve low



Figure 3.41: Clock  $4.64\,\mathrm{GHz}$  at  $0.8\,\mathrm{V}$ 

jitter at the output.

# 3.1.12 Simulation Results and Analysis

The top level layout of the transmitter along with all the test blocks for testing purposes is shown in Figure 3.43. The right to left block placement is used where clock and data generation are on left side while multiplexer and driver with pre-



Figure 3.42: Phase noise of ring oscillator



Figure 3.43: SST top level layout



Figure 3.44: SST top level with connections to GSG pads



Figure 3.45: Simulation setup of the Transmitter

driver are on right side. Total size of the transmitter only including the driver, pre-driver and multiplexer is  $70 \times 18 \,\mu\text{m}$ . The connections of the transmitter block to the GSG pads and placement methodology of transmitter is shown in Figure 3.44. The size of complete transmitter is such that it can be easily placed under three 60 µm pads for GSG output. The size of the block shown here is  $260 \times 260 \,\mu\text{m}$ . Total wiring capacitance of the driver connection to the output 60 µm pad and wiring is extracted to be 70 fF.

The simulation setup for this transmitter is shown in Figure 3.45. The interconnect used for simulation is either a  $50 \Omega$  lossless transmission line to test the impedance matching qualities of the driver. The second interconnect used for simulation is 3.8 mm long organic substrate channel with s-parameters shown in Figure 3.9. For longer lengths, multiple s-parameter blocks are cascaded together together for 7.6 mm and 11.4 mm channels.

For  $50 \Omega$  transmission line, output is test with 20% impedance mismatch, i.e. 40  $\Omega$  termination resistance is used at far end of transmission line. This results in reflections coming back to the driver and will its reflection suppressing qualities. The results for transmission line with impedance mismatched termination are shown in Figure 3.46. With minimum drive strength, it can be seen that the reflections are quite large but still not very damaging if the receiver is not very sensitive. Also, due to minimum driver size, the output voltage swing is only 260 mV which stresses the receiver. While with maximum driver size, the large reflections coming back from receiver 40  $\Omega$  termination are suppressed completely by the driver and the voltage swing is also increased to 350 mV.

In order to test the transmitter performance on 11.4 mm organic substrate interconnect, three s-parameter blocks of 3.8 mm interconnect model are cascaded together. The receiver end is terminated with 50  $\Omega$  and simulations are performed with various clock rates to generate 13.3 Gb/s and 9.16 Gb/s outputs. Resulting eye diagrams are shown in Figure 3.47 which demonstrate the rise and fall time (equivalent bandwidth) of the transmitter under targeted channels. Some non-monotonic behavior is seen at 13.3 Gb/s which is due to limited performance of the multiplexer, hence limiting the transmitter maximum data rate to 13.3 Gb/s instead of ideally maximum of 16 Gb/s.

For low speed unterminated driver, the maximum data rate reached after 3.8 mm interconnect and max data rate achieved of 4.58 Gb/s, the eye diagrams without and with interconnect are shown in Figures 3.48a and 3.48b respectively. The eye has already its limit and hence, higher data rates with this small unterminated driver are not possible.





Figure 3.47: Output at  $50 \Omega$  termination after 11.4 mm organic interconnect

### 3.1.13 Comparison with State of the Art

There are several works with which this transmitter is compared. As discussed previously, there is no current bunch of wires standard published work. Hence, the transmitter is compared to chip to chip similar interfaces but some have very short lengths capability while some have very low maximum data rate. The comparison regarding the length and data rate is shown in Figure 3.50 Figure 3.49, where most works do not support the required length and data rate at the same time. While this work achieves the high data rate and also supports long lengths.

The energy area per unit length metric comparison with only comparative or comparable high data rate or long length is shown in Figure 3.50. This work achieves lower energy area cost per unit length in comparison. Other works could achieve better metric but they either support extremely short lengths or the data rate maximum is too low. Hence, they cannot be compared to this work due to their inadequacy to fulfill the basic channel length and data rate requirements together. Obviously, our work is better than any other CML or LVDS driver based transmitter work because



Figure 3.48: Low speed HSUL unterminated driver output at 70 fF load and 3.8 mm organic interconnect



Figure 3.49: Length and data rate comparison of SSTL driver

their energy costs are much higher than this work, as it was also shown at the start of this section.

For low speed HSUL driver, the state of art comparison is shown in Figure 3.51.



Figure 3.50: Energy-area per unit channel length cost comparison of terminated SSTL driver



Figure 3.51: Energy-area per unit channel length cost comparison of unterminated HSUL driver



Figure 3.52: Problem visualization feedback based driver optimization

# 3.2 Problem 2: Driver Optimization Example

There are two kinds of optimization in transmitter design possible, i.e. during the design stage and during the operation. The optimization during the design stage is discussed in the previous section where the channel properties are used to set the impedance of the driver and the drive strength or current requirements. This section discusses the other type of transmitter optimization in terms of energy efficiency based on the data rate requirements during the operation or in runtime. This problem can be visualized as shown in Figure 3.52. The requirement is to have as close to a digital implementation of the feedback and tuning as possible. This is again based on the chip to chip interfaces extremely digital style implementations as discussed in the previous section. There are several questions which need to be answered.

**Choice of Topology**: Choice of topology has one requirement, i.e. ease of tuning mechanism based on the feedback signal. There are several possible topologies as discussed previously, i.e. SSTL, HSUL, LVDS, CML and so on. In all of these topologies, there is no direct device which can be controlled separately to adapt to the interconnect length. The problem with all these topologies is that the pull-up and pull-down devices are directly controlled based on the data inputs as shown in the Figure 3.53. NMOS-over-NMOS topologies, i.e. SSTL (N-N) and low swing unterminated logic (LSUL) also need two devices which are both controlled directly from incoming data. Stacking of an uncontrolled device could be an option but it shall increase the area cost, which needs to be minimized as well. The only remaining topology with direct device control of driver is a common source follower (SF) topology, which is less used due to DC current path to the ground. But this is a good case study to show the optimization of driver based on data rate and interconnect feedback. Hence, due to direct control option of driver in source follower topology, it is chosen for this design.

**Design parameters:** For this driver runtime optimization example, based on the state of the art extremely small area designs [53][34], extremely small area driver and low data rates up to 1 Gb/s is targeted. Such interfaces can use the unterminated driver topology and hence achieve low energy consumption. In order to reduce the power consumption, low signal swing at output is better as given by Figure 3.2. A 1 Gb/s driver for short MCM interconnect is designed with low signal swing of 0.25 V similar to the idea used in latest memories such as LPDDR4X [92][93] and in the voltage swing requirement of previous source series terminated design. The difference is that memories like LPDDR4 have termination at receiver end. This circuit is suited to unterminated applications. The driver sends the data to a receiver circuit which does not expect a large signal swing.

Instead of using PMOS-NMOS push pull topology, this driver uses NMOS-NMOS source follower topology to achieve low signal swing and fast rise times due to saturation region operation of the pull up NMOS [94]. Pull down NMOS is the control device for tuning and driven from a bias voltage which can be configured to control the fall time slew rate depending upon the interconnect and required data rate. The schematic of the driver is shown in Figure 3.54 below.

The data is sent at the gate of the transistor and output is probed at the source of the transistor. As the source output follows the input at the gate, it is generally referred to as a source follower buffer. The transistor operates in saturation region ideally. Source follower output driver is current biased where current defines the  $g_m$ of the driver transistor. Then output impedance  $R_{out}$  is equal to  $1/(g_m + g_{mb})$  where  $g_m$  is the transconductance due to voltage at the gate and  $g_{mb}$  is the transconductance due to voltage at the bulk or substrate.  $R_{out}$  is generally quite low.

Low  $R_{out}$  can be used to match to standard 50  $\Omega$  line impedance or higher. Since this work uses unmatched impedance design to enable low power data transmission



Figure 3.53: Topologies available for driver optimization example design



Figure 3.54: Source follower driver with 2:1 input Multiplexer

with voltage mode signalling, very low  $R_{out}$  is necessary to increase the swing at the output with high impedance. As shown in Figure 3.54, transistors N0, P0, N1, P1 form the 2:1 multiplexer in combination with latches (not shown here), while N2,N3 make the output driver. Driver receives input from a 2:1 input multiplexing transmission gate driven by complementary clock signals at half the data rate.

Sizing: The widths of transistors in pre-driver transmission gate (N0, P0, N1, P1) are designed with extremely small widths of only  $0.16 \,\mu\text{m}$  only. The transistors in source follower driver N2 and N3 have width of  $1.6 \,\mu\text{m}$  and length of  $0.07 \,\mu\text{m}$ , which is derived based on the saturation current required to drive 70 fF load. Channel length is larger due to usage of medium thick oxide transistors at output stage for N2 and N3.

The transmission gate itself is not a multiplexer since it cannot shift the input data sent to the gate. The data shifting of even and odd data signals at half data rate (equal to clock rate) is achieved by opposite clock edge triggered latches. These latches followed by the transmission gate form a 2:1 multiplexer.

The timing diagram for odd and even signals, relationship with output data and clock is shown in Figure 3.55. Data inputs dE and dO are clocked at positive and negative clock edges of clk previous to the transmission gate, respectively. In driver transmission gate, clk negative edge latches the even data dE while positive edge latches the odd data dO as shown in Figure 3.55. This timing ensures that both dE and dO have timing margin of half the clock cycle or equal to 1 unit interval (1 UI).



Figure 3.55: Timing of driver

The layout of the driver along with top level test setup pads is shown in Figure 3.56a. The size of driver is only  $6 \times 4 \,\mu\text{m}$ . The test system contains the pads for data and clock inputs from outside equipment as shown in Figure 3.56b. The output is connected to GSG pads which are connected to wafer GSG probe to read the data output. The DC power supply and ground is also provided from wafer probe card. Next section will describe the analysis of slew rate of driver and edge phase noise, which are critical to its performance.



#### (a) Source follower driver with 2:1 input multiplexer layout

#### 3.2.1Analysis and Results

During the 0-1 transition at output, the top transistor N2 turns on by entering into saturation region and starts supplying current to the output capacitance and the bias transistor N3. The current charges the load capacitance from 0 to  $VDD - V_T$ if there is no current flowing through N3. Some current provided by N2 will always flow through N3 if its bias voltage is greater than  $V_T$  of N3. As bias voltage of N3 is



Figure 3.57: Simulation model for organic substrate interconnect signalling

increased, the current drawn through N3 increases which results in the output DC voltage to decrease in order to allow N2 to output the same current by enhancing its overdrive voltage  $V_{GS} - V_T$ . For a given low swing output, the bias voltage can be changed to control the output swing voltage.

**Bias voltage tuning of driver** For extremely low bias voltages, the current capability of N3 is very low during 1-0 transition. The drain current in N3 is very small which is not enough to pull down the output capacitance load to zero within 1-unit interval (1-UI). This leads to DC wander or change in the output common mode voltage, which can result in false reception of data at the receiver end. This directly impacts the falling edge slew rate defined as  $I_{N3}/C_{out}$ . Hence, the feedback signal must somehow increase the bias voltage until it is enough to pull down the output within 1-UI. This minimum bias voltage is the optimum value which would lead to the minimum energy consumption for this driver under given data rate and interconnect conditions. This tuning to find the minimum  $V_b$  is shown in Figure 3.58, where three same drivers are used to calibrate the interface. Two drivers are sending complementary 50% duty cylce periodic steady state (PSS) data which is used to extract the output common mode voltage or in other terms, the reference voltage for sampler. This  $V_{ref}$  is used by slicer or sampler to detect the high or low bit on test pattern sending driver line. The error signal is generated based on the false or correct data pattern, which is used in the driver side to increment the bias voltage until the test pattern is matched or error signal goes low. In this manner, always the minimum required bias voltage is set based on the required data rate.

In order the further analyze the impact of bias voltage, a simulation setup shown in Figure 3.57 is used to analyze 1 Gb/s signalling on 3.8 mm interconnect with 70 fF pad capacitance  $C_{PAD}$  on both ends. Periodic steady state analysis was used to evaluate the slew rate and the results are shown in Figure 3.59. At 0.6 V  $V_{bias}$ , the slew rate is extremely small and the signal swing is also very low. While for 0.8 V  $V_{bias}$ , the slew rate is sufficient enough and can support 1 Gb/s data rate on this channel.

Reduced slew rate directly impacts the phase noise of the falling and rising edges,



Figure 3.58: Proposed  $V_b$  tuning mechanism of driver



Figure 3.59: Complementary data (PSS) analysis and common mode  $V_{ref}$  variation



Figure 3.60: Random data slew rate problem and common mode  $V_{ref}$  variation

which results in higher jitter. In order to analyze this impact, edge phase noise of both rising and falling edges with different bias voltages is plotted versus frequency log scale in Figure 3.61. Falling edge phase noise is always higher than rising edge noise due to triode region operation of N3 and low drive current. Highest phase noise of  $-110 \, \text{dBc/Hz}$  is reported for lowest bias voltage while lowest phase noise  $-132 \, \text{dBc/Hz}$  is achieved using higher bias voltage.

The several performance metrics of the driver are simulated and shown in Table 3.1. It also shows the rms jitter values of rising and falling edge due to slew rate resulting edge phase noise extracted in the offset frequency range of 100-250 MHz.

For random half baud data inputs dE and dO with correct half clock timing



Figure 3.61: Edge phase noise analysis



Figure 3.62: 1 Gb/s output at receiver with 70 fF pad capacitance after 3.8 mm interconnect and 1 V bias voltage.

relationship at 0.5 Gb/s, the resulting baud rate 1 Gb/s output eye diagram with 1 V  $V_{bias}$  using the simulation setup of Figure 3.57 is shown in Figure 3.62. The transient simulation for different interconnect lengths and difference in falling edge

Table 3.1: Performance analysis of driver output at receiver end with 70 fF pad capacitance after 3.8 mm package interconnect and different bias voltages  $(V_b)$ 

|         | Rising    | Falling   | Voltage    | Common  | Rising         | Falling        |
|---------|-----------|-----------|------------|---------|----------------|----------------|
| Bias    | Edge      | Edge      | swing      | mode    | Edge           | Edge           |
| voltage | Slew rate | Slew rate | $V_{pkpk}$ | voltage | $Jitter_{rms}$ | $Jitter_{rms}$ |
| $(V_b)$ | (MV/s)    | (MV/s)    | (mV)       | (mV)    | (ps)           | (ps)           |
| 0.9     | 410       | 350       | 275        | 147     | 0.38           | 0.44           |
| 0.8     | 401       | 314       | 304        | 190     | 0.89           | 0.95           |
| 0.7     | 317       | 239       | 254        | 280     | 1.12           | 1.60           |



Figure 3.63: 1 Gb/s output at receiver with 70 fF pad capacitance after 3.8, 7.6 and 11.4 mm interconnect and 1 V bias voltage.

slew rate is shown in Figure 3.63.

**State of the Art Comparison** Although this driver was used as an example to show the tuning mechanism based on feedback and its usage in short interconnect based communication in multi-chip systems, this driver can be compared to recent

| Reference                                                                            | [41]                | [34]        | [53]                               | [33]             | This Work          |
|--------------------------------------------------------------------------------------|---------------------|-------------|------------------------------------|------------------|--------------------|
| Process<br>(nm)                                                                      | 28                  | 40          | 65                                 | 28               | 22                 |
| Driver<br>topology                                                                   | Capacitor<br>driven | SST (N-N)   | Current<br>mode<br>(Open<br>Drain) | SST (P-N)        | source<br>follower |
| Equalizer<br>required                                                                | No                  | Yes (FFE)   | Yes (FFE)                          | Yes<br>(Passive) | No                 |
| Driver area $(\mu m^2)$                                                              | $140 \times .03$    | 9.4×1.1     | $21 \times 63$                     | 1500             | $6 \times 4$       |
| $\begin{array}{c c} {\rm Max} & {\rm Data} \\ {\rm rate} \ {\rm (Gb/s)} \end{array}$ | 20                  | 1.1         | 3                                  | 20               | 1                  |
| Max length (mm)                                                                      | 4.5<br>(organic)    | 1 (silicon) | 20 (silicon)                       | 3.5<br>(silicon) | 3.8<br>(organic)   |
| Energy<br>Metric<br>(pJ/bit)                                                         | 0.125               | 0.105       | .095                               | 0.125            | 0.15               |

Table 3.2: Feedback loop optimized driver comparison with prior art



Figure 3.64: State of the art comparison of example source follower bias tunable driver

published works in terms of energy, area and energy-area combined metrics per unit length. Since, this driver has the DC current consumption, the energy-metric would be high but still due to low area usage and optimum values of bias voltages, this driver still can compare well with prior art. The comparison in terms of energy-area cost per unit interconnect length (pJ/bit\*mm) is shown in Figure 3.64, where even with DC current, the energy consumption is quite close to recent works.

# 3.3 Conclusion

Transceivers can use one of the two topologies, i.e. terminated and unterminated, where unterminated is used for low data rates on short interconnects, while terminated is used for multi Gb/s data rates on longer channels. Recently bunch of wires interface standard was introduced, for which both drivers in a single interface with same digital back end were required. This work presents the design and state of art comparison of designed drivers and shows that lower energy-area cost per unit length of interconnect could be achieved with these designs.

An example source follower driver was shown to adapt to the data rate requirement and interconnect by tuning the bias voltage based on the feedback error signal. The driver was, though not targeted, but still compared to the state of the art and close energy-area costs to the state of the art were achieved even though it consumes DC current due to the bias transistor, which was used as control device for slew rate.

Next chapter will describe the manufactured chip measurement results and their comparison to simulation results.

# Chapter 4

# Measurement Results

In order to test the SSTL high speed and HSUL drivers along with the source follower driver with tunable bias voltage, two chips are manufactured in 22 nm FDSOI technology. In these chips, the back end digital blocks, clock oscillator, multiplexers, PRBS generator based on flip-flops and XORs, and transmission gate based 2:1 multiplexer for pre-timed even and odd data streams are also manufactured. Both the chips are tested at wafer level with a wafer prober, RF Ground-Signal-Ground (GSG) probe and  $50 \Omega$  oscilloscope input. The control signals and power supply and ground are sent to the chip using probe card. The test chips' layout, test setups, measurement results and comparison to the simulations are presented in this chapter.

# 4.1 Bunch of Wires Transmitter

This section describes the test setup and the measurement results of the BOW transmitter with dual drivers SSTL and HSUL for high and low speed respectively.

#### 4.1.1 Test Setup

The layout of BOW transmitter is done is such manner that the power supply, ground and control signals come from the wafer probe card and the output is measured using RF GSG probe. The vertical distance between the probe card pads and GSG pads of the RF probe should be enough that they can fit together side by side. Thats why the distance between the probe card pads and the design is large as seen in the Figure 4.1, which shows the design along with pads and the manufactured chip micrograph. The bottom pads can be accessed with an 8-pin probe card with 100 µm pitch.

The top right pad is for HSUL unterminated driver output while the bottom pads are for SST drivers outputs. In order to send the power supply and the control signals, the test equipment from National instruments (NI) is used as shown in the Figure 4.2. NI-6739 card with analog outputs is used to send the control signals and turn them on or off. While, the NI-4143 card is used to send the power supplies for clock oscillator (vddc) and rest of design (vdd) respectively.



Figure 4.1: Complete Pad level layout and chip micrograph of BOW transmitter

# 4.1.2 Results

The frequency is changed with the supply voltage change of the oscillator. The comparison between measured and simulated values with extracted layout is shown in Table 4.1. The frequency at 0.9 V has a lot of jitter due to lack of phase locked loop (PLL). Rest of values match quite good to the simulated frequency values at 0.7 V and 0.8 V.



Figure 4.2: Test setup of BOW transmitter

| Voltage (V) | Freq simulated (GHz) | Freq measured (GHz) |
|-------------|----------------------|---------------------|
| 0.7         | 3.35                 | 3.145               |
| 0.8         | 4.64                 | 4.5                 |
| 0.9         | 5.76                 | 6-8 (high jitter)   |

Table 4.1: Measured versus simulated clock oscillator output frequency

The resulting PRBS outputs at the SSTL drivers measured at their output GSG pads using RF probe with maximum 300 mV allowed inputs are shown in Figures 4.4, 4.5 and 4.6 respectively. As can be seen in these waveforms, the output swing is clipped at 300 mV due to oscilloscope sensitive input limitations. The expected simulated swing is around 400 mV and the measured waveforms based on their shape, have higher than 300 mV swing which matches well to our expectations of greater than 300 mV swing even after the losses in the probe and the cables.



Figure 4.3: Measured waveform of SST output at 0.7 V clock supply and 6.29 Gb/s

The eye diagram at 9 Gb/s is shown in Figure 4.7a, where the jitter is still manageable in the clock. If the clock was provided from an external high quality source, the jitter could be made extremely small. However, the driver behavior which is irrelevant to the clock quality, can be seen in the eye diagram that the rise and fall times are well matched and the impedance termination is working well due to lack of reflections and matches good to the simulated eye diagram as shown in Figure 4.7b.

For HSUL driver, the waveform at 3.65 Gb/s is shown in Figure 4.8a, where the voltage swing is small due to  $50 \Omega$  termination of oscilloscope. The voltage swing of 70 mV means the current of 1.4 mA being driven into oscilloscope termination. This measured swing is less than the expected swing of 90 mV as shown in the Figure 4.8b, which could be due to probe losses, and cable losses in the measurement setup. Another reason could be due to variations in the oscilloscope termination which being less than  $50 \Omega$  could cause lower voltage swing as seen in the measurement.



Figure 4.4: Measured waveform of SST output at 0.7 V clock supply and 6.29 Gb/s



Figure 4.5: Measured waveform of SST output at 0.8 V clock supply and 9 Gb/s



Figure 4.6: Measured waveform of SST output at  $0.9\,{\rm V}$  clock supply and 13-16 Gb/s with high jitter of clock



(b) Simulated eye diagram of SST output at 9 Gb/s output data rateFigure 4.7: Measured and simulated eye diagram comparison at 9 Gb/s



(b) Simulated eye diagram of HSUL output at 3.65 Gb/s output data rate Figure 4.8: Measured and simulated HSUL waveforms at 3.65 Gb/s



Figure 4.9: Complete Pad level layout of low swing transmitter

# 4.2 Source Follower Driver Measurements

This section shows the measured results along with test setup of the example bias tunable source follower driver and the comparison to simulation results for validation purposes.

# 4.2.1 Test Setup

The pad level structure of low swing source follower transmitter with 2:1 multiplexer is shown in Figure 4.9. The control, supply, ground and biasing pads are at bottom, while the output is sent to the GSG RF pad at the right. In GSG pad, top and bottom pads are connected to VSS input from DC probe card pad at the bottom. The signal pad in GSG configuration is connected to the output of the driver. This output will then be connected to a  $50 \Omega$  termination in the oscilloscope.

Input signals are received using probe card at the bottom pads along with the multiplexing  $clk/\overline{clkb}$  signals. The synchronous signals dE, dO, clk and  $\overline{clkb}$  are tried to be matched in length from the pads to the input gates of the transistors. VDD and VSS pads are kept close together to provide some decoupling capacitance due to parasitic capacitance between the VDD and VSS interconnects.

These test signals are automated using test equipment. A software application is made in Labview to send the  $clk/\overline{clkb}$ , dE, and dO signals along with setting up

| Signal | Type            | Voltage(V) | Time Period     | Phase                       |
|--------|-----------------|------------|-----------------|-----------------------------|
| VDD    | Power Supply    | 1.2        | -               | -                           |
| VSS    | Ground          | 0          | -               | -                           |
| Out    | output          | -          | 0.5*TCk per bit | -                           |
| dO     | input data odd  | 0 - 1.2    | TCk per bit     | 0.75TCk out of phase of clk |
| dE     | input data even | 0 - 1.2    | TCk per bit     | 0.25TCk out of phase of clk |
| clk    | clock           | 0 - 1.2    | TCk             | 0                           |
| clkb   | clock inverted  | 0 - 1.2    | TCk             | 0                           |
| vbias  | bias voltage    | 0.6        | -               | -                           |

Table 4.2: Test circuit signal specifications

the power and ground supplies. The test schematic is shown in Figure 4.10. SMU stands for supply management unit which sends the VDD, Vbias and VSS (ground) to the test structure on the wafer. The HSDIO (high speed data input output) card sends dE, dO, clk and clkb signals to the wafer.

A probe card is used to connect all the signals from PXI equipment to the wafer as shown in Figure 4.11. All these signals are first transmitted from the PXI cards to the connector board CB-2162 as shown in Figure 4.12. The CB-2162 card then connects to the probe card connector.

The requirements for signal inputs to the test circuit consisting of multiplexer and source follower driver are shown in Table 4.2. TCk defines the time period of the clock. The software application made in Labview for data transmission to the wafer is shown in Figure 4.13. The maximum data rate of clk or dE/dO signals from HSDIO card in PXI is 200 MHz at 1.2 V setting. dE and dO signals are delayed using delay control settings in the card. As shown in Figure 4.13, 0.25 and 0.75 delay settings are given to channel 4 and 8 which are used for dE and dO, respectively.

In order to test the data and clock generated by software application, oscilloscope tests are used to measure the generated data. The resulting eye diagrams at 100 MHz clock rate with double data rate dE and dO are shown in Figure 4.14 and 4.15 which confirm the delay settings or phase between dE/dO are as intended.

Figure 4.14 shows two eye diagrams, upper is for the clock signal 200 MHz



Figure 4.10: Test setup schematic of low swing transmitter



Figure 4.11: Probe card connector in touch with wafer using thin needles

and lower eye is for dE with  $0.25\,{\rm Tck}$  delay. The phase difference between two eyes confirm that dE is delayed by  $25\,\%$  of the clock period. Similarly Figure 4.15



Figure 4.12: CB-2162 card for data transmission from HSDIO card to probe card

confirms the  $75\,\%$  delay in dO as compared to clock.



Figure 4.13: Labview application for power supplies and data transmission to Wafer



Figure 4.14: Double data rate dE with 0.25 TCk delay as compared to Clock

# 4.2.2 Results

The probe card and RF GSG probe are set onto the pads on the wafer as shown in Figure 4.16. Figure 4.17 shows the scratches on the pads of the test circuit after the probes are placed. This is typical in wafer probing because probes must slide onto the pads a little to make a good contact with the pad surface. The RF probe is connected to a 50  $\Omega$  terminated to ground 33 GHz bandwidth Oscilloscope input.

Random data is streamed in the form of dE and dO data signals with correct phase delay as described before. The received data for 200 MHz dE and dO data streams is shown in Figure 4.18. Since dE/dO are 200 Mb/s data streams, then the output is expected to be 400 Mb/s random stream. For 200 Mb/s, eye diagram is shown in Figure 4.19 with eye width of 5 ns and eye height of 36 mV.

For low swing driver, the simulation is done to exactly copy the measurement environment. The simulation environment is shown in Figure 4.20. The 400 Mb/s random data stream eye output at the oscilloscope input is shown in Figure 4.21. The voltage swing is 36 mV which is very close to average of the measured swing of 30 mV. Reflections in simultion are not very large but in measurements the



Figure 4.15: Double data rate dO with 0.75 TCk delay as compared to Clock



Figure 4.16: Setup of probe card and RF probe needles on wafer pads

reflections are large due to non-ideal data and clock input paths from DC probe card. The reflections can be linked to the difference in the channel model at the inputs of dE/dO and  $clk/\overline{clkb}$ .

In simulation  $50 \Omega$  ideal line impedance is chosen, while the actual channel in measurement is non-ideal and consists of following path items:

- $50\Omega \ 1 \ \text{meter} \ (4\text{ns}) \ \text{long cable}$
- Unmatched connection at CB-2162 board from HSDIO card in PXI equipment
- Unmatched connection at the Probe card connector from CB-2162 board



Figure 4.17: Scratches on wafer pads after probing


Figure 4.18: Multiplexed 400 Mb/s random output stream at  $50\,\Omega$  oscilloscope



Figure 4.19: Multiplexed 200 Mb/s eye diagram at 50  $\Omega$  oscilloscope

• Non-ideal impedance of Probe card needles which is generally not as good as  $50\,\Omega$  RF needles

These mismatches in the input channels dE/dO,  $clk/\overline{clkb}$  cause the reflections seen in the measurement result. But still the measurement verifies the voltage swing which matches closely to the simulation result. Also measurement verifies



Figure 4.20: Simulation Test bench of test circuit



Figure 4.21: Simulated 400 Mb/s random output stream 50  $\Omega$  terminated to ground

the performance of 2:1 multiplexer with minimum bit intervals equal to half of the dE/dO bit interval.

#### 4.2.3 400Mbps Test Limitation

The maximum clock rate at which the source follower driver was tested is 200 MHz. This limitation is due to the CMOS 1.2V logic signal requirements of the driver. Available equipment in the lab, i.e PXIE-6548 HSDIO card is only capable to send up to 200 Mbps signal data rate which means that the maximum output data rate from the driver under these test conditions is 400 Mbps. Hence, 400 Mbps test is due to equipment limitation, not the driver itself. In order to test the driver at higher clock rate, either higher speed CMOS IO Card is required or high speed signals in form of current mode logic (CML) or low voltage differential signalling (LVDS) is required. Along with these differential signals, a CML/LVDS-to-CMOS(1.2 V) converters are required which then send the high speed clock and data signals in 1.2 V logic format to the driver.

# 4.3 Conclusion

This chapter validates the simulation results in previous chapter with measurement results. The simulation to measurement waveforms and eye diagram comparisons are performed to validate the design. Very good matching with simulation up to 9 Gb/s was observed even with non-ideal clock management unit. High jitter was observed in the data outputs higher than 9-10 Gb/s in the SST driver, which was due to lack of phase lock loop in the clock generator block. Since, the clock generation design is not a part of this thesis, the focus part of front end driver voltage swing and rise and fall times comparisons are the actual target. They were compared to the simulation results and voltage swing in measurement was about 50 mV less than simulated in SSTL driver which can be attributed to the measurement setup and probe and cable losses along with mismatch in oscilloscope termination. In HSUL driver, the voltage swing was only 20 mV less. The clock generator frequency was also compared with simulation, and good matching was observed up to 5 GHz. Higher frequencies had too much jitter, again due to lack of a phase locked loop.

Both transmitter are designed in 22 nm and taped out. The chips are manufactured and then tested on wafer level using wafer prober, probe card, RF probe, and automated test equipment. Complete test setup including test application in labview is shown in this chapter.

The source follower validation measurement is done for 200 and 400 Mb/s data random stream coming out of the low swing transmitter. The resulting data shows that the multiplexer is performing as expected and the bit period of minimum width bit is equal to half of the bit period of input even and odd data streams. The driver is sending the multiplexed data with expected power and the voltage swing at the oscilloscope is very close to simulated 36 mV.

The reflections in the received data can be attributed to the mismatched impedances along the signal path of input data streams from PXI card to the input transistor gates. The channels mismatches are detailed including the contributions from the CB-2162 Card, probe card connector and needles. It has been shown that other than reflections, the measurement and simulation results are closely matched which proves the specifications of the designed driver for low data rate unterminated applications.

# Chapter 5

# Channel

There are two parts to any multi-chip communication interface, i.e. the transceiver and the interconnect channel. The previous two chapters described the transmitter and circuit design for given organic substrate channel. But there is another possible way around it, i.e. what if the transmitter is given then how to design the channel for given transmitter circuits and are there any settings or choices in transmitter that can be changed to reduce the energy and area usage of the communication interface. The state of the art analysis regarding this question led to the following problem statements:

- 1. Characterize and design a memory-cpu 2.5D MCS system channel for DDR3 memory-cpu interface which should reduce the energy and area cost metrics for 2.5D DDR3 system versus the PCB system and make a comparison
- 2. Characterize the performance of high speed serial interfaces used in industry (SERDES) on 2.5D silicon interposer interconnect and compare the energyarea costs of change in channel design versus the change in transmitter design to increase the eye width and height at 10 Gb/s SERDES data rate. This discussion shall make the base for next chapter on co-design of channel and transmitter methodology

The first section shall deal with the first problem and the second section deals with the second problem specified above.

# 5.1 Channel design for DDR3 2.5D Interface

There were two widely used data interfaces found in the state of the art analysis, i.e. memory interfaces with many data lines and then high speed serial (SERDES) interfaces with few data lines. The prior art in memory interfaces only talked about the HBM and wide-I/O memories and their usage on the 2.5D interposer interconnect. There was missing knowledge on the usage of double data rate (DDR) memory dies in a 2.5D memory-cpu interface. The first problem in this chapter tries to add knowledge in this area and specifically shall answer the following questions :

• Determine the minimum possible width and spacing of 10 mm long silicon interposer interconnect for DDR3 memory interface using industry encrypted circuits



Figure 5.1: Problem 1 of 2.5D DDR3 interface minimum energy-area design flow

- determine the optimum or minimum possible energy consumption settings for DDR3 driver and receiver circuits when used in the silicon substrate based 2.5 interconnect
- Compare the minimum possible energy-area DDR3 2.5D interface with a typical DDR3 PCB interface

The flow describing how the above problem shall be dealt with is shown in Figure 5.1. Consider a typical DDR3 interface with n data lines generally called DQ lines, which carry the data through the interposer channel between memory and cpu. The area A consumed by the memory interface on the interposer between the memory and cpu in a side by side placement is equal to

$$A = [n \cdot W + (n-1) \cdot S] \cdot L$$

where W is the width of the single interconnect, S is the spacing between two interconnects and L is the length of the interconnects. The energy  $\phi$  consumed by the DDR3 interface is given as

$$\phi = \phi_i \cdot n$$

where  $\phi_i$  is the energy consumed by the single interconnect data DQ transmission at the specified frequency. The widely used version of DDR3 with 1600 Mbps DQdata rate is used in this work. DDR3 signalling topology is based on the SSTL-HCM topology analyzed in chapter 3. The previous analysis shall help determine the minimum energy consumption settings for DDR3 2.5D interface.

#### 5.1.1 Silicon Interposer Channel Characterization

For interposer, an example stackup is shown in Figure 5.2, which shows two layer of interconnects on each side of a double sided silicon based 4-BEOL-layer interposer. The width W and spacing between interconnects S are not shown because they will be swept over a range of values. In order to understand the impact of the material and interconnect design on the channel characteristics, it is mandatory to describe the basic signal transmission phenomenon of these interconnects.

From the basic electromagnetism and transmission line theory [85], any metal interconnect can be described in the form of infinite small elements, where each element has some resistance, inductance, dielectric conductance and capacitance elements as shown in Figure 5.3, commonly known as *RLGC* model. Total length of the interconnect can be written as  $\sum \Delta x$ .  $R_{\Delta x}$  describes the resistance per unit length of the interconnect,  $L_{\Delta x}$  shows the inductance per unit length,  $G_{\Delta x}$  dielectric conductance per unit length and  $C_{\Delta x}$  shows the capacitance per unit length of the interconnect. These values signify the characteristic impedance  $Z_0$  as shown below.

$$Z_0 = \sqrt{\frac{R_{\Delta x} + jwL_{\Delta x}}{G_{\Delta x} + jwC_{\Delta x}}}$$
(5.1)

RLGC values also describe the loss of the signal amplitude at different frequencies while traveling through the interconnect length due to metal resistance, dielectric loss and dielectric conductance. These losses play significant role in the wave propagation as the wave propagation constant  $\gamma$  is written as:

$$\gamma = \alpha + j\beta \tag{5.2}$$

where  $\alpha$  is the attenuation factor due to the mainly metal conductive and dielectric losses and  $\beta$  is the lossless wave propagation metric which describe the shape of



Figure 5.2: Silicon interposer stackup (not to scale)

wave at a certain time along the interconnect. For signal integrity analysis, metal widths and spacings values will be swept over a range of values. *RLGC* values will be calculated using HSPICE Field Solver.

For RLGC models driven for interposer as shown in Figure 5.2, the two interconnects on the top metal layer are considered to be actively carrying the signal (S) while the rest for the purpose of simplification are considered as reference (R) connected to ground. Instead of a single interconnect, two active signal interconnects are used for simulation because it will help in finding out the effect of the coupling on impedance characteristics and will give an idea of the multi conductor effect on the signal propagation through the interposer dielectric. The dielectric constant of silicon substrate is 11.9 and its tangent loss  $tan\delta_D$  based on silicon resistivity of  $10\,000\,\Omega\,\mathrm{cm}$  is 0.0015 [95]. The dielectric constant of  $S_iO_2$  is 3.9 and the tangent loss  $tan\delta_P$  is 0.001. The width of the interposer interconnects is swept over the range of  $\{2,5,10,15\}\,\mu m$  while the spacing between the interconnects is swept over the range of  $\{5,10,20,30,40\}\,\mu m$ .

For simulation in HSPICE field solver, high accuracy option is used with grid factor of 3. The simulation results in several matrices which are order of  $n \times n$ where n is the number of signal conductors in the simulation. In our case, all the matrices are the order of  $2 \times 2$ . An example resulting file from simulation is shown in Figure 5.4, where  $L_o$ ,  $C_o$ ,  $R_o$  and  $G_o$  are frequency independent inductance, capacitance, resistance and dielectric conductance matrices, respectively. While  $R_s$  and  $G_d$  are the skin factor for conductor resistance and dielectric dissipation factor for dielectric conductance which describe the frequency dependent variation of conductor resistance and dielectric conductance. Here, Capacitance matrix  $C_o$  is maxwell capacitance matrix where the diagonal term is the sum of all capacitances from that conductor. For example, in Figure 5.4,  $C_{11}$  is 131 pF/m which is the sum of  $C_{12} = 46.1$  pF/m and the conductor's capacitance to the reference ground which can be calculated to be 85 pF/m.

The RLGC parameters for interposer stackup are plotted in Figure 5.5. Mutual inductance decreases with increasing spacing and same is the case with mutual capacitance which decreases with increased spacing. It can be seen that mutual inductance varies with width variation, i.e. the larger the width, smaller the mutual inductance. Mutual capacitance between the two conductors almost remains constant with width variation. This is expected because change of width will not change



Figure 5.3: RLGC model representation of interconnect

```
*SYSTEM_NAME : se_rlgc
      Half Space, AIR
*
     ----- Z = 1.220000e-04
      poly H = 2.200000e-05
  ----- Z = 1.000000e-04
*
×
      silicon H = 1.000000e-04
    ----- Z = 0
*
      Half Space, AIR
* L(H/m), C(F/m), Ro(Ohm/m), Go(S/m), Rs(Ohm/(m*sqrt(Hz)), Gd(S/(m*Hz))
.MODEL se_rlgc W MODELTYPE=RLGC, N=2
+ Lo = 3.468480e-07
     1.085375e-07 3.468480e-07
+ Co = 1.312641e - 10
     -4.615596e-11 1.312641e-10
+
+ Ro = 2.531829e+03
     3.616898e+02 2.531829e+03
+
+ Go = 3.838637e - 04
     -1.349768e-04 3.838637e-04
+ Rs = 3.361608e-02
     8.546231e-03 3.361604e-02
+
+ Gd = 6.872970e-13
     -2.416721e-13 6.872970e-13
```

Figure 5.4: RLGC output file example

the amount of parallel area between the parallel conductors, which keeps the mutual capacitance almost constant. While spacing between the conductors directly effects the amount of coupling possible between the conductors with inverse relationship derived from basic parallel plate capacitance formula of  $C = A\epsilon_o/d$  where A is the parallel area and d is the spacing between the conductors.

The RLGC factors play the critical role in defining the signal propagation through the interconnect, which is especially shown by the respective loss that they introduce to the signal. While the impedance of the signal over a lossy conductor is defined by 5.1, the propagation of signal is basically described by the equation 5.2, which can be further detailed as

$$\gamma = \sqrt{(R_{\Delta x} + jwL_{\Delta x})(G_{\Delta x} + jwC_{\Delta x})}$$
(5.3)

where  $R_{\Delta x}$  is actually a function of frequency due to skin resistance and  $G_{\Delta x}$  is also a function of frequency. These frequency relationships are described as

$$R_{\Delta x} = R_{o\Delta x} + R_{s\Delta x}\sqrt{f} \tag{5.4}$$

$$G_{\Delta x} = G_{o\Delta x} + G_{d\Delta x} f \tag{5.5}$$

where  $R_{s\Delta x}$  represents the skin effect frequency factor per meter length and  $G_{d\Delta x}$  represents the dielectric loss frequency factor. The effect of frequency on the conductor metal resistance and dielectric loss are shown in the plots in Figures 5.6a and 5.6b, respectively. The resistance plot shows that even for 10 mm long realistic



Figure 5.5: RLGC parameters of interposer interconnect

interposer line of  $2 \mu m$  width, the resistance at 2 GHz would be  $40 \Omega$ . Increasing the width from  $2 \mu m$  to  $5 \mu m$ , the resistance at 2 GHz reduces to half, i.e.  $20 \Omega$ . But



(a) Resistance vs frequency  $(\Omega/m)$  (b) Dielectric conductance vs frequency

Figure 5.6: Frequency dependent conductances

further increasing the width does not have exactly as high an effect as from  $2 \,\mu m$  to  $5 \,\mu m$  increment. This means that for narrow width interposer lines, the resistance will play an important role. But for higher width lines, this effect is quite reduced.

In dielectric conductance plot in Figure 5.6b, the conductance increases proportionally to the frequency and at high frequencies, will be dominant force in terms of loss as skin effect induced resistance increase almost becomes constant at some high frequency as can be seen in Figure 5.6a. It is mandatory here now to see how the different width and spacing combinations will perform in terms of loss and impedance seen by the incoming signal.

Loss is defined by the real part  $\alpha$  in Equation 5.2 which can be computed by solving the Equation 5.3 for real part. The propagation real part  $\alpha$  has the units of Nepers/m, while the imaginary part  $\beta$  has the units of radians/m. The real part determines the decay of the wave along the line while imaginary part determines the shape of the wave along the line at a given time. But there is a problem, that the coupling between the two conductors is not zero. This means that the general formula for impedance and propagation constants are not any more valid since these values will be dependent upon the type of signal traveling through the two conductors. This is called generally in microwave theory as odd and even mode propagation [85]. In odd mode, two opposite signals are traveling through the two conductors while in even mode, two same signals are incident on the two conductors. So,  $\gamma$  and Z both need to be calculated for both types of propagation as given by equations below.

$$Z_{even} = \sqrt{\frac{R_{o\Delta x} + j\omega L_{o\Delta x} + j\omega L_{m\Delta x}}{G_{o\Delta x} + j\omega C_{o\Delta x}}}$$
(5.6)

$$Z_{odd} = \sqrt{\frac{R_{o\Delta x} + j\omega L_{o\Delta x} - j\omega L_{m\Delta x}}{G_{o\Delta x} + j\omega C_{o\Delta x} + 2j\omega C_{m\Delta x}}}$$
(5.7)

where  $C_o$  represent the self capacitance not the maxwell capacitance.  $\omega$  is  $2\pi f$  where f is frequency in hertz. Similarly, the even and odd mode propagation constants for



(a) Odd mode impedance vs frequency (b) Even mode impedance vs frequency

Figure 5.7: Impedance vs frequency

the two modes can be calculated as:

$$\gamma_{even} = \sqrt{\left(R_{o\Delta x} + j\omega L_{o\Delta x} + j\omega L_{m\Delta x}\right)\left(G_{o\Delta x} + j\omega C_{o\Delta x}\right)} \tag{5.8}$$

$$\gamma_{odd} = \sqrt{(R_{o\Delta x} + j\omega L_{o\Delta x} - j\omega L_{m\Delta x})(G_{o\Delta x} + j\omega C_{o\Delta x} + 2j\omega C_{m\Delta x})}$$
(5.9)

By calculating the attenuation part from the equations above, the loss in decibels per 10 mm length is plotted using the formula:

$$\alpha_{dB/10mm} = -20 \log e \,(\alpha/100) \tag{5.10}$$

The impedance plotted in Figure 5.7b and 5.7a show that the resistive portion of the impedance in Equations 6.1 and 5.6 plays the dominant role in lower frequencies up to 500 MHz for the widths larger than 5 µm. While for width of 2 µm which has a much higher resistive value, impedance keeps on decreasing till 1500 MHz. Also greater the width, lower the impedance, e.g. 15 µm line has odd mode impedance of about 38  $\Omega$  at 2 GHz while 5 µm has an odd mode impedance of about 45  $\Omega$  at 2 GHz. This is expected because with increasing width, the spacing is also being increased because the width to spacing relationship is kept same. By keeping the spacing 2× the width for all lines, the larger width and spacing still pushes the impedance values down.

Another point noticeable here is that odd mode impedance is smaller for all lines while even mode impedance is larger. The odd and even mode propagation impedances make it difficult in such coupled lines to terminate the lines easily because both propagation modes have to be terminated separately. In Figure 5.8a and 5.8b, the attenuation or loss per 10 mm is plotted for both odd and even propagation modes. Clearly, odd mode loss is higher than even mode at the same frequencies. Odd mode loss reaches  $-3 \, dB$  at 2 GHz while even mode loss is only about  $-1.5 \, dB$ . One thing really important to notice here is that the losses decrease dramatically when the width is only increased a little from 2 µm to 5 µm. This can be seen in odd mode plot where 5 µm has only  $-1 \, dB$  loss while 2 µm line has  $-3 \, dB$  loss.

This gives us a good design tip that if the losses need to be decreased, then one does not need to increase width by very large number. Even small amounts



(a) Odd mode attenuation vs frequency (b) Even mode attenuation vs frequency

Figure 5.8: Attenuation vs frequency (dB/10mm)

of width increase give a very good reduction in the loss numbers. Also, it can be noted that increasing the width from  $10 \,\mu\text{m}$  to  $15 \,\mu\text{m}$  does not give much advantage in terms of losses. So, for a given loss number and impedance, designer can use the plots shown and adjust the widths and spacing of the lines to optimize the design performance according to the needs. Also, designer can use the metal resistance plots to decide if the design should be run in the more lossy environment or purely LC or low loss requirement, where the metal resistance will not play a big role in defining the impedance of the structure. Also, if it is known that the signals on the two lines will be completely odd mode or even mode, then the terminations if required will be straightforward.

# 5.2 Eye-digram Mask based DDR3 Signal Analysis

After characterizing the structural effects of interconnect on the channel properties, it is now time to perform the eye diagram analysis using the channel properties and the actual DDR3 transistor level encrypted drivers and receiver models from industry [96]. This shall give insight into 2.5D interface memory channel design. The DDR3 memory is defined in detail by the JEDEC standard [88]. The RLGC models derived in the previous section are used for the simulation schematic shown in Figure 5.10, where two DDR3 data transmitters (DQ) in output mode are sending data through the RLGC HSPICE W-element to the two data (DQ) DDR3 receivers. The receiver cell will either be operating in termination mode  $60 \Omega$  ODT (on die termination) or unterminated mode with no ODT. Since DDR3 signalling is based on the SSTL-HCM topology detailed in chapter 3, the receiver termination is to the supply  $V_{TT}$  which means that there will be power dissipated during the zero transmission. The output impedance of the DQ transmitters is chosen to be  $34 \Omega$ . The driver and receiver topology of DDR3 is shown in Figure 5.9, where the  $V_{DD}$ ,  $V_{ref}$  and resistance values are shown. The possible ODT resistance values are 240, 120, 80, 60, 48, 40 and  $34 \Omega$ , out of which  $60 \Omega$  is chosen for analysis in this work. The frequency of operation is chosen to be 800 MHz which means that the data rate on the DQ lines will be 1.6 Gb/s. Simulating the two signal lines at the same time will give us a very realistic result close to the real environment where close signal lines can affect each other.

As odd and even propagation modes greatly effect the impedance and propagation constant behavior of the interposer lines, the two data patterns on two lines are kept independent as much as possible. For this purpose, both lines are carrying Pseudo random bit streams (PRBS). This will give a realistic averaged performance analysis of the DDR3 on these interconnects.

Power consumption values are also calculated for simulated schematic. It is important because the power values will be different for the cases with and without  $60 \Omega$  ODT. The length of the interposer interconnect for these simulations is 10 mm.

This length of 10 mm is the constraint for routing length on interposer for this analysis and is a good choice based on the bunch of wires interface standard [56]. The eye diagram plot for the signal received at DQ receiver is shown in Figure 5.11 where eye diagram height and widths are also shown with the help of a diamond shaped mask. The eye height is 0.7 V while the eye width is 590 ps. These values are good enough for data receiver to sample the signal correctly based on the requirements specified in the standard [88].

Similarly, the  $5\,\mu$ m wide interposer line results are plotted in Figure 5.12 which shows that eye height is 831 mV and eye width is 590 ps which is much larger than the eye height of the receiver input eye diagram for the  $2\,\mu$ m wide interposer line. The reasons is that the  $2\,\mu$ m line has higher loss due to higher resistivity which reduces the height of eye diagram as compared to the  $5\,\mu$ m line.

For cases when on die termination (ODT) is not used, the result for  $2 \mu m$  wide line is plotted in Figure 5.13. The eye height without ODT is increased by about 400 mV and eye width remains same. This is a very surprising but a really good result from this eye diagram is that there is no overshoot/undershoot even without any on die termination. In DDR3 signalling according to JEDEC specified electrical specifications [88],  $\pm 0.4 V$  is allowed for over/undershoot.

But from the simulation result for  $2 \,\mu m$  interposer line, there is no need for on die termination as there is no under or overshoot even without termination. There can be two explanations to this phenomenon; one is the behavior of channel as



Figure 5.9: DDR3 driver and receiver topology and values



Figure 5.10: HSPICE DDR3 DQ lines simulation over the Interconnect Models



Figure 5.11: [60  $\Omega$  ODT] DQ receiver input at 1.6 Gb/s for W=2  $\mu m,$  S=5  $\mu m,$  L=10 mm interposer line

a voltage line due to very short 10 mm length, and second is the damping of the possible reflections due to losses in the channel. For  $5\,\mu$ m wide interposer line under no ODT condition, the result is shown in Figure 5.14. This shows that eye height and width is increased but still the reflections although present are very small and it is a great initiative for the memory system designers to move towards the 2.5D memory interfaces.

# 5.2.1 Minimum Channel Area and DDR3 Energy Values

For n = 64 data lines in memory channel, based on the eye diagram analysis, the minimum width in the given range is usable for 10 mm 2.5D memory interfaces. The

resulting minimum area based upon the resulting minimum width and spacing from eye diagram analysis can be given as

$$A_{min} = \left[64 \cdot 2e^{-3} + (64 - 1) \cdot 5e^{-3}\right] \cdot 10 = 4.43 \,\mathrm{mm}^2$$

The simulated power consumed for real transistor level HSPICE based DDR3 model without termination or ODT is only 9.64 mW as compared to the case with termination which consumes 32 mW per DQ line.

From eye diagram analysis, it has been confirmed that the case without termination can be used for 10 mm average 2.5D interface with small widths of  $2\,\mu$ m and the eye passes all the receiver eye mask requirements. Therefore, the minimum energy consumption per DQ line can be written as

$$\phi_{imin} = \frac{9.64 \,\mathrm{mW}}{1.6 \,\mathrm{Gb/s}} = 6.025 \,\mathrm{pJ/bit} \tag{5.11}$$

Hence, the total minimum energy consumption per bit for 64 DQ lines in ODT OFF case is

$$\phi_{min} = 64 \cdot \phi_{imin} = 385.6 \,\mathrm{pJ/bit}$$

#### 5.2.2 PCB channel and DDR3 Eye-digaram Analysis

In order to make a comparison of DDR3 2.5D interface with a typical PCB, the PCB interconnect is characterized for widely used widths, spacing and lengths of 10 and 30 mm which is not very large for PCB systems.

Longer than 10 mm lengths in the range of 30 mm are very common in pointto-point DDR3 PCB designs. Hence, their characterization and comparison with



Figure 5.12: [60  $\Omega$  ODT] DQ receiver input at 1.6 Gb/s for W=5  $\mu m,$  S=10  $\mu m,$  L=10 mm interposer line



Figure 5.13: [No ODT] DQ receiver input at 1.6 Gb/s for W=2  $\mu m,$  S=5  $\mu m,$  L=10 mm interposer line

similar length 2.5D interposer DDR3 interfaces shall give good insight for low energy consumption DDR3 designs. This shall give an understanding on how the 2.5D



Figure 5.14: [No ODT] DQ receiver input at 1.6 Gb/s for W=5  $\mu m,$  S=10  $\mu m,$  L=10 mm interposer line



Figure 5.15: Resulting minimum area and energy consumption settings for 2.5D DDR3 10 mm long silicon interposer interface

DDR3 interfaces could compare to the popular PCB DDR3 solutions. Just like for interposer structure characterization, self and mutual inductances along with self and mutual capacitances, conductor resistance, and dielectric conductance are plotted in Figure 5.16 from the extracted RLGC  $2 \times 2$  matrices by simulating the structure in 2D field solver.

One significant difference to be noted here with respect to silicon interposer is the resistance per unit length. The interposer line has  $2500 \,\Omega/m$  resistance while PCB line of standard width has only resistance of  $9 \,\Omega/m$ , which is due to difference in the widths but this issue will play important role especially in determining the propagation losses and shall impact the eye diagram, especially the voltage swing.

The plots in Figure 5.17a and 5.17b show that both resistance and dielectric conductance increase with frequency just like in silicon interposer. Looking towards the odd mode and even mode impedances in Figure 5.17c and 5.17d, the odd mode impedances are less than even mode impedances and they stay constant over the range of frequency shown. While as can be seen in Figure 5.7a and 5.7b, the interposer impedances decrease with increasing frequency and shift from purely lossy behavior towards low loss behavior affecting the impedance. In PCB impedances, the loss is so small that lines behave as true LC lines and impedance is constant over the frequency range.

Since it has been made clear in interposer eye diagram analysis for DDR3 signals that turning off the termination is really important to save the power. For saving power in PCB based DDR3 designs for 10 mm and longer 30 mm channels, the eye diagrams must be evaluated only for NO-ODT case. The eye for 10 mm PCB line without ODT is shown in Figure 5.18, where some reflections are seen, but overall the eye quality is very good. In DDR3 signaling according to JEDEC specified electrical specifications,  $\pm 0.4$ V is allowed for over/undershoot. The eye for 10 mm PCB line without low power consuming No-ODT case is well within the under and overshoot limits of DDR3.

For 30 mm PCB line without termination or ODT, eye diagram is shown in Figure 5.19, which shows that the eye crosses the red over and undershoot limits of -0.4 V and 1.9 V. Although eye width and height are very good but the overshoots/undershoots of eye make this PCB case unusable. Hence, without termination and low power usage of DDR3 on PCB, only around 20 mm length is allowed. In order to compare this with silicon interposer no termination 30 mm case, eye diagram for 2.5D channel is shown in Figure 5.20 where the eye width and height are well within the specifications of DDR3 and there are no over/undershoots due to resistive nature of silicon interposer interconnect. Hence, 2.5D DDR3 interfaces are



Figure 5.16: RLGC parameters of PCB interconnect

usable for even longer than 30 mm channels with low power no ODT settings, while PCB is not usable without ODT for longer than about 20 mm PCB channels.



Figure 5.17: Frequency dependent conductances and impedances

#### 5.2.3 PCB vs 2.5D DDR3 Interface Comparison

From above eye diagram analysis fo different lengths DDR3 2.5D and PCB interfaces, it has been concluded that 2.5D interfaces are usable for DDR3 without termination or ODT up to 30 mm lengths while PCB interfaces only support this low power DDR3 setting for only up to 20 mm lengths. Hence, for general long PCB DDR3 interfaces, low power consumption is not possible, while 2.5D DDR3 interfaces support low power option for short lengths and even longer lengths up to 30 mm. Table 5.1 shows the impact of changing the ODT setting on eye passing or failing the specifications along with power consumption for PCB and 2.5D silicon interposer DDR3 interfaces.

It is concluded from eye diagram analysis that at 1.6 Gb/s, unterminated scheme saves large amounts of power. Therefore, for moderate memory data rates in 2.5D systems, short channels with small widths and high attenuation with unterminated signalling scheme should be used. This shall save not just the energy but shall also allow the system designer to use memories with less physical interface circuitry complexity.



Figure 5.18: [No ODT] DQ receiver input at 1.6 Gb/s for W=100  $\mu m,$  S=200  $\mu m,$  L=10 mm PCB line



Figure 5.19: [No ODT] DQ receiver input at 1.6 Gb/s for W=100  $\mu m,$  S=200  $\mu m,$  L=30 mm PCB line



Figure 5.20: [No ODT] DQ receiver input at 1.6 Gb/s for W=2  $\mu m,$  S=5  $\mu m,$  L=30 mm interposer line

| Result             | Interposer $(2 \mu m)$ | PCB $100 \mu \text{m}$ width |  |
|--------------------|------------------------|------------------------------|--|
| Max length         | 30 mm                  | 20 mm                        |  |
| Power with ODT     | 32                     | 33                           |  |
| Power without      |                        |                              |  |
| ODT per DQ<br>(mW) | 9.68                   | 10.51                        |  |

Table 5.1: Comparison result DDR3 2.5D and PCB interface

## 5.2.4 DDR4 Memory Standard Discussion

Modern standard DDR4 [97] supports higher data rate of 3200 Mbps per DQ line as compared to 1600 Mbps in DDR3. The main changes applied to DDR4 standard are:

- The voltage supply of the data lines receive and transmit drivers is reduced from  $1.5\,\mathrm{V}$  to  $1.2\,\mathrm{V}.$
- The driver topology is changed from source series terminated logic (SSTL) to pseudo open drain (POD)
- The reference voltage  $(V_{ref})$  of the receiver in DDR4 memory channel is not set from outside pin, rather calibrated by a calibration routine at start of system
- The receiver is not terminated to half of supply voltage  $(V_{tt})$  as in DDR3, rather it is terminated to the supply voltage VDDQ 1.2 V

• Data bus inversion (DBI) is implemented to save power when there are large number of zero bit transmission

The pseudo open drain (POD) topology of DDR4 is shown in Figure 5.21, where the termination ODT on receiver is connected to VDDQ. During pull-up or high symbol transmission, there is no DC current flow and hence, no DC power consumption. There is only DC power consumption during zero or low symbol transmission. The  $V_{oh}$  in this topology is always 1.2 V, while the low symbol received voltage  $V_{ol}$ is dependent upon the  $R_{on}$  and ODT resistances. By changing these resistances in POD DDR4 topology, the voltage swing and reference voltage or middle voltage  $V_{ref}$ are controlled.



Figure 5.21: DDR4 pseudo open drain (POD) signalling topology

The minimum power consumption in DDR4 can be achieved by choosing the highest possible  $R_{on}$  and ODT. The possible  $R_{on}$  values are 34, 40 and 48  $\Omega$ , while the ODT values range from 34 to 240  $\Omega$ . Just like the DDR3 channel design discussed in previous subsections, the trace impedance on PCB or interposer is important for longer channel lengths. While for extremely short interconnects on interposer, no ODT could be used. The extensive simulations similar to DDR3 shown previously should be performed to select the optimum ODT and  $R_{on}$  for give PCB or interposer channel. The data bus inversion technique in DDR4 along with  $V_{ref}$  calibration helps improve the energy performance and receiver eye margin respectively. Therefore, a typical DDR4 channel design should follow the same steps as in DDR3 to achieve the minimum power consumption.

## 5.3 Channel Design for High Speed Interfaces

Memory interfaces are considered as moderate bandwidth communication interfaces. The "high speed interface" term is generally used for multi-Gb/s data rates equal to or exceeding 10 Gb/s. A 13 Gb/s high speed transmitter design was presented in chapter 3 of this thesis, which dealt with the transmission circuit design for given channel. This section deals with the industry SERDES transmitter design calibration versus channel design trade-offs associated with high speed interfaces



Figure 5.22: Problem description of impact comparison of width and transmitter variation on output

[98][99]. For this purpose, an encrypted industry serial circuit IP (SERDES IP) is analyzed over different 2.5D interposer channels [100]. The analyzed interconnect channel is a differential pair consisting of two metal interconnects because most industry serial interfaces use differential current mode signalling [101].

A signal x(t) traveling through a channel with linear time invariant response h(t) shall result in an output y(t) at the other end of the channel. Output is a convolution of input sent by the transmitter circuit at one end of the channel x(t) with the impulse response of the channel h(t).

$$y(t) = x(t) * h(t)$$

In frequency format, the above equation simplifies to a multiplication format of the frequency form of the transmitter output X(s) and the channel response H(s) is written as

$$Y(s) = X(s) \cdot H(s)$$

The above equation implies that for a given frequency characteristics of the received signal Y(s) at the receiver end of the channel, either the channel response H(s) can be changed or the transmitter output X(s) can be changed. The change in the transmitter output shall lead to cost impact on the energy consumption while the change in channel response by increasing the width of interconnect shall lead to higher channel area costs. This section makes an analysis of the two methods and forms a base for the next chapter on channel and transmitter Co-design methodology. The target problem is described in the Figures 5.22 and 5.23.

The channel losses or attenuation variation versus frequency for various width per 20 mm line length is shown in Figure 5.24. The 2  $\mu$ m wide interconnect has the attenuation of 10 dB per 20 mm at 5 GHz while the 10  $\mu$ m wide interconnect has only half of the attenuation, i.e. 5 dB for 20 mm line length. The cost comparison

of width variation on area and transmitter variation on energy consumption is done in the following flow:

- First, the line length variation impact on eye diagram is described for a given width and transmitter setting
- Second, for a certain length with closed eye, width is varied (variation in H(s)) to analyze the signalling pitch or channel area cost impact to open the received eye of signal y(t)
- Thirdly, for the same length with closed eye, transmitter equalization or emphasis is varied (variation in X(s)) to analyze the energy cost to open the received eye of signal y(t)

The eye diagrams for  $2\,\mu$ m wide line and 20, 30 and 40 mm lengths are shown in Figures 5.25, 5.26 and 5.27 respectively. The eye progressively closes with increase in the length. It should be note that the basic transmitter settings without any equalization or emphasis are used, i.e minimum basic energy cost point of transmitter. The 30 mm line length results in an extremely small eye height and width which should be made better with either choosing a lower loss higher line width or using some transmitter equalization.

Width Increment Impact: The line width is increased from  $2 \mu m$  to  $10 \mu m$ and the output after 30 mm interconnect length is shown in Figure 5.28. The eye width changes from 66 ps to 84 ps, and the eye height increases from 126 mV to 357 mV. The cost paid for this eye opening is the increment in the signalling pitch or the channel area. There is no extra energy consumption cost for this method. It should be noted that the spacing between the differential interconnects is kept constant to  $4 \mu m$  for all interconnect variants in this analysis.

Tx Emphasis Impact: The second method is the increase in the high frequency content of the transmitted signal by using pre-emphasis or equalization in the transmitter driver. Since the channel has higher loss at higher frequencies, this technique is used to add extra energy into the high frequency content of the signal which covers up for the extra loss due to the channel at those frequencies. The received signal after adding correct equalization tap settings on the transmitter for  $2 \,\mu$ m wide line of length 30 mm is shown in the Figure 5.29, where the eye height increases from 126 mV to 200 mV and eye width increases from 66 ps to 88 ps.



Figure 5.23: Cost impact of interconnect width and transmitter variation



Figure 5.24: Losses versus width variation in silicon substrate interconnect



Figure 5.25: 10 Gb/s received signal after  $2\,\mu m$  wide, 20 mm long interconnect on silicon interposer

#### 5.3.1 Channel versus Transmitter Variation Comparison

The above analysis shows the width and transmitter equalization impact on the received signal and both as possible options to increase the available eye height and width. The increment in width leads to better channel response H(s) with lower losses in the high frequency regime but adds the signalling pitch or channel area cost



Figure 5.26: 10 Gb/s received signal after  $2\,\mu{\rm m}$  wide, 30 mm long interconnect on silicon interposer



Figure 5.27: 10 Gb/s received signal after  $2\,\mu{\rm m}$  wide, 40 mm long interconnect on silicon interposer

to the design. While the transmitter equalization does not add any area cost but rather adds energy cost to the design. The comparison of both methods for 30 mm long silicon substrate differential interconnect for default case of  $2 \,\mu$ m width and no equalization is shown in Table 5.2, where the comparison of area and energy costs is presented.

Basis for Co-design discussion in next chapter: The above analysis makes



Figure 5.28: 10 Gb/s received signal after  $10\,\mu m$  wide,  $30\,mm$  long interconnect on silicon interposer



Figure 5.29: 10 Gb/s received signal with transmitter equalization or emphasis after  $2 \,\mu m$  wide, 30 mm long interconnect on silicon interposer

the base for the co-design of channel and transmitter discussion for minimum energyarea or minimum energy-pitch cost. Since, increasing the pitch or increasing the transmitter or receiver equalization at the cost of higher power consumption are both possible options to design a chip-to-chip communication interface with good signal integrity at the receiver end, the co-design connects the two methods together for a joint co-optimized energy-area or energy-pitch product. The above analysis

| Conditions       | Eye Height   | Eye Width     | Analog Power | Area factor |
|------------------|--------------|---------------|--------------|-------------|
|                  | Factor       | Factor        | Factor       | Alea lactor |
| Width $5 \times$ | $2.83\times$ | $1.27 \times$ | 1×           | $5 \times$  |
| Increment        | 2.00 /       | 1.21 /        | 17           |             |
| $T_x$            | 1 59 ×       | 1 33 \        | 1.46×        | 1~          |
| Equalization     | 1.03         | 1.00 ^        | 1.40 ^       | 1^          |

Table 5.2: Channel versus Transmitter variation energy and area cost comparison

has made the base for the co-design optimization discussion in next chapter.

# 5.4 Conclusion

This chapter added the missing knowledge in prior art regarding energy and area efficient DDR3 interfaces for 2.5D silicon substrate channel and the high speed SERDES industrial circuits signal integrity improvement on the 2.5D channel using channel or transceiver variations along with related area and energy cost analysis.

The previous state of the art or published work focused on the high bandwidth memory (HBM) and wide-I/O memory for 2.5D silicon interposer channel, but the widely used DDR memories traditionally used in PCB memory interfaces could also be used on silicon interposer channel. DDR3, taken as an example in this work, though specifically used with termination on the receiver side, could be used without termination or ODT in 2.5D DDR3 interfaces, due to their short lengths and higher resistivity which reduces the impact of the reflections due to impedance mismatch. While the usage without termination on PCB for DDR3 signals is extremely limited to very short interconnects up to 20 mm because the overshoots and undershoots cause the specification failure. The eye diagram analysis with industry HSPICE DDR3 models and extracted channel models were used to prove the above claim. The added knowledge for 2.5D DDR3 interfaces shall help the system designers to consider the widely understood and used DDR memories for highly miniaturized 2.5D memory-cpu systems.

As the data rates increase in high speed SERDES interfaces, the width and spacing is especially important in matching the impedance of channel to the termination. Also, the width is critical for keeping the attenuation of interconnect small. Since the received signal Y(s) is a product of the input signal X(s) and the channel frequency response H(s), either the channel or input signal can be varied to receive a good quality signal at the receiver end. Prior art for SERDES on 2.5D channels did not compare the energy and area costs together. The published work either analyzed the transmitter variations for a given channel or only analyzed the channel variations for a given input signal. This work looks at the two methods together and compares the area and energy costs. This discussion makes the basis for the co-design of channel and transceiver for minimal energy-area cost of a chip-to-chip multi-chip system communication interface presented in the next chapter.

# Chapter 6 Design Methodologies

Two or more dies communicate with each other in a single package or on an interposer within a package through a communication interface termed as *multi-chip interface*. There are several lines or interconnects which serve the communication purpose between the dies. An example of such an interface is shown in Figure 6.1, where two dies are placed side by side in a multi-chip system (MCS) and they are communicating with each other through N interconnects, drivers and receiver cells.

The transceiver blocks are denoted as physical interface (PHY), typically used for such interface front ends. The spacing between the interconnects is S, while the width of each interconnect is W. The data rate per unit line is f Gb/s. Hence, the total aggregate data rate of the interface is  $F = f \cdot N$ . The length of the interconnects is L and the pitch of interconnect is denoted as  $\rho$ . For a given aggregate bandwidth F, an optimal interface needs to be designed which can reduce the energy and area costs of the interface. Since, there is no defined transmitter topology specified, it is also a variable which needs to be determined from the methodology, along with determination of the interface communication parameters, i.e. N, W, and S.

One of the most common multi-chip interface is the memory-cpu interface. There are several types of memories and integration options, e.g. silicon substrate, PCB substrate or package substrate are available. From literature review and prior art analysis, it was found that there was no such path finding methodology or design exploration available for multi-chip memory-cpu interface design which can select the optimal memory type, and integration method based on a cost analysis which includes the energy and area costs. The generic multi-chip interface design problem and the memory-cpu path finding or design exploration problems are dealt with in



Figure 6.1: Two-die multi-chip communication interface

this chapter. The problem statements for them are stated below.

#### **Problem Statement**

- 1. Present a holistic co-design methodology for energy-area  $\psi$  minimized channel and transceiver for MCS channel. Demonstrate the methodology with examples and specifically for widely used high speed SERDES driver such as CML for 2.5D silicon substrate based interconnect.
- 2. Determine a memory-cpu MCS path finding / design exploration methodology which could list out the optimum choice of memory, and integration platform for given bandwidth, memory size and maximum energy-area constraints.

**Prior Art** The main work in the area of power minimization of transmitters for given channel was performed by Hatamkhani et al. [76][74][78], Balamurugan et al. [75] and Palaniappan et al. [77]. Closest work focusing the design problem of energy per bit minimization was performed by Hatamkhani, where the optimum energy per bit and data rate per interconnect was found for a given aggregate bandwidth F based on given channel characteristics. The objective for minimization was only the power consumption and specifically the total energy per bit for aggregate bandwidth F was minimized. The work did not take into account that the interconnect pitch is also an important factor especially for area constrained multi-chip interfaces. The only energy minimization could lead to large number of interconnects which could be disastrous for a small size MCS interface. Therefore, an extension of this work is required which is more holistic and takes into account the channel as a variable for design methodology rather than a constant as used in Hatamkhani's work.

The only prior published work on how to choose an integration technology, such as interposer, package or PCB was done by Yazdani et al. [80], but it focused only on the placements of ball, I/O drivers for a specific memory only, i.e. DDR3. There is a need for a holistic memory-cpu interface design methodology based on energy, routing area, and costs for given requirements of total bandwidth, memory size etc.

Next section 6.1 describes the holistic multi-chip interface design methodology for a given aggregate interface bandwidth F and shows two examples of energyarea minimization using this co-design approach of transmitter and channel. The section 6.2 describes the design exploration path finding approach for memory-cpu multi-chip interfaces with a design objective minimized integration technology, and memory type choice algorithm.

# 6.1 Holistic Multi-Chip Interface Design

This work extends the work done by Hatamkhani [76][74][78] with a wider design objective, i.e minimization of energy-area combined metric  $\psi$  rather than energy only and finds also the optimal width, and spacing of interconnect along with the optimal transmitter topology for given aggregate required bandwidth F Gb/s. The prior art design flow is depicted in Figure 6.2b, which consists of using the given constraints of aggregate bandwidth, and channel or interconnect parameters to find the minimum energy per bit driver topology and the number of interconnects. It does not consider the width of the interconnect and the spacing between the differential



Figure 6.2: Prior art and this work design flow comparison

pair in GSSG format or single ended interconnects in GSGSG format. Hence, the area of the channel or the routing resources used on the interconnect substrate are not taken into account.

The design flow proposed in this work is shown in Figure 6.2a, where the aggregate bandwidth (Gb/s), maximum available routing area, and transmitter topology are used as constraints to find the minimum energy-area cost related interconnect width, spacing, and transceiver settings.

### 6.1.1 Design Flow Description

The flow as shown in Figure 6.2a, consists of a detailed characterization of the interconnect which is then used to determine the transmitter and receiver output/input impedances, equalization or emphasis settings and the drive strength of the drivers etc. The derivation is based upon the combined cost metric of energy usage per bit multiplied with the signalling pitch  $\rho$  with unit  $pJ/bit \cdot \mu m$ . Pitch is used in the cost calculation instead of the routing area because the length of the interconnect is constant, hence, makes no impact in the cost calculation as also depicted in the introductory Figure 6.1.

The idea behind this work is that the increase in width shall lead to a decrement in the insertion loss of the interconnect and hence could lead to a decrement in the energy per bit of the transceiver. While a decrement in the width of the interconnect W shall lead to an increment in the insertion loss and therefore could cause increment in the energy per bit efficiency of the interface.

Assume that there are several topologies available for choice in a multi-chip communication interface design, calling the choice set as T

$$T \in [SSTL, HSUL, LVDS, CML]$$

The energy per bit efficiency for a given transceiver topology  $T_i \in T$  is defined by the interface power consumption  $\phi$  for both transmitter  $T_x$  and receiver  $R_x$  and can be written as  $\phi_{T_i} = \phi_{Tx} + \phi_{Rx}$  where

$$\phi_{Tx} = [\phi_{Drv} + \phi_{Eq} + \phi_{Ser} + \phi_{Ckbuf}]$$
  
$$\phi_{Rx} = [\phi_{buf} + \phi_{Eq} + \phi_{DeSer} + \phi_{Ckbuf}]$$

where  $\phi_{Drv}$  represents driver power,  $\phi_{Eq}$  is for equalization,  $\phi_{Ser}$  and  $\phi_{DeSer}$  are for serialization and de-serialization blocks, and  $\phi_{Ckbuf}$  denotes the clock buffering and distribution block. The power consumed by the extra blocks other than the front end driver, receiver and equalization blocks is determined by the data rate and the topology choice. The energy area metric is written as  $\phi/f_b * \rho$  in  $pJ/bit \cdot \mu m$  where  $f_b$  is the data bit rate in Gb/s.

For any integration technology and the substrate available, the width W is restricted by the  $W_{min}$  and the spacing restricted by the  $S_{min}$  which leads to the minimum signalling pitch  $\rho_{min}$  in the given interface integration platform. For the case of single ended signalling with routing in the format GSGSG, where W is the signal width, and  $W_{GND}$  is the ground line width, then pitch  $\rho$  is written as

$$\rho = W + W_{GND} + 2S$$

For the case, when the ground line width is set to minimum possible in the integration technology, then minimum signalling pitch  $\rho_{min}$  can be written as

$$\rho_{min} = W + W_{min} + 2S$$

Therefore, for this signal ended signalling case with minimum width of ground and transceiver topology  $T_i \in T$ , the energy-pitch efficiency metric  $\psi$  can be written as

$$\psi\left(T_{i},\rho\right) = \frac{\phi}{f_{b}}\left(W + W_{min} + 2S\right)$$

The design flow shown in Algorithm 2 is run in its basic form through all the possible combinations of transceiver topology in the given set T, possible interconnect width W and spacing W possibilities to find the optimum cost  $\psi$  solution consisting of  $T_{opt}$ ,  $W_{opt}$ , and  $S_{opt}$  for a given aggregate bandwidth  $f_b$ . Algorithm is based on the methodology that the transmitter and receiver equalization settings or tap values are calculated based on the pulse response. The tap values or equalization settings are used to calculate the energy per bit and the cost metric  $\psi$  which is computed iteratively until the minimum is locked, which corresponds to the optimal solution.

Algorithm 1: Holistic MCS communication interface design flow

**Result:** Optimum solution  $T_{opt}, W_{opt}, S_{opt}$ define width range:  $W = \{W_{min}, \ldots, W_{max}\}$ define spacing range:  $S = \{S_{min}, \ldots, S_{max}\}$ define Transceiver types:  $T_i \in T$ define data bit rate:  $f_b$ define interconnect average length: Linitialize  $\psi_{old}$ while  $T_i \in T$  do for  $W \leq W_{max}$  do for  $S \leq S_{max}$  do Find S-parameters for given W, SFind pulse response for given  $f_b$ Find required number of Taps for TxFind required number of DFE Taps for Rxcalculate power consumption in Tx, Rx:  $\phi_{Tx} = [\phi_{Drv} + \phi_{Eq} + \phi_{Ser} + \phi_{Ckbuf}]$  $\phi_{Rx} = [\phi_{buf} + \phi_{Eq} + \phi_{DeSer} + \phi_{Ckbuf}]$ calculate signalling pitch :  $\rho = W + W_{min} + 2S$ calculate interface energy-area cost :  $\psi = \frac{\phi}{f_b} \left( W + W_{min} + 2S \right)$ if  $\psi < \psi_{old}$  then  $T_{opt} = T_i, W_{opt} = W, S_{opt} = S$ end update  $\psi_{old} = \psi$ end end end

#### 6.1.2 Example of low resistivity silicon substrate interface

In order to understand the basic operation of above algorithm, consider an example of silicon interposer with low resistivity of  $100 \,\Omega \cdot \text{cm}$  and dielectric constant of 11.9 as shown in the Figure 6.3, where two metal layers are present in insulator  $S_iO_2$ . Consider the length L to be 10 mm. For simplicity, consider the data rate per unit line to be 10 Gb/s only and the range of width variation to be  $1-2\,\mu\text{m}$  while the spacing is kept constant to  $1\,\mu\text{m}$  as shown in Figure 6.3. The insertion loss variation by width variation is shown in Figure 6.4, where at Nyquist frequency 5 GHz for  $10 \,\text{Gb/s}$  signalling, the insertion loss dependent upon frequency for  $2\,\mu\text{m}$  wide line is 7 dB higher than for  $1\,\mu\text{m}$  line. Also, there is 4 dB higher DC loss in  $1\,\mu\text{m}$  wide line which means a reduced voltage swing at the Rx input.

For evaluating a channel regarding its insertion loss, pulse response method is used in general. This method also used here consists of sending a pulse with width equal to one unit interval of 100 ps at 10 Gb/s on one end of the channel. Both ends of the channel are terminated properly with typical 50  $\Omega$  impedance to avoid any reflection, which is not analyzed in this case. The pulse response attained on the other end of the channel is shown in Figure 6.5, where the x-axis is normalized to



Figure 6.3: Stackup for silicon interposer based multi-chip system

one unit interval in order to clearly see the inter-symbol interference with previous or later data bits.

The signal rises completely within 1-UI for both 1 and  $2\,\mu$ m wide lines. That means that there is no pre-cursor inter-symbol-interference (ISI) with previous bits. But both lines causes the signal to extend further into later bits, where the  $2\,\mu$ m response drops to zero after about 3-UIs and after 2-UI for  $1\,\mu$ m line. For cancellation of these post-cursor ISI, either a continuous time linear equalizer (CTLE) or a decision feedback equalizer is used in receiver design [77]. For complete cancellation of ISI, a high impedance peaking in CTLE shall be needed or large number of DFE taps. If a DFE is used, two decision-feedback equalizer (DFE) taps will be required for  $2\,\mu$ m line at receiver end to cancel the 2nd and 3rd UI ISI while only 1 DFE tap shall be required for  $1\,\mu$ m line to cancel the 2nd-UI ISI.

Palaniappan et al. demonstrated the method to estimate the power consumption of receiver circuit based upon the equalization value in dB for CTLE and the number of taps for DFE [77]. It was shown that CTLE is used only up to 12 dB equalization while for higher values, a DFE is preferred. Since, for our case, though total loss is much higher but the frequency dependent insertion loss (total-DC) S21(dB) is



Figure 6.4: S-parameters extracted using HSPICE 2D field solver



Figure 6.5: Response for  $10 \,\mathrm{Gb/s}$  input pulse with 1 ps rise time

less than 12 dB for both lines, CTLE equalization is used. Based on this estimation methodology and  $0.1 \,\mathrm{mW/Gb/s}$  power per 6 dB CTLE equalization, extra 1 mW power  $\phi_{Rx}$  is consumed by the receiver circuit interface with 1  $\mu$ m wide interconnect. Ignoring other blocks in transmitter and receiver for basic understanding of this algorithm, the energy-pitch metric  $\psi$  is calculated based on just the front-end driver, receiver and equalization blocks. Even though the power consumption for 1  $\mu$ m interface increased, but the energy-pitch metric  $\psi$  is still less by 0.1 pJ/bit  $\cdot \mu$ m for 1  $\mu$ m line interface as compared to the 2  $\mu$ m interface. This means that for combined energy-area or energy-pitch performance of a multi-chip interface, the 1  $\mu$ m wide line interface shall still be the better choice.

#### 6.1.3 CML Front-End and Silicon substrate Example

In previous example, the width variation is the main factor but the spacing between interconnects is not considered. In this example, the spacing shall also play an important role along with the width of the interconnect and be critical to the optimization of energy-pitch  $\psi$  metric. Current mode logic (CML) signalling topology is used for this example and a high resistivity  $10\,000\,\Omega \cdot \text{cm}$  silicon substrate with low  $tan\delta$  of 0.001 only [103]. This example is based on the following point for energy-pitch minimization of chip-to-chip communication interface:

The increase in spacing between differential pair in CML signalling should lead to higher impedance and lower power consumption, but shall increase the signalling pitch. The decrease in spacing shall decrease the differential pair impedance and lead to higher power consumption, but shall reduce the signalling pitch. The methodology presents combined  $\psi$  energy-pitch analysis to optimize the width, spacing and power consumption.

This subsection uses the holistic energy-pitch minimization methodology to optimize the design of current mode logic (CML) transmitter front end and interposer for optimum energy and area efficiency [104]. A general schematic of the CML driver



Figure 6.6: CML driver through  $Z_{odd}$  channel to  $R_x$ 

is shown in Figure 6.6, which shows two input transistors biased in their saturation regions with common mode bias voltage  $V_{CM}$  resulting in current  $I_{bias}$  flowing to the ground. The voltages  $v_{in}^+$  and  $v_{in}^-$  represent the differential voltage input at the gates of CML transistors. As described in [57], if the interconnect odd mode impedance  $(Z_{odd})$ , driver impedance  $(R_D)$  and receiver single ended termination termination  $(\frac{R_T}{2})$  are matched to suppress the ringing/reflections at the receiver end, then half of  $I_{bias}$  goes into the receiver impedance  $R_T$  giving a differential voltage swing of  $\frac{I_{bias}}{2}R_T$ . Typically good receivers are capable to interpret signals  $\geq 100 \text{ mV}$ , which means that the CML driver can be designed for lowest power until it does not cross the minimum input voltage swing requirements at the receiver.

Generally, the impedance of the CML drivers is designed as  $50 \Omega$  for PCB based systems. But when these circuits are used for data transmission between chips in 2.5D integrated technology, then these circuits can use the high impedance design to lower the required current (I) for a given required voltage swing. The biggest drawback of CML drivers is their data rate independent power consumption. Regardless of the frequency of data transmission, static  $I_{bias}$  current flows through the driver. Therefore, power consumption in CML drivers is defined as  $V_{DD} \cdot I_{bias}$ .

Consider a stack up shown in Figure 6.7, which contains two metal layers of copper in  $S_iO_2$  dielectric over a silicon substrate. A coplanar architecture is considered in which a differential pair is surrounded by ground lines for shielding purposes and has ground lines under it on the lower metal layer, all separated with a constant spacing (S).

The goal of the co-design is to investigate the performance of this coplanar architecture with different width and spacing values and then to find the W and S values for which minimum Energy\*Pitch (Enpitch) cost is achieved at the maximum possible effective -3 dB bandwidth  $(BW_{eff})$ . Once 2D field solver has extracted the RLGC model for all possible values of W and S, then the first thing to do is to find out the odd mode impedance  $Z_{odd}$  for each W, S value combination.  $Z_{odd}$  can


Figure 6.7: Stackup used for Simulation

be calculated using Eq. 6.1 where  $L_o$ ,  $L_m$ ,  $C_o$  and  $C_m$  represent the self, mutual inductances and capacitances, respectively.

$$Z_{odd} = \sqrt{\frac{L_o - L_m}{C_o + 2C_m}} \tag{6.1}$$

This equation shows that if the mutual coupling capacitance between the interconnects of a differential pair for CML signalling is increased due to reduced spacing between the pair lines, the odd mode impedance shall decrease. The impedance is directly related to the power consumption for a given voltage swing and hence, shall lead to higher power consumption.

Based upon the basic transmission line theory, the insertion loss of the differential pair is dependent upon both the conductor R and dielectric G conductances.

$$\alpha = \alpha_C + \alpha_D$$

where  $\alpha_C$  represents the conductor resistance loss and  $\alpha_D$  represents the dielectric conductance loss. For simplicity of example, there are two assumptions made. One assumption is that the dielectric loss is extremely small because the dielectric conductance factor G of RLGC model is not significant due to the extremely high resistivity  $10\,000\,\Omega$  · cm silicon substrate. Therefore, the attenuation over the line will be only due to the conductive losses due to interconnect resistance R. The second is low loss assumption of differential pair such that the inductive behavior of transmission line is much larger than the resistive behavior and the capacitive coupling is much larger than the dielectric conductance. This can be shown as the general transmission line propagation constant  $\gamma$  is written as [105]

$$\gamma = \sqrt{(R + jwL)(G + jwC)} \tag{6.2}$$

which can be written as

$$\gamma = jw\sqrt{LC}\sqrt{\left(1 + \frac{R}{jwL}\right)\left(1 + \frac{G}{jwC}\right)}$$

Using the second assumption of low loss transmission line,  $R \ll jwL$  and  $G \ll jwC$ , then taylor series expansion leads to

$$\gamma = \frac{1}{2} \left( R \sqrt{\frac{C}{L}} + G \sqrt{\frac{L}{C}} \right) + j w \sqrt{LC} = \alpha + j\beta$$

Hence, the attenuation  $\alpha$  in Nepers/meter is given as

$$\alpha = \frac{1}{2} \left( \frac{R}{Z_{odd}} + GZ_{odd} \right)$$

Using the first assumption that the dielectric conductance loss is extremely small in this example for simplicity, the attenuation in Nepers/meter is given as

$$\alpha = \frac{R}{2Z_{odd}}$$

Since 1 Neper = 8.686 dB, then attenuation factor  $\alpha$  in decibels for 10 mm line differential pair can be calculated by Eq. 6.3 where  $R_o$ ,  $R_s$  are dc resistance and skin effect resistance factor values.

$$\alpha_{dB} = 8.686 \left[ \frac{\left( R_o + R_s \sqrt{f} \right) / 100}{2Z_{odd}} \right]$$
(6.3)

By plotting and finding the  $-3 \,\mathrm{dB}$  bandwidth  $BW_{ch}$  for each W and S configuration, 10-90% rise time  $\mathrm{tr}_{ch}$  of the link interconnect [85] can be calculated using the Eq. 6.4.

$$tr_{ch} = \frac{0.35}{BW_{ch}} \tag{6.4}$$

For perfect matching of driver, channel and receiver impedance, the approximate effective or final rise time  $tr_{tot}$  based upon the single-pole RC channel assumption at the receiver input can be calculated based upon the cascaded three RC-blocks and can be written as

$$tr_{tot} = \sqrt{(tr_{Tx})^2 + (tr_{channel})^2 + (tr_{Rx})^2}$$

Since the output impedance of  $T_x$  and input impedance of  $R_x$  is equal to channel impedance  $Z_{odd}$  for best matching, their 10-90% rise time is equal to 2.2RC or  $2.2Z_{odd}C_{pad}$ . Hence, the final rise time to input amplifier at receiver end is

$$tr_{tot} = \sqrt{(2.2C_{pad}Z_{odd})^2 + \left(\frac{0.35}{BW_{ch}}\right)^2 + (2.2C_{pad}Z_{odd})^2} \tag{6.5}$$

and finally can be simplified to Eq. 6.6.

$$tr_{tot} = \sqrt{9.68(C_{pad}Z_{odd})^2 + \left(\frac{0.35}{BW_{ch}}\right)^2}$$
(6.6)

Then the total or final Bandwidth  $BW_{tot}$  can be calculated using the inverse of the Eq. 6.4 and can be written as

$$BW_{tot} = \frac{0.35}{\sqrt{9.68(C_{pad}Z_{odd})^2 + \left(\frac{0.35}{BW_{ch}}\right)^2}}$$
(6.7)

The next step is to find the power and signal routing pitch cost for each configuration. For current mode logic driver as shown in Figure 6.6, the power consumed is only



Figure 6.8: Attenuation vs width for S = W

static which can be calculated as the product of supply voltage  $V_{DD}$  and  $I_{bias}$ . For a given voltage swing  $V_{SW}$  requirement, the current required is  $V_{SW}/Z_{odd}$  which is equal to half of the CML driver bias current, i.e.  $I_{bias}/2$ . The signal pitch for such coplanar configuration is

$$\rho = 3 \times (S + W)$$

The power consumption  $\phi$  is product of current  $I_{bias}$  and voltage supply  $V_{DD}$ .

$$\phi = I_{bias} \cdot V_{DD} = \frac{2V_{DD}V_{SW}}{Z_{odd}}$$

Therefore, the final metric for our co-design is energy\*pitch/bit given by Eq. 6.8.

$$Energy/bit * Pitch = \frac{\phi}{BW_{tot}} * \rho = \left(\frac{2V_{DD}V_{SW} * 3(S+W)}{Z_{odd}BW_{tot}}\right)$$
(6.8)

The calculated attenuation values using Eq. 6.3 are plotted in Figure 6.8. Odd mode differential impedance  $Z_{diff}$  is plotted in Figure 6.9. It shows that with increasing spacing of metal lines, the inductance increases which results in increased impedance. With even larger increase in width, the capacitance increases which makes the impedance lower. As can be seen in the plot,  $Z_{diff}$  reaches a peak at 5µm width but decreases with further increments in width. One conclusion from this plot is that at 5µm width, differential impedance is maximum for the given stackup which could lead to lowest current requirement in the CML driver design.

In order to calculate the bandwidth of driver and receiver, pad capacitance is chosen as 0.2 pF to meet the JEDEC ESD requirements [81][83]. The -3 dB effective bandwidth for the whole path from  $T_x$  to  $R_x$  is plotted in Figure 6.10, which shows that bandwidth increases with increasing width and spacing. But this will drastically increase the area cost of the design. Therefore, a combined energy/bit\*pitch metric is needed for optimum configuration selection as plotted in Figure 6.11. Power supply value  $V_{DD}$  is 1.8 V and required  $V_{SW}$  is 300 mV. It can be seen from the



Figure 6.9:  $Z_{diff}$  vs width(W) and spacing(S)

plot that cost metric reaches a minimum for  $10\,\mu\text{m}$  width with  $10\,\mu\text{m}$  spacing and supports the  $-3\,\text{dB}$  bandwidth of  $10\,\text{GHz}$ .

Consider for example that a chip has to be designed with side length of 3 mm for maximum bandwidth and minimum power. Then using 10 µm width and spacing, 50 CML differential pairs running at 10 Gb/s each can be placed on the interposer resulting in total bandwidth of 500 Gb/s.



Figure 6.10: -3dB bandwidth variation with width (W) and spacing (S)



Figure 6.11: Energy\*pitch variation with width (W) and spacing(S)

# 6.2 Memory-CPU MCS Design Methodology

2.5D integration of memory and SOC on silicon interposer technology can be a better solution for high bandwidth memory systems as compared to standard printed circuit board (PCB) technology. In this section, a methodology will be shown to find the best possible memory type and integration technology for given memory interface requirements. In order to find the optimum solution, the important aspects of the memory system have to be clearly defined. Based on the cost and performance metrics, a memory interface with best cost to benefit ratio will be selected by choosing between different possible memory standards and integration technologies.

The methodology with the realistic cost metrics will be explained for a 400 Gb/s bandwidth memory system. The different factors and aspects affecting the derived solution will be discussed. The methodology to select the optimum integration technology for given memory interface requirements will be described. The impact of bandwidth, power and area requirements on the selection of memory type will be analyzed. There can be no single solution which could fit all types of applications. It shall be demonstrated that under certain circumstances, a certain solution will outperform the other in terms of performance metrics.

A flow chart describing the steps involved in the design methodology is shown in Figure 6.12. Like any methodology, it starts with the application dependent system requirements normally derived by system architect in the industry. The most important system requirements in terms of memory interface are :

- the data exchange bandwidth normally given in terms of Gb/s (Gigabits per second)
- maximum area allowed for the complete system, especially important for handheld products and mobile systems
- power used by memory interface in the whole system

- latency given in terms of number of memory interface clock cycles
- memory size given in GB (Gigabytes)

All these design parameters greatly vary from one application to other application. A desktop computer allows large space for big memory modules. While a small smartphone puts strict requirements on the thickness of the memory, how much heat it generates, how much routing resources it consumes and how much maximum bandwidth it can offer in one channel. Such kind of small space applications are one of the biggest drivers pushing towards multi-chip and 2.5D integrated memory systems. Hence, it becomes essential for system architects and designers to understand the complex relationships between the performance metrics, and design choices available to them. These basic system requirements are given to a system designer who looks into the possible memories available and how to place these memories in the complete system. The types of memories available and their properties in comparison to each other will be discussed below.

### 6.2.1 Introduction to Memory standards

Over the years, there has been a lot of development in the memory industry. This has been pushed by the applications in high end graphics and data centers which require faster and wider memory access from central processing unit (CPU). One



Figure 6.12: Optimum memory system design methodology

of the biggest advancements is the double data rate (DDR) standard which has continuously evolved over the last decade [88]. DDR not only got its graphics version in terms of GDDR but also got a special low power version LPDDR with many interface changes and power saving extra features [92].

The trend for low power mobile memories started as a result of huge growth of smartphone and hand-held products which must offer high speed performance along with extremely long battery times. The memories can be divided into mainly three categories of "main memory", "graphics memory" and "Mobile memory". Another memory category is added to the list in last few years which enable the memory interfaces for 2.5D integration. This fourth category can be named as the "3D enabling memories" shortened as "3D-Mem". The main contenders in this new category are high bandwidth memory (HBM) [60], and wide-I/O [58].

Figure 6.13 shows how the memories developed over the years especially the three categories of DDR, LPDDR and 3D-Mem. The picture shows the Gb/s data rate possible per data line (DQ) which gives an idea of how different memory standards revolutionized the maximum data rate possible over the years. Data shown in Figure 6.13 does not take in account the over clocking which could be possible with higher voltages. It depicts only the maximum possible bandwidth defined in JEDEC standards. It can be seen that DDR1 standard in 2008 only supported 400 Mb/s data rate per DQ line in the memory channel, while in 2015 introduced LPDDR4 standard can deliver almost 4200 Mb/s over a single DQ line.

This tremendous growth of bandwidth has helped in huge increase in the performance of main-memory and mobile systems. If one looks at the desktop computing market, it is nowadays common to have more than 8 GB of DDR4 dual in-line memory modules (DIMM), working over a single or multiple channels. Similarly, mobile systems are showing continuous increase in the size and bandwidth of memory interface, reaching to around 3-4 GB size with multiple 64-bit channels operating separately. But this memory revolution is now entering into a new era of heterogeneous



Figure 6.13: Memory bandwidth growth over the years

2.5D where memory and CPU will communicate over the interposer interconnect or through the through-silicon-vias (TSV).

The trend is due to the possibility of connecting chips together on the interposer and has also pushed the development of memories specifically designed for 2.5D integration technology, e.g. wide-I/O, and HBM. As can be seen from Figure 6.13, these memories are not as high speed per DQ line as LPDDR4 but they do offer huge number of DQ lines. These DQ lines can be easily routed over the interposer with small area as compared to large area PCB routing, which will not allow huge number of interconnects. So these new technologies offer different advantages and it is very interesting to find the optimum choice of memory and technology with given system design parameters.

Memory core and I/O voltage has also reduced drastically over the years. The reason behind the voltage reduction is the miniaturization of the transistors over the years due to Moore's law. This voltage reduction has directly impacted the power usage of the memory interface as well. A graph showing the voltage  $V_{DDQ}$  supply reduction over the years for different memories is shown in Figure 6.14. DDR3 has  $1.5 \text{ V} V_{DDQ}$  as compared to 1.8 V of DDR2. Then DDR3 also introduced a low power version called DDR3L which has  $V_{DDQ}$  of 1.35 V. Similarly, in low power memories for mobile and hand held systems, LPDDR2 and LPDDR3 have  $V_{DDQ}$  of 1.2 V while LPDDR4 reduced it to 1.1 V. There is another even lower power version of LPDDR4 where I/O voltage is further reduced by  $1.81 \times$  to 0.6 V which will dramatically decrease the power usage of these memories [93]. The trend is clearly towards less and less power but this less power has led to increased design complexity and very detailed and complex signal integrity compliance requirements.

A standard memory-cpu channel consists of some data lines labeled as DQ, and some command/address/control lines labeled as CA/CTL. The number of CA and DQ lines per channel will play the key role in deciding the routing resources required in either PCB or interposer based layout and will directly affect the rout-



Figure 6.14: Memory I/O voltage supply reduction trend

ing/integration costs. For example, Wide I/O2 memory per channel requires 33 CA/CTL lines while it has 64 DQ lines. As shown in Figure 6.13, if Wide-I/O2 single DQ line can support 1.066 Gb/s data rate, then one channel of Wide-I/O2 with total of 97 interconnects, provides 68.2 Gb/s. From the JEDEC specification [58], one memory die can have maximum of 8 channels. Hence, one Wide-I/O2 memory die can provide the total bandwidth of 68.2 GB/s.

HBM can also have maximum of 8 channels per memory die where each channel has total of 212 signals. From these 212 signals, only 128 are DQ signals while the rest 84 signals consist of CA/CTL lines and reliability serving redundant interconnects. Hence, with 8 channels in total for HBM, 1696 interconnects are required and will provide the maximum bandwidth of 256 GB/s.

LPDDR4 is a low power mobile memory standard which targets a maximum data rate per pin of 4266 Mb/s. But this maximum is still not much used and available. Instead, a little slower speed grade of 3200 Mb/s per DQ line is used for comparative analysis in this study. A single channel of LPDDR4 consists of 45 interconnects in total, of which 16 are DQ lines. Hence, per channel maximum bandwidth of LPDDR4 is 51.2 Gb/s. A quantitative interconnect and data rate per channel comparison of these memories is shown in Table. 6.1 below.

| Memory   | Per channel<br>total<br>interconnects | Data lines per<br>channel | $\begin{array}{c} \text{Per channel} \\ \text{bandwidth} \\ (\text{Gb/s}) \end{array}$ |
|----------|---------------------------------------|---------------------------|----------------------------------------------------------------------------------------|
| Wide-IO2 | 97                                    | 64                        | 68.2                                                                                   |
| HBM      | 212                                   | 128                       | 256                                                                                    |
| LPDDR4   | 45                                    | 16                        | 512.2                                                                                  |

Table 6.1: Per channel memory interface comparison

### 6.2.2 Design Algorithm

To understand the algorithm, it is necessary to define all the metrics, parameters and design factors mathematically. For any optimization problem, the first thing to fix is the optimization objective of the design algorithm. In memory interfaces, a linear optimization objective cost function  $\psi$  can be defined as

$$\psi = a_1 \alpha + a_2 \phi + a_3 \zeta \tag{6.9}$$

where  $\alpha$  is the silicon cost of the CPU or memory controller die derived from its size,  $\phi$  is the routing cost derived from the number of required routing layers and  $\zeta$  is the power cost of the interface derived from the required current and voltage in the I/O cells of the CPU and memory die. Also,  $a_1$ ,  $a_2$ , and  $a_3$  are conversion factors for  $\alpha$ ,  $\phi$ and  $\zeta$  into a single monetary unit of '\$'. These factors are set by the system designer based upon the prices of silicon, packages, routing layers and power. The factors in the Equation 6.9 will change with the choice of different integration technology and the choice of memory type. The memory choice space can be defined as the set Mand integration technology space can be defined as the set I. The complete design space D can be defined as the all possible combinations of the two sets M and I as below

$$D = \begin{pmatrix} M \\ I \end{pmatrix} \tag{6.10}$$

where

$$M = \{M_{Main-mem}, M_{Mobile-mem}, M_{3D-mem}\}$$
$$I = \{PCB, Interposer\}$$
$$M_{Main-mem} = \{DDR3, DDR4\}$$
$$M_{Mobile-mem} = \{LPDDR3, LPDDR4\}$$
$$M_{3D-mem} = \{Wide - I/O, Wide - I/O2, HBM\}$$

In the design space, although not all memories are specifically designed for interposer based 2.5D or 3D integration, but still the dies of the memory are available from the industry and can be placed onto an interposer. For such dies, also the die stack can be placed on the interposer with the correct bumping pitch. The best suited memories for 2.5D integration are clearly HBM, Wide-I/O and Wide-I/O2 because they are specifically developed for interposer based integration. For the purpose of simplicity and clear design space evaluation, only the 3D-Mem will be evaluated with interposer based integration while others will be evaluated with the PCB based integration. This means that the design space as described in Eq. 6.10 needs to be further filtered down.

Furthermore, DDR1-3 in terms of voltage and power are not comparable to the latest standard of LPDDR4 as shown in the Figure 6.13 and Figure 6.14. Therefore, only DDR4 and LPDDR4 will be used for PCB based solution comparison. This will make the comparative evaluation of PCB and interposer based memory interfaces very beneficial for the current system designer. Also, in the 3D-Mem memory category, the Wide-I/O is the older standard as compared to Wide-I/O2 and it is more realistic to use only the Wide-I/O2. These changes in the design space D will convert the Eq. 6.10 into the reduced form  $D_r$  in Eq. 6.11.

$$D_r = \begin{pmatrix} LPDDR4, DDR4\\ PCB \end{pmatrix} \bigcup \begin{pmatrix} Wide - I/O2, HBM\\ Interposer \end{pmatrix}$$
(6.11)

With the definition of design space  $D_r$ , the question is how the different solutions in this set  $D_r$  will prove to be optimum under different system requirements. The main system requirement is the minimum data exchange rate or bandwidth of the memory interface labeled in this work as  $\beta_{min}$  in units of gigabits per second (Gb/s). Other constraints are maximum possible CPU die size  $\alpha_{max}$  and maximum affordable power consumption  $\zeta_{max}$ . So, the best possible solution  $\gamma$  can be written as

$$\gamma \in D_r \mid (\alpha \le \alpha_{max}) \land (\zeta \le \zeta_{max}) \tag{6.12}$$

This means that it is not enough to just minimize the cost metric given in Eq. 6.9, rather the solution must also meet the maximum cost conditions for power and die area. The optimum solution can be one which may not give the minimum objective cost but would still be chosen because it is the only one possible within the given design space. To reach such a solution, it is necessary to run brute-force exhaustive search through the design space which is not very computational intensive because of the small number of members of the set  $D_r$  [106]. The exhaustive search algorithm computes the objective cost  $\psi$  for each possible solution  $d \in D_r$  and determines if it could be a possible solution based on the relationship defined in Eq. 6.12. The

| Algorithm | <b>2</b> : | Design | Metho | odology |
|-----------|------------|--------|-------|---------|
|-----------|------------|--------|-------|---------|

**Result:** Optimum solution  $\gamma$ define cost weights:  $a_i \forall i \in \{1, 2, 3\}$ define design space:  $D_r = \{d_1, d_2, d_3, \dots, d_n\}$ define cost objective:  $\psi = a_1 \alpha + a_2 \phi + a_3 \zeta$ define max constraints:  $\alpha_{max}$  and  $\zeta_{max}$ initialize iteration variable k = 1while (k < n) do calculate  $\alpha_k$ calculate  $\phi_k$ calculate  $\zeta_k$ calculate  $\psi_k = a_1 \alpha_k + a_2 \phi_k + a_3 \zeta_k$ if  $(k \neq 1) \land (k \leq n)$  then if  $(\alpha_k \leq \alpha_{max}) \land (\zeta_k \leq \zeta_{max})$  then if  $(\psi_k \leq \psi_{k-1})$  then  $\gamma = d_k$ end end else if (k = 1) then if  $(\alpha_1 \leq \alpha_{max}) \land (\zeta_1 \leq \zeta_{max})$  then  $\gamma = d_1$ else  $\gamma = undefined$ end end end end if  $(\gamma \neq undefined)$  then Print:  $\gamma$  is the optimum solution else Print: no solution found as system requirements are too strict end

algorithm for the design methodology to reach the optimum solution  $\gamma \in D_r$  is described in Algorithm. 2.

The algorithm performs an exhaustive enumeration search through the design space and tests all the possible solutions. It calculates the costs for each solution and finds the optimum memory interface design option. As shown at the end of the algorithm after the while loop, it may be possible that there is no existing solution which can fulfill the system design requirements. In this case, it would be necessary to either decrease the system bandwidth requirements or increase the maximum limiting constraints on the die area cost  $\alpha$  and power cost  $\zeta$ . Then the algorithm must be run again and it will try to find a plausible solution among the design space.

In order to determine the silicon cost ' $\alpha$ ', a flow diagram showing the controller die area calculation for PCB and interposer based memory interfaces is shown in Figure 6.15. Silicon die cost of the memory controller depends upon the number of interconnects, bumping pitch used for die to interposer connection and the area consumed by the I/O cell in the die. This is a true I/O constrained memory controller or CPU die whose area is directly dependent upon the I/O number. The bumping pitch defines the minimum distance allowed between the centers of point of connections between interposer and CPU die. The bumping pitch for copper pillars can be considered to be 50 µm [68]. Figure 6.16 shows the bumps and the minimum pitch  $\rho_{min}$ .

Minimum copper pillar pitch plays a critical role in defining the size of the CPU die. If the bumping pitch is too large, consequently a very large die will be required to accommodate the number of bumps. It should be noted that bump count depends not only the number of signal I/O but also on the number of power bumps. It is good practice to define the number of power I/O as 30-40% of the total I/O in a silicon die. The second factor in the CPU die sizing is the area consumed by the I/O cells. So, these two factors will compete with each other and the dominant one



Figure 6.15: Memory controller die size and PCB based package area costs



Figure 6.16: Copper pillars minimum pitch  $\rho_{min}$ 

will define the die size.

There are two types of bumping, one is the staggered bumping required for HBM and other is regular bumping required in Wide-I/O2. In order to make the comparison easy, only regular bumping is considered here as shown in Figure 6.16. If the number of interface signals is much larger and the total area due to I/O cells is smaller, then the area due to bumping requirements would be considered. Integration technology is also impacted by the choice of memory because wide-I/O2 uses 40 µm pitch bumps which are only possible with copper pillars because C4 micro bumps are not usable anymore at this pitch.

Digital logic inside memory controller die is ignored here because the processing logic in controller is generally very small to affect the total area cost. If it is a graphics processing unit (GPU) then it may be large due to many processing cores inside the GPU. This factor of large digital core area is not considered in this study as this study presents a methodology for memory interface communication system, independent of the computing requirements of the CPU die. This study will give a lower limit of die size to the system designer. Then the designer can choose to increase the die size to include more processing power, while paying more in terms of die cost and system size. Therefore, memory controller die area driven from I/O requirements and integration technology can be written as

$$A_{die} = \max\left[ (s_{io} \cdot n_{io}), \left[ \rho_{min} \times \left( \sqrt{intsq - ceiling(n)} + 1 \right) \right]^2 \right]$$
(6.13)

 $s_{io} \in S = \{s_{HBM}, s_{wide-IO2}\}$  $\rho_{min} \in P = \{p_{HBM}, p_{wide-IO2}\}$  $n_{io} \in N = \{n_{HBM}, n_{wide-IO2}\}$ 

where  $\rho_{min}$  is the minimum bumping pitch dictated by the memory device die,  $n_{io}$  is the number of interface signals, and  $s_{io}$  is the area of one I/O cell in memory controller. The function "Intsq - ceiling(n)" gives the closest higher integer square number to the number of interface signals  $n_{io}$ . This will give the maximum of two different dictations given by memory choice and data rate, i.e. the I/O cell area which is compared to the minimum area required for interface bumps.

The number of routing layers required to connect the memory to controller die defines the cost  $\phi$  given in Eq. 6.9. The line spacings on PCB are much larger than the minimum line spacings allowed in the interposer which can be as less as only a few  $\mu ms$ . This ultra fine interconnect pitch enables thousands of interconnects on interposer between the memory dies and controller die [107]. This fine pitch property of 2.5D integration means that very few metal layers on interposer are required as compared to large number of routing layers typically required in PCB designs.

The energy cost  $\zeta$  of interface is defined by the voltage supply and I/O signalling topology, i.e. terminated or unterminated signalling. The memory interfaces in HBM may use unterminated signalling due to low data rate, while PCB based memory interfaces typically required termination even at low data rates due to longer lengths and imperfections in the channel. Total energy consumption cost is calculated in this study by multiplying the power consumption values of memories with the memory usage time duration.

### 6.2.3 Memory Interface 400 Gb/s Design Example

In recent years, 400 Gb/s memory interfaces are hot topic for high performance applications [108]. Both the PCB and interposer based options will be compared for the memories available in the design space.

**LPDDR4 Memory Interface** In LPDDR4 package of  $15 \times 15 \text{ mm}^2$ , quad channels (4 channels) per package are supported with each channel containing 16 DQ lines. Number of signals per channel for LPDDR4 is 45 which includes the required data (DQ) and command/control (CA/CTL) signals. In LPDDR4, 3.2 Gb/s per DQ line is the supported data rate. Therefore, minimum 125 DQ data lines are required for 400 Gb/s bandwidth, which can be fulfilled by 128 DQ lines using two LPDDR4 packages. The minimum area consumed only by the LPDDR4 packages is 450 mm<sup>2</sup>. If the size of controller package is assumed to be equal to the sum of two memory package sizes, then total area consumed by the controller and memory packages on PCB will be 900 mm<sup>2</sup>. This 8 channel 400 Gb/s LPDDR4 package based memory system on PCB is shown in Figure 6.17.

In order to run this 400 Gb/s LPDDR4 based memory interface, the controller die must have 128 LVSTL (low voltage swing terminated logic) I/O cells [109]. LVSTL cells have low voltage swing capability and use dynamically configured eye mask to receive the data. Using termination to VDDQ, it consumes 2.4 pJ/b energy [110]. Therefore, power consumption for 128 DQ lines with each running at 3.2 Gb/s is 983 mW.

If \$50 is assumed to be the cost of one package of LPDDR4 with 4 channels then two packages will cost \$100. The cost of controller package is calculated based upon the cost per  $mm^2$  area. Assuming \$50 per 100 mm<sup>2</sup>, then controller for LPDDR4 interface shall cost \$225. Total ' $\alpha$ ' cost for packages and controller is \$325. The routing cost  $\phi$  metric will be derived from the number of layers required. In octachannel LPDDR4 system, total 360 interconnects need to be routed. If a standard PCB with 100 µm wide interconnect with 300 µm spacing is used, the width of one square shape layer required for routing all of these interconnects is 144 mm. This is



Figure 6.17: Octa channel 400 Gb/s LPDDR4 system

a very large PCB design considering the small area requirements of todays systems. As the memory package is 15 mm wide only, then it can be estimated that due to small area of system requirements, the maximum width of a square shape layer of PCB is fixed to be  $3\times$  the memory package width i.e. 45 mm. Therefore, in order to route 360 interconnects, minimum 3 such layers will be required. But not whole of the top layer can be used for routing due to packages bonding to PCB on top layer, so it can be safely assumed that routing layers needed must be at least one more. Thus, 4 signal routing layers are required for routing two LPDDR4 packages. But high speed signals are always routed next to a reference plane layer so that the return currents have a definitive path to avoid electromagnetic radiations and bad signal integrity. Hence, for 4 signal layers, minimum 3 solid plane layers are required, making in total the PCB size to be 7 layers. If the cost per layer is \$20 then,  $\phi$  cost for LPDDR4 based PCB system would be \$140.

For energy costs  $\zeta$ , as 1 W is consumed by the LPDDR4 based system, then in Kilowatt-hour (KWh) energy unit for 2 years product cycle (17520 hours), it becomes 17.52 KWh energy consumption. By using the energy cost of 0.13 \$/KWh then the energy cost  $\zeta$  for assumed product life cycle of 2 years would be \$2.27. Therefore, the total cost metric ' $\psi$ ' for LPDDR4 based 400 Gb/s memory interface system for 2 years product life cycle is \$467.

**High Bandwidth Memory Interface** HBM supports per DQ line data rate of 2 Gb/s. In order to support 400 Gb/s, HBM needs at least 200 DQ lines. Since single HBM channel contains 128 lines as shown in Table 6.1, minimum two HBM channels are required. Per channel interconnects including data and control lines are 212, so total interconnects needed to support the 400 Gb/s would be 424. It means that CPU chip needs 424 I/O cells in the die to support this HBM interface as shown in Figure 6.18.

As given in the specification of HBM, the minimum pitch is 55  $\mu m$ . The minimum upper integer square given by "int-ceiling" function for 424 is 441 which is square of 21. This leads to the area calculated for CPU for such an interface using Eq. 6.13 is  $1.21 \times 1.21 \text{ mm}^2$ . So, the minimum side length for a square shape memory controller



Figure 6.18: HBM based 2.5D interposer 400 Gb/s memory interface

die for HBM based 400 Gb/s memory system is 1.21 mm. Assuming \$100 cost per HBM die, then  $\alpha$  cost for HBM and controller die is \$101. Assuming a square interposer layer with 2 mm length with 10 µm interconnect pitch, 2 interposer layers are required for 424 interconnects. If \$100 is the cost of one interposer layer, then the cost of routing of HBM based system  $\phi$  would be \$200. The estimated power usage of HBM is 1 mW [60]. Hence, the energy cost  $\zeta$  for 2 year life cycle is negligible. Therefore, the total cost  $\psi$  for HBM based 400 Gb/s system is \$301.

Wide-I/O2 Memory Interface For Wide-I/O2 400 Gb/s system, six data channels are required because a single channel can only provide maximum bandwidth of 68.2 Gb/s as shown in Table 6.1. A single channel consists of 97 interconnects, yielding in total of 586 signals for the 6 channels including DQ, DQS, control and other clocking and reset signals. The Wide-I/O2 JEDEC standard defines 40 µm bumping pitch, therefore, the minimum die size with 25% power and ground bumps



Figure 6.19: Wide-I/O2 based on interposer 400 Gb/s system

is  $1.3 \times 1.3$  mm<sup>2</sup>. Hence, the  $\alpha$  cost for Wide-I/O2 based system is \$102.

In order to route 586 signals on square interposer layer with 2 mm length and 10 µm interconnect pitch, 3 metal layers on interposer are required. The  $\phi$  cost of Wide-I/O2 based 400 Gb/s memory system would be \$300.

The  $\zeta$  cost of system is again negligible due to small power supply and low driven pad capacitances. This makes the total cost  $\psi$  of Wide-I/O2 based 400 Gb/s memory system to be \$402.

### 6.2.4 Final Remarks

A comparative analysis of the different memory interfaces to achieve a certain data bandwidth is shown. As the designer moves from LPDDR4 towards 2.5D specific memories, the cost goes down. High bandwidth memory (HBM) has the least cost of the available memories while LPDDR4 has the highest cost. The main deciding cost factors are the controller die and routing layer costs. HBM is the most efficient memory for this system. It should be noted that the optimum choice can change with the prices at a certain time and location. Therefore, the methodology shown above is meant to give a design flow and an approach, which must be adapted to the conditions of the product design company. For certain company, LPDDR4 based system could be an optimum solution if the die costs and interposer routing layer manufacturing costs are extremely high. Hence, the design flow should be used with care and the cost coefficients should be adapted continuously.

# 6.3 Conclusion

For multi-chip communication interfaces within a package or on an interposer, the energy per bit minimization is not enough. The resources consumed by the interconnects for routing of signals between the chips have to be minimized along with the energy per bit. This chapter presents a holistic design methodology for energy and signalling pitch combined cost minimization. The prior art is used as the starting point and a design flow along with examples is shown to give the optimum energypitch solution for given constraints. A detailed example is shown with current mode logic driver and the interdependency between width, and spacing of interconnect with the power consumption of CML signalling is shown. The co-design approach is then used to find the optimum width and spacing for a given data rate as an example.

The next work is the memory-cpu design exploration and path finding for MCS. The design flow along with algorithm and example is demonstrated for helping the designer to make the correct trade-offs between memory system bandwidth, routing and total costs. The cost factors can be adjusted by the designer according to the contemporary market cost trends. Design flow also helps to compare between PCB based and interposer based memory systems, which can be used as early design stage cost to performance system topology comparisons.

# Chapter 7

# Conclusion

Silicon integrated circuits have become shorter and faster continuously in last few decades. This downscaling process enhanced the speed of transistors while the speed of interconnects have decreased [7], [6]. Furthermore, the downscaling is getting more difficult after reaching a few nanometers of channel length. Hence, the deep nanometer complementary metal oxide semiconductor (CMOS) technologies of today are constrained by the metal interconnect performance and the difficulty of further downscaling the channel length below few nanometers.

Over the years, multi-chip systems (MCS) have been introduced to deal with the challenges in large PCBs with packaged ICs, high global interconnect delay in large SOCs, and More than Moore implementation of future systems. Specifically, More than Moore specified the incorporation of new functionalities into the systems which may not scale like traditional IC, e.g. analog and mixed signal blocks, RF circuits, sensors and actuators etc. Hence, a multi-chip system is proposed to consist of an integrated system within a package or on an interposer within a package containing many dies [21], which need to communicate with each other to perform up to the requirements dependent upon the application. These multi-chip integrated systems (MCS) are generally called multi-chip-module (MCM) [24], system-in-package (SiP) [25] and 2.5D integrated systems [26].



Figure 7.1: Multi-chip communication interface in a two-die MCS

# 7.1 Challenges

Integration of ICs in an MCS is faced by several design challenges. The interconnect and multi-chip communication environment is different from a standard PCB [31]. The transmitters that were used in PCB must be re-designed and optimized for multi-chip interfaces. A generic two-die side-by-side multi-chip system with N-line communication interface is shown in Figure 7.1, where the two transceiver blocks physical interface (PHY) in two dies along with channel are shown. In this MCS system, the **two main design cost metrics** are:

- The power consumption in the transmitter and receiver blocks in the dies for die to die multi-chip communication, generally measured in pJ/bit.
- The routing area resources used on the substrate for communication between dies, defined generally in the form of routing area or signalling pitch  $\rho$  in  $\mu m$  for given number of interconnects N.

The **three main design problems** based on the above design costs can be written as

- Design of minimum energy per bit transmitter and receiver for given MCS channel : Chapter 3 and 4
- Design of minimum area usage channel for given transmitter and receiver for an MCS system: Chapter 5
- Co-design and optimization of channel and transceiver for minimum energyarea or energy-pitch metric in an MCS system: Chapter 6

This thesis deals with the above three design problems in detail, hence, the central theme of thesis: Design of energy and area efficient multi-chip communication interfaces.

Several factors influence the channel behavior in these systems, e.g. different ESD requirements, shorter channel lengths, smaller widths and spacings between channels, and different channel loss metrics. The length of channel, velocity of signal on these substrates and target data rate influence the impedance matching requirements. Hence, unterminated signalling can also be used in these systems when possible and terminated signalling should be used according to the application needs.

The standard high speed communication circuits (SERDES) comprise of several complex blocks at the front ends of both transmitter and receiver side, e.g. impedance calibration, programmable equalization, drive strength control, pre-driver and driver [57]. Though these blocks cannot be removed completely for multi-chip interfaces, they can be optimized and reduced in complexity for shorter design cycle and ideally lower power and area consumption [34].

Channel design is a critical element of multi-chip interface [33]. The high speed multi-Gb/s and moderate speed memory interfaces should be analyzed for usage in multi-chip integrated systems. The performance behavior of these circuits shall define the requirements of the channel width, spacing and length [59]. During channel analysis, insight into transmitter optimization can also be derived which could help in the performance optimization of transmitter circuits.

Design methodologies are required for high speed interfaces and memory interfaces in multi-chip interfaces [77], [76]. They can help the designer to optimize the system by using bandwidth, channel properties, and transmitter co-design approach. Development of new memory types such as high bandwidth memory (HBM), and wide-I/O demand an extensive design space exploration of memory interface design. With the availability of silicon interposer and new memory types, the design space has become large. There is a lack of methodologies to select the best memory type and integration technology for a given application. Some memories have large number of signal lines working at lower speeds while others operate at multi-Gb/s data rates with fewer interconnects. This diversity could be beneficial if the trade-offs in the design space are correctly understood.

Based upon the detailed state of the art analysis described in chapter 2, the design problems stated above for multi-chip communication transmitter, channel and co-design methodologies are further specified and narrowed down. These design problems along with the results are stated below.

# 7.2 Research Summary and Conclusions

This thesis focuses on the three challenges in multi-chip interfaces, i.e. transmitter design, channel design and optimization methodologies for high speed and memory interfaces. The narrowed down design problems based on the state of the art analysis along with results are discussed below.

**Transmitter Problem 1: BOW Standard Transmitter** A bunch of wires interface standard [56] first ever transmitter with dual driver topology was designed, manufactured and measured in 22 nm FDSOI for multi-chip interfaces. This BOW transmitter design work has been accepted for publication in [111]. Transmitter must support two different driver topologies for different data rates and interconnect lengths according to the standard. For target data range of 2-16 Gb/s, based upon the energy per bit comparison of different topologies for the target interconnect length of average 10 mm given in the standard, source series PMOS-over-NMOS terminated driver topology (SSTL-LCM) was selected for high speed data transmission and simple unterminated PMOS-over-NMOS unterminated driver (HSUL) topology was chosen for low speed data transmission over short interconnects.

SSTL-LCM transmitter with impedance and drive strength calibration along with HSUL unterminated transmitter is presented in chapter 3. This transmitter is designed in complete form including all the required transmitter blocks, i.e. clock generation and distribution, clock buffers, PRBS-7 parallel data generator, C<sup>2</sup>MOS high speed digital cells, calibration control enabled pre-driver and slicing architecture based driver. SSTL-LCM front-end is matched to  $50 \Omega$  impedance and can be calibrated using pre-driver calibration bits. Based on the required data rate, channel length and power requirements, calibration can be used to save energy consumption while meeting the minimum signal integrity requirements. The wafer level measurements to simulation correlation is performed for both terminated and unterminated drivers in the BOW transmitter. The transmitter is the first ever to report these data rates for multi-chip systems for lengths up to 12 mm. While in state of the art, some works achieve high data rates but very short lengths up to a few millimeters only and some works achieve long lengths with low data rates. The designed SSTL driver achieves a lower energy-are cost per unit interconnect length of only  $5.22 \times 10^{-6} \text{ pJ/bit} \cdot \text{mm}$  which is less than the state of the art reported work. Similarly, the unterminated low speed HSUL driver achieves  $7.8 \times 10^{-7} \text{ pJ/bit} \cdot \text{mm}$  less than the comparative state of the art.

**Transmitter Problem 2: Driver Optimization Example** An example of driver optimization based on run-time interconnect length or data rate is shown, which helps understand the co-design approach later discussed for transmitter and channel. An unterminated driver using source follower topology was chosen for simplicity of tuning control and is shown to achieve 1 Gb/s data rate on MCM channels up to 11 mm. The slew rate analysis of falling and rising edge along with their impact on jitter performance is shown. A detailed theoretical analysis of rise time and bandwidth with regards to transistor operating region and load capacitance is presented. The source follower driver is driven by four transistor transmission gate pre-driver, which performs the multiplexing of pre-timed even and odd data streams. The driver is compared to other recently published work and signifies its small area and achieves only  $9.0 \times 10^{-7}$  pJ/bit  $\cdot$  mm energy-area efficiency per unit length less than the state of the art as shown in chapter 3.

**Channel or Interface Interconnect** From the state of the art analysis, it was found that the interface interconnect or channel analysis was done for high bandwidth memories and wide-I/O memory. The widely used DDR memories which are also available as unpackaged dies could be used for multi-chip 2.5D systems, as they are also very well understood due to their extensive usage in industry. The missing DDR signalling over 2.5D silicon interposer interconnect was performed to find the optimum energy-efficiency channel design and the length limitations in chapter 5. DDR3 drivers were used as an example to demonstrate the performance over 2.5D interconnect and optimum termination or ODT settings were derived. The analysis found that the minimum routing area and energy consumption is achieved using  $2 \,\mu$ m wide interconnect with no-ODT or termination in DDR3 signalling which is possible up to the lengths of 30 mm. Similar analysis for a typical PCB channel was performed and the minimum energy usage no-ODT case was found to work only up to 20 mm lengths which is very rare in general PCB designs.

Analysis of 2.5D channel interconnect for high speed serial data (SERDES) is present but there is missing knowledge regarding the optimal SERDES transceiver settings or channel width or spacing variation for opening a closed eye diagram at the receiver end. The analysis performed in chapter 5, where the channel width increment by  $5 \times$  to  $10 \,\mu$ m from default case of  $2 \,\mu$ m width improves the response H(s)for 30 mm long interconnect with area cost being  $5 \times$  the default width. Whereas, by improving the transmitted signal X(s) using emphasis or equalization improves the eye with the power cost of only  $1.46 \times$  the default case. The analysis makes the basis for the co-design methodology described in chapter 6 for combined energy and channel area minimization for a given interface bandwidth requirements. The details of channel analysis for different interfaces have been published in papers [98], [99], [101].

**Design Methodologies Problem 1: Holistic Methodology** State of the art analysis only focused on the energy per bit minimization and did not take into

account the cost of the routing resources which is significant for multi-chip interfaces. A holistic methodology which minimizes the combine energy-channel-pitch cost is presented along with an example of interconnect width increment versus the energy per bit increment and shows that using higher energy with smaller width leads to saving of energy-pitch cost of  $0.1 \text{ pJ/bit} \cdot \mu \text{m}$  for 1 µm line interface as compared to the 2 µm interface. This methodology is accepted for publication in [102].

For high resistivity silicon substrate, this methodology is demonstrated for current mode logic (CML) transmitter circuit and 2.5D interposer interconnect detailed in chapter 6. As compared to previous example, this demonstrates the significance of the differential pair spacing on the transmitter power consumption and routing area usage. The increment in spacing increases the impedance and reduces the energy efficiency, but decrement in spacing decreases the routing cost but increases the energy efficiency (pJ/bit). A combined optimization of energy, and routing area is used to get the best energy\*area efficiency for a specific data rate. The methodology helps the designer to understand the trade-offs in transmitter plus interconnect design for interposer based signalling. It also signifies the usage of larger spacing between coupled differential pair interconnect to increase the differential impedance which can be matched at the transmitter and receiver end. Higher impedance reduces the static energy consumption of the CML driver by reducing the required current for a given voltage swing. This co-design methodology of CML signalling transmitter and channel is published in paper [104].

**Design Methodologies Problem 2: MCS CPU-Memory Interface** From state of art analysis, it was discovered that there is missing prior knowledge on the combined energy-area-cost minimization flow for memory-cpu interface design. Hence, a path finding design exploration methodology is presented for selecting the optimum memory and interconnect type (silicon or PCB) depending upon application requirements is developed as detailed in chapter 6. It provides a complete design flow for the system designer to choose between the available options. Algorithms helps the designer to calculate the design costs of various memory and integration technologies for a specific required memory interface bandwidth. It analyzes the latest memory standards and uses cost approximations to determine the optimum memory interface for 400 Gb/s bandwidth. Designers can use the algorithm and test case study to make high level abstraction cost analysis and make the optimum decisions at the start of design process. This work on memory-cpu interfaces is published in papers [108], [107].

# 7.3 Research Outlook

### 7.3.1 General work

The challenges of transmitter, channel design and methodologies for multi-chip interfaces were narrowed down into specific design problems and results were analysed along with comparison to state of the art in chapters 3, 5 and 6, respectively. Future work in multi-chip system design should target all the three areas, i.e. transceiver, channel and design methodologies. The following subsections detail the possible future research problems and questions in MCS design.

# 7.3.2 BOW Transceiver

The SSTL-LCM driver for BOW transmitter uses no transmitter equalization. It can be extended with pre-emphasis implementation, which requires changes to both the pre-driver and driver. Another control method would be required to generate the pre-emphasis taps in accordance with the channel properties. Another way to extend the driver would be to remove the bias current source and instead use a NMOS-NMOS unmatched push-pull technique. The pre-driver and driver power supplies could be generated by feedback control loops to keep the voltage swing under control and to make sure that the transistors operate only in the desired saturation or triode regions. Minimum possible power supply for the driver in the range of 0.2-0.3 V would be targeted. This could help energy savings by a good margin.

Terminated transmitter with whole transmitter blocks can be extended by adding a phase-locked-loop (PLL) to reduce the jitter at the output. Driver can be extended to even higher data rates in the range 56 Gb/s NRZ signalling, which would require bandwidth enhancement techniques, e.g. inductive peaking of the driver and preemphasis. Clock dividers along with deeper serializer tree architecture would be needed. Clock duty cycle correction blocks would be required to keep the jitter to minimum. A receiver block could also be implemented for the terminated transmitter, which would demonstrate the whole interface and could be used as a complete transceiver block.

Receiver design was not performed in this work. In future, a BOW receiver block shall be designed which could have some kind of receiver equalization (CTLE or DFE or both) along with forwarded clock synchronization blocks. The receiver and transmitter blocks together shall be connected over package using different interconnect widths and lengths to test their performance and calculate the energy-area efficiency per unit length (pJ/bit\*mm).

# 7.3.3 Channel Design

For signal integrity in channel design, the near end and far end crosstalk analysis could be added. The crosstalk parameters could influence the minimum possible spacing at given data rates and could influence the routing topology in the MCM and interposer technologies. The different types of available materials could be analyzed to test which materials are suited the best for short unterminated signalling and which substrate materials would be suited for multi-chip systems with high speed communication interfaces. An analysis of the impact of ground lines in these systems could be added. Ideally, all the interconnects should be used for signal transmission only, but a through analysis for minimum necessary reference ground lines around or below the signals lines in multi-chip systems could be very helpful. This would save the unnecessary extra routing area used for ground lines and therefore, increasing the overall signal transmission density and total bandwidth.

# 7.3.4 MCS Holistic Design Methodology

The holistic design methodology could be further detailed for different transceiver topologies and theoretical combined energy per bit and routing area evaluation of multi-chip interface. The design methodology for CML driver can be extended to terminated source series drivers. The extended methodology could also include the pre-driver, equalization and clock distribution power consumption in the equation. Different pre-driver architectures and equalization architectures could also be added in the methodology. An optimum energy and area methodology for different high speed transmitter topologies with all the necessary blocks could be very useful for multi-chip interface communication circuit designers.

## 7.3.5 MCS CPU-Memory Interface

The memory interface design flow could be converted into a design exploration program with options to enter the approximate cost factors for available memories and integration technologies. A design house could use this software to make very accurate cost analysis for the available options. Furthermore, the channel design techniques could be combined with this flow, which would shorten the design time of the interposer or MCM channels for a given memory interface. The minimum power and ground routing lines metric could be added to the design flow which would increase the accuracy of the estimated routing costs of the interface.

# Appendix A IEEE ©MWSCAS 2020 Paper

# 13-Gb/s Transmitter for Bunch of Wires Chip-to-Chip Interface Standard

Muhammad Waqas Chaudhary, Andy Heinig Fraunhofer Institute for Integrated Circuits IIS Division Engineering of Adaptive Systems EAS Zeunerstr. 38, 01069 Dresden, Germany

Abstract—Continuous downscaling of integrated circuits has reached a bottleneck. Technologies such as system in a package, multi-chip module and integration of chips on an active or passive interposer can further improve the system performance. Bunch of wires interface standard was recently introduced for chip to chip short interfaces within a package. This standard required both terminated and unterminated driver topologies for different data rates and interconnect lengths. This paper presents a first ever reported transmitter implementation of this interface. Unterminated and terminated impedance controlled drivers with feedback calibration enable transmitter power optimization for a given interconnect based on the respective signal integrity at the receiver side. Results show that this transmitter can support both low and high speed low power communication between chips for interconnects up to 11 mm length with energy consumption of 0.34 pJ/bit at maximum data rate of 13 Gb/s. The transmitter is designed and taped out in 22 nm FDSOI technology node.

*Index Terms*—Bunch of wires, co-design, interconnect, multichip communication, transmitter.

#### I. INTRODUCTION

For high speed signalling on printed circuit boards (PCB) and backplane cards, techniques such as pulse amplitude modulation (PAM4) are getting popular. Chang and coworkers presented an 80 Gb/s PAM4 transmitter which is impractical for short interfaces [1]. Poulton et al. introduced a ground referenced signalling technique for 25 Gb/s signalling in packages [2]. This supported communication on an interconnect only up to 10 mm. Carusone et al. demonstrated a parallel interface design for chip to chip communication on interposer but only for length up to 4 mm with maximum 20 Gb/s per wire [3]. Active interposer based 2.5D implementation of multi-chip system was shown by Vivet et al. [4]. Extremely short chip-to-chip interconnects were kept purely passive while longer interconnects were enabled by usage of CMOS repeater buffers in the active interposer.

Specifications given in the bunch of wires (BOW) interface proposal require transmitter physical interface (PHY) support for both unterminated and terminated signalling [5]. Lengths up to 10 mm and 2-8 Gb/s data rate should be supported by unterminated signalling. Terminated signalling should support the similar length range for data rates in the range of 4-16 Gb/s. A critical requirement is the availability of control in the interface for impedance calibration in terminated driver. Similarly, the drive strength control in unterminated signalling could help save power for short interconnects. This is required Bhaskar Choubey Chair of Analogue Circuits and Image Sensors Siegen University Hölderlinstr. 3, 57076 Siegen, Germany



Fig. 1. Transmitter and system architecture

to optimize to PHY energy consumption while enabling flexibility in the interconnect routing on interposer or package.

This work presents a generic interface which can be easily transferred from one technology node to another. It presents the library for critical blocks needed in short interfaces. Unterminated and terminated driver (source series terminated SST) along with pre-driver for drive strength control and impedance matching are designed. The complete transmitter is designed from schematic to layout level in 22 nm FDSOI technology and taped out.

Section II describes the complete transmitter architecture and describes the individual blocks in detail. Section III shows the simulation results of the transmitter at different speeds on different interconnects. Finally, Conclusion section IV summarizes the work and concludes the paper.

#### II. TRANSMITTER

The transmitter and test system architecture are shown in Figure 1. It consists of a central clock management unit (CMU) with a clock generator and distribution network for complete PHY and test system. Dual phase clock is generated to drive the blocks using high speed C<sup>2</sup>MOS logic architecture [6]. The clock is distributed on chip through 3-stage fanout of 4 (FO4) buffers for driving the load consisting of data generator flip-flops and multiplexers. Pseudorandom PRBS-7 data is generated at the system clock rate for testing the transmitter outputs. For terminated drivers, data is multiplexed using 2:1 multiplexers. Half rate and double data rate signals are then sent to the respective unterminated and terminated pre-drivers and drivers for transmitting the signal out of the chip. Each block is described in detail below.



Fig. 2. Terminated driver schematic

#### A. Predriver and Driver

For chip-to-chip interfaces, the energy consumption of the transmitter can be reduced by decreasing the capacitive load at the pad. For 60 µm medium diagonal octagonal pad with top three metal layers, the extracted capacitance is 40 fF. Furthermore, the driver can be placed right beneath the output signal high frequency pad to reduce the wiring capacitance and also increase the bandwidth density of the transmitter. For 13 Gb/s transmitter with 0.1 mm signal pad pitch, bandwidth density of  $1.3 \,\mathrm{Tb/s/mm^2}$  can be achieved. For unterminated driver, the pre-driver consists of 3 stage fanout of 3 (FO3) inverting buffers while the driver consists of a single large inverting buffer to drive the pad capacitance. The size of the driver inverter stages could be changed in order to tackle with different interconnect losses and different pad capacitances in older technology nodes. For the terminated driver, pre-driver produces pull up and pull down Enable signals for the driver pull up and pull down slices. Due to impedance matching requirements of the driver, size of pull up and pull down transistors must be changed.

The schematic for terminated driver topology in this work is shown in Figure 2, where pull up and pull down predriver signals enable or disable the 1-2-4-8 $\times$  sized transistors based on the calibration control signals. The fix transistors are always on depicting the minimum possible drive strength setting. All enabled transistors in the driver operate in the linear or triode region so that their drain-source impedance is defined by the current and voltage relationship across the drain and source terminals. If transistors of the driver enter into saturation region even for a short duration of the data bit or unit interval (UI), the reflection coming from the other end of the channel can be bad for signal integrity due to the uncontrolled driver output impedance. It is required that driver must be able to absorb about 20% termination mismatch reflections from the far end. The driver is designed to operate in the 50  $\Omega$  environment. A series resistor is used at the output to enhance linearity. It means that the pull up and pull down impedance of the transistors must be a few ohms during most part of the unit interval.

It should be noted here that the widths of PMOS and NMOS are chosen same for the driver which can cause the crossing point of output to move down from the midpoint. This is avoided by reducing the pull up PMOS width in pre-driver and increasing the pulldown NMOS width in pre-driver. This technique helps to reduce the PMOS width of driver by 0.5 to  $0.3 \times$  as compared to standard CMOS where PMOS is generally designed to be 2 to  $3 \times$  larger than NMOS.

#### B. Driver Calibration

In order to calibrate the drive strength of unterminated drivers for different interconnect lengths and widths with different losses, a feedback topology with pattern checker could be used as shown in Figure 3. The periodic steady-state (PSS) data is sent on the forwarded clock channels with driver similar to data driver and similar drive strength setting. The periodic steady state reference voltage  $V_{ref}$  extracted at the receiver end is used to slice the incoming test pattern 1110101000 which includes long and short 1 and 0 bits. Until



Fig. 3. Unterminated driver size feedback tuning

a certain given number of patterns are detected at the pattern checker in receiver, the size of the drivers is incremented by a *calibcounter* block in the transmitter. Due to this feedback topology, the interconnect variations are automatically taken into account along with any process and temperature variations at the receiver or transmitter end.

For terminated drivers, the impedance calibration is also adjusted in a similar manner using a pattern checker at receiver end but  $V_{ref}$  is not extracted from clock inputs. Instead, a constant reference voltage vdda/4 is used. The calibration bits are incremented until the pattern matches are found. This tuning mechanism has the advantage that it shall start the link with always minimum drive strength and only increment it until the link achieves minimum possible signal quality. Thus, feedback tuning circuity is necessary for interconnect and driver co-design optimization.

#### C. 2:1 Mux

Traditional multiplexers have 5 latch architecture where one bit-stream is latched through three latches and other bit-stream through two latches then followed by a selector run at half clock rate  $(clk_{hr})$ . Chang et al. showed that even latch-less topology can be used to multiplex the two data streams [7]. But that is only possible with 4-phase clocks and is shown to work up to only 5 Gb/s full rate  $(D_{fr})$ . This work does not



Fig. 4. (i) 5-Latch traditional 2:1 mux (ii) Used single Latch 2:1 mux



Fig. 5. C<sup>2</sup>MOS 2:1 Multiplexer

use 5 latches as in traditional architectures and also avoids complete latch-less topology due to speed requirements. The 2:1 Mux shown in Figure 4 uses a single latch topology which works up to 13 Gb/s  $D_{fr}$ . The single latch 2:1 Mux in this transmitter is shown in Figure 5, which is based on C<sup>2</sup>MOS topology and requires complementary clocks.

#### D. Clock Management Unit (CMU) and Test Data

A two-phase complementary clock for the system is generated by a 3-stage CMOS inverter based ring oscillator. Instead of placing metal-on-metal or MOS capacitors at the nodes of the ring oscillator, longer channel length transistors instead of 20 nm are used. Since CMOS ring oscillators are very sensitive to power supply noise, an on-chip voltage regulator could be used. In this test system, oscillator power supply is separate from the rest of system to avoid the supply noise. By changing the supply voltage, clock rate can be changed to meet the required bunch of wire interface clock rate. In order to reduce the phase and duty cycle variation between the two clock phases, complementary small inverters are placed at the three nodes of the ring oscillator. The simulated phase noise for RCC extracted ring oscillator is shown in Figure 6.

For chip to chip communication in packages and on interposer, chips can be working at a high rate internal clock. In order to fulfill this requirement, this work designs a PRBS-7 parallel data generator instead of a slow shift register based architecture. In order to run the PRBS-7 at high clock rates, the flip-flops and XOR cells must be fast enough. C<sup>2</sup>MOS



Fig. 6. Phase Noise of Oscillator

architecture uses two clock phases and is very suitable for high speed digital structures [6]. The width of transistors is chosen to be  $0.75 \,\mu\text{m}$  with minimum possible  $20 \,\text{nm}$  length.

#### **III. SIMULATION RESULTS**

Total area consumed by the transmitter including an oscillator, data generator, multiplexer, and driver is  $115 \times 40 \,\mu\text{m}$  which is ideal for placing these transmitter blocks in  $100 \,\mu\text{m}$  pitch pads. The top view of transmitter layout is shown in Figure 7.

In order to simulate the transmitter and evaluate its performance, s-parameters are measured and simulated for an organic substrate package channel. The length of the measured package interconnect is 3.8 mm. For longer interconnects, the s-parameters can be cascaded together. The measured to simulated s-parameter comparison is shown in Figure 8. Measured line is  $10 \,\mu\text{m}$  wide with spacing of  $10 \,\mu\text{m}$  from the ground lines on both sides.

The extracted capacitance of the pad along with wiring capacitance of the terminated driver is around 70 fF. Similarly expected receiver chip pad capacitance is assumed to be about 70 fF. The receiver is assumed to be  $50 \Omega$  terminated to ground for terminated signalling. It is important to evaluate the transmitter termination performance for 15-20% far end receiver termination mismatch. This relaxes the receiver termination



Fig. 7. Layout of the proposed transmitter



Fig. 8. Measured and simulated s-parameters of 3.8 mm organic interconnect

circuitry and improves the signal integrity. For 1 ns delay transmission line with  $50 \Omega$  impedance, received signal with 20% mismatch (i.e.  $40 \Omega$ ) receiver termination is shown in Figure 9a. The reflections in eye diagram and small voltage swing depict the high output impedance of the driver. With full drive strength of SST using the calibration bits in predriver, output is shown in Figure 9b. It can be seen that with more drive strength causing the driver output impedance to decrease, reflections have been removed from output and the output swing is increased. By extracting the layout of the transmitter, maximum clock generation is up to 6.66 GHz at 1 V supply, thus limiting the transmitter to 13 Gb/s data rate. For SST terminated driver, maximum possible 13.3 Gb/s and 9.16 Gb/s data rate outputs are shown for 11.4 mm long interconnect in Figure 9c and 9d. For 13.3 Gb/s simulation, some



Fig. 9. (a)(b) 9.16 Gb/s terminated output (SST) at  $40 \Omega$  termination after 1 ns delay  $50 \Omega$  impedance transmission line with minimum and maximum driver strength, respectively (c)(d) 13.3 Gb/s and 9.16 Gb/s SST driver output with maximum drive strength at far end  $50 \Omega$  termination after 11.4 mm organic interconnect, respectively

lines show charge sharing problems in previous multiplexer stage. This problem limits the maximum data rate achievable with this architecture to  $13 \,\mathrm{Gb/s}$ .

For unterminated signalling mode, this work achieves 5 Gb/s up to 4 mm long organic interconnects and higher data rate for shorter interconnects. The eye diagrams for 70 fF load with and without interconnect are shown in Figure 10.

Oscillator consumes  $200 \,\mu\text{A}$  current at 6.66 GHz with 1 V supply. The drivers are designed with the same thin oxide transistors as in the digital blocks. They can sustain 0.8-1 V power supply. With 1 V oscillator supply and 0.8 V rest of transmitter supply, total energy consumption is 0.34 pJ/bit at 13.3 Gb/s. This energy performance is very close to recent published works [2][3].



Fig. 10. (a)  $4.6\,{\rm Gb/s}$  unterminated driver output at 70 fF load (b)  $4.6\,{\rm Gb/s}$  unterminated driver output after  $3.8\,{\rm mm}$  long interconnect and 70 fF load

#### IV. CONCLUSION

This paper presents a transmitter for bunch of wires chipto-chip communication interface standard in multi-chip systems. It offers interconnect based co-design of terminated and unterminated drivers along with the digital  $C^2MOS$  library. The design can be conveniently transferred to other technology nodes. It offers energy efficiency of  $0.34 \, pJ/bit$  at  $13.3 \, Gb/s$ on  $11 \, mm$  long organic substrate channel.

#### REFERENCES

- Y. Chang, A. Manian *et al.*, "An 80-gb/s 44-mw wireline pam4 transmitter," *IEEE Journal of Solid-State Circuits*, vol. 53, no. 8, pp. 2214–2226, 2018.
- [2] J. W. Poulton, J. M. Wilson *et al.*, "A 1.17-pj/b, 25-gb/s/pin groundreferenced single-ended serial link for off- and on-package communication using a process- and temperature-adaptive voltage regulator," *IEEE Journal of Solid-State Circuits*, vol. 54, no. 1, pp. 43–54, 2019.
- [3] B. Dehlaghi and A. Chan Carusone, "A 0.3 pj/bit 20 gb/s/wire parallel interface for die-to-die communication," *IEEE Journal of Solid-State Circuits*, vol. 51, no. 11, pp. 2690–2701, 2016.
- [4] P. Vivet, E. Guthmuller et al., "2.3 a 220gops 96-core processor with 6 chiplets 3d-stacked on an active interposer offering 0.6ns/mm latency, 3tb/s/mm 2 inter-chiplet interconnects and 156mw/mm 2 @ 82%-peakefficiency dc-dc converters," in 2020 IEEE International Solid- State Circuits Conference - (ISSCC). IEEE, 2/16/2020 - 2/20/2020, pp. 46–48.
- [5] M. Kuemerle, R. Farjad, and B. Vinnakota. (2019) Bunch of wires interface proposal rev 0.7. [Online]. Available: https://www.opencompute.org/wiki/Server/ODSA
- [6] B. Razavi, "Tspc logic [a circuit for all seasons]," *IEEE Solid-State Circuits Magazine*, vol. 8, no. 4, pp. 10–13, 2016.
- [7] Y. Chang, A. Manian et al., "A 32-mw 40-gb/s cmos nrz transmitter," in 2018 IEEE Custom Integrated Circuits Conference (CICC). IEEE, 4/8/2018 - 4/11/2018, pp. 1–4.

# Appendix B IEEE ©EPEPS 2020 Paper

# Energy-Area Aware Channel Design for Multi-Chip Interfaces

Muhammad Waqas Chaudhary, Andy Heinig Fraunhofer Institute for Integrated Circuits IIS Division Engineering of Adaptive Systems EAS Zeunerstr. 38, 01069 Dresden, Germany {muhammad.chaudhary, andy.heinig}@eas.iis.fraunhofer.de

Abstract—Multi-chip communication interfaces on an interposer or a package substrate must consume minimum routing area while consuming low power in the transceiver blocks. An algorithm is presented to design the channel in view of energy and area metrics for a given transceiver topology.

*Index Terms*—2.5D/3D interconnects and packages, electronic packages and microsystems, high-speed channels

#### I. INTRODUCTION

Moore's law is going to reach a bottleneck soon which has led to development of multi-chip systems to further enhance the system performance [1]. A memory-processor system on an interposer is shown in Figure 1. These chips must transfer high speed data between each other which has led to the development of chip-to-chip high speed interfaces [2].

These transceivers are designed for a specific channel represented by scattering (S) parameters. They are then optimized at circuit level to achieve minimum power consumption for given interconnect at required data rate [3]. But for optimal usage of space in multi-chip systems, the routing area is an important constraint which should be co-optimized with the transmitter or at least optimized for a given transceiver architecture.

A co-design of area and current mode logic driver was previously presented [4]. But it does not consider the equalization needs of transceiver and the required power consumption. Lho et al. describe an optimization approach for high speed channel but the relationship with technology node, equalization requirements, and combined energy-area performance are not discussed [5]. This work presents an algorithm for combined optimization of transceiver and channel for minimum energyarea costs.

#### II. DESIGN FLOW AND ALGORITHM

An overview diagram of design flow is shown in Figure 2. It consists of an extensive interconnect characterization which is then used to derive the transceiver design constraints especially with regards to drive strength, impedance matching and equalization. The energy consumption variation of transceiver for various interconnects is used to develop a combined performance metric of routing area and energy consumption. The energy-area minimum measured by the performance metric of  $pJ/bit \cdot \mu m$  (product of energy efficiency pJ/bit and signalling pitch  $\mu m$ ) for given data rate, type of

Bhaskar Choubey Chair of Analogue Circuits and Image Sensors Siegen University Hölderlinstr. 3, 57076 Siegen, Germany acis@eti.uni-siegen.de



Fig. 1. Multi-chip interposer system model

transceiver, substrate material and interconnect length can then be derived.

The concept behind this flow is that the increase in width of the interconnect shall lead to lower interconnect insertion loss but it shall increase the signal routing pitch  $\rho$  measured in  $\mu m$ . The very first step in the flow is to characterize the interconnect for various widths (W) and spacings (S) for a given length and substrate material. The interconnect sparameters are then evaluated for a given data rate per wire (GSG) in case of single ended systems and per two wires (GSSG) in differential signalling transceiver architectures.

The decrease in width shall lead to higher transceiver energy consumption and increase in width shall lead to higher signalling pitch. This concept as shown in Figure 2 requires detailed analysis in each step which is described in the algorithm. Design flow in combination with detailed algorithm shall lead to an optimal channel design for a given substrate, bandwidth and transceiver topology.

The algorithm is designed to be holistic and shall keep the fixed constraints to as minimum as possible. The design space is increased in the algorithm to even include different kinds of signalling topologies, and their correlation with channel area consumption along with total interface power consumption, which shall provide an overall system level optimization. Consider T as the set of possible transceiver topologies, e.g. source series terminated signalling (SST), low voltage swing terminated logic (LVSTL), high swing pushpull signalling (CMOS)  $T \in \{SST, LVSTL, CMOS\}$ . The power consumption  $\phi$  for a given transceiver topology  $T_i \in T$ is a function of signalling pitch  $\rho$  defined by interconnect width, spacing and ground width, and is a sum of transmitter and receiver power consumption given as  $\phi_{T_i} = \phi_{Tx} + \phi_{Rx}$ 



Fig. 2. Channel design flow

where

$$\phi_{Tx} = [\phi_{Drv} + \phi_{Eq} + \phi_{Ser} + \phi_{Ckbuf}]$$
  
$$\phi_{Rx} = [\phi_{buf} + \phi_{Eq} + \phi_{DeSer} + \phi_{Ckbuf}]$$

where  $\phi_{Drv}$  represents driver power,  $\phi_{Eq}$  is for equalization,  $\phi_{Ser}$  and  $\phi_{DeSer}$  are for serialization and de-serialization blocks, and  $\phi_{Ckbuf}$  denotes the clock buffering and distribution block. The back-end blocks in transmitter and receiver, e.g. serializer, de-serializer, clock buffers and samplers are indirectly influenced by the interconnect width and spacing variations. Rather these are defined by the transmitter and receiver front-ends, i.e. driver, receiver amplifier and equalization. The energy-area metric  $\psi$  is given as  $\phi/f_b * \rho$  in  $pJ/bit \cdot \mu m$  where  $f_b$  is the data bit rate in Gb/s.

The width W range is defined by the minimum  $W_{min}$  and maximum  $W_{max}$  values in given interconnect technology. The spacing between interconnect is restricted by the minimum value  $S_{min}$  and generally does not go above a few times of the width of the signal line, e.g.  $3 \times W$ . For single ended GSG signalling using minimum width ground interconnect, the signalling pitch is given as  $\rho = W + W_{min} + 2S$ . The final energy-area performance metric  $\psi$  is then given as

$$\psi(T_i,\rho) = \frac{\phi}{f_b} \left( W + W_{min} + 2S \right)$$

The algorithm iterates in exhaustive manner through all possible combinations of width, spacing and transceiver topologies to find the minimum energy-area cost combination of (W, S, T).

Algorithm 1: Channel design **Result:** Optimum solution  $T_{opt}$ ,  $W_{opt}$ ,  $S_{opt}$ define width range:  $W = \{W_{min}, \dots, W_{max}\}$ define spacing range:  $S = \{S_{min}, \ldots, S_{max}\}$ define Transceiver types:  $T_i \in T$ define data bit rate:  $f_b$ define interconnect average length: L initialize  $\psi_{old}$ while  $T_i \in T$  do for  $W \leq W_{max}$  do for  $S \leq S_{max}$  do Find S-parameters for given W, SFind pulse response for given  $f_b$ Find required number of Taps for TxFind required number of DFE Taps for Rxcalculate power consumption in Tx, Rx:  $\phi_{Tx} = [\phi_{Drv} + \phi_{Eq} + \phi_{Ser} + \phi_{Ckbuf}]$  $\phi_{Rx} = [\phi_{buf} + \phi_{Eq} + \phi_{DeSer} + \phi_{Ckbuf}]$ calculate signalling pitch :  $\rho = W + W_{min} + 2S$ calculate interface energy-area cost : 
$$\begin{split} \psi &= \frac{\phi}{f_b} \left( W + W_{min} + 2S \right) \\ \text{if } \psi &< \psi_{old} \text{ then} \\ \mid \ T_{opt} = T_i, W_{opt} = W, S_{opt} = S \end{split}$$
end update  $\psi_{old} = \psi$ end end end

#### III. CASE STUDY: SILICON SUBSTRATE CHANNEL

In order to explain the algorithm, a silicon interposer chip to chip interface is used here as a case study. The stackup for this system is shown in Figure 3, where two metal layers in silicon-dioxide are placed on a silicon substrate. The tangent loss  $(tan\delta)$  value of silicon is dependent upon the resistivity and for typical  $100 \Omega \cdot cm$  is chosen to be 0.1 for data rates around 5-10 GHz [6]. The length is chosen as 10 mm for the interconnect. The impact of width variation on the channel insertion loss S21 from 1 to 2 µm is shown in Figure 4. The data rate for this study is chosen as 10 Gb/s which has the Nyquist frequency of 5 GHz, at which 2µm wide line has frequency dependent loss of only -2 dB while 1 µm has insertion loss of -7 dB. It should be noted that there is 6 dB higher DC loss in 1 µm wide line which means a reduced voltage swing at the Rx input.

In order to understand the equalization and voltage swing requirements, the channel is excited at the transmitter side with a 10 Gb/s pulse with ideal rise time (1 ps) and unit interval (UI) of 0.1 ns. The received pulse response after channel is shown in Figure 5. As expected due to high resistivity of interconnect and DC loss, the voltage swing is just 0.2 V for 1  $\mu$ m wide line.

From the pulse response in Figure 5, it can be seen that there is no pre-cursor inter symbol interference (ISI) for both lines.



Fig. 3. Stackup for silicon interposer based multi-chip system

The signal rises within 1UI completely, as depicted by the dotted blue line at 1UI tick of x-axis. But both interconnects show some post-cursor ISI, as shown by the red dashed lines. The behavior is similar to an RC exponential voltage drop, especially significant in 1  $\mu$ m wide line. In order to completely cancel the post-cursor ISI, a high continuous time linear equalization (CTLE) or a number of decision feedback equalization (DFE) taps will be required, which shall impact the power consumption of the transceiver. For 1  $\mu$ m wide line, at least two DFE taps for 2nd and 3rd UI ISI cancellation for 2nd UI ISI cancellation is enough.

The power consumption estimation for different data rates and equalization requirement is based upon the work by Palaniappan et al. in [3]. The CTLE based post-cursor ISI cancellation is generally used up to 12 dB insertion loss. This is due to the fact that CTLE is a part of Rx input amplifier and it increases the power equally for signal and the noise. Therefore, for much higher losses than 12 dB, DFE taps are used which are calculated directly from the impulse response shown in Figure 5.

By using CTLE for equalization and 0.1 mW/Gb/s power for every 6 dB bandwidth peaking [3] at 10 Gb/s in 90 nm technology node, extra  $\phi_{Eq}$  of 1 mW is added to the total power consumption  $\phi_{Rx}$  of 1 µm wide wire interface as compared to the 2 µm wire interface. The equalization constraints in CTLE and DFE will directly impact other design parameters for driver, samplers and clock buffers in Tx and Rx. But if we ignore them for quick comparison of energy-area metric, just



Fig. 4. S-parameters extracted using HSPICE 2D field solver



Fig. 5. Response for 10 Gb/s input pulse with 1 ps rise time

based on CTLE requirements,  $\psi$  is 0.7 and 0.8 pJ/bit  $\cdot \mu m$  for 1 and 2  $\mu m$  wide wire interface respectively.

This shows that at specific data rates and equalization requirements, higher power consumption is not that detrimental if overall energy-area cost metric is used. But noting the CTLE power consumption being constant for 6 and 12 dB peaking in 45 nm technology [3], the  $\psi$  for wider 2 µm interconnect interface would be an even worse choice than in 90 nm node. This leads to a conclusion that wider lines for high speed chip to chip links are useful from energy-area perspective in older technology nodes. But for newer nodes in the range of 45, 20 and 14 nm, thin wires with high receiver side equalization requirements are a better choice.

#### IV. CONCLUSION

A design flow for energy-area aware channel design for high speed chip to chip links is presented using silicon interposer interface case study. The flow shows that energy-area tradeoffs can lead to an optimized interconnect width and spacing for given data rate, transceiver type and technology node.

#### REFERENCES

- P. Vivet, E. Guthmuller *et al.*, "2.3 a 220gops 96-core processor with 6 chiplets 3d-stacked on an active interposer offering 0.6ns/mm latency, 3tb/s/mm 2 inter-chiplet interconnects and 156mw/mm 2 @ 82%-peakefficiency dc-dc converters," in 2020 IEEE International Solid- State Circuits Conference - (ISSCC). IEEE, 2/16/2020 - 2/20/2020, pp. 46–48.
   T. O. Dickson, Y. Liu *et al.*, "A 1.4 pj/bit, power-scalable 16×12 gb/s
- [2] T. O. Dickson, Y. Liu *et al.*, "A 1.4 pj/bit, power-scalable 16×12 gb/s source-synchronous i/o with dfe receiver in 32 nm soi cmos technology," *IEEE Journal of Solid-State Circuits*, vol. 50, no. 8, pp. 1917–1931, 2015.
- [3] A. Palaniappan and S. Palermo, "A design methodology for power efficiency optimization of high-speed equalized-electrical i/o architectures," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 21, no. 8, pp. 1421–1431, 2013.
- [4] M. W. Chaudhary and A. Heinig, "Co-design of cml io and interposer channel for low area and power signaling," in *Formal proceedings of the* 2016 IEEE 19th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS), J. Brenkuš and Stopjaková, Eds. IEEE, 2016, pp. 1–6.
- [5] D. Lho, J. Park *et al.*, "Bayesian optimization of high-speed channel for signal integrity analysis," in *EPEPS 2019*. IEEE, 2019, pp. 1–3.
- [6] R.-Y. Yang, C.-Y. Hung et al., "Loss characteristics of silicon substrate with different resistivities," *Microwave and Optical Technology Letters*, vol. 48, no. 9, pp. 1773–1776, 2006.

# Bibliography

- R. G. Arns, "The other transistor: early history of the metal-oxide semiconductor field-effect transistor," *Engineering Science & Education Journal*, vol. 7, no. 5, pp. 233–240, 1998.
- [2] W. Jacobi/SIEMENS AG, "Halbleiterverstärker," Patent DE833366C, 14 April 1949.
- [3] N. H. E. Weste and D. M. Harris, *CMOS VLSI design: A circuits and systems perspective*, 4th ed. Boston: Addison Wesley, 2011.
- [4] M. T. Bohr and I. A. Young, "Cmos scaling trends and beyond," *IEEE Micro*, vol. 37, no. 6, pp. 20–29, 2017.
- [5] WikiChip. Process technology history intel. Accessed on: May 06, 2021.
   [Online]. Available: https://en.wikichip.org/wiki/intel/process
- [6] R. Brain, "Interconnect scaling: Challenges and opportunities," in 2016 IEEE International Electron Devices Meeting (IEDM). IEEE, 12/3/2016 - 12/7/2016, pp. 9.3.1–9.3.4.
- [7] M. T. Bohr, "Interconnect scaling-the real limiter to high performance ulsi," in *International electron devices meeting*. IEEE, 1995, pp. 241–244.
- [8] J. M. Tendler, J. S. Dodson, J. S. Fields, H. Le, and B. Sinharoy, "Power4 system microarchitecture," *IBM Journal of Research and Development*, vol. 46, no. 1, pp. 5–25, 2002.
- [9] Samsung. (2020) Spezifikation | galaxy s20, s20+ und s20 ultra | samsung de. Accessed on: May 06, 2021. [Online]. Available: https://www.samsung.com/de/smartphones/galaxy-s20/specs/
- [10] Barrett, "Microprocessor evolution and technology impact," in 1997 symposium on VLSI technology. IEEE, 1997, pp. 7–10.
- [11] J. Chang. (2020, 01) Processors. Accessed on: May 06, 2021. [Online]. Available: http://www.qdpma.com/CPU/CPU.html
- [12] (2020, 06) Lpdram. Accessed on: May 06, 2021. [Online]. Available: https://www.micron.com/products/dram/lpdram/
- [13] S. I. Association. (2009,(09)The international technology for semiconductors (itrs). roadmap Accessed on: May 06. 2021.Online. Available: https://www.semiconductors.org/resources/ 2009-international-technology-roadmap-for-semiconductors-itrs/

- [14] A. L. S. Loke, D. Yang, T. T. Wee, J. L. Holland, P. Isakanian, K. Rim, S. Yang, J. S. Schneider, G. Nallapati, S. Dundigal, H. Lakdawala, B. Amelifard, C. Lee, B. McGovern, P. S. Holdaway, X. Kong, and B. M. Leary, "Analog/mixed-signal design challenges in 7-nm cmos and beyond," in 2018 *IEEE Custom Integrated Circuits Conference (CICC)*. IEEE, 4/8/2018 -4/11/2018, pp. 1–8.
- [15] A. B. Kahng, "Scaling: More than moore's law," IEEE Design & Test of Computers, vol. 27, no. 3, pp. 86–87, 2010.
- [16] B. E. Owens, S. Adluri, P. Birrer, R. Shreeve, S. K. Arunachalam, K. Mayaram, and T. S. Fiez, "Simulation and measurement of supply and substrate noise in mixed-signal ics," *IEEE Journal of Solid-State Circuits*, vol. 40, no. 2, pp. 382–391, 2005.
- [17] R. Harris. (2014, 08) Why doesn't intel put dram on the cpu? Accessed on: May 06, 2021. [Online]. Available: https://www.zdnet.com/article/ why-doesnt-intel-put-dram-on-the-cpu/
- [18] Shukri Souri, "3d ics interconnect performance modeling and analysis," Ph.D. dissertation, StanfordUniversity, 2002.
- [19] A. B. Kahng and S. Muddu, "An analytical delay model for rlc interconnects," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Sys*tems, vol. 16, no. 12, pp. 1507–1514, 1997.
- [20] A. Blodgett, "A multilayer ceramic multichip module," *IEEE Transactions on Components, Hybrids, and Manufacturing Technology*, vol. 3, no. 4, pp. 634–637, 1980.
- [21] R. R. Tummala, *Fundamentals of microsystems packaging*. New York: McGraw-Hill, 2001.
- [22] J. Lau. (2017,(08)Mcm, and heterogesip, soc, neous integration defined and explained. Accessed on: May 06, 2021. [Online]. Available: https://www.3dincites.com/2017/08/ mcm-sip-soc-and-heterogeneous-integration-defined-and-explained/
- [23] D. B. Papworth, "Tuning the pentium pro microarchitecture," *IEEE Micro*, vol. 16, no. 2, pp. 8–15, 1996.
- [24] D. Suggs, D. Bouvier, M. Clark, K. Lepak, and M. Subramony, "Amd "zen 2"," in 2019 IEEE Hot Chips 31 Symposium (HCS). IEEE, 2019, pp. 1–24.
- [25] W. W.-M. Dai, "Historical perspective of system in package (sip)," IEEE Circuits and Systems Magazine, vol. 16, no. 2, pp. 50–61, 2016.
- [26] B. Banijamali, S. Ramalingam, K. Nagarajan, and R. Chaware, "Advanced reliability study of tsv interposers and interconnects for the 28nm technology fpga," in *Sixty First Electronic Components & Technology Conference*. IEEE, 2011, pp. 285–290.

- [27] S. Lakka, "Xilinx ssi technology concept to silicon development overview," in HCS. IEEE, 2016, pp. 1–22.
- [28] H. Kim, S.-J. Ahn, Y. G. Shin, K. Lee, and E. Jung, "Evolution of nand flash memory: From 2d to 3d as a storage market leader," in 2017 IEEE International Memory Workshop (IMW). IEEE, 5/14/2017 - 5/17/2017, pp. 1–4.
- [29] P. Leduc, N. Sillon, S. Maitrejean, D. Louis, G. Passemard, F. de Crecy, M. Fayolle, B. Charlet, T. Enot, M. Zussy, B. Jones, J.-C. Barbe, and N. Kernevez, "Challenges for 3d ic integration: bonding quality and thermal management," in *Proceedings of the IEEE 2007 International Interconnect Technology Conference*. IEEE, 2007, pp. 210–212.
- [30] Z. Toprak-Deniz, J. E. Proesel, J. F. Bulzacchelli, H. A. Ainspan, T. O. Dickson, M. P. Beakes, and M. Meghelli, "A 128-gb/s 1.3-pj/b pam-4 transmitter with reconfigurable 3-tap ffe in 14-nm cmos," *IEEE Journal of Solid-State Circuits*, vol. 55, no. 1, pp. 19–26, 2020.
- [31] M. Shimada, "Low cost mcm package for mobile telecommunication," in 1998 International conference on multichip modules and high density packaging. Institute of Electrical and Electronics Engineers, 1998, pp. 124–128.
- [32] N. Zhou, L. Wu, Z. Wang, X. Zheng, W. Cao, C. Zhang, F. Li, and Z. Wang, "A 28-gb/s transmitter with 3-tap ffe and t-coil enhanced terminal in 65-nm cmos technology," in 2016 14th IEEE International New Circuits and Systems Conference (NEWCAS). IEEE, 2016, pp. 1–4.
- [33] B. Dehlaghi and A. Chan Carusone, "A 0.3 pj/bit 20 gb/s/wire parallel interface for die-to-die communication," *IEEE Journal of Solid-State Circuits*, vol. 51, no. 11, pp. 2690–2701, 2016.
- [34] M.-S. Lin, C.-C. Tsai, C.-H. Chang, W.-H. Huang, Y.-Y. Hsu, S.-C. Yang, C.-M. Fu, M.-H. Chou, T.-C. Huang, C.-F. Chen, T.-C. Huang, S. Adham, M.-J. Wang, W. W. Shen, and A. Mehta, "A 1 tbit/s bandwidth 1024 b pll/dll-less edram phy using 0.3 v 0.105 mw/gbps low-swing io for cowos application," *IEEE Journal of Solid-State Circuits*, vol. 49, no. 4, pp. 1063–1074, 2014.
- [35] "Design techniques for a 60 gb/s 173 mw wireline receiver frontend in 65 nm cmos technology," *IEEE Journal of Solid-State Circuits*, vol. 51, no. 4, pp. 871–880, 2016.
- [36] M.-H. Chien, Y.-L. Lee, J.-R. Goh, and S.-J. Chang, "A low power duobinary voltage mode transmitter," in 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2017, pp. 1–6.
- [37] G.-S. Jeong, S.-H. Chu, Y. Kim, S. Jang, S. Kim, W. Bae, S.-Y. Cho, H. Ju, and D.-K. Jeong, "A 20 gb/s 0.4 pj/b energy-efficient transmitter driver architecture utilizing constant gm," in *Solid-State Circuits Conference (A-SSCC)*, 2015 IEEE Asian. IEEE, 11/9/2015 - 11/11/2015, pp. 1–4.
- [38] B. Kim, Y. Liu, T. O. Dickson, J. F. Bulzacchelli, and D. J. Friedman, "A 10gb/s compact low-power serial i/o with dfe-iir equalization in 65-nm cmos," *IEEE Journal of Solid-State Circuits*, vol. 44, no. 12, pp. 3526–3538, 2009.
- [39] K.-L. Wong, H. Hatamkhani, M. Mansuri, and C.-K. Yang, "A 27-mw 3.6gb/s i/o transceiver," *IEEE Journal of Solid-State Circuits*, vol. 39, no. 4, pp. 602–612, 2004.
- [40] T. O. Dickson, Y. Liu, S. V. Rylov, B. Dang, C. K. Tsang, P. S. Andry, J. F. Bulzacchelli, H. A. Ainspan, X. Gu, L. Turlapati, M. P. Beakes, B. D. Parker, J. U. Knickerbocker, and D. J. Friedman, "An 8x 10-gb/s source-synchronous i/o system based on high-density silicon carrier interconnects," *IEEE Journal of Solid-State Circuits*, vol. 47, no. 4, pp. 884–896, 2012.
- [41] J. W. Poulton, W. J. Dally, X. Chen, J. G. Eyles, T. H. Greer, S. G. Tell, J. M. Wilson, and C. T. Gray, "A 0.54 pj/b 20 gb/s ground-referenced single-ended short-reach serial link in 28 nm cmos for advanced packaging applications," *IEEE Journal of Solid-State Circuits*, vol. 48, no. 12, pp. 3206–3218, 2013.
- [42] J. W. Poulton, J. M. Wilson, W. J. Turner, B. Zimmer, X. Chen, S. S. Kudva, S. Song, S. G. Tell, N. Nedovic, W. Zhao, S. R. Sudhakaran, C. T. Gray, and W. J. Dally, "A 1.17-pj/b, 25-gb/s/pin ground-referenced single-ended serial link for off- and on-package communication using a process- and temperatureadaptive voltage regulator," *IEEE Journal of Solid-State Circuits*, vol. 54, no. 1, pp. 43–54, 2019.
- [43] J. H. Shim, S. Byun, J. C. Lee, K. Kim, and C. S. Kim, "A low-power 10gb/s 0.13-um cmos transmitter for oc-192/stm-64 applications," in 2007 50th Midwest Symposium on Circuits and Systems, 2007, pp. 1165–1168.
- [44] F. Celik, A. Akkaya, A. Tajalli, A. Burg, and Y. Leblebici, "Jesd204b compliant 12.5 gb/s lvds and sst transmitters in 28 nm fd-soi cmos," in 2019 15th Conference on Ph.D Research in Microelectronics and Electronics (PRIME), 2019, pp. 101–104.
- [45] H. S. Gupta, R. M. Parmar, and R. K. Dave, "High speed lvds driver for serdes," in 2009 International Conference on Emerging Trends in Electronic and Photonic Devices & Systems (ELECTRO-2009), Varanasi, India, 22-24 December 2009. IEEE, 2009, pp. 92–95.
- [46] Wonki-Park and S.-C. Lee, "Design of lvds driver based cmos transmitter for a high speed serial link," in *International Conference on Electronics and Information Engineering (ICEIE)*, 2010. IEEE, 2010, pp. V1–300–V1–302.
- [47] A. Tajalli and Y. Leblebici, "Tajalli slew controlled lvds driver 180nm," IEEE Journal of Solid-State Circuits, vol. 44, no. 2, pp. 538–548, 2009.
- [48] W. He, F. Ye, and J. Ren, "A 40gb/s low power transmitter with 2-tap ffe and 40:1 mux in 28nm cmos technology," in 2019 IEEE 13th International Conference on ASIC (ASICON), 2019, pp. 594–597.

- [49] J. Kim, A. Balankutty, R. Dokania, A. Elshazly, H. S. Kim, S. Kundu, S. Weaver, K. Yu, and F. O'Mahony, "A 112gb/s pam-4 transmitter with 3-tap ffe in 10nm cmos," in 2018 IEEE International Solid-State Circuits Conference (ISSCC 2018). IEEE, 2018, pp. 102–104.
- [50] F. Lv, X. Zheng, Z. Wang, J. Wang, and F. Li, "A 50gb/s low power pam4 serdes transmitter with 4-tap ffe and high linearity output voltage in 65nm cmos technology," in 2015 IEEE 11th International Conference on ASIC (ASI-CON), 2015, pp. 1–4.
- [51] N. K. Ramamoorthy, J. R. M, and V. Muniyappa, "High speed serial link transmitter for 10gig ethernet applications," in 2010, 23rd International Conference on VLSI Design. IEEE Computer Society, 2010, pp. 246–251.
- [52] J. Wang, S. Ma, P. D. S. Manoj, M. Yu, R. Weerasekera, and H. Yu, "Highspeed and low-power 2.5d i/o circuits for memory-logic-integration by throughsilicon interposer," in 2013 IEEE International 3D Systems Integration Conference (3DIC). IEEE, 2013, pp. 1–4.
- [53] S.-H. Lee, S.-K. Lee, B. Kim, H.-J. Park, and J.-Y. Sim, "Current-mode transceiver for silicon interposer channel," *IEEE Journal of Solid-State Circuits*, vol. 49, no. 9, pp. 2044–2053, 2014.
- [54] T. O. Dickson, Y. Liu, S. V. Rylov, A. Agrawal, S. Kim, P.-H. Hsieh, J. F. Bulzacchelli, M. Ferriss, H. A. Ainspan, A. Rylyakov, B. D. Parker, M. P. Beakes, C. Baks, L. Shan, Y. Kwark, J. A. Tierno, and D. J. Friedman, "A 1.4 pj/bit, power-scalable 16×12 gb/s source-synchronous i/o with dfe receiver in 32 nm soi cmos technology," *IEEE Journal of Solid-State Circuits*, vol. 50, no. 8, pp. 1917–1931, 2015.
- [55] A. Tajalli, K. L. Hofstra, B. Holden, A. Hormati, J. Keay, Y. Mogentale, V. Perrin, J. Phillips, S. Raparthy, A. Shokrollahi, D. Stauffer, M. B. Parizi, R. Simpson, A. Stewart, G. Surace, O. T. Amiri, E. Truffa, A. Tschank, R. Ulrich, C. Walter, A. Singh, D. A. Carnelli, C. Cao, K. Gharibdoust, D. Gorret, A. Gupta, C. Hall, and A. Hassanin, "A 1.02-pj/b 20.83-gb/s/wire usr transceiver using cnrz-5 in 16-nm finfet," *IEEE Journal of Solid-State Circuits*, pp. 1–16, 2020.
- [56] M. Kuemerle, R. Farjad, and B. Vinnakota. (2019, 10) Bunch of wires interface proposal rev 0.7. Accessed on: May 06, 2021. [Online]. Available: http://files.opencompute.org/oc/public.php?service=files& t=6bfc2493f2f3e0a1d1a14a3314062bdd&download
- [57] Y. Chang, A. Manian, L. Kong, and B. Razavi, "An 80-gb/s 44-mw wireline pam4 transmitter," *IEEE Journal of Solid-State Circuits*, vol. 53, no. 8, pp. 2214–2226, 2018.
- [58] J.-S. Kim, C. S. Oh, H. Lee, D. Lee, H.-R. Hwang, S. Hwang, B. Na, J. Moon, J.-G. Kim, H. Park, J.-W. Ryu, K. Park, S.-K. Kang, S.-Y. Kim, H. Kim, J.-M. Bang, H. Cho, M. Jang, C. Han, J.-B. Lee, K. Kyung, J.-S. Choi, and Y.-H.

Jun, "A 1.2v 12.8gb/s 2gb mobile wide-i/o dram with 4x128 i/os using tsvbased stacking," in 2011 IEEE International Solid-State Circuits Conference, 2011, pp. 496–498.

- [59] C.-C. Wang, H.-H. Cheng, M.-F. Chung, P.-C. Pan, C.-T. Chiu, and C.-P. Hung, "High bandwidth application with wide i/o memory on 2.5d-ic silicon interposer," in 2013 IEEE 3rd CPMT Symposium Japan (ICSJ). IEEE, 2013, pp. 1–4.
- [60] J. C. Lee, J. Kim, K. W. Kim, Y. J. Ku, D. S. Kim, C. Jeong, T. S. Yun, H. Kim, H. S. Cho, Y. O. Kim, J. H. Kim, J. H. Kim, S. Oh, H. S. Lee, K. H. Kwon, D. B. Lee, Y. J. Choi, J. Lee, H. G. Kim, J. H. Chun, J. Oh, and S. H. Lee, "18.3 a 1.2v 64gb 8-channel 256gb/s hbm dram with peripheral-base-die architecture and small-swing technique on heavy load interface," in 2016 IEEE International Solid-State Circuits Conference (ISSCC), 2016, pp. 318–319.
- [61] K. Cho, H. Lee, and J. Kim, "Signal and power integrity design of 2.5d hbm (high bandwidth memory module) on si interposer," in 2016 Pan Pacific Microelectronics Symposium (Pan Pacific). IEEE, 2016, pp. 1–5.
- [62] H. Lee, K. Cho, H. Kim, S. Choi, J. Lim, and J. Kim, "Electrical performance of high bandwidth memory (hbm) interposer channel in terabyte/s bandwidth graphics module," in 2015 International 3D Systems Integration Conference (3DIC). IEEE / Institute of Electrical and Electronics Engineers Incorporated, 2015, pp. TS2.2.1–TS2.2.4.
- [63] C.-C. Wang, H.-H. Cheng, M.-F. Chung, P.-C. Pan, C.-Y. Ho, C.-T. Chiu, and C.-P. Hung, "High bandwidth application on 2.5d ic silicon interposer," in *Electronic Packaging Technology (ICEPT)*, 2014 15th International Conference on. IEEE, 8/12/2014 - 8/15/2014, pp. 568–572.
- [64] H. Lee, K. Cho, H. Kim, S. Choi, J. Lim, H. Shim, and J. Kim, "Design and signal integrity analysis of high bandwidth memory (hbm) interposer in 2.5d terabyte/s bandwidth graphics module," in *Electrical Performance of Electronic Packaging and Systems (EPEPS), 2015 IEEE 24th.* IEEE, 2015, pp. 145–148.
- [65] S. Choi, H. Kim, D. H. Jung, J. J. Kim, J. Lim, H. Lee, K. Cho, and J. Kim, "Eye-diagram estimation and analysis of high-bandwidth memory (hbm) interposer channel with crosstalk reduction schemes on 2.5d and 3d ic," in 2016 IEEE International Symposium on Electromagnetic Compatibility (EMC 2016). IEEE, 2016, pp. 425–429.
- [66] K. Chandrasekar, D. Oh, and A. Rahman, "Timing analysis for wide io memory interface applications with silicon interposer," in *Electromagnetic Compatibility (EMC), 2014 IEEE International Symposium on.* IEEE, 8/4/2014 -8/8/2014, pp. 46–51.
- [67] R. Egawa, M. Sato, J. Tada, and H. Kobayashi, "Vertically integrated processor and memory module design for vector supercomputers," in 2013 IEEE International 3D Systems Integration Conference (3DIC). IEEE, 2013, pp. 1–6.

- [68] B. Dehlaghi, R. Beerkens, D. Tonietto, and A. C. Carusone, "Interconnect technologies for terabit-per-second die-to-die interfaces," in 2016 IEEE Compound Semiconductor Integrated Circuit Symposium (CSICS). IEEE, 10/23/2016 -10/26/2016, pp. 1–4.
- [69] M. A. Karim and P. D. Franzon, "A 0.65 mw/gbps 30 gbps capacitive coupled 10 mm serial link in 2.5d silicon interposer," in 2015 IEEE 24th Conference on Electrical Performance of Electronic Packaging and Systems. IEEE, 2015, pp. 131–134.
- [70] H. Kim, J. Cho, D. H. Jung, J. J. Kim, S. Choi, J. Kim, J. Lee, and K. Park, "Design and measurement of a compact on-interposer passive equalizer for chip-to-chip high-speed differential signaling," in 9th Intl. Workshop on Electromagnetic Compatibility of Integrated Circuits (EMC Compo), 2013. IEEE, 2013, pp. 5–9.
- [71] S.-J. Xue, X. Chen, and J.-F. Jiang, "Modeling and optimization of high speed transmission structure on silicon interposer," in *ICSICT-2016*. IEEE Press, 2016, pp. 536–538.
- [72] B. Sawyer, B. C. Chou, J. Tong, W. Vis, K. Panayappan, S. Deng, H. Tournier, V. Sundaram, and R. Tummala, "Design and demonstration of 2.5d glass interposers as a superior alternative to silicon interposers for 28 gbps signal transmission," in 2016 IEEE 66th Electronic Components and Technology Conference (ECTC). IEEE, 5/31/2016 - 6/3/2016, pp. 972–977.
- [73] N. Kim, D. Wu, J. Carrel, J.-H. Kim, and P. Wu, "Channel design methodology for 28gb/s serdes fpga applications with stacked silicon interconnect technology," in *IEEE 62nd Electronic Components and Technology Conference (ECTC)*, 2012. IEEE, 2012, pp. 1786–1793.
- [74] H. Hatamkhani and C.-K. K. Yang, "Power analysis for high-speed i/o transmitters," in 2004 symposium on VLSI circuits. IEEE, 2003, pp. 142–145.
- [75] G. Balamurugan, B. Casper, J. E. Jaussi, M. Mansuri, F. O'Mahony, and J. Kennedy, "Modeling and analysis of high-speed i/o links," *IEEE Transactions on Advanced Packaging*, vol. 32, no. 2, pp. 237–247, 2009.
- [76] H. Hatamkhani and C.-K. Ken Yang, "A study of the optimal data rate for minimum power of i/os," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 53, no. 11, pp. 1230–1234, 2006.
- [77] A. Palaniappan and S. Palermo, "A design methodology for power efficiency optimization of high-speed equalized-electrical i/o architectures," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 21, no. 8, pp. 1421–1431, 2013.
- [78] H. Hatamkhani, F. Lambrecht, V. Stojanovic, and C.-K. K. Yang, "Powercentric design of high-speed i/os," in 2006 43rd ACM. IEEE, 2006, p. 867.
- [79] D. Xu, N. Yu, P. D. Sai Manoj, K. Wang, H. Yu, and M. Yu, "A 2.5-d memorylogic integration with data-pattern-aware memory controller," *IEEE Design & Test*, vol. 32, no. 4, pp. 1–10, 2015.

- [80] F. Yazdani and J. Park, "Pathfinding methodology for optimal design and integration of 2.5d/3d interconnects," in 2014 Ieee 64Th Electronic Components and Technology Conference (Ectc). IEEE, May 2014, pp. 1667–1672.
- [81] M. Huo, Q. Guo, Y. Han, L. Shen, Q. Liu, B. Song, Q. Ma, K. Zhu, Y. Shen, X. Du, and S. Dong, "A case study of problems in jedec hbm esd test standard," *IEEE Transactions on Device and Materials Reliability*, vol. 9, no. 3, pp. 361– 366, 2009.
- [82] A. Righter, T. Welsher, M. Farris, M. Johnson, S. Ward, M. Dekker, T. Maloney, R. Ashton, L. G. Henry, T. Meuse, J. Barth, E. Grund, T. Smedes, and P. Ngan, "Progress towards a joint esda/jedec cdm standard: Methods, experiments, and results," in *Electrical Overstress / Electrostatic Discharge* Symposium Proceedings 2012, 2012, pp. 10–10.
- [83] G. S. Alliance. (2015, 01) Electrostatic discharge (esd) in 3d-ic packages version 1.0. Accessed on: May 06, 2021. [Online]. Available: https://www.3dincites. com/wp-content/uploads/GSA-ESDA-3D-IC\_ESD\_Whitepaper\_1.pdf
- [84] J. M. Rabaey, A. P. Chandrakasan, and B. Nikolić, Digital integrated circuits: A design perspective / Jan M. Rabaey, Anantha Chandrakasan, Borivoje Nikolić, 2nd ed., ser. Prentice Hall electronics and VLSI series. Upper Saddle River, N.J.: Pearson Education, 2003.
- [85] E. Bogatin, Signal integrity Simplified, ser. Prentice Hall modern semiconductor design series. Upper Saddle River, N.J. and Great Britain: Prentice Hall, 2004.
- [86] L. Shufang and W. Dong, "Trace termination design," in 2005 IEEE International Symposium on Microwave, Antenna, Propagation and EMC Technologies for Wireless Communications, vol. 1, 2005, pp. 682–687.
- [87] MIT. Polyimide. Accessed on: May 06, 2021. [Online]. Available: http://www.mit.edu/~6.777/matprops/polyimide.htm
- [88] H. Y. Song, S. J. Jang, J. S. Kwak, C. S. Kim, C. M. Kang, D. H. Jeong, Y. S. Park, M. S. Park, K. S. Byun, W. J. Lee, Y. C. Cho, W. H. Shin, Y. U. Jang, S. W. Hwang, Y. H. Jun, and S. I. Cho, "A 1.2 gb/s/pin double data rate sdram with on-die-termination," in 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC., vol. 1, 2003, pp. 314–496.
- [89] B. Razavi, "Tspc logic [a circuit for all seasons]," IEEE Solid-State Circuits Magazine, vol. 8, no. 4, pp. 10–13, 2016.
- [90] E. Laskin and S. P. Voinigescu, "A 60 mw per lane, 4x, 23-gb/s 2<sup>7</sup> -1 prbs generator," *IEEE Journal of Solid-State Circuits*, vol. 41, no. 10, pp. 2198– 2208, 2006.
- [91] B. Nikolic, V. G. Oklobdzija, V. Stojanovic, W. Jia, J. K.-S. Chiu, and M. Ming-Tak Leung, "Improved sense-amplifier-based flip-flop: design and measurements," *IEEE Journal of Solid-State Circuits*, vol. 35, no. 6, pp. 876– 884, 2000.

- [92] C.-K. Lee, Y.-J. Eom, J.-H. Park, J. Lee, H.-R. Kim, K. Kim, Y. Choi, H.-J. Chang, J. Kim, J.-M. Bang, S. Shin, H. Park, S. Park, Y.-R. Choi, H. Lee, K.-H. Jeon, J.-Y. Lee, H.-J. Ahn, K.-H. Kim, J.-S. Kim, S. Chang, H.-R. Hwang, D. Kim, Y.-H. Yoon, S.-H. Hyun, J.-Y. Park, Y.-G. Song, Y.-S. Park, H.-J. Kwon, S.-J. Bae, T.-Y. Oh, I.-D. Song, Y.-C. Bae, J.-H. Choi, K.-I. Park, S.-J. Jang, and G.-Y. Jin, "23.2 a 5gb/s/pin 8gb lpddr4x sdram with power-isolated lvstl and split-die architecture with 2-die zq calibration scheme," in 2017 IEEE International Solid-State Circuits Conference (ISSCC), 2017, pp. 390–391.
- [93] JEDEC, "Jedec addendum no. 1 to jesd209-4 low power double data rate 4x (lpddr4x)," JANUARY 2017.
- [94] T. C. Carusone, D. Johns, and K. W. Martin, Analog integrated circuit design, 2nd ed. Hoboken NJ: John Wiley & Sons, 2012.
- [95] C.-Y. Hung and M.-H. Weng, "Investigation of the silicon substrate with different substrate resistivities for integrated filters with excellent performance," *IEEE Transactions on Electron Devices*, vol. 59, no. 4, pp. 1164–1171, 2012.
- [96] Micron. Ddr3 sdram. Accessed on: May 08, 2021. [Online]. Available: https://www.micron.com/products/dram/ddr3-sdram
- [97] K. Sohn, T. Na, I. Song, Y. Shim, W. Bae, S. Kang, D. Lee, H. Jung, H. Jeoung, K.-W. Lee, J. Lee, and B. Lee, "A 1.2v 30nm 3.2gb/s/pin 4gb ddr4 sdram with dual-error detection and pvt-tolerant data-fetch scheme," in 2012 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2012, pp. 38–40.
- [98] M. W. Chaudhary and A. Heinig, "Heterogeneous interposer based integration of chips onto interposer to achieve high speed interfaces for adc application," in 2017 IEEE 19th Electronics Packaging Technology Conference (EPTC). IEEE, 2017, pp. 1–5.
- [99] M. W. Chaudhary and A. Heinig, "High speed serial interfaces in 2.5d integrated systems," *International Symposium on Microelectronics*, vol. 2016, no. 1, pp. 000155–000159, 2016.
- [100] Intel. Spice models for intel fpgas. Accessed on: May 08, 2021. [Online]. Available: https://www.intel.de/content/www/de/de/programmable/ support/support-resources/download/board-layout-test/hspice.html
- [101] M. Waqas Chaudhary, A. Heinig, and M. Dittrich, "Interposer based integration to achieve high speed interfaces for adc application," in 2016 IEEE International 3D Systems Integration Conference (3DIC). IEEE, 2016, pp. 1–4.
- [102] M. W. Chaudhary, A. Heinig, and B. Choubey, "Energy-area aware channel design for multi-chip interfaces," in 29th Conference on Electrical Performance of Electronic Packaging and Systems. IEEE, 2020.
- [103] R.-Y. Yang, C.-Y. Hung, Y.-K. Su, M.-H. Weng, and H.-W. Wu, "Loss characteristics of silicon substrate with different resistivities," *Microwave and Optical Technology Letters*, vol. 48, no. 9, pp. 1773–1776, 2006.

- [104] M. W. Chaudhary and A. Heinig, "Co-design of cml io and interposer channel for low area and power signaling," in *Formal proceedings of the 2016 IEEE 19th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS).* IEEE, 2016, pp. 1–6.
- [105] M. B. Steer, Microwave and RF design: A systems approach. Raleigh N.C.: SciTech Pub, 2010.
- [106] T. H. Cormen, Introduction to algorithms, 3rd ed. Cambridge, Mass. and London: MIT Press, 2009.
- [107] A. Heinig, M. W. Chaudhary, R. Fischbach, and M. Dittrich, "Design challenges in interposer based 3-d memory logic interface," *International Sympo*sium on Microelectronics, vol. 2015, no. 1, pp. 000050–000054, 2015.
- [108] M. W. Chaudhary and A. Heinig, "Interposer based integration of advanced memories and an asic," in *IWLPC International wafer level packaging conference*, ser. IWLPC Conference Proceedings, Oct 2015.
- [109] S.-M. Lee, J. Oh, J. Choi, S. Ko, D. Kim, K. Koo, J. Choi, Y. Nam, S. Park, H. Lee, E. Kim, S. Jung, K. Chae, S. Kim, S. Park, S. Lee, and S. Park, "23.6 a 0.6v 4.266gb/s/pin lpddr4x interface with auto-dqs cleaning and writevwm training for memory controller," in 2017 IEEE International Solid-State Circuits Conference (ISSCC), 2017, pp. 398–399.
- [110] Y. Cho, Y. Bae, B. Moon, Y. Eom, M. Ahn, W. Lee, C. Cho, M. Park, Y. Jeon, J. Ahn, B. Choi, D. Kang, S. Yoon, Y. Yang, K. Park, J. Choi, and J. Lee, "A sub-1.0v 20nm 5gb/s/pin post-lpddr3 i/o interface with low voltage-swing terminated logic and adaptive calibration scheme for mobile application," in 2013 Symposium on VLSI Circuits, 2013, pp. C240–C241.
- [111] M. W. Chaudhary, A. Heinig, and B. Choubey, "13-gb/s transmitter for bunch of wires chip-to-chip interface standard," in *IEEE 63rd International Midwest* Symposium on circuits and systems (MWSCAS). IEEE, 2020, pp. 333–336.