Visit the purchase page:
The late consumer-grade CPU roadmap
Recently, the industry has been very much looking forward to seeing Intel's future architecture roadmap, but since Skylake has been in a state of being half-covered. In recent months, Intel has simply announced a number of data center product roadmaps, including Cascade Lake, Cooper Lake and Ice Lake, and future generations, but consumer products are still difficult to produce.
At this architecture day event, Intel finally brought a consumer-grade PC processor architecture roadmap and Atom architecture roadmap.
On the high-performance Core line, Intel lists three new codes for the next three years: Sunny Cove, Willow Cove, and Golden Cove.Nearest to usSunny Cove will be available in 2019 (PS: You guess it won't be pigeons ^_^).
It is reported that the Sunny Cove architecture is designed to improve the performance per clock and reduce power consumption under common computing tasks.Will have AVX-512 unitAnd includes new features that accelerate specialized computing tasks such as artificial intelligence and encryption, and will become Intel's next-generation PCs andserverThe processor's infrastructure.
Subsequent Willow Cove is on the road map in 2020, probably 10nm. Intel's focus here on cache redesign (which may mean L1/L2 tuning), new transistor optimization (manufacturing-based), and other security features may be a further enhancement of a new class of side channel attacks.
Golden Cove is in the chart 2021, the process is still a question mark, maybe 10nm or 7nm, Intel will further enhance its single-threaded performance and artificial intelligence performance, and add potential network and AI in the core design. Features and security features seem to have improved.
The architecture roadmap for the Atom series of low-power processors is slower than the Core series, which is not surprising given its history. Given that Atom must adapt to a variety of devices, the industry is more likely to expect products to offer a wider range of features, especially for SoCs.
The architecture, to be launched in 2019, is called Tremont and focuses on single-threaded performance, network server performance, and battery life. Following Tremont will be Gracemont, which Intel will list as a 2021 product, possibly with a wider vector processing unit or support for new vector instructions.
From the roadmap, Gracemont will have a core of the "XXXmont" series, and Intel is studying the performance, frequency and features that the new core might have in 2023.
The above are the names of the architecture, and the actual product may have another code, which is the XXX-Lake name that the Core series has been using in recent years. For example, the processor code-named Ice Lake is the CPU of the Sunny Cove architecture. The kernel is combined with the Gen11 core graphics card.
Another noteworthy news in the event is thatIntel's future architecture is likely to be disconnected from the process. Dr. Raja Koduri and Dr. Murthy Renduchintala explained that in order to give the product line a certain degree of flexibility, the latest products of these architectures will be brought to market with the best process available at the time.
Although it is not stated, it should meanThe "Tick-Tock" strategy, which has already existed in name only, has been thoroughly swept into the history of the trash. In the future, some core designs may become normal in different processes..
Spying on the Sunny Cove architecture
Every time you hear about the new processor architecture, what you are most looking forward to is a detailed analysis of the new architecture and changes from the previous generation.
Since the first launch of Skylake in 2015, Intel has launched three generations of Kaby Lake, Coffee Lake and Coffee Lake so far. Because each generation has not improved much, it has been called “squeezing toothpaste” by players. Although Intel demonstrated the new Sunny Cove architecture this time, unfortunately its information is not comprehensive enough, mainly concentrated in the back end of the architecture design.
Intel has divided its microarchitecture update into two distinct parts: general-purpose performance enhancements and special-purpose performance enhancements. General-purpose performance improvements refer to raw IPC (per-clock instructions) throughput or frequency increase, and IPC increases may come from a wider core ( Each clock executes more instructions, deeper (more parallel per clock) or smarter (better data transfer through the front end), while frequency is usually a function of the implementation and process, while special-purpose performance boosts can be accelerated by others. Methods such as dedicated IP or dedicated instructions to improve some of the workloads used in a particular scenario.
It is reported that Sunny Cove has a comprehensive improvement in both general performance and special purpose performance. In the back end of the architecture, Intel has made improvements including increasing cache size, increasing core execution width, and increasing L1 storage bandwidth.
The Sunny Cove architecture's L1 data cache has been upgraded from 32KB to 48KB. Usually, when the cache capacity increases, the probability of cache misses will decrease by the square root. Therefore, the Sunny Cove architecture's L1 cache miss rate can theoretically be reduced by 22%. At the same time, the L2 cache of the Sunny Cove architecture Core and Xeon processors will also increase by 256KB and 1MB respectively, and the specific capacity is not known.
In addition, the micro-ops (uOp) cache and the secondary TLB, although not part of the backend, have an increase in capacity compared to the current one, which will help machine address translation. Some other changes can be seen in the figure, such as increasing the execution port from 8 to 10, allowing more instructions to be fetched from the scheduler at a time; the scheduling of the reorder buffer is also increased from 4 instructions per cycle to 5 instructions; Ports 4 and 9 are linked to a circular data store that doubles the bandwidth, but the AGU storage function is doubled, which will help increase the L1-D size.
The execution port of the Sunny Cove architecture has also undergone major changes. See the following figure for details:
We see that Intel has more LEA units in the core part of the core to help with memory addressing calculations, which may help improve performance loss through security mitigation that requires frequent memory calculations, or help provide constant offsets. High performance array code. Port 1 gets the MUL (multiplication) unit from Skylake Port 5, possibly for rebalancing, but there is also an integer divider unit. This is a minor tweak. Cannon Lake also has a 64-bit IDIV (signed integer division) unit in its design, in which case it reduces the 64-bit integer division from 97 clocks (mixed instructions) to 18 A clock, Sunny Cove may be similar.
In terms of integer arithmetic units, the multiply unit of port 5 has become the "MulHi" unit. In other architectures, it leaves the most important nibble in the register for further use, but it is currently not certain that it is in the Sunny Cove core. What is the location?
In terms of floating-point arithmetic units, Intel has added shuffling resources, which is to eliminate bottlenecks in the code. Intel does not describe the FMA (Fusing) operation in the core floating-point operation, but since there is an AVX-512 unit in the core, at least one of these FMAs should interact with it. Cannon Lake has only one 512-bit FMA, this FMA is likely to be here, and the Xeon's scalable version may have two FMAs.
Other updates listed by Intel include improvements to branch predictors and reduced payload latency caused by TLB and L1-D. It has been pointed out that these improvements cannot help all users, and there may be only new algorithms to use the core capabilities of these specific parts.
In addition to architectural differences, Sunny Cove has added new instructions to help accelerate professional computing tasks. With the advent of the AVX-512 unit, the new architecture will support IFMA (Signed Melt Operation) instructions for large arithmetic calculations, which are very useful in cryptography. Sunny Cove also supports Vector-AES, Vector Carryless Multiply, SHA, SHA-NI, and Galois Field instructions, which are also the basic building blocks in some of the elements of cryptography.
Sunny Cove supports larger memory capacity, and its main memory paging table has been increased from 4 to 5 layers, supporting up to 57 bit linear address space and up to 52 bit physical address space, which means that the server processor can theoretically support single-slot 4TB memory. .
According to Intel's previous Xeon roadmap, Sunny Cove will be listed in the server space with Ice Lake-SP in 2020. For security reasons, Sunny Cove features multi-key full memory encryption and user mode command prevention.
Gen11 core graphics
In 2015, Intel introduced the Skylake processor with Gen9 core graphics, but then Kaby Lake and Coffee Lake's core graphics are only Gen9.5 and not Gen10. In fact, the Intel 10nm Cannon Lake processor should have been paired with Gen10, but Intel has never released a PC-side Cannon Lake processor with a core graphics card.
Today, Raja Koduri, Intel's chief architect, senior vice president of core and vision computing group and general manager of edge computing solutions, directly announced the new Gen11 core graphics card and reiterated plans to launch a standalone graphics processor in 2020.
According to the roadmap, the Gen11 core graphics card will be available with the 10nm processor in 2019, with 64 EUs (Enhanced Execution Unit), which is twice the size of the previous Gen 9 core graphics card and 1TFlops for floating-point performance. The 64 EUs are divided into 4 slices, each slice consisting of 2 8EUs sub-slices, each of which has an instruction cache and a 3D sampler, while the larger 4 slices have 2 media samplers, 1 PixelFE and additional load/storage hardware.
Intel did not disclose much details on how to improve EU performance, but said that the internal floating-point unit interface of the EU is redesigned to support fast (2x) FP16 performance. Each EU supports 7 threads as before, which means that the entire GPU has 512 concurrent pipes. Intel says it has redesigned the memory interface and increased the GPU's L3 cache to 3MB, which is 4 more than Gen9.5. Times.
A major improvement in the Gen11 core graphics card is the support for tiled rendering, which makes Intel the last PC GPU vendor to implement this feature after NVIDIA in 2014 and AMD in 2017. While tiled rendering is not a panacea for GPU performance issues, optimized tile rendering is well suited to the bandwidth limitations of core graphics cards.
At the same time, Intel's lossless memory compression technology has also improved, in the best case, performance can be increased by 10%, an average of 4%. The GTI interface now supports reading and writing 64 bytes per clock to increase throughput to match the redesigned memory interface.
The Gen11 core graphics card also supports Intel's new multi-rate coloring technology, Coarse Pixel Shading, which is similar to NVIDIA's variable pixel shading, which allows the GPU to reduce the amount of rendering operations required to shadow a portion of a pixel. Intel showed two demonstrations for CPS, in which pixel shadows are used as a function of camera distance and screen center. When the object is far away from the camera or the center of the screen, the rendering amount is reduced. The design is designed to help VR achieve gaze point rendering, etc. Function, Intel said that the game can increase the frame rate by about 30% after supporting this technology.
Raja Koduri announced the new product brand of Intel's discrete graphics business: Xe, still informally known as the "Gen12" series, covering all areas from the client to the data center from 2020, covering the core of the future Graphics card solutions, Intel hopes that Xe can compete with the best products of competitors from entry to mid-range, to enthusiasts and AI.
Xe will start at the 10nm node and lay the foundation for future generations of graphics, and will follow Intel's single stack software philosophy, Hope SoftwareDevelopmentPeople can take advantage of CPU, GPU, FPGA, and AI, all using the same set of APIs, which indicates that Intel is ready to move around a brand.
As part of the Architecture Day event, Intel conducted a number of chip demonstrations on-site, allegedly based on the new Sunny Cove core and Gen11 core graphics cards. The current demo projects include the 7-Zip app and the Tekken 7 game.
The 7-Zip project is relatively straightforward, and the same-frequency performance of the demo machine is 75% higher than that of the SkyLake platform, demonstrating the special-purpose performance gains of new instructions such as Vector-AES and SHA-NI for the Sunny Cove architecture. In Tekken 7, the Sunny Cove+Gen11 demonstration machine is smoother than SkyLake+Gen9, completely exceeding the minimum requirement of 30fps.
Foveros 3D package that changes the way the chip is manufactured
Anyone who has focused on semiconductor chip design should be aware that most of the CPUs and SoCs currently produced are based on a single-chip die, that is, everything is needed in a single piece of silicon before being packaged and entered into the system. In addition, there are a number of multi-chip packages with shared connections and carrier or embedded bridge products that connect different chips together through high-speed interconnects.
One of the biggest challenges in modern chip design is to minimize the chip area, which reduces cost and power consumption and makes it easier to implement in the system. However, when it comes to improving performance, one of the shortcomings of large single-chip or multi-chip packages is that it is too far away from memory, so Intel is ready to introduce 3D stacking into the mass market.
Raja said that Intel has been focusing on high-performance process nodes for decades, trying to release its core performance as much as possible. In addition, Intel runs the IO Optimization Process Node at a similar pace, but is more suitable for PCH or SoC type functions.
126x and 127x are internal numbering systems for Intel process node technology, but the node variants with the "+" suffix are not distinguished. Raja presented the existing 2019 process technology. The core of the calculations is 10nm 1274 process, and the IO aspect has 14nm process 1273. The Foveros 3D stack technology introduced here is P1222. Looking ahead, Intel will expand its node base so that it can cover more power and performance points.
One way to do this is to choose the best transistor for each job, patch, and package, whether it's CPU, GPU, IO, FPGA, RF, or anything else, just use the right package. They can be put together for optimal optimization.
This is where Foveros comes in. Foveros is Intel's new active carrier technology. Its design is based on EMIB (Embedded Multi-Chip Interconnect Bridge) 2D packaging technology introduced in 2018, which is more suitable for small size products or products with extremely high memory bandwidth requirements. In these designs, the power of the data transmitted per bit is very low, and the packaging technology deals with reduced bump pitch, increased bump density, and chip stacking techniques. Intel said that Foveros is ready for mass production.
The first iteration of this technique is not as complicated as the slide above, except that it uses a set of CPU cores connected to the PCH below, but Intel can use different transistor types on different chips, such as using a 22FFL process. Place a set of 10nm CPUs on the carrier board.
Intel demonstrated the Foveros chip on the architecture day. It uses 22FFL IO chip as the active carrier board and connects a 10nm chip with TSV (through-silicon via technology), including one Sunny Cove core and four Atom cores. It is Tremont). The microchip measures 12*12 and has a standby power of only 2mW, which seems to be mobile-oriented.
As you can see on the Intel slide, the "Big CPU" of the Sunny Cove core has a 0.5 MB exclusive L2 cache, the four small Atom cores have a 1.5MB shared L2 cache, and the two cores share a 4MB L3 cache. The chip also integrates a 64EUs Gen11 core graphics card, a four-channel LPDDR4 memory controller (4*16bit), and MIPI (Mobile Industry Processor Interface) that supports DisplayPort 1.4.
Jim Keller said that Intel is trying to use Foveros technology to create a lot of new gadgets to see which ones might be a good product, so the industry should see more Foveros products in 2019 and 2020.
Some surrounding news
In this architecture day event, the most "no passion" part should be the discussion about data center products. Intel has previously announced the next two products in the enterprise market, Cascade Lake and Cooper Lake, both based on 14nm, focusing on enhanced security and AI instructions to help accelerate, followed by 10nm Ice Lake Scalable, but also That's it.
However, at the event, Intel confirmed that Ice Lake will build on the Sunny Cove architecture and showcase the package of the Ice Lake Xeon 10nm processor, which is a bit of a new comforting news.
In addition, Intel introduced the Aurora technology, One API software and deep learning reference stack at the event.
One API software: Intel announced the "One API" project to simplify the programming of various computing engines across CPUs, GPUs, FPGAs, artificial intelligence and other accelerators. The project includes a comprehensive, unified set of development tools to match software to hardware that accelerates software code to the greatest extent possible. The public release is expected to be released in 2019.
Proud technology: Intel Proud Data Center-class persistent memory as a new product that integrates memory-like performance with data persistence and storage capacity. This technology enables applications in artificial intelligence and large scale by placing more data closer to the CPU.databaseA larger number of data sets in it can get faster processing speeds. Its large capacity and data persistence reduce latency loss when accessing storage, improving workload performance.
Intel's proud data center-level persistent memory provides cached row (64B) reads for the CPU. Generally speaking, the average idle read latency of proud persistent memory is about 350ns when an application directs the read operation to proud persistent memory or requests data that is not cached in DRAM. If scale is achieved, the average idle read latency of arrogant data center-level solid-state disks is about 10,000 ns (10 mus), which will be a significant improvement. In some cases, when the requested data is in DRAM, the response speed of the memory subsystem is expected to be the same as that of DRAM (less than 100 ns), whether it is cached by the CPU's memory controller or booted by the application.
Intel also demonstrated the combination of Aurora and QLC SSDs, which will reduce access latency for the most commonly used data. Overall, these improvements to the platform and memory reshape the memory and storage hierarchy to provide a complete selection of systems and applications.
Deep Learning Reference Stack: This is an integrated, high performance open source stack optimized for Intel Xeon scalable platforms. The open source community version is designed to ensure that AI developers can easily access all the features and functionality of the Intel platform. The Deep Learning Reference Stack is highly tuned and built for the cloud native environment. This version reduces the complexity of integrating multiple software components, helping developers quickly prototype, while giving users the flexibility to create customized solutions.
Operating System: The Clear Linux operating system can be customized to individual development needs and tuned for specific use cases such as Intel platforms and deep learning;
Orchestration: Kubernetes can manage and orchestrate containerized applications for multi-node clusters based on the perception of the Intel platform;
Containers: Docker containers and Kata containers use Intel virtualization technology to help protect containers;
Function library: Intel Deep Neural Network Mathematical Core Function Library (MKL DNN) is Intel's highly optimized mathematical library for mathematical function performance;
Runtime: Python is highly tuned and optimized for Intel architecture, providing application and service execution runtime support;
Framework: TensorFlow is a leading deep learning and machine learning framework;
Deployment: KubeFlow is an open source, industry-driven deployment tool that provides a fast experience on Intel architecture, easy to install and use.