Where cuda leads: practical applications of gpgpu technology - the best hardware. See what "CUDA" is in other dictionaries List of cards with cuda support

07.09.2021

- a set of low-level software interfaces ( API) to create games and other high-performance multimedia applications. Includes support for high performance 2D- and 3D-graphics, sound and input devices.

Direct3D (D3D) - interface for outputting three-dimensional primitives(geometric bodies). Included in .

Opengl(from the English. Open Graphics Library, literally - an open graphics library) is a specification that defines a cross-platform programming interface independent of a programming language for writing applications using two-dimensional and three-dimensional computer graphics. Includes over 250 functions for drawing complex 3D scenes from simple primitives. It is used in the creation of video games, virtual reality, visualization in scientific research. On the platform Windows competes with .

OpenCL(from the English. Open Computing Language, literally - an open computing language) - framework(framework of a software system) for writing computer programs related to parallel computations on various graphic ( GPU) and ( ). Into the framework OpenCL includes a programming language and an application programming interface ( API). OpenCL provides instruction-level and data-level parallelism and is an implementation of the technique GPGPU.

GPGPU(abbreviated from English. General-P urpose G raphics P rocessing U nits, literally - GPU general purpose) is a technique for using a graphics processor-video card for general computations that are usually performed.

Shader(eng. shader) - a program for shading on synthesized images, used in three-dimensional graphics to determine the final parameters of an object or image. Typically includes arbitrary complexity descriptions of light absorption and scattering, texture mapping, reflection and refraction, shading, surface displacement, and post-processing effects. Complex surfaces can be rendered using simple geometric shapes.

Rendering(eng. rendering) - visualization, in computer graphics, the process of obtaining an image from a model using software.

SDK(abbreviated from English. Software Development Kit) - a set of software development tools.

Cpu(abbreviated from English. Central Processing Unit, literally - central / main / main computing device) - central (micro); device that executes machine instructions; a piece of hardware that is responsible for performing computational operations (specified by the operating system and application software) and coordinating the work of all devices.

GPU(abbreviated from English. Graphic Processing Unit, literally - a graphics computing device) - a graphics processor; a separate device or game console that performs graphics rendering (rendering). Modern GPUs handle and render computer graphics very efficiently. The graphics processor in modern video adapters is used as an accelerator for three-dimensional graphics, but it can also be used in some cases for computations ( GPGPU).

Problems Cpu

For a long time, an increase in the performance of traditional ones was mainly due to a sequential increase in the clock frequency (about 80% of the performance was determined by the clock frequency) with a simultaneous increase in the number of transistors on one crystal. However, a further increase in the clock frequency (at a clock frequency of more than 3.8 GHz, the chips simply overheat!) Runs up against a number of fundamental physical barriers (since the technological process is very close to the size of an atom: and the size of a silicon atom is about 0.543 nm):

First, as the crystal size decreases and the clock frequency increases, the leakage current of the transistors increases. This leads to an increase in power consumption and an increase in heat emission;

Second, the benefits of higher clock speeds are partially offset by memory latency because memory access times do not match increasing clock speeds;

Third, for some applications, traditional sequential architectures become ineffective with increasing clock rates due to the so-called "von Neumann bottleneck" - a performance limitation resulting from a sequential flow of computations. At the same time, the resistive-capacitive signal transmission delays increase, which is an additional bottleneck associated with an increase in the clock frequency.

Development GPU

In parallel with the development of GPU:

…

November 2008 - Intel introduced a line of 4-core Intel Core i7 based on a new generation microarchitecture Nehalem... The processors operate at a clock speed of 2.6-3.2 GHz. Made according to the 45nm process technology.

December 2008 - supplies of 4-core AMD Phenom II 940(codename - Deneb). Operates at a frequency of 3 GHz, manufactured using the 45-nm process technology.

…

May 2009 - company AMD introduced the GPU version ATI Radeon HD 4890 with a core clock speed increased from 850 MHz to 1 GHz. This is the first graphic 1 GHz processor. The computing power of the chip, due to the increase in frequency, increased from 1.36 to 1.6 teraflops. The processor contains 800 (!) Computational cores, supports video memory GDDR5, DirectX 10.1, ATI CrossFireX and all other technologies inherent in modern video card models. The chip is made on the basis of 55nm technology.

The main differences GPU

Distinctive features GPU(compared with ) are:

- architecture, maximally aimed at increasing the speed of calculating textures and complex graphic objects;

- peak power typical GPU much higher than ;

- thanks to the specialized conveyor architecture, GPU much more efficient in processing graphic information than.

"Crisis of the genre"

"Crisis of the genre" for matured by 2005 - it was then that they appeared. But, despite the development of technology, the productivity growth of conventional decreased markedly. At the same time performance GPU continues to grow. So, by 2003, this revolutionary idea crystallized - use the computing power of the graphic... GPUs have become actively used for "non-graphical" computing (physics simulation, signal processing, computational mathematics / geometry, database operations, computational biology, computational economics, computer vision, etc.).

The main problem was that there was no standard interface for programming GPU... Developers used Opengl or Direct3D but it was very convenient. Corporation NVIDIA(one of the largest manufacturers of graphics, media and communication processors, as well as wireless media processors; founded in 1993) took up the development of a unified and convenient standard - and presented the technology CUDA.

How did it start

2006 - NVIDIA demonstrates CUDA ™; the beginning of a revolution in computing GPU.

2007 - NVIDIA releases architecture CUDA(original version CUDA SDK was introduced on February 15, 2007); nomination "Best novelty" from the magazine Popular Science and "Readers' Choice" from the publication HPCWire.

2008 - technology NVIDIA CUDA won in the category "Technical Excellence" from PC Magazine.

What CUDA

CUDA(abbreviated from English. Compute Unified Device Architecture, literally - unified computing architecture of devices) - architecture (a set of software and hardware) that allows you to perform on GPU general-purpose calculations, while GPU actually acts as a powerful coprocessor.

Technology NVIDIA CUDA ™ Is the only development environment in the programming language C, which allows developers to create software to solve complex computational problems in less time, thanks to the processing power of GPUs. Millions already work in the world GPU with the support CUDA, and thousands of programmers are already using (free!) tools CUDA to accelerate applications and solve the most complex resource-intensive tasks - from video and audio encoding to oil and gas exploration, product modeling, medical imaging and scientific research.

CUDA gives the developer the opportunity, at his discretion, to organize access to the set of instructions of the graphics accelerator and manage its memory, organize complex parallel computations on it. Graphics accelerator with support CUDA becomes a powerful programmable open architecture like today. All this provides the developer with low-level, distributed and high-speed access to equipment, making CUDA a necessary foundation when building serious high-level tools such as compilers, debuggers, math libraries, software platforms.

Uralsky, leading technology specialist NVIDIA comparing GPU and , says this: " Is an SUV. He drives anytime, anywhere, but not very fast. A GPU Is a sports car. On a bad road, he simply will not go anywhere, but give a good coverage - and he will show all his speed, which the SUV never dreamed of! .. ".

Technology capabilities CUDA

For decades, Moore's Law has been in effect, which states that the number of transistors on a chip will double every two years. However, this was back in 1965, and the last 5 years the idea of physical multicore in consumer-class processors began to develop rapidly: in 2005, Intel introduced the Pentium D, and AMD - the Athlon X2. Then applications using 2 cores could be counted on the fingers of one hand. However, the next generation of Intel processors, which made a revolution, had exactly 2 physical cores. Moreover, in January 2007, the Quad series appeared, at the same time Moore himself admitted that his law would soon cease to operate.

What is it now? Dual-core processors even in budget office systems, and 4 physical cores became the norm in just 2-3 years. The frequency of processors is not increasing, but the architecture is improving, the number of physical and virtual cores is increasing. However, the idea of using video adapters equipped with dozens or even hundreds of computing "blocks" has been around for a long time.

And although the prospects for computing with GPUs are huge, the most popular solution - Nvidia CUDA is free, has a lot of documentation and, in general, is very simple to implement, there are not many applications using this technology. Basically, these are all kinds of specialized calculations, to which the ordinary user in most cases does not care. But there are also programs designed for the mass user, and we will talk about them in this article.

To begin with, a little about the technology itself and what it is eaten with. Because When writing an article, I focus on a wide range of readers, then I will try to explain it in an accessible language without complex terms and somewhat briefly.

CUDA(English Compute Unified Device Architecture) is a software and hardware architecture that allows computations using NVIDIA GPUs that support GPGPU technology (arbitrary computing on video cards). The CUDA architecture first appeared on the market with the release of the eighth generation NVIDIA G80 chip and is present in all subsequent series of graphics chips used in the GeForce, Quadro and Tesla accelerator families. (c) Wikipedia.org

Incoming streams are processed independently of each other, i.e. parallel.

Moreover, there is a division into 3 levels:

Grid- core. Contains a one / two / three dimensional array of blocks.

Block- contains many threads (thread). Streams of different blocks cannot interact with each other. Why did you need to introduce blocks? Each block is essentially responsible for its own subtask. For example, a large image (which is a matrix) can be split into several smaller parts (matrices) and work in parallel with each part of the image.

Thread- stream. Threads within one block can interact either through shared memory, which, by the way, is much faster than global memory, or through thread synchronization tools.

Warp- this is the union of interacting streams, for all modern GPUs Warp's size is 32. Next comes half-warp, which is a half of a warp'a, since memory accesses usually go separately for the first and second half of the warp.

As you can see, this architecture is great for parallelizing tasks. And although programming is carried out in the C language with some restrictions, in reality it is not so simple, since not everything can be parallelized. There are no standard functions for generating random numbers (or initialization) either; all this has to be implemented separately. And although there are enough ready-made options, all this does not bring joy. The ability to use recursion is relatively new.

For clarity, a small console (to minimize the code) program was written that performs operations with two arrays of float type, i.e. with non-integer values. For the above reasons, initialization (filling the array with various arbitrary values) was performed by the CPU. Then, 25 various operations were performed with the corresponding elements from each array, intermediate results were written into the third array. The array size changed, the results are as follows:

In total, 4 tests were carried out:

1024 elements in each array:

It is clearly seen that with such a small number of elements, there is little sense from parallel computations, since the calculations themselves are much faster than their preparation.

4096 elements in each array:

And now you can see that the video card performs operations on arrays 3 times faster than the processor. Moreover, the execution time of this test on a video card has not increased (a slight decrease in time can be attributed to an error).

Now there are 12288 elements in each array:

The gap between the video card has doubled. Again, note that the execution time on the video card has increased.
insignificantly, but on the processor more than 3 times, i.e. proportional to the complexity of the task.

And the last test - 36864 elements in each array:

In this case, the acceleration reaches impressive values - almost 22 times faster on a video card. And again, the execution time on the video card has increased insignificantly, and on the processor - the prescribed 3 times, which is again proportional to the complication of the task.

If you continue to complicate the calculations, then the video card wins more and more. Although the example is somewhat exaggerated, it shows the situation in general clearly. But as mentioned above, not everything can be parallelized. For example, calculating pi. There are only examples written using the Monte Carlo method, but the accuracy of the calculations is 7 decimal places, i.e. regular float. In order to increase the accuracy of calculations, long arithmetic is needed, but here problems arise, tk. it is very, very difficult to implement effectively. On the Internet, I could not find examples using CUDA and calculating Pi to 1 million decimal places. Attempts have been made to write such an application, but the simplest and most efficient method for calculating pi is the Brent-Salamin algorithm or the Gauss formula. In the well-known SuperPI, most likely (judging by the speed of work and the number of iterations), the Gauss formula is used. And judging by
the fact that SuperPI is single-threaded, the lack of examples for CUDA and the failure of my attempts, it is impossible to effectively parallelize the calculation of Pi.

By the way, you can see how in the process of performing calculations the load on the GPU increases, as well as memory allocation.

Now let's move on to the more practical benefits of CUDA, namely the currently existing programs using this technology. For the most part, these are all kinds of audio / video converters and editors.

We used 3 different video files for testing:

* The history of the movie Avatar - 1920x1080, MPEG4, h.264.

* "Lie to me" series - 1280x720, MPEG4, h.264.

* Series "It's Always Sunny in Philadelphia" - 624x464, xvid.

The container and size of the first two files were mkv and 1.55 gb, and the last one was .avi and 272 mb.

Let's start with a very sensational and popular product - Badaboom... The version used was - 1.2.1.74 ... The cost of the program is $29.90 .

The program interface is simple and intuitive - on the left we select the source file or disk, and on the right - the required device for which we will encode. There is also a user mode, in which parameters are manually set, and it was used.

To begin with, let's consider how quickly and efficiently the video is encoded "into itself", i.e. at the same resolution and approximately the same size. The speed will be measured in fps, and not in the elapsed time - it is more convenient to both compare and calculate how much video of arbitrary length will be compressed. Because today we are considering the technology of "green", then the graphs will be appropriate -)

The encoding speed directly depends on the quality, this is obvious. It should be noted that the light resolution (let's call it traditionally - SD) is not a problem for Badaboom - the encoding speed is 5.5 times higher than the original (24 fps) video frame rate. And even a heavy 1080p video is converted by the program in real time. It should be noted that the quality of the final video is very close to the original video, i.e. encodes Badaboom very, very high quality.

But usually video is overtaken to a lower resolution, let's see how things are in this mode. As the resolution was lowered, the video bitrate also dropped. It was 9500 kbps for 1080p output file, 4100 kbps for 720p and 2400 kbps for 720x404. The choice was made on the basis of a reasonable ratio of size / quality.

Comments are superfluous. If you rip from 720p to normal SD quality, then it will take about 30 minutes to transcode a 2-hour movie. And at the same time, the processor load will be insignificant, you can go about your business without feeling discomfort.

But what if you convert the video to a format for a mobile device? To do this, select the iPhone profile (bitrate 1 Mbps, 480x320) and look at the encoding speed:

Do I need to say something? A 2-hour movie in normal iPhone quality is transcoded in less than 15 minutes. HD quality is more difficult, but still quite fast. The main thing is that the quality of the output video material remains at a fairly high level when viewed on the phone display.

In general, the impressions from Badaboom are positive, the speed of work pleases, the interface is simple and straightforward. All sorts of bugs in earlier versions (I still used beta in 2008) have been cured. Except for one thing - the path to the source file, as well as to the folder where the finished video is saved, should not contain Russian letters. But against the background of the merits of the program, this drawback is insignificant.

Next in line we will have Super LoiLoScope... For the usual version, they ask 3 280 rubles, and for the touch version that supports touch control in Windows 7, they ask for as much 4 440 rubles... Let's try to figure out why the developer wants that kind of money and why the video editor needs multitouch support. The latest version was used - 1.8.3.3 .

It is rather difficult to describe the program interface in words, so I decided to shoot a short video. I must say right away that, like all video converters for CUDA, GPU acceleration is supported only for video output in MPEG4 with the h.264 codec.

During encoding, the processor load is 100%, but this does not cause discomfort. The browser and other non-heavy applications do not slow down.

Now let's move on to performance. To begin with, everything is the same as with Badaboom - video transcoding into the same quality.

The results are much better than Badaboom. The quality is also on top, the difference with the original can be seen only by comparing the frames in pairs under a magnifying glass.

Wow, and here LoiloScope bypasses Badaboom 2.5 times. At the same time, you can easily cut and encode another video in parallel, read news and even watch a movie, and even FullHD can be played without problems, although the processor load is maximum.

Now let's try to make a video for a mobile device, we will name the profile the same as it was called in Badaboom - iPhone (480x320, 1 Mbps):

There is no mistake. Everything was checked several times, each time the result was the same. Most likely, this happens for the simple reason that the SD file is written with a different codec and in a different container. When transcoding, the video is first decoded, divided into matrices of a certain size, and compressed. ASP decoder used in case of xvid is slower than AVC (for h.264) in parallel decoding. However, 192 fps is 8 times faster than the original video speed, a 23-minute burst is compressed in less than 4 minutes. The situation was repeated with other files compressed in xvid / DivX.

LoiloScope left only pleasant impressions about myself - the interface, despite its unusualness, is convenient and functional, and the speed of work is beyond praise. The relatively poor functionality is somewhat frustrating, but often with simple editing, you only need to slightly adjust the colors, make smooth transitions, overlay text, and LoiloScope does an excellent job with this. The price is also somewhat frightening - more than $ 100 for the regular version is normal for foreign countries, but such figures still seem somewhat wild to us. Although, I confess that if, for example, I often shot and edited home videos, I might have thought about buying. At the same time, by the way, I checked the ability to edit HD (or rather AVCHD) content directly from the camcorder without first converting to another format, LoiloScope has not found any problems with mts files.

New technology is like a newly emerged evolutionary species. A strange creature, unlike the many old-timers. Awkward in some places, funny in some places. And at first, his new qualities seem to be in no way suitable for this habitable and stable world.

However, a little time passes, and it turns out that the beginner runs faster, jumps higher and generally stronger. And he eats flies more than his retrograde neighbors. And then these same neighbors begin to understand that it is not worth quarreling with this awkward former. Better to be friends with him, and even better to organize a symbiosis. You look, and more flies will fall.

GPGPU technology (General-Purpose Graphics Processing Units) has long existed only in the theoretical calculations of brainy academics. How else? Only theorists are capable of proposing to radically change the computational process that has developed over decades by entrusting the calculation of its parallel branches to a video card.

The CUDA technology logo reminds us that it grew up in the depths
3D graphics.

But the GPGPU technology was not going to gather dust on the pages of university journals for a long time. Having fluffed up feathers of her best qualities, she attracted the attention of manufacturers. This is how CUDA was born - an implementation of GPGPU on NVIDIA GeForce GPUs.

Thanks to CUDA, GPGPU technologies have become mainstream. And now only the most short-sighted and covered with a thick layer of laziness, the developer of programming systems does not declare support for his product CUDA. IT-publications considered it an honor to present the details of the technology in numerous plump popular science articles, and competitors urgently sat down at patterns and cross-compilers to develop something similar.

Public recognition is not only the dream of aspiring starlets, but also of newly emerging technologies. And CUDA is lucky. She is well-known, people talk and write about her.

They just write as if they continue to discuss GPGPU in thick scientific journals. The reader is bombarded with a heap of terms like "grid", "SIMD", "warp", "host", "texture and constant memory". They immerse it to the very top in the organization schemes of nVidia GPUs, lead the winding paths of parallel algorithms and (the strongest move) show long listings of code in the C language. As a result, it turns out that at the entrance of the article we have a fresh and eager to understand CUDA reader, and at the exit - the same reader, but with a swollen head filled with a mess of facts, diagrams, code, algorithms and terms.

Meanwhile, the goal of any technology is to make our life easier. And CUDA does a great job of it. The results of her work - this is what will convince any skeptic better than a hundred schemes and algorithms.

Far from everywhere

CUDA is supported by high-performance supercomputers
nVidia Tesla.

And yet, before looking at the results of CUDA's work in the field of making life easier for the average user, it is worth understanding all its limitations. Just like a genie: any desire, but one. CUDA also has its Achilles heels. One of them is the limitations of the platforms on which she can work.

The list of video cards manufactured by nVidia that support CUDA is presented in a special list called CUDA Enabled Products. The list is quite impressive, but easily classified. CUDA support is not denied:

NVidia GeForce 8th, 9th, 100th, 200th and 400th series models with a minimum of 256 megabytes of video memory on board. Support extends to both desktop cards and mobile solutions.

The vast majority of desktop and mobile graphics cards are nVidia Quadro.

All solutions from the nvidia ION netbook range.

High-performance HPC (High Performance Computing) and nVidia Tesla supercomputer solutions used for both personal computing and scalable cluster systems.

Therefore, it is worth checking this list of favorites before using CUDA-based software products.

In addition to the video card itself, a corresponding driver is required to support CUDA. It is he who is the link between the central and the graphics processor, acting as a kind of software interface for accessing the program code and data to the multicore treasury of the GPU. To be sure not to be mistaken, nVidia recommends that you visit the drivers page for the most recent version.

... but the process itself

How does CUDA work? How to explain the complex process of parallel computing on a special hardware architecture of the GPU so as not to plunge the reader into the abyss of specific terms?

You can try to do this by imagining how the central processor executes the program in symbiosis with the graphics processor.

The architectural central processing unit (CPU) and its graphics sibling (GPU) are structured differently. If we draw an analogy with the world of the automotive industry, then the CPU is a station wagon, one of those that are called a "barn". It looks like a passenger car, but at the same time (from the point of view of the developers) "a Swiss, and a reaper, and a gamer on a pipe." Performs the role of a small truck, bus and hypertrophied hatchback at the same time. Wagon, in short. It has few cylinders-cores, but they "pull" almost any task, and an impressive cache memory is able to accommodate a bunch of data.

The GPU, on the other hand, is a sports car. There is only one function: to get the pilot to the finish line as quickly as possible. Therefore, no big memory-trunk, no extra seats. But the number of core cylinders is hundreds of times larger than those of a CPU.

Thanks to CUDA, GPGPU program developers do not need to delve into the complexity of programming
rationing for graphics engines such as DirectX and OpenGL

Unlike a central processor capable of solving any task, including graphics, but with average performance, the graphics processor is adapted to a high-speed solution to one problem: converting heaps of polygons at the input into a bunch of pixels at the output. Moreover, this problem can be solved in parallel on hundreds of relatively simple computing cores in the GPU.

So what can be a tandem from a station wagon and a sports car? The work of CUDA goes something like this: the program is executed on the CPU until a piece of code appears in it that can be executed in parallel. Then, instead of slowly executing on two (and even eight) cores of the coolest CPU, it is transferred to hundreds of GPU cores. At the same time, the execution time of this section is reduced significantly, which means that the execution time of the entire program is also reduced.

Nothing changes technologically for a programmer. CUDA code is written in the C language. More precisely, in his special dialect "C with streams" (C with streams). Developed at Stanford, this C extension is called Brook. The interface that transfers Brook code to the GPU is the driver of a video card that supports CUDA. It organizes the entire process of processing this section of the program so that for the GPU programmer it looks like a CPU coprocessor. Much like the use of a math coprocessor in the early days of personal computers. With the advent of Brook, video cards with CUDA support and drivers for them, any programmer is able to access the GPU in his programs. But before this shamanism was owned by a narrow circle of the elite, who for years honing programming techniques for graphics engines DirectX or OpenGL.

In the barrel of this pretentious honey - the praises of CUDA - you should put a fly in the ointment, that is, restrictions. Not every task that needs to be programmed is suitable for solving with CUDA. It will not work to speed up the solution of routine office tasks, but entrust CUDA to calculate the behavior of a thousand of the same type of fighters in World of Warcraft - please. But this is a task out of hand. Let's take a look at examples of what CUDA already solves very effectively.

Righteous works

CUDA is a very pragmatic technology. With its support for its graphics cards, nVidia quite rightly hoped that the CUDA banner would be picked up by a multitude of enthusiasts in both college and commerce. And so it happened. CUDA-based projects live and benefit.

NVIDIA PhysX

Advertising the next game masterpiece, manufacturers often press on its 3D-realism. But no matter how real the game 3D world is, if the elementary laws of physics, such as gravity, friction, hydrodynamics, are implemented incorrectly, the falsity will be felt instantly.

One of the features of the NVIDIA PhysX physics engine is realistic tissue manipulation.

Implementing algorithms for computer simulation of basic physical laws is very laborious. The most famous companies in this field are the Irish company Havok with its cross-platform physics Havok Physics and California-based Ageia - the progenitor of the world's first physical processor (PPU - Physics Processing Unit) and the corresponding physics engine PhysX. The first of them, although acquired by Intel, is now actively working on optimizing the Havok engine for ATI video cards and AMD processors. But Ageia with its PhysX engine became part of nVidia. At the same time, nVidia solved the rather difficult task of adapting PhysX to CUDA technology.

This became possible thanks to statistics. It has been statistically proven that no matter how complex rendering is performed by the GPU, some of its cores are still idle. It is on these cores that the PhysX engine works.

Thanks to CUDA, the lion's share of computing related to the physics of the game world has begun to be performed on a video card. The freed up power of the central processor was used to solve other problems of the gameplay. The result was not long in coming. According to experts, the performance gain of PhysX-based gameplay on CUDA has increased by at least an order of magnitude. The likelihood of the implementation of physical laws has also grown. CUDA takes care of the routine calculation of the implementation of friction, gravity and other things we are used to for multidimensional objects. Now not only the heroes and their technique fit perfectly into the laws of the physical world we are used to, but also dust, fog, blast wave, flame and water.

CUDA version of NVIDIA Texture Tools 2

Do you like realistic objects in modern games? Thanks to the texture developers. But the more reality there is in the texture, the greater its volume. The more it takes up precious memory. To avoid this, textures are precompressed and dynamically decompressed as needed. And compressing and decompressing is a lot of computation. For working with textures, nVidia has released the NVIDIA Texture Tools package. It supports efficient compression and decompression of DirectX textures (the so-called RFC format). The second version of this package boasts support for BC4 and BC5 compression algorithms, implemented in DirectX 11 technology. But the main thing is that NVIDIA Texture Tools 2 supports CUDA. According to nVidia, this gives a 12x performance improvement in texture compression and decompression tasks. This means that the frames of the gameplay will load faster and delight the player with their realism.

The NVIDIA Texture Tools 2 package is tailored for working with CUDA. The performance gains for compressing and decompressing textures are evident.

The use of CUDA can significantly improve the efficiency of video surveillance.

Real-time video stream processing

Whatever one may say, but the present world, from the point of view of eavesdropping, is much closer to the world of Orwell's Big Brother than it seems. The gazes of video cameras are felt by both car drivers and visitors to public places.

Full-flowing rivers of video information flow into the centers of its processing and ... run into a narrow link - a person. It is he who, in most cases, is the last instance monitoring the video world. Moreover, the authority is not the most effective. Blinks, gets distracted and tries to fall asleep.

Thanks to CUDA, it became possible to implement algorithms for simultaneous tracking of multiple objects in a video stream. In this case, the process takes place in real time, and the video is full-fledged 30 fps. Compared to the implementation of such an algorithm on modern multi-core CPUs, CUDA gives a two-, three-fold increase in performance, and this, you see, is a lot.

Converting video, filtering audio

Badaboom Video Converter is the first to use CUDA to speed up conversions.

It's nice to watch the new video rental in FullHD quality and on the big screen. But you can't take a big screen with you on the road, and the FullHD video codec will hiccup on a low-power processor of a mobile gadget. Conversion comes to the rescue. But most of those who have come across it in practice complain about the long conversion time. This is understandable, the process is routine, suitable for parallelization, and its execution on the CPU is not very optimal.

But CUDA copes with it with a bang. The first swallow is the Badaboom converter from Elevental. The Badaboom developers did not miscalculate by choosing CUDA. Tests show that a standard 1.5 hour movie is converted to iPhone / iPod Touch format in less than twenty minutes. And this despite the fact that when using only the CPU, this process takes more than an hour.

Helps CUDA and professional music lovers. Any of them will give my kingdom for an effective FIR crossover - a set of filters that divide the audio spectrum into several bands. This process is very laborious and with a large volume of audio material makes the sound engineer go to "smoke" for several hours. The implementation of the FIR crossover based on CUDA speeds up its operation hundreds of times.

CUDA Future

By making GPGPU technology a reality, CUDA is not going to rest on its laurels. As it happens everywhere, the principle of reflection works in CUDA: now not only the architecture of nVidia video processors affects the development of versions of the CUDA SDK, but the CUDA technology itself forces nVidia to revise the architecture of its chips. An example of this reflection is the nVidia ION platform. Its second version is specially optimized for solving CUDA tasks. This means that even in relatively inexpensive hardware solutions, consumers will get the full power and brilliance of CUDA.

Let us turn to history - back to 2003, when Intel and AMD were in a joint race for the most powerful processor. In just a few years, clock speeds have increased significantly as a result of this race, especially since the release of the Intel Pentium 4.

But the race was rapidly approaching its limit. After a wave of huge increases in clock speeds (between 2001 and 2003, the clock speed of the Pentium 4 doubled from 1.5 to 3 GHz), users had to be content with tenths of a gigahertz that manufacturers were able to squeeze out (from 2003 to 2005, clock speeds increased from 3 to 3 , 8 GHz).

Even architectures optimized for high clock speeds, like Prescott, began to experience difficulties, and this time not only production ones. Chip makers simply ran into the laws of physics. Some analysts even predicted that Moore's Law would cease to work. But that did not happen. The original meaning of the law is often distorted, but it concerns the number of transistors on the surface of a silicon core. For a long time, the increase in the number of transistors in the CPU was accompanied by a corresponding increase in performance - which led to a distortion of the meaning. But then the situation became more complicated. The designers of the CPU architecture approached the law of gain reduction: the number of transistors that had to be added for the required increase in performance became more and more, leading to a dead end.

While CPU makers were pulling the last hairs of their heads trying to find solutions to their problems, GPU makers continued to benefit remarkably from the benefits of Moore's Law.

Why haven't they hit the same dead end as the CPU architects? The reason is very simple: CPUs are designed for maximum performance on a stream of instructions that process different data (both integers and floating point numbers), perform random access to memory, etc. Until now, developers are trying to provide more parallelism of instructions - that is, to execute as many instructions in parallel as possible. So, for example, with the Pentium superscalar execution appeared, when under certain conditions it was possible to execute two instructions per cycle. Pentium Pro received out-of-order execution of instructions, which made it possible to optimize the work of computational units. The problem is that the parallel execution of a sequential stream of instructions has obvious limitations, so a blind increase in the number of computational units does not give a benefit, since they will still be idle most of the time.

In contrast, the GPU's job is relatively straightforward. It consists of accepting a group of polygons on one side and generating a group of pixels on the other. Polygons and pixels are independent of each other, so they can be processed in parallel. Thus, in the GPU, a large part of the crystal can be separated into computational units, which, unlike the CPU, will actually be used.

Click on the picture to enlarge.

The GPU differs from the CPU not only in this. Access to memory in the GPU is very tied - if a texel is read, then after a few clock cycles the adjacent texel will be read; when a pixel is recorded, then after a few clock cycles the adjacent one will be recorded. By judiciously organizing memory, you can get performance close to the theoretical bandwidth. This means that the GPU, unlike the CPU, does not require a huge cache, as its role is to speed up texturing operations. All that is needed is a few kilobytes containing several texels used in bilinear and trilinear filters.

Click on the picture to enlarge.

Long live GeForce FX!

The two worlds remained separated for a long time. We used a CPU (or even multiple CPUs) for office tasks and web applications, and GPUs were only good for speeding up rendering. But one feature changed everything: namely, the advent of programmable GPUs. At first, central processing units had nothing to fear. The first so-called programmable GPUs (NV20 and R200) were hardly a threat. The number of instructions in the program remained limited to about 10, they worked on very exotic data types such as 9- or 12-bit fixed point numbers.

Click on the picture to enlarge.

But Moore's law again showed its best side. The increase in the number of transistors has not only increased the number of computing units, but also improved their flexibility. The appearance of the NV30 can be considered a significant step forward for several reasons. Of course, gamers didn't really like the NV30 cards, but the new GPUs began to rely on two features that were designed to change the perception of GPUs not only as graphics accelerators.

Support for single precision floating point calculations (even if it did not comply with the IEEE754 standard);
support for more than a thousand instructions.

So we got all the conditions that can attract pioneer researchers who are always looking for additional computing power.

The idea of using graphics accelerators for mathematical calculations is not new. The first attempts were made back in the 90s of the last century. Of course, they were very primitive - limited, for the most part, to the use of some hardware-based features, such as rasterization and Z-buffers to speed up tasks such as finding a route or rendering. Voronoi diagrams .

Click on the picture to enlarge.

In 2003, with the advent of evolved shaders, a new benchmark was reached - this time performing matrix calculations. This was the year that a whole section of SIGGRAPH ("Computations on GPUs") was dedicated to the new area of IT. This early initiative was called GPGPU (General-Purpose computation on GPU). And the early turning point was the appearance.

To understand the role of BrookGPU, you need to understand how everything happened before it appeared. The only way to get GPU resources in 2003 was to use one of the two graphics APIs, Direct3D or OpenGL. Consequently, developers who wanted to get the GPU power for their computing had to rely on the two APIs mentioned. The problem is that they weren't always experts in programming video cards, which made access to technology very difficult. While 3D programmers operate with shaders, textures and fragments, specialists in the field of parallel programming rely on threads, cores, scatter, etc. Therefore, first it was necessary to draw analogies between the two worlds.

Stream is a stream of elements of the same type, in the GPU it can be represented by a texture. Basically, in classical programming there is such an analogue as an array.
Kernel- a function that will be applied independently to each element of the stream; is the equivalent of a pixel shader. In classical programming, the analogy of a loop can be made - it is applied to a large number of elements.
To read the results of applying a kernel to a stream, a texture must be created. There is no equivalent on the CPU, since there is full memory access.
The location in memory where the recording will be made (in the scatter / scatter operations) is controlled through the vertex shader, since the pixel shader cannot change the coordinates of the processed pixel.

As you can see, even taking into account the analogies given, the task does not look simple. And Brook came to the rescue. This name refers to extensions to the C language ("C with streams", "C with streams"), as the developers at Stanford called them. At its core, Brook's task was to hide from the programmer all the components of the 3D API, which made it possible to present the GPU as a coprocessor for parallel computations. To do this, the Brook compiler processed a .br file with C ++ code and extensions, and then generated C ++ code that was linked to a library with support for different outputs (DirectX, OpenGL ARB, OpenGL NV3x, x86).

Click on the picture to enlarge.

Brook has several merit, the first of which is to bring GPGPU out of the shadows so that the general public can become familiar with the technology. Although after the announcement of the project, a number of IT sites were too optimistic that the release of Brook casts doubt on the existence of CPUs, which will soon be replaced by more powerful GPUs. But, as we can see, this did not happen after five years. To be honest, we don't think this will ever happen. On the other hand, looking at the successful evolution of CPUs, which are increasingly oriented towards parallelism (more cores, SMT multithreading technology, expansion of SIMD blocks), as well as GPUs, which, on the contrary, are becoming more and more universal (support for floating point calculations single precision, integer computation, double precision support), it looks like the GPU and CPU will soon merge. What will happen then? Will GPUs be swallowed up by the CPU, as happened with math coprocessors? Quite possible. Intel and AMD are currently working on similar projects. But there is still a lot that can change.

But back to our topic. Brook's advantage was to popularize the concept of GPGPU, it significantly simplified access to GPU resources, which allowed more and more users to master the new programming model. On the other hand, despite all the qualities of Brook, there was still a long way to go before GPU resources could be used for computation.

One of the problems is associated with different levels of abstraction, and also, in particular, with the excessive additional load created by the 3D API, which can be quite noticeable. But more serious is the compatibility issue that the Brook developers couldn't do anything about. There is fierce competition among GPU manufacturers, so they often optimize their drivers. If these optimizations are good for gamers for the most part, they could end Brook compatibility in a moment. Therefore, it is difficult to imagine using this API in production code that will work somewhere. And for a long time, Brook remained the lot of amateur researchers and programmers.

However, Brook's success was enough to attract the attention of ATI and Nvidia, and they developed an interest in such an initiative, as it could expand the market, opening up a significant new sector for companies.

Researchers initially involved with the Brook project quickly joined the development teams in Santa Clara to present a global strategy for developing a new market. The idea was to create a combination of hardware and software suitable for GPGPU tasks. Since nVidia developers know all the secrets of their GPUs, the graphics API could not be relied on, but rather communicate with the GPU through a driver. Although, of course, this has its own problems. So, the CUDA (Compute Unified Device Architecture) development team has created a set of software layers for working with the GPU.

Click on the picture to enlarge.

As you can see in the diagram, CUDA provides two APIs.

High-level API: CUDA Runtime API;
low-level API: CUDA Driver API.

Because the high-level API is implemented over the low-level API, each runtime function call is broken down into simpler instructions that the Driver API processes. Please note that the two APIs are mutually exclusive: a programmer can use one or the other API, but mixing calls to functions of the two APIs will not work. In general, the term "high-level API" is relative. Even the Runtime API is such that many would consider it low-level; however, it does provide functions that are quite handy for initialization or context management. But don't expect a particularly high level of abstraction - you still need to have a good knowledge of nVidia GPUs and how they work.

The Driver API is even harder to work with; it takes more effort to run GPU processing. On the other hand, the low-level API is more flexible, giving the programmer additional control if needed. Two APIs are capable of working with OpenGL or Direct3D resources (only the ninth version for today). The benefits of this feature are obvious - CUDA can be used to create resources (geometry, procedural textures, etc.) that can be passed to the graphics API, or, conversely, you can make the 3D API send the rendering results to the CUDA program, which, in turn will perform post-processing. There are many examples of such interactions, and the advantage is that resources continue to be stored in the GPU memory, they do not need to be passed through the PCI Express bus, which is still a bottleneck.

However, it should be noted that the sharing of resources in the video memory is not always perfect and can lead to some "headaches". For example, when changing the resolution or color depth, the graphic data takes precedence. Therefore, if you need to increase the resources in the framebuffer, then the driver will easily do it at the expense of the resources of the CUDA applications, which will simply "take off" with an error. Not very elegant, of course, but this shouldn't happen very often. And while we started talking about the disadvantages: if you want to use multiple GPUs for CUDA applications, then you need to disable SLI mode first, otherwise CUDA applications will only be able to "see" one GPU.

Finally, the third software layer is given to libraries - two, to be precise.

CUBLAS, where there are the necessary blocks for calculating linear algebra on the GPU;
CUFFT, which supports the calculation of Fourier transforms, an algorithm widely used in signal processing.

Before we dive into CUDA, let me define a number of terms scattered throughout the nVidia documentation. The company has chosen a very specific terminology that is difficult to get used to. First of all, note that thread in CUDA is far from the same meaning as CPU thread, nor is it the equivalent of thread in our articles on GPUs. The GPU thread in this case is the underlying set of data that needs to be processed. Unlike CPU threads, CUDA threads are very "light", that is, context switching between two threads is by no means a resource-intensive operation.

The second term often found in the CUDA documentation is warp... There is no confusion here, since there is no analogue in Russian (unless you are a fan of Start Trek or the Warhammer game). In fact, the term is taken from the textile industry, where the weft yarn is pulled through the warp yarn, which is stretched on the machine. Warp in CUDA is a group of 32 threads and is the minimum amount of data processed by the SIMD method in CUDA multiprocessors.

But this "graininess" is not always convenient for the programmer. Therefore, in CUDA, instead of working with warps directly, you can work with blocks / block containing from 64 to 512 threads.

Finally, these blocks come together in grids / grid... The advantage of this grouping is that the number of blocks simultaneously processed by the GPU is closely related to the hardware resources, as we will see below. Grouping blocks into grids allows you to completely abstract from this limitation and apply the kernel / kernel to more threads in one call, without thinking about fixed resources. The CUDA libraries are responsible for all this. In addition, such a model scales well. If the GPU has few resources, then it will execute blocks sequentially. If the number of computational processors is large, then the blocks can be executed in parallel. That is, the same code can run on both entry-level GPUs and top-end and even future models.

There are a couple of other terms in the CUDA API that refer to CPU ( host / host) and GPU ( device / device). If this little introduction didn’t scare you, then it’s time to take a closer look at CUDA.

If you regularly read Tom "s Hardware Guide, then the architecture of the latest GPUs from nVidia is familiar to you. If not, we recommend that you read the article" nVidia GeForce GTX 260 and 280: the next generation of graphics cards"In terms of CUDA, nVidia presents the architecture in a slightly different way, showing some details that were previously hidden.

As you can see from the illustration above, the nVidia shader core is made up of multiple texture processor clusters. (Texture Processor Cluster, TPC)... The 8800 GTX, for example, used eight clusters, the 8800 GTS used six, and so on. Each cluster, in fact, consists of a texture unit and two streaming multiprocessor... The latter include the beginning of the pipeline (front end), which reads and decodes instructions, as well as sending them for execution, and the end of the pipeline (back end), which consists of eight computing devices and two super-functional devices. SFU (Super Function Unit) where instructions are executed according to the SIMD principle, that is, one instruction applies to all threads in the warp. nVidia calls this way of doing SIMT(single instruction multiple threads, one instruction, many threads). It is important to note that the end of the pipeline operates at twice the frequency of its beginning. In practice, this means that this part looks twice "wider" than it actually is (ie, like a 16-channel SIMD block instead of an eight-channel block). Streaming multiprocessors work as follows: each clock cycle, the beginning of the pipeline picks a warp ready for execution and starts executing the instruction. It would take four clock cycles for the instruction to apply to all 32 threads in the warp, but since it runs at twice the frequency of the start, it only takes two clock cycles (in terms of the start of the pipeline). Therefore, so that the beginning of the pipeline does not idle a cycle, and the hardware is loaded as much as possible, in the ideal case, you can alternate instructions each cycle - a classic instruction in one cycle and an instruction for SFU - in another.

Each multiprocessor has a specific set of resources that are worth understanding. There is a small area of memory called "Shared Memory", 16 KB per multiprocessor. This is by no means a cache memory: the programmer can use it at will. That is, we have before us something close to the Local Store of the SPU on Cell processors. This detail is interesting because it emphasizes that CUDA is a combination of software and hardware technologies. This memory area is not used for pixel shaders, which nVidia wittily points out "we don't like it when pixels talk to each other."

This memory area opens up the possibility of exchanging information between threads. in one block... It is important to emphasize this limitation: all threads in a block are guaranteed to be executed by one multiprocessor. On the contrary, the binding of blocks to different multiprocessors is not stipulated at all, and two threads from different blocks cannot exchange information with each other at runtime. That is, using shared memory is not easy. However, shared memory is still justified, except in cases where several threads try to access the same memory bank, causing a conflict. In other situations, access to shared memory is as fast as it is to registers.

Shared memory is not the only memory that multiprocessors can access. They can use video memory, but with lower bandwidth and higher latency. Therefore, to reduce the frequency of access to this memory, nVidia equipped the multiprocessors with a cache (approximately 8KB per multiprocessor) storing constants and textures.

The multiprocessor has 8192 registers that are common to all streams of all blocks active on the multiprocessor. The number of active blocks per multiprocessor cannot exceed eight, and the number of active warps is limited to 24 (768 threads). Therefore, the 8800 GTX can handle up to 12,288 threads at a time. All of these limitations were worth mentioning as they allow the algorithm to be optimized based on the available resources.

Optimization of a CUDA program, therefore, consists in obtaining an optimal balance between the number of blocks and their size. More threads per block will be useful to reduce memory latency, but the number of registers available per thread is reduced. Moreover, a block of 512 threads will be ineffective, since only one block can be active on a multiprocessor, which will lead to the loss of 256 threads. Therefore, nVidia recommends using blocks of 128 or 256 threads, which provides the best compromise between lower latency and the number of registers for most kernels / kernel.

From a programmatic point of view, CUDA consists of a set of C extensions, reminiscent of BrookGPU, as well as several specific API calls. Extensions include type specifiers related to functions and variables. It is important to remember the key word __global__, which, being given before the function, shows that the latter refers to the kernel / kernel - this function will be called by the CPU, and it will be executed on the GPU. Prefix __device__ indicates that the function will be executed on the GPU (which, by the way, CUDA calls "device / device") but it can only be called from the GPU (in other words, from another __device__ function or from the __global__ function). Finally, the prefix __host__ optional, it denotes a function that is called by the CPU and executed by the CPU - in other words, a regular function.

There are a number of limitations with the __device__ and __global__ functions: they cannot be recursive (that is, call themselves), and they cannot have a variable number of arguments. Finally, since the __device__ functions reside in GPU memory space, it makes sense that you won't be able to get their address. Variables also have a number of qualifiers that indicate the area of memory where they will be stored. Variable prefixed __shared__ means that it will be stored in the shared memory of the streaming multiprocessor. The __global__ function call is slightly different. The point is, when calling, you need to set the execution configuration - more specifically, the size of the grid / grid to which the kernel / kernel will be applied, as well as the size of each block. Take, for example, the kernel with the following signature.

__global__ void Func (float * parameter);

It will be called as

Func<<< Dg, Db >>> (parameter);

where Dg is the grid size and Db is the block size. These two variables are of a new type of vector introduced with CUDA.

The CUDA API contains functions for working with memory in VRAM: cudaMalloc to allocate memory, cudaFree to free, and cudaMemcpy to copy memory between RAM and VRAM and vice versa.

We will end this overview in a very interesting way in which a CUDA program is compiled: the compilation is done in several steps. First, the CPU code is retrieved and passed to the standard compiler. The code for the GPU is first converted to the intermediate language PTX. It is similar to assembly language and allows you to study your code looking for potential inefficiencies. Finally, the last phase is to translate the intermediate language into specific GPU instructions and create a binary file.

After looking through the nVidia docs, I just want to try CUDA this week. Indeed, what could be better than evaluating an API by creating your own program? That's when most problems should surface, even if everything looks perfect on paper. In addition, practice will best show how well you have understood all the principles outlined in the CUDA documentation.

It's pretty easy to dive into a project like this. Today, a large number of free, but high-quality tools are available for download. For our test, we used Visual C ++ Express 2005, which has everything you need. The hardest part was finding a program that didn't take weeks to port to the GPU, yet fun enough that our efforts weren't wasted. Finally, we chose a chunk of code that takes a heightmap and calculates the corresponding normalmap. We will not delve into this function in detail, since it is hardly interesting in this article. In short, the program deals with the curvature of areas: for each pixel of the initial image, we impose a matrix that determines the color of the resulting pixel in the generated image by the adjacent pixels, using a more or less complex formula. The advantage of this feature is that it is very easy to parallelize, so this test perfectly demonstrates the capabilities of CUDA.

Another advantage is that we already have an implementation on the CPU, so we can compare its result with the CUDA version - and not reinvent the wheel.

We repeat once again that the purpose of the test was to familiarize yourself with the CUDA SDK utilities, and not to comparative testing of versions for CPU and GPU. Since this was our first attempt at creating a CUDA program, we did not really hope to get high performance. Since this part of the code is not critical, the version for the CPU was not optimized, so a direct comparison of the results is hardly interesting.

Performance

However, we measured the execution time to see if there is an advantage in using CUDA even with the roughest implementation, or if it will take a long and tedious practice to get some kind of gain when using the GPU. The test machine was taken from our development lab - a laptop with a Core 2 Duo T5450 processor and a GeForce 8600M GT graphics card running Vista. This is far from a supercomputer, but the results are quite interesting, since the test is not "sharpened" for the GPU. It's always nice to see nVidia show huge gains on systems with monstrous GPUs and significant bandwidth, but in practice, many of the 70 million CUDA-enabled GPUs on the PC market today are far from as powerful, which is why our test has a right to life.

For a 2,048 x 2,048 pixel image, we got the following results.

CPU 1 thread: 1 419 ms;
CPU 2 threads: 749 ms;
CPU 4 threads: 593 ms
GPU (8600M GT) blocks of 256 threads: 109 ms;
GPU (8600M GT) blocks of 128 threads: 94 ms;
GPU (8800 GTX) blocks of 128 threads / 256 threads: 31 ms.

Several conclusions can be drawn from the results. Let's start with the fact that, despite the talk about the obvious laziness of programmers, we modified the initial version of the CPU for multiple threads. As we mentioned, the code is ideal for this situation - all that is required is to split the initial image into as many zones as there are streams. Please note that the acceleration from the transition from one thread to two on our dual-core CPU turned out to be almost linear, which also indicates the parallel nature of the test program. Quite unexpectedly, the version with four threads also turned out to be faster, although this is very strange on our processor - on the contrary, one could expect a drop in efficiency due to the overhead of managing additional threads. How can this result be explained? It's hard to tell, but the Windows thread scheduler may be to blame; in any case, the result is repeatable. With smaller textures (512x512), the gain from splitting into threads was not as pronounced (about 35% versus 100%), and the behavior of the version with four threads was more logical, without an increase compared to the version for two threads. The GPU was still faster, but not as pronounced (8600M GT was three times faster than the two-thread version).

Click on the picture to enlarge.

The second significant observation is that even the slowest GPU implementation turned out to be almost six times faster than the highest performing CPU version. For the first program and the unoptimized version of the algorithm, the result is very encouraging. Note that we got a noticeably better result on small blocks, although intuition might suggest otherwise. The explanation is simple - our program uses 14 registers per thread, and with 256-thread blocks, 3,584 registers per block are required, and 768 threads are required for the full processor load, as we showed. In our case, this is three blocks or 10,572 registers. But the multiprocessor has only 8192 registers, so it can only keep two blocks active. In contrast, with blocks of 128 threads, we need 1,792 registers per block; if 8 192 is divided by 1 792 and rounded to the nearest integer, then we get four blocks. In practice, the number of threads will be the same (512 per multiprocessor, although theoretically 768 are needed for the full load), but an increase in the number of blocks gives the GPU the advantage of flexibility in memory access - when an operation with high latencies is in progress, you can start the execution of instructions from another block, waiting receipt of results. Four blocks clearly reduce latency, especially since our program uses multiple memory accesses.

Analysis

Finally, despite what we said above, we could not resist the temptation and ran the program on the 8800 GTX, which turned out to be three times faster than the 8600, regardless of the block size. You might think that in practice, on the respective architectures, the result will be four or more times higher: 128 ALU / shader processors versus 32 and higher clock speeds (1.35 GHz versus 950 MHz), but this did not work out. Most likely, memory access was the limiting factor. To be more precise, the initial image is accessed as a multidimensional CUDA array - a rather complicated term for what is nothing more than a texture. But there are several benefits.

accesses benefit from texture cache;
we use wrapping mode, which does not need to handle image borders, unlike the CPU version.

In addition, we can take advantage of "free" filtering with normalized addressing between and instead of and, but in our case this is hardly useful. As you know, the 8600 has 16 texture units compared to 32 for the 8800 GTX. Therefore, the ratio between the two architectures is only two to one. Add to that the difference in frequencies and we get the ratio (32 x 0.575) / (16 x 0.475) = 2.4 - close to the "three to one" we actually get. This theory also explains why the block size does not change much on the G80, since the ALU still rests against texture blocks.

Click on the picture to enlarge.

Aside from promising results, our first exposure to CUDA went very well, given the not-so-favorable conditions chosen. Developing on a Vista laptop means using the CUDA SDK 2.0, still in beta, with driver 174.55, which is also beta. Despite this we cannot report any unpleasant surprises - only initial errors during the first debugging, when our program, still very "buggy", tried to address memory outside the allocated space.

The monitor flickered wildly, then the screen went black ... until Vista started the driver repair service and everything was fine. But it is still somewhat surprising to observe if you are used to seeing the typical Segmentation Fault error on standard programs like ours. Finally, a little criticism towards nVidia: in all the documentation available for CUDA, there is no small guide that would walk you step by step on how to set up a development environment for Visual Studio. Actually, this is not a big problem, since the SDK has a full set of examples that you can study to understand the framework for CUDA applications, but a beginner's guide would be helpful.

Click on the picture to enlarge.

Nvidia introduced CUDA with the GeForce 8800. While the promises were tempting at the time, we held on to our enthusiasm until the actual test. Indeed, at the time, it seemed more of a plot of territory to stay on the GPGPU wave. Without an available SDK, it's hard to say that this is not just another marketing dummy that will fail. This is not the first time that a good initiative was announced too early and did not come to light at the time due to a lack of support - especially in such a competitive sector. Now, a year and a half after the announcement, we can confidently say that nVidia has kept its word.

The SDK appeared fairly quickly in beta in early 2007 and has been rapidly updated since then, which proves the value of this project to nVidia. Today CUDA is developing very nicely: the SDK is already available in beta 2.0 for the main operating systems (Windows XP and Vista, Linux, and 1.1 for Mac OS X), and nVidia has dedicated a whole section of the site for developers.

On a more professional level, the impression of the first steps with CUDA was very positive. Even if you are familiar with GPU architecture, you can easily figure it out. When the API looks straightforward at first glance, you immediately begin to believe that you will get convincing results. But won't the computing time be lost from numerous transfers from CPU to GPU? And how to use these thousands of threads with little or no synchronization primitive? We began our experiments with all these fears in mind. But they quickly dissipated when the first version of our algorithm, albeit a very trivial one, turned out to be significantly faster than on the CPU.

So CUDA is not a lifesaver for researchers who want to convince university officials to buy them a GeForce. CUDA is already a fully available technology that any C programmer can use if he is willing to spend the time and effort getting used to the new programming paradigm. This effort won't be wasted if your algorithms parallelize well. We would also like to thank nVidia for providing complete and high-quality documentation for aspiring CUDA programmers.

What does CUDA require to become a recognizable API? In one word: portability. We know that the future of IT lies in parallel computing - today everyone is already preparing for such changes, and all initiatives, both software and hardware, are directed in this direction. However, at the moment, if we look at the development of paradigms, we are still at the initial stage: we create flows manually and try to plan access to shared resources; you can somehow cope with all this if the number of nuclei can be counted on the fingers of one hand. But in a few years, when the number of processors will be in the hundreds, this opportunity will no longer exist. With the release of CUDA, nVidia took the first step towards solving this problem - but, of course, this solution is only suitable for GPUs from this company, and even then not for everyone. Only GF8 and 9 (and their Quadro / Tesla derivatives) can work with CUDA programs today. And the new 260/280 range, of course.

Click on the picture to enlarge.

Nvidia can boast that it has sold 70 million CUDA-compatible GPUs worldwide, but it's still not enough to become the de facto standard. Taking into account the fact that competitors do not sit idly by. AMD offers its own SDK (Stream Computing), and Intel has announced a solution (Ct), although it is not yet available. A standards war is coming, and there will clearly not be room for the three competitors in the market until another player, such as Microsoft, comes up with a common API, which, of course, makes life easier for developers.

Therefore, nVidia has a lot of difficulties in the way of CUDA approval. Although technologically we have, without a doubt, a successful solution, it still remains to convince the developers of its prospects - and this will not be easy. However, judging by the many recent API announcements and news, the future looks far from bleak.

According to Darwin's theory of evolution, the first great ape (if
to be precise - homo antecessor, human predecessor) later turned
in us. Multi-ton computing centers with a thousand or more radio tubes,
occupying entire rooms, were replaced by half-kilogram laptops, which, by the way,
will not yield to the first in performance. Antediluvian typewriters have become
in printing anything and on anything (even on the human body)
multifunctional devices. Processor giants suddenly decided to brick up
graphics core in "stone". And video cards began not only to show a picture with
acceptable FPS and graphics quality, but also perform all sorts of calculations. Yes
how to produce! About the technology of multithreaded computing by means of GPU, and will be discussed.

Why GPU?

I wonder why they decided to shift all the computing power to graphics
adapter? As you can see, processors are still in vogue, and they are unlikely to give up their warm
place. But the GPU has a couple of trump cards up its sleeve, along with the joker, and the sleeves.
enough. The modern central processor is sharpened to obtain the maximum
performance when processing integer and floating-point data
comma, not worrying about the parallel processing of information. At the same
time, the architecture of the video card allows you to quickly and easily "parallelize"
data processing. On the one hand, polygons are being counted (due to the 3D conveyor),
on the other hand, pixel texture processing. It can be seen that there is a "well-coordinated
breakdown "load in the core of the card. In addition, the work of memory and video processor
more optimal than the combination "RAM-cache-processor". The moment the data unit
in the video card begins to be processed by one GPU stream processor, another
a unit is loaded in parallel into another, and, in principle, it is easy to achieve
GPU load comparable to the bus bandwidth,
however, for this, the loading of conveyors must be carried out uniformly, without
any conditional branching and branching. The central processor, by virtue of its
universality requires a cache full of
information.

Pundits have thought about the work of GPUs in parallel computing and
mathematics and deduced the theory that many scientific calculations are in many ways similar to
processing of 3D graphics. Many experts believe that a fundamental factor in
development GPGPU (General Purpose computation on GPU - universal
calculations by means of a video card) was the emergence in 2003 of the Brook GPU project.

The creators of the project from Stanford University had to solve a difficult
problem: hardware and software force the graphics adapter to produce
versatile calculations. And they did it. Using the generic C language,
American scientists forced the GPU to work like a processor, adjusted for
parallel processing. After Brook, a number of VGA projects appeared,
such as Accelerator library, Brahma library, system
metaprogramming GPU ++ and others.

CUDA!

The presentiment of the development prospects made AMD and NVIDIA
cling to the Brook GPU like a pit bull. If we omit the marketing policy, then,
having implemented everything correctly, you can gain a foothold not only in the graphics sector
market, but also in computing (look at special computing cards and
servers Tesla with hundreds of multiprocessors), displacing the usual CPUs.

Naturally, the "FPS masters" dispersed at the stumbling block, each in their own way.
path, but the basic principle has remained unchanged - to make calculations
GPU tools. And now we will take a closer look at the technology of "green" - CUDA
(Compute Unified Device Architecture).

The job of our "heroine" is to provide APIs, and two at once.
The first one is high-level, CUDA Runtime, which are functions that
are broken down into simpler levels and passed to the lower API - CUDA Driver. So
that the phrase “high-level” applies to a stretch process. All the salt is
it is in the driver, and libraries, kindly created by
developers NVIDIA: CUBLAS (tools for mathematical calculations) and
FFT (calculation by means of the Fourier algorithm). Well, let's move on to practical
parts of the material.

CUDA terminology

NVIDIA operates with very peculiar definitions for the CUDA API. They
differ from the definitions used for working with the central processor.

Thread- the data set to be processed (not
requires a lot of processing resources).

Warp- a group of 32 streams. Data is processed only
warps, therefore warp is the minimum amount of data.

Block- a set of streams (from 64 to 512) or a set
warps (from 2 to 16).

Grid Is a collection of blocks. This data separation
is used solely to improve performance. So, if the number
multiprocessors is large, then the blocks will be executed in parallel. If with
the card was unlucky (the developers recommend using
adapter is not lower than the level of GeForce 8800 GTS 320 MB), then the data blocks will be processed
consistently.

Also NVIDIA introduces concepts such as kernel, host
and device.

We are working!

To fully work with CUDA, you need:

1. Know the structure of GPU shader cores, as the essence of programming
is to evenly distribute the load between them.
2. Be able to program in the C environment, taking into account some aspects.

Developers NVIDIA revealed the "insides" of the video card several
differently than we are used to seeing. So, willy-nilly, you have to study everything
subtleties of architecture. Let's analyze the structure of the "stone" G80 legendary GeForce 8800
GTX.

The shader core consists of eight TPC (Texture Processor Cluster) - clusters
texture processors (for example, GeForce GTX 280- 15 cores, 8800 GTS
there are six of them, 8600 - four, etc.). Those, in turn, consist of two
streaming multiprocessors (hereinafter referred to as SM). SM (there are
16) consists of a front end (solves the problem of reading and decoding instructions) and
back end (final output of instructions) pipelines, as well as eight scalar SP (shader
processor) and two SFUs (super functional units). For each measure (one
time) the front end selects the warp and processes it. To all the streams of the warp
(remember, there are 32 of them) have been processed, 32/8 = 4 cycles are required at the end of the pipeline.

Each multiprocessor has a so-called shared memory.
Its size is 16 kilobytes and provides the programmer with complete freedom.
action. Distribute as you want :). Shared memory provides thread communication in
one block and is not designed to work with pixel shaders.

Also SMs can access GDDR. To do this, they were "sewn" on 8 kilobytes
cache memory that stores all the most important for work (for example, computing
constants).

The multiprocessor has 8192 registers. The number of active blocks cannot be
more than eight, and the number of warps is no more than 768/32 = 24. It can be seen from this that G80
can process a maximum of 32 * 16 * 24 = 12288 threads per unit of time. It is impossible not
take these figures into account when optimizing the program in the future (on one pan
- block size, on the other - the number of threads). The balance of parameters can play
important role in the future, therefore NVIDIA recommends using blocks
with 128 or 256 threads. A block of 512 threads is ineffective, since it has
increased delays. Considering all the subtleties of the structure of the GPU of the video card, plus
good programming skills, you can create a very productive
tool for parallel computing. Speaking of programming ...

Programming

For "creativity" together with CUDA you need GeForce video card not lower
eighth series... WITH

the official site, you need to download three software packages: a driver with
CUDA support (for each OS - its own), directly the CUDA SDK package (the second
beta) and additional libraries (CUDA toolkit). Technology supports
operating systems Windows (XP and Vista), Linux and Mac OS X. To study I
chose Vista Ultimate Edition x64 (looking ahead, I will say that the system behaved
Just perfect). At the time of this writing, the following was relevant for work
ForceWare driver 177.35. The toolbox used was
Borland C ++ 6 Builder software package (although any environment that works with
language C).

It will be easy for a person who knows the language to get comfortable in a new environment. It only takes
remember the basic parameters. The _global_ keyword (placed before the function)
indicates that the function belongs to the kernel. It will be called by the central
processor, and all the work will happen on the GPU. The call to _global_ requires more
specific details, namely mesh size, block size and which core will be
applied. For example, the line _global_ void saxpy_parallel<<>>, where X -
the grid size, and Y is the block size, sets these parameters.

The _device_ symbol means that the function will be called by the graphics core, it is
will follow all instructions. This function is located in the memory of the multiprocessor,
therefore, it is impossible to obtain its address. The _host_ prefix means that the call
and processing will take place only with the participation of the CPU. It should be borne in mind that _global_ and
_device_ cannot call each other and cannot call themselves.

Also, the language for CUDA has a number of functions for working with video memory: cudafree
(freeing memory between GDDR and RAM), cudamemcpy and cudamemcpy2D (copying
memory between GDDR and RAM) and cudamalloc (memory allocation).

All program codes are compiled by the CUDA API. First is taken
code that is exclusively for the CPU and is exposed
standard compilation, and other code for the graphics adapter,
is rewritten into the intermediate language PTX (very similar to assembler) for
identifying possible errors. After all these "dances", the final
translation (translation) of commands into a language understandable for the GPU / CPU.

Study set

Almost all aspects of programming are described in the documentation that goes
along with a driver and two applications, as well as on the developer site. Size
the article is not enough to describe them (the interested reader should attach
a little bit of effort and study the material on your own).

The CUDA SDK Browser is specially designed for beginners. Anyone can
feel the power of parallel computing on your own skin (the best test for
stability - work of examples without artifacts and crashes). The app has
a large number of demonstrative mini-programs (61 "tests"). Each experience has
detailed code documentation plus PDFs. It is immediately clear that people
those present with their creations in the browser are doing some serious work.
You can also compare the speed of the processor and video card during processing
data. For example, scanning multidimensional arrays with a video card GeForce 8800
GT 512 MB with a block with 256 threads produces in 0.17109 milliseconds.
The technology does not recognize SLI tandems, so if you have a duet or trio,
turn off the "pairing" function before work, otherwise CUDA will see only one
device. Dual core AMD Athlon 64 X2(core frequency 3000 MHz) same experience
passes in 2.761528 milliseconds. It turns out that the G92 is more than 16 times
faster than "stone" AMD! As you can see, it is far from an extreme system in
tandem with the operating system unloved by the masses shows good
results.

In addition to the browser, there are a number of programs useful to society. Adobe
has adapted its products to new technology. Photoshop CS4 is now complete
at least uses the resources of graphics adapters (you need to download a special
plugin). With programs such as Badaboom media converter and RapiHD you can
decode video to MPEG-2 format. Not bad for sound processing
free utility Accelero will do. The number of software tailored for the CUDA API,
will undoubtedly grow.

And at this time ...

In the meantime, you are reading this material, hard workers from processor concerns
are developing their own technologies for integrating GPUs into CPUs. From the side AMD all
understandable: they have a tremendous experience gained with ATI.

The creation of "microdevelopers", Fusion, will consist of several cores under
codenamed Bulldozer and video chip RV710 (Kong). Their relationship will be
carried out due to the improved HyperTransport bus. Depending on the
the number of cores and their frequency characteristics AMD plans to create a whole price
hierarchy of "stones". It is also planned to produce processors for laptops (Falcon),
and for multimedia gadgets (Bobcat). Moreover, it is the use of technology
in portable devices will be the initial challenge for Canadians. With development
parallel computing, the use of such "stones" should be very popular.

Intel slightly behind in time with its Larrabee. Products AMD,
if nothing happens, will appear on store shelves in late 2009 - early
2010. And the enemy's decision will come to light only after almost two
of the year.

Larrabee will have a large number (read hundreds) of cores. initially
there will be products designed for 8 - 64 cores. They are very similar to the Pentium, but
pretty much reworked. Each core has 256 kilobytes of L2 cache
(it will increase in size over time). The relationship will be carried out through
1024-bit bidirectional ring bus. Intel says that their "child" will
work great with DirectX and Open GL API (for Yabloko), so no
no software intervention is required.

Why did I tell you all this? Obviously Larrabee and Fusion will not supplant
conventional, stationary processors from the market, just as they will not be pushed out of the market
video cards. For gamers and extreme lovers, the ultimate dream will remain
multi-core CPU and a tandem of several top-end VGAs. But what even
processor companies are switching to parallel computing based on principles
similar to GPGPU, says a lot. In particular, that such
technology like CUDA has a right to exist and is likely to be
very popular.

Small summary

Parallel computing with a video card is just a good tool
in the hands of a hardworking programmer. Unlikely processors led by Moore's Law
the end will come. Companies NVIDIA there is still a long way to go
promoting its API to the masses (the same can be said about the brainchild ATI / AMD).
What it will be, the future will show. So CUDA will be back :).

P.S. I recommend for beginner programmers and interested people to visit
the following "virtual establishments":

NVIDIA official website and website
GPGPU.com. All
the information provided is in English, but thank you at least that you are not on
Chinese. So go for it! I hope that the author helped you at least a little in
exciting endeavors of learning CUDA!