
《大规模并行处理器程序设计(英文版)(第2版)》作者:结合自己多年从事并行计算课程教学的经验,以简洁、直观和实用的方式,详细剖析了编写并行程序所需的各种技术,并用丰富的案例说明了并行程序设计的整个开发过程,即从计算机思想开始,直到最终实现高效可行的并行程序。
编辑推荐
《大规模并行处理器程序设计(英文版)(第2版)》编辑推荐:《大规模并行处理器程序设计(英文版)(第2版)》是作者精心为广大读者朋友们编写而成的此书。《大规模并行处理器程序设计(英文版)(第2版)》对书中内容进行全面修订和更新,更加系统地阐述并行程序设计,既介绍了基本并行算法模式,又补充了更多的背景资料,而且还介绍了一些新的实用编程技术和工具。 作者简介
作者:(美国)柯克(David Kirk) 胡文美(Wen—mei W.Hwu)
柯克(David Kirk),美国国家工程院院士、NVIDIA Fellow,曾是NVIDIA公司首席科学家。他领导了NVIDIA图形技术开发,并使其成为当今最流行的大众娱乐平台,也是CUDA技术的创始人之一。2002年,他荣获ACM SIGGRAPH计算机图形成就奖,以表彰其在把高性能计算机图形系统推向大众市场方面所做出的杰出贡献。他拥有麻省理工学院的机械工程学学士学位和硕士学位,加州理工学院的计算机科学博士学位。Kirk博士是50项与图形芯片设计相关的专利和专利申请的发明者,发表了50多篇关于图形处理技术的论文,是可视化计算技术方面的权威。
胡文美(Wen—mei W.Hwu),拥有美国加州大学伯克利分校计算机科学博士学位,现任美国伊利诺伊大学厄巴纳一香槟分校(UIUC)协调科学实验室电气与计算机工程Jerrv Sanders(AMD创始人)讲座教授、微软和英特尔联合资助的通用并行计算研究中心联合主任兼世界上第一个NVlDIA CUDA中心首席研究员。胡教授是世界顶级的并行处理器架构与编译器专家,担任美国下一代千万亿级计算机——蓝水系统的首席研究员。他是IEEE Fellow、ACM Fellow。 目录
Preface
Acknowledgements
CHAPTER 1 Introduction
1.1 Heterogeneous Parallel Computing
1.2 Architecture of a Modem GPU
1.3 Why More Speed or Parallelism?
1.4 Speeding Up Real Applications
1.5 Parallel Programming Languages and Models
1.6 Overarching Goals
1.7 Organization of the Book
References
CHAPTER 2 History of GPU Computing
2.1 Evolution of Graphics Pipelines
The Era of Fixed—Function Graphics Pipelines
Evolution of Programmable Real—Time Graphics
Unified Graphics and Computing Processors
2.2 GPGPU: An Intermediate Step
2.3 GPU Computing
Scalable GPUs
Recent Developments
Future Trends
References and Further Reading
CHAPTER 3 Introduction to Data Parallelism and CUDA C
3.1 Data Parallelism
3.2 CUDA Program Structure
3.3 A Vector Addition Kernel
3.4 Device Global Memory and Data Transfer
3.5 Kernel Functions and Threading
3.6 Summary
Function Declarations
Kernel Launch
Predefined Variables
Runtime API
3.7 Exercises
References
CHAPTER 4 Data—Parallel Execution Model
4.1 Cuda Thread Organization
4.2 Mapping Threads to Multidimensional Data
4.3 Matrix—Matrix Multiplication—A More Complex Kernel
4.4 Synchronization and Transparent Scalability
4.5 Assigning Resources to Blocks
4.6 Querying Device Properties
4.7 Thread Scheduling and Latency Tolerance
4.8 Summary
4.9 Exercises
CHAPTER 5 CUDA Memories
5.1 Importance of Memory Access Efficiency
5.2 CUDA Device Memory Types
5.3 A Strategy for Reducing Global Memory Traffic
5.4 A Tiled Matrix—Matrix Multiplication Kernel
5.5 Memory as a Limiting Factor to Parallelism
5.6 Summary
5.7 Exercises
CHAPTER 6 Performance Considerations
6.1 Warps and Thread Execution
6.2 Global Memory Bandwidth
6.3 Dynamic Partitioning of Execution Resources
6.4 Instruction Mix and Thread Granularity
6.5 Summary
6.6 Exercises
References
CHAPTER 7 Floating—Point Considerations
7.1 Floating—Point Format
Normalized Representation of M
Excess Encoding of E
7.2 Representable Numbers
7.3 Special Bit Patterns and Precision in IEEE Format
7.4 Arithmetic Accuracy and Rounding
7.5 Algorithm Considerations
7.6 Numerical Stability
7.7 Summary
7.8 Exercises
References
CHAPTER 8 Parallel Patterns: Convolution
8.1 Background
8.21D Parallel Convolution—A Basic Algorithm
8.3 Constant Memory and Caching
8.4 Tiled 1D Convolution with Halo Elements
8.5 A Simpler Tiled 1D Convolution—General Caching
8.6 Summary
8.7 Exercises
CHAPTER 9 Parallel Patterns: Prefix Sum
9.1 Background
9.2 A Simple Parallel Scan
9.3 Work Efficiency Considerations
9.4 A Work—Efficient Parallel Scan
9.5 Parallel Scan for Arbitrary—Length Inputs
9.6 Summary
9.7 Exercises
Reference
CHAPTER 10 Parallel Patterns: Sparse Matrix—Vector
Multiplication
10.1 Background
10.2 Parallel SpMV Using CSR
10.3 Padding and Transposition
10.4 Using Hybrid to Control Padding
10.5 Sorting and Partitioning for Regularization
10.6 Summary
10.7 Exercises
References
CHAPTER 11 Application Case Study: Advanced MRI
Reconstruction
11.1 Application Background
11.2 Iterative Reconstruction
11.3 Computing FHD
Step 1: Determine the Kernel parallelism Structure
Step 2: Getting Around the Memory Bandwidth Limitation
Step 3: Using Hardware Trigonometry Functions
Step 4: Experimental Performance Tuning
11.4 Final Evaluation
11.5 Exercises
References
CHAPTER 12 Application Case Study: Molecular Visualization and Analysis
12.1 Application Background
12.2 A Simple Kernel Implementation
12.3 Thread Granularity Adjustment
12.4 Memory Coalescing
12.5 Summary
12.6 Exercises
References
CHAPTER 13 Parallel Programming and Computational Thinking
13.1 Goals of Parallel Computing
13.2 Problem Decomposition
13.3 Algorithm Selection
13.4 Computational Thinking
13.5 Summary
13.6 Exercises
References
CHAPTER 14 An Introduction to OpeneLTM
14.1 Background
14.2 Data Parallelism Model
14.3 Device Architecture
14.4 Kernel Functions
14.5 Device Management and Kernel Launch
14.6 Electrostatic Potential Map in OpenCL
14.7 Summary
14.8 Exercises
References
CHAPTER 15 Parallel Programming with OpenACC
15.10penACC Versus CUDA C
15.2 Execution Model
15.3 Memory Model
15.4 Basic OpenACC Programs
Parallel Construct
Loop Construct
Kernels Construct
Data Management
Asynchronous Computation and Data Transfer
15.5 Future Directions of OpenACC
15.6 Exercises
……
CHAPTER 16 Thrust: A Productivity—Oriented Library for CUDA
CHAPTER 17 CUDA FORTRAN
CHAPTER 18 An Introduction to C++ AMP
CHAPTER 19 Programming a Heterogeneous Computing Cluster
CHAPTER 20 CUDA Dynamic Parallelism
CHAPTER 21 Conclusion and Future Outlook
Appendix A: Matrix Multiplication Host—Only Version Source Code
Appendix B: GPU Compute Capabilities
Index 文摘
版权页:
Chapter 15 presents the OpenACC programming interface.It shows how to use directives and pragmas to tell the compiler that a loop can be parallelized, and if desirable, instruct the compiler how to parallelize the loop.It also uses concrete examples to illustrate how one can take advan- tage of the interface and make their code more portable across vendor sys- tems.With the foundational concepts in this book, readers will find the OpenACC programming directives and pragmas easy to learn and master.
Chapter 16 covers Thrust, a productivity-oriented C++ library for building CUDA applications.This is a chapter that shows how modern object-oriented programming interfaces and techniques can be used to increase productivity in a parallel programming environment.In particular, it shows how generic programming and abstractions can significantly reduce the efforts and code complexity of applications.
Chapter 17 presents CUDA FORTRAN, an interface that supports FORTRAN-style programming based on the CUDA model.All concepts and techniques learned using CUDA C can be applied when programming in CUDA.In addition, the CUDA FORTRAN interface has strong support for multidimensional arrays that make programming of 3D models much more readable.It also assumes the FORTRAN array data layout conven- tion and works better with an existing application written in FORTRAN.
Chapter 18 is an overview of the C++AMP programming interface from Microsoft.This programming interface uses a combination language extension and API support to support data-parallel computation patterns.It allows programmers to use C++ features to increase their productivity.Like OpenACC, C++AMP abstracts away some of the parallel program- ming details that are specific to the hardware so the code is potentially more portable across vendor systems.
Chapter 19 presents an introduction to joint MPI/CUDA programming.We cover the key MPI concepts that a programmer needs to understand to scale their heterogeneous applications to multiple nodes in a cluster environment.In particular, we will focus on domain partitioning, point-to- point communication, collective communication in the context of scaling a CUDA kernel into multiple nodes.
Chapter 20 introduces the dynamic parallelism capability available in the Kepler GPUs and their successors.Dynamic parallelism can potentially help the implementations of sophisticated algorithms to reduce CPU-GPU interaction overhead, free up CPU for other tasks, and improve the utiliza- tion of GPU execution resources.We describe the basic concepts of dynamic parallelism and why some algorithms can benefit from dynamic parallelism. | ISBN | |
|---|---|
| 出版社 | 机械工业出版社 |
| 作者 | 柯克 (David Kirk) |
| 尺寸 | 16 |