2011
Escaping the SIMD vs. MIMD mindset: a new class of hybrid microarchitectures between GPUs and CPUs
Par Collange Sylvain (Università degli Studi di Siena, Italia) le 2011-12-15
-
Parallel applications running on GPUs are made of thousands of fine-grained threads that run in parallel.
-
Different threads in SPMD applications tend to execute the same instructions at the same time. GPUs take advantage of this parallel manifestation of instruction locality to amortize the cost of instruction fetch and memory access over many execution units by means of SIMD execution.
-
Divergent control flow between threads is handled by predication, which serializes the execution of each branch until they reconverge.
-
However, this has a severe impact on performance as application control flow gets less regular. Reconsidering the assumption of SIMD execution, we will design microarchitectures that execute more than a single instruction over multiple data, while maintaining the same ratio of instruction fetch units to execution units as existing GPUs. Additional instructions may come either from other branches in the same warp (SIMD thread), or from other warps. Combining both sources of parallelism allows performance improvements of 40% for an area overhead less than 4%.
Software for squaring floats on ST231: a case study in bringing floating-point to VLIW integer processors
Par Jourdan-Lu Jingyan (LIP-ENS Lyon/ST Microelectronics) le 2011-11-24
-
In this talk we will consider the problem of computing IEEE floating-point squares by means of integer arithmetic.
-
We will show how the specific properties of squaring can be exploited in order to design and implement algorithms that have much lower latency than those for general multiplication, while still guaranteeing correct rounding. Our algorithm descriptions are parameterized by the floating-point format, aim at high instruction-level parallelism (ILP) exposure, and cover all rounding modes.
-
We will show further that their C implementation for the binary32 format yields efficient codes for targets like the ST231 VLIW integer processor from STMicroelectronics, with a latency at least 1.75x smaller than that of general multiplication in the same context.
-
We will conclude by examining the impact of such special operators on some FFT-related floating-point applications. The speedup is around 1.28x to 1.46x.
-
Key words: squaring, binary floating-point arithmetic, correct rounding, IEEE 754, instruction level parallelism, C software implementation, VLIW integer processor
Accelerating Denovo Peptide Spectrum Matching Using FPGA's
Par Dandass Yoginder S. (Mississippi State University, MS, USA) le 2011-07-12
- Identifying proteins in new organisms under study is an important problem in computational biology. One approach for solving this problem is to use tandem mass spectrometry (MS/MS). MS/MS produces spectra that correspond to peptide ions in a sample mixture of proteins. These sample spectra are matched against the theoretical spectra computed from peptides of known proteins. Because the number of theoretical spectra is very large and the computations used for matching are complex, this technique is computationally challenging and time consuming. Therefore, techniques for accelerating this processing have the potential to make a significant impact on the scientific productivity of bioinformatics researchers. This talk introduces the problem and describes ongoing research and development using field programmable gate array (FPGA) architectures for accelerate the matching process at the Institute of Genomics, Biocomputing, and Biotechnology (IGBB) at Mississippi State University (MSU).
Practical program verification for the working programmer with CodeContracts and Abstract Interpretation
Par Logozzo Francesco (Microsoft Research, Redmond, WA, USA) le 2011-06-29
-
In this talk I will present Clousot, an abstract interpretation-based static analyzer to be used as verifier for the CodeContracts. Clousot is routinely used every day by many .NET programmers.
-
In the first part of the talk I will recall what contracts are (essentially preconditions, postconditions and object invariants), why they are almost universally accepted as good software engineering practice. Nevertheless, their adoption is very low in practice, mainly for two reasons: (i) they require a non-negligible change in the build environment which very few professional programmers are willing to pay (e.g. a new language or a new or non-standard compiler or poor IDE integration ...); (ii) the static verification tools are either absent or they require far too much help and hugely miss automation.
-
The CodeContracts API is an answer to (i). Clousot is an answer to (ii).
-
In the second part of the talk, I will dig deeper into the Clousot architecture, explaining: (i) why unlikely similar tools we decided to base it on abstract interpretation (essentially because of automation, inference power, generality, fine control of the cost/tradeoff ratio ...); and (ii) the new abstract domains (numerical, universal, existential, etc.) we designed and implemented.
-
The CodeContracts API is part of .NET 4.0. Clousot can be downloaded from the DevLabs: http://msdn.microsoft.com/es-AR/devlabs/dd491992.aspx
A Decremental Analysis Tool for Fine-Grained Bottleneck Detection
Par Petit Eric (Exascale Computing Research Center, Performance Evaluation Team, Versailles) le 2011-04-20
-
In this talk we will present the decremental analysis (DECAN) performance tuning tool for simple and automatic detection of performance anomalies. DECAN operates by patching machine code instructions in hot functions or loops to alter the semantics of the original program.
-
It can substitute memory access instructions, execute the modified code, and associate performance degradations with individual instructions. DECAN replaces x86 SSE memory access instructions inside regular loops, with "nop" instructions to identify memory bottlenecks.
-
DECAN was applied on 3 large industrial HPC applications from Dassault-Aviation, Recom Services, and MAGMA Giebereitechnologie GmbH on Intel-based processor (Xeon X7350). It detects several performance anomalies that have been successfully removed through memory optimizations resulting in up to 2.5x speedup.
Modified algorithms for accurate floating-point summation
Par Ogita Takeshi (Division of Mathematical Sciences, Tokyo Woman's Christian University, Japan) le 2011-02-24
-
Several years ago, we have proposed an accurate summation algorithm for floating-point vectors. For a given floating-point vector, the algorithm returns a result faithfully rounded from the exact sum of the vector. It has been shown that the algorithm is very fast in terms of not only flop counts but also measured computing time due to its high instruction level parallelism. In this talk, we try to accelerate the algorithm further. For the purpose, we use a certain grouping of the elements in a given vector without any sorting. The summation algorithm with our accelerated method can be faster than or at worst equal to the original one. The accelerated method does not change any analysis for the original summation algorithm, i.e., it returns exactly the same (faithful) result as that from the original one.
-
The effectiveness of the proposed method strongly depends on how the data of various magnitude are distributed in a given vector. Numerical results are presented showing performance of the proposed method.
PerPI: a tool to measure, observe and analyze the ILP present in a code
Par Goossens Bernard (DALI, Univ. de Perpignan Via Domitia - LIRMM) le 2011-02-17
- Comparing algorithms and their program implementations with Flops leads to surprising results where for example algorithm A has a better (lower) Flop count than algorithm B but the program implementing A runs slower than the program implementing B on a machine. We will show that this kind of paradox usually comes from the difference in the instruction level parallelism (ILP) present in each program. Actual microarchitectures offer resources to capture some of the ILP. We will present the PerPI tool which was developed by the DALI team to measure, observe and analyze the ILP present in a code.