BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20131121T180000Z DTEND:20131121T183000Z LOCATION:205/207 DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: OpenCL has become the de facto data parallel programming model for parallel devices in todays high-performance supercomputers. OpenCL was designed with the goal of guaranteeing program portability across hardware from different vendors. However, achieving good performance is hard, requiring manual tuning of the program and expert knowledge of each target device.=0A=0AIn this paper we consider a data parallel compiler transformation thread-coarsening and evaluate its effects across a range of devices by developing a source-to-source OpenCL compiler based on LLVM. We thoroughly evaluate this transformation on 17 benchmarks and five platforms with different coarsening parameters giving over 43,000 different experiments. =0AWe achieve speedups over 9x on individual applications and average speedups ranging from 1.15x on the NVIDIA Kepler GPU to 1.50x on the AMD Cypress GPU. Finally, we use statistical regression to analyze and explain program performance in terms of hardware-based performance counters. SUMMARY:A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening PRIORITY:3 END:VEVENT END:VCALENDAR