skip to main content
research-article

Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators

Published:01 August 2013Publication History
Skip Abstract Section

Abstract

We present a taxonomy and modular implementation approach for data-parallel accelerators, including the MIMD, vector-SIMD, subword-SIMD, SIMT, and vector-thread (VT) architectural design patterns. We introduce Maven, a new VT microarchitecture based on the traditional vector-SIMD microarchitecture, that is considerably simpler to implement and easier to program than previous VT designs. Using an extensive design-space exploration of full VLSI implementations of many accelerator design points, we evaluate the varying tradeoffs between programmability and implementation efficiency among the MIMD, vector-SIMD, and VT patterns on a workload of compiled microbenchmarks and application kernels. We find the vector cores provide greater efficiency than the MIMD cores, even on fairly irregular kernels. Our results suggest that the Maven VT microarchitecture is superior to the traditional vector-SIMD architecture, providing both greater efficiency and easier programmability.

References

  1. Dennis Abts, Abdulla Bataineh, Steve Scott, Greg Faanes, Jim Schwarzmeier, Eric Lundberg, Tim Johnson, Mike Bye, and Gerald Schwoerer. 2007. The Cray BlackWidow: A highly scalable vector multiprocessor. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Randy Allen and Ken Kennedy. 2001. Optimizing Compilers for Modern Architectures. Morgan Kaufmann.Google ScholarGoogle Scholar
  3. Krste Asanović. 1998. Vector microprocessors. Ph.D. dissertation, EECS Department, University of California, Berkeley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. David F. Bacon, Susan L. Graham, and Oliver J. Sharp. 1994. Compiler transformations for high-performance computing. Comput. Surv. 26, 4, 345--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Christopher Batten. 2010. Simplified vector-thread architectures for flexible and efficient data-parallel accelerators. Ph.D. dissertation, MIT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Christopher Batten, Ronny Krashinsky, Steve Gerding, and Krste Asanović. 2004. Cache refill/access decoupling for vector machines. In Proceedings of the International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Werner Buchholz. 1986. The IBM System/370 vector architecture. IBM Syst. J. 25, 1, 51--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. 2004. Brook for GPUs: Stream computing on graphics hardware. ACM Trans. Graph. 23, 3, 777--786. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Derek DeVries and Corinna G. Lee. 1995. A vectorizing SUIF compiler. In Proceedings of SUIF Compiler Workshop.Google ScholarGoogle Scholar
  10. Keith Diefendorff, Pradeep K. Dubey, Ron Hochsprung, and Hunter Scale. 2000. AltiVec extension to PowerPC accelerates media processing. IEEE Micro 20, 2, 85--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Roger Espasa and Mateo Valero. 1996. Decoupled vector architectures. In Proceedings of International Symposium on High-Performance Computer Architecture (HPCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Michael J. Flynn. 1966. Very high-speed computing systems. Proc. IEEE 54, 12, 1901--1909.Google ScholarGoogle ScholarCross RefCross Ref
  13. John M. Frankovich and H. Phillip Peterson. 1957. A functional description of the Lincoln TX-2 computer. In Proceedings of the IRE-AIEE-ACM Western Joint Computer Conference: Techniques For Realibility. 146--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2009. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Trans. Archit. Code Optim. 6, 2, 1--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. John Goodacre and Andrew N. Sloss. 2005. Parallelism and the ARM instruction set architecture. Comput. 38, 7, 42--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Michael Gschwind, H. Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio Watanabe, and Takeshi Yamazaki. 2006. Synergistic processing in cell’s multicore architecture. IEEE Micro 26, 2, 10--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Linley Gwennap. 1996. Digital, MIPS add multimedia extensions. Microprocessor Forum 10, 15.Google ScholarGoogle Scholar
  18. Mark Hampton and Krste Asanović. 2008. Compiling for vector-thread architectures. In Proceedings of the International Symposium on Code Generation and Optimization (CGO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy, Aqeel Mahesri, Steven S. Lumetta, Matthew I. Frank, and Sanjay J. Patel. 2009a. Rigel: An architecture and scalable programming interface for a 1000-core accelerator. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. John H. Kelm, Daniel R. Johnson, Steven S. Lumetta, Matthew I. Frank, and Sanjay J. Patel. 2009b. A task-centric memory model for scalable accelerator architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. 2005. Niagara: A 32-way multithreaded SPARC processor. IEEE Micro 25, 2, 21--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Christoforos Kozyrakis, Stylianos Perissakis, David Patterson, Thomas Anderson, Krste Asanović, Neal Cardwell, Richard Fromm, Jason Golbus, Benjamin Gribstad, Kimberly Keeton, Randi Thomas, Noah Treuhaft, and Kathy Yelick. 1997. Scalable processors in the billion-transistor era: IRAM. IEEE Comput. 30, 9, 75--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ronny Krashinsky. 2007. Vector-thread architecture and implementation. Ph.D. dissertation, MIT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ronny Krashinsky, Christopher Batten, and Krste Asanović. 2008. Implementing the scale vector-thread processor. ACM Trans. Des. Autom. Electronic Syst. 13, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, and Krste Asanović. 2004. The vector-thread architecture. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Ruby B. Lee. 1996. Subword parallelism with MAX-2. IEEE Micro 16, 4, 51--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yunsup Lee. 2011. Efficient VLSI implementations of vector-thread architectures. Master’s thesis, UC Berkeley.Google ScholarGoogle Scholar
  28. Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lockhart, Christopher Batten, and Krste Asanović. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. 2008. NVIDIA Tesla: A unified graphics and computer architecture. IEEE Micro 28, 2, 39--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Chris Lomont. 2011. Introduction to Intel advanced vector extensions. Intel White Paper.Google ScholarGoogle Scholar
  31. Aqeel Mahesri, Daniel Johnson, Neal Crago, and Sanjay J. Patel. 2008. Tradeoffs in designing accelerator architectures for visual computing. In Proceedings of the International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Microsoft. 2009. Graphics guide for Windows 7: A guide For hardware and system manufacturers. Microsoft White Paper. http://www.microsoft.com/whdc/device/display/graphicsguidewin7.mspx.Google ScholarGoogle Scholar
  33. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to model large caches. Tech. rep. Hewlett Packard, HPL-2009-85.Google ScholarGoogle Scholar
  34. Umesh Gajanan Nawathe, Mahmudul Hassan, Lynn Warriner, King Yen, Bharat Upputuri, David Greenhill, and Ashok Kumar. 2007. An 8-core 64-thread 64 b power-efficient SPARC SoC. In Proceedings of the International Solid-State Circuits Conference (ISSCC).Google ScholarGoogle ScholarCross RefCross Ref
  35. John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. ACM Queue 6, 2, 40--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. NVIDIA. 2009. NVIDIA’s next gen CUDA compute architecture: Fermi. NVIDIA White Paper. http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture _whitepaper.pdf.Google ScholarGoogle Scholar
  37. OpenCL. 2008. The OpenCL specification. Khronos OpenCL Working Group. http://www.khronos.org/registry/cl/specs/opencl-1.0.48.pdf.Google ScholarGoogle Scholar
  38. OpenMP. 2008. OpenMP application program interface. OpenMP Architecture Review Board. http://www.openmp.org/mp-documents/spec30.pdf.Google ScholarGoogle Scholar
  39. Alex Peleg and Uri Weiser. 1996. MMX technology extension to the Intel architecture. IEEE Micro 16, 4, 42--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. K. Raman, V. Pentkovski, and J. Keshava. 2000. Implementing streaming SIMD extensions on the Pentium-III processor. IEEE Micro 20, 4, 47--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Suzanne Rivoire, Rebecca Schultz, Tomofumi Okuda, and Christos Kozyrakis. 2006. Vector lane threading. In Proceedings of the International Conference on Parallel Processing (ICPP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Richard M. Russell. 1978. The Cray-1 Computer System. Comm. ACM 21, 1, 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Karthikeyan Sankaralingam, Stephen W. Keckler, William R. Mark, and Doug Burger. 2003. Universal mechanisms for data-parallel architectures. In Proceedings of the International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. E. Smith, Greg Faanes, and Rabin Sugumar. 2000. Vector instruction set support for conditional operations. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Takashi Soga, Akihiro Musa, Youichi Shimomura, Ryusuke Egawa, Ken’ichi Itakura, Hiroyuki Takizawa, Koki Okabe, and Hiroaki Kobayashi. 2009. Performance evaluation of NEC SX-9 using real science and engineering applications. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC), Article 28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Hiroshi Tamura, Sachio Kamiya, and Takahiro Ishigai. 1985. FACOM VP-100/200: Supercomputers with ease of use. Parallel Comput. 2, 2, 87--107.Google ScholarGoogle ScholarCross RefCross Ref
  48. Marc Tremblay, J. Michael O’Connor, Venkatesh Narayanan, and Liang He. 1996. VIS speeds new media processing. IEEE Micro 16, 4, 10--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. John Wawrzynek, Krste Asanović, Brian Kingsbury, David Johnson, James Beck, and Nelson Morgan. 1996. Spert-II: A vector microprocessor system. IEEE Comput. 29, 3, 79--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Comm. ACM 52, 4, 65--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Sven Woop, Jörg Schmittler, and Philipp Slusallek. 2005. RPU: A programmable ray processing unit for realtime ray tracing. ACM Trans. Graph. 24 3, 434--444. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Marco Zagha and Guy E. Blelloch. 1991. Radix sort for vector multiprocessors. In Proceedings of ACM/IEEE Conference on Supercomputing (SC). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Computer Systems
          ACM Transactions on Computer Systems  Volume 31, Issue 3
          August 2013
          94 pages
          ISSN:0734-2071
          EISSN:1557-7333
          DOI:10.1145/2518037
          Issue’s Table of Contents

          Copyright © 2013 Owner/Author

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 August 2013
          • Revised: 1 March 2013
          • Received: 1 March 2013
          • Accepted: 1 March 2013
          Published in tocs Volume 31, Issue 3

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader