Publications

Published Papers

First Next Last

Page 1 of 13

  • Ananta Tiwari, Vamsi Sripathi, Sarat Sreepathi, Shreyas Ramalingam, Gabriel Marin, Kumar Mahinthakumar, Jeffrey K. Hollingsworth, Mary Hall, Chun Chen, and Jacqueline Chame, "PERI Autotuning of PFLOTRAN", Journal of Physics, (to appear) Proceedings of SciDAC 2011, July 2011BibTeX
  • G. W. Stewart, Jeffrey K. Hollingsworth, and Michael O. Lam, "Dynamic Floating-Point Cancellation Detection", First International Workshop on High-performance Infrastructure for Scalable Tools, Tuscon AZ, June 2011PDF linkBibTeX
  • , "", First International Workshop on High-performance Infrastructure for Scalable Tools, Tuscon AZ, BibTeX
  • Jeffrey K. Hollingsworth, and Nick Rutar, "Software Analysis Techniques to Approximate Data Centric Direct Measurements", First International Workshop on High-performance Infrastructure for Scalable Tools, Tuscon AZ, June 2011PDF linkBibTeX
  • Jeffrey K. Hollingsworth, and Geoffrey Stoker, "Towards a Methodology for Deliberate Sample-Based Statistical Performance Analysis", 16th International Workshop on High-Level Parallel Programming Models and Supportive Environments, May 2011BibTeX
  • Jeffrey K. Hollingsworth, and Nick Rutar, "Data Centric Techniques for Mapping Performance Measurements", 16th International Workshop on High-Level Parallel Programming Models and 16th International Workshop on High-Level Parallel Programming Models and Supportive Environments, May 2011BibTeX
  • Jacqueline Chame, Dan Quinlin, C Liao, Mary Hall, Chun Chen, Jeffrey K. Hollingsworth, and Ananta Tiwari, "Auto-tuning Full Applications: A Case Study", International Journal of High Performance Computing, (To Appear)PDF linkBibTeX
  • Jeffrey K. Hollingsworth, and Ananta Tiwari, "Online Adaptive Code Generation and Tuning", IPDPS, (to appear) May 2011PDF linkBibTeX
  • Jeffrey K. Hollingsworth, and Ananta Tiwari, "End-to-end Auto-tuning with Active Harmony", in Performance Tuning in Scientific Computing, D. Bailey & S. Williams, ed., BibTeX
  • Jeffrey K. Hollingsworth, Vahid Tabatabaee, and Ananta Tiwari, "Tuning Parallel Applications in Parallel", Parallel Computing, Aug 2009BibTeX
  • Jeffrey K. Hollingsworth, and Tugrul Ince, "Profile Driven Selective Program Loading", EuroPar 2010, Italy, Aug. 2010PDF linkBibTeX
  • Jeffrey K. Hollingsworth, and Nick Rutar, "Assigning Blame: Mapping Performance to High Level Parallel Programming Abstractions", EuroPar'09, Delft, Aug. 2009PDF linkBibTeX
  • Jeffrey K. Hollingsworth, Mary Hall, Jacqueline Chame, Chun Chen, and Ananta Tiwari, "A Scalable Autotuning Framework for Compiler Optimization", IPDPS 2009, Rome, May 2009PDF linkBibTeX
  • Haihang You, Sam Williams, Ananta Tiwari, Jaewook Shin, Keith Seymour, Shirely Moore, Paul Hovland, Jeffrey K. Hollingsworth, Mary Hall, Jack Dongarra, Chun Chen, Jacqueline Chame, and David Bailey, "PERI Auto-Tuning", Journal of Physics: Conference Series 125 (2008), Nov. 2008PDF linkBibTeX
  • Jeffrey K. Hollingsworth, and Mustafa M. Tikir, "Hardware Monitors for Dynamic Page Migration", JPDC (68)9, September 2008PDF linkBibTeX
    Really delete? Yes
    No
    Bibtex for PERI Autotuning of PFLOTRAN:

    @article { Tiwari2011,
     	author = "Ananta Tiwari, Vamsi Sripathi, Sarat Sreepathi, Shreyas Ramalingam, Gabriel Marin, Kumar Mahinthakumar, Jeffrey K. Hollingsworth, Mary Hall, Chun Chen, and Jacqueline Chame",
     	title = "PERI Autotuning of PFLOTRAN",
     	journal = "Journal of Physics, (to appear) Proceedings of SciDAC 2011",
     	year = "2011" }
    

    Really delete? Yes
    No
    Abstract for Dynamic Floating-Point Cancellation Detection:

    As scienti c computation continues to scale, it is crucial to use floating-point arithmetic processors as eciently as pos- sible. Lower precision allows streaming architectures to perform more operations per second and can reduce memory bandwidth pressure on all architectures. However, using a precision that is too low for a given algorithm and data set will result in inaccurate results. Thus, developers must balance speed and accuracy when choosing the oating-point precision of their subroutines and data structures. We are building tools to help developers learn about the runtime oating-point behavior of their programs, and to help them make decisions concerning the choice of precision in imple- mentation. We propose a tool that performs automatic binary instrumentation of oating-point code to detect mathematical cancellations, as well as to automatically run calculations in alternate precisions. In particular, we show how our prototype can detect the variation in cancellation patterns for di erent pivoting strategies in Gaussian elimina- tion, as well as how our prototype can detect a program's sensitivity to ill-conditioned input sets.

    Bibtex for Dynamic Floating-Point Cancellation Detection:

    @article { Stewart2011,
     	author = "G. W. Stewart, Jeffrey K. Hollingsworth, and Michael O. Lam",
     	title = "Dynamic Floating-Point Cancellation Detection",
     	journal = "First International Workshop on High-performance Infrastructure for Scalable Tools, Tuscon AZ",
     	year = "2011" }
    

    Really delete? Yes
    No
    Abstract for :

    Bibtex for :

    @article { ,
     	author = "",
     	title = "",
     	journal = "First International Workshop on High-performance Infrastructure for Scalable Tools, Tuscon AZ",
     	year = "" }
    

    Really delete? Yes
    No
    Abstract for Software Analysis Techniques to Approximate Data Centric Direct Measurements:

    Data centric analysis using direct measurements has been established as a successful performance analysis technique. The information gathered with this technique can be used to address data locality problems and other issues. Existing approaches rely on special hardware support which is needed to negate a `skid' factor. Our approach is viable on hardware where the skid factor is an issue. Prior methods also rely on maintaining runtime information about memory allocation addresses for variables, which may lead to program perturbation. Our approach uses software analysis to eliminate the need for maintaining allocation and free records. We show that by using heuristics our technique can attribute data centric values to program variables and mainain the approximate rank-order found by using traditional techniques.

    Bibtex for Software Analysis Techniques to Approximate Data Centric Direct Measurements:

    @article { Hollingsworth2011,
     	author = "Jeffrey K. Hollingsworth, and Nick Rutar",
     	title = "Software Analysis Techniques to Approximate Data Centric Direct Measurements",
     	journal = "First International Workshop on High-performance Infrastructure for Scalable Tools, Tuscon AZ",
     	year = "2011" }
    

    Really delete? Yes
    No
    Abstract for Towards a Methodology for Deliberate Sample-Based Statistical Performance Analysis:

    Dynamic performance analysis of long-running programs in the high performance computing community increasingly relies on statistical profiling techniques to provide performance measurement results. Systematic sampling rates used to generate the statistical data are typically selected in an ad hoc manner with little formal regard for the context provided by the program being analyzed and the underlying system on which it is run. In an effort to provide a more effective statistical profiling process and additional rigor we argue in favor of the general principle of deliberate sampling rate selection. We present our idea for a methodology of systematic sample rate selection based on a performance measurement model incorporating the effect of sampling on both measurement precision and perturbation effects.

    Bibtex for Towards a Methodology for Deliberate Sample-Based Statistical Performance Analysis:

    @article { Hollingsworth2011,
     	author = "Jeffrey K. Hollingsworth, and Geoffrey Stoker",
     	title = "Towards a Methodology for Deliberate Sample-Based Statistical Performance Analysis",
     	journal = "16th International Workshop on High-Level Parallel Programming Models and  Supportive Environments",
     	year = "2011" }
    

    Really delete? Yes
    No
    Abstract for Data Centric Techniques for Mapping Performance Measurements:

    Traditional methods of performance analysis offer a code centric view, presenting performance data in terms of blocks of contiguous code (statement, basic block, loop, function). Data centric techniques, combined with hardware counter information, allow various program properties including cache misses and cycle count to be mapped directly to variables. We introduce mechanisms for efficiently collecting data centric performance numbers independent of hardware support. We create extended data centric mappings, which we call variable blame, that relates data centric information to high level data structures. Finally, we show performance data gathered from three parallel programs using our technique.

    Bibtex for Data Centric Techniques for Mapping Performance Measurements:

    @article { Hollingsworth2011,
     	author = "Jeffrey K. Hollingsworth, and Nick Rutar",
     	title = "Data Centric Techniques for Mapping Performance Measurements",
     	journal = "16th International Workshop on
    High-Level Parallel Programming Models and
    16th International Workshop on High-Level Parallel Programming Models and  Supportive Environments",
     	year = "2011" }
    

    Really delete? Yes
    No
    Abstract for Auto-tuning Full Applications: A Case Study:

    In this paper, we take a concrete step towards materializing our long-term goal of providing a fully automatic end-to-end tuning infrastructure for arbitrary program components and full applications. We describe a general-purpose offline auto-tuning framework and apply it to an application benchmark, SMG2000, a semi-coarsening multigrid on structured grids. We show that the proposed system first extracts computationally-intensive loop nests into separate executable functions, a code transformation called outlining. The outlined loop nests are then tuned by the framework and subsequently integrated back into the application. Each loop nest is optimized through a series of composable code transformations, with the transformations parameterized by unbound optimization parameters that are bound during the tuning process. The values for these parameters are selected using a search-based auto-tuner, which performs a parallel heuristic search for the best-performing optimized variants of the outlined loop nests. We show that our system pinpoints a code variant that performs 2.37 times faster than the original loop nest. When the full application is run using the code variant found by the system, the application's performance improves by 27%.

    Bibtex for Auto-tuning Full Applications: A Case Study:

    @article { ChameAppear),
     	author = "Jacqueline Chame, Dan Quinlin, C Liao, Mary Hall, Chun Chen, Jeffrey K. Hollingsworth, and Ananta Tiwari",
     	title = "Auto-tuning Full Applications: A Case Study",
     	journal = "International Journal of High Performance Computing",
     	year = "Appear)" }
    

    Really delete? Yes
    No
    Abstract for Online Adaptive Code Generation and Tuning:

    In this paper, we present a runtime compilation and tuning framework for parallel programs. We extend our prior work on our auto-tuner, Active Harmony, for tunable parameters that require code generation(for example, different unroll factors). For such parameters, our auto-tuner generates and compiles new code on-the-fly. Effectively, we merge traditional feedback directed optimization and just-in-time compilation. We show that our system can leverage available parallelism in today's HPC platforms by evaluating different code-variants on different nodes simultaneously. We evaluate our system on two parallel applications and show that our system can improve runtime execution by up to 46% compared to the original version of the program.

    Bibtex for Online Adaptive Code Generation and Tuning:

    @article { Hollingsworthappear),
     	author = "Jeffrey K. Hollingsworth, and Ananta Tiwari",
     	title = "Online Adaptive Code Generation and Tuning",
     	journal = "IPDPS",
     	year = "appear)" }
    

    Really delete? Yes
    No
    Bibtex for End-to-end Auto-tuning with Active Harmony:

    @article { Hollingsworth,
     	author = "Jeffrey K. Hollingsworth, and Ananta Tiwari",
     	title = "End-to-end Auto-tuning with Active Harmony",
     	journal = "in Performance Tuning in Scientific Computing, D. Bailey & S. Williams, ed.",
     	year = "" }
    

    Really delete? Yes
    No
    Abstract for Tuning Parallel Applications in Parallel:

    In this paper, we present and evaluate a parallel algorithm for parameter tuning of parallel applications. We discuss the impact of performance variability on the accuracy and efficiency of the optimization algorithm and propose a strategy to minimize the impact of this variability. We evaluate our algorithm within the Active Harmony system, an automated online/offline tuning framework. We study its performance on three benchmark codes: PSTSWM, HPL and POP. Compared to the Nelder-Mead algorithm, our algorithm finds better configurations up to 7 times faster. For POP, we were able to improve the performance of a production sized run by 59%.

    Bibtex for Tuning Parallel Applications in Parallel:

    @article { Hollingsworth2009,
     	author = "Jeffrey K. Hollingsworth, Vahid Tabatabaee, and Ananta Tiwari",
     	title = "Tuning Parallel Applications in Parallel",
     	journal = "Parallel Computing",
     	year = "2009" }
    

    Really delete? Yes
    No
    Abstract for Profile Driven Selective Program Loading:

    Complex software systems use many shared libraries frequently composed of large off-the-shelf components. Only a limited number of functions are used from these shared libraries. Historically demand paging prevented this from wasting large amounts of memory. Many high end systems lack virtual memory and thus must load the entire shared library into each node’s memory. In this paper we propose a system which decreases the memory footprint of applications by selectively loading only he used portions of the shared libraries. After profiling executables and shared libraries, our system rewrites all target shared libraries with a new function ordering and updated ELF program headers so that the loader only loads those functions that are likely to be used by a given application and includes a fallback user-level paging system to recover in the case of failures in our analysis. We present a case study that shows our system achieves more than 80% reduction in the number of pages that are loaded for several HPC applications while causing no performance overhead for reasonably long running programs.

    Bibtex for Profile Driven Selective Program Loading:

    @article { Hollingsworth2010,
     	author = "Jeffrey K. Hollingsworth, and Tugrul Ince",
     	title = "Profile Driven Selective Program Loading",
     	journal = "EuroPar 2010, Italy",
     	year = "2010" }
    

    Really delete? Yes
    No
    Abstract for Assigning Blame: Mapping Performance to High Level Parallel Programming Abstractions:

    Parallel programs are increasingly being written using programming frameworks and other environments that allow parallel constructs to be programmed with greater ease. The data structures used allow the modeling of complex mathematical structures like linear systems and partial differential equations using high-level programming abstractions. While this allows programmers to model complex systems in a more intuitive way, it also makes the debugging and profiling of these systems more difficult due to the complexity of mapping these high level abstractions down to the low level parallel programming constructs. This work discusses mapping mechanisms, called variable blame, for creating these mappings and using them to assist in the profiling and debugging of programs created using advanced parallel programming techniques. We also include an example of a prototype implementation of the system profiling three programs.

    Bibtex for Assigning Blame: Mapping Performance to High Level Parallel Programming Abstractions:

    @article { Hollingsworth2009,
     	author = "Jeffrey K. Hollingsworth, and Nick Rutar",
     	title = "Assigning Blame: Mapping Performance to High Level Parallel Programming Abstractions",
     	journal = "EuroPar'09, Delft",
     	year = "2009" }
    

    Really delete? Yes
    No
    Abstract for A Scalable Autotuning Framework for Compiler Optimization:

    We describe a scalable and general-purpose framework for autotuning compiler-generated code. We combine Active Harmony's parallel search backend with the CHiLL compiler transformation framework to generate in parallel a set of alternative implementations of computation kernels and automatically select the one with the best-performing implementation. The resulting system achieves performance of compiler-generated code comparable to the fully automated version of the ATLAS library for the tested kernels. Performance for various kernels is 1.4 to 3.6 times faster than the native Intel compiler without search. Our search algorithm simultaneously evaluates di erent combinations of compiler optimizations and converges to solutions in a few tens of search-steps.

    Bibtex for A Scalable Autotuning Framework for Compiler Optimization:

    @article { Hollingsworth2009,
     	author = "Jeffrey K. Hollingsworth, Mary Hall, Jacqueline Chame, Chun Chen, and Ananta Tiwari",
     	title = "A Scalable Autotuning Framework for Compiler  Optimization",
     	journal = "IPDPS 2009, Rome",
     	year = "2009" }
    

    Really delete? Yes
    No
    Abstract for PERI Auto-Tuning:

    The enormous and growing complexity of today's high-end systems has increased the already significant challenges of obtaining high performance on equally complex scientific applications. Application scientists are faced with a daunting challenge in tuning their codes to exploit performance-enhancing architectural features. The Performance Engineering Research Institute (PERI) is working toward the goal of automating portions of the performance tuning process. This paper describes PERI's overall strategy for auto-tuning tools and recent progress in both building auto-tuning tools and demonstrating their success on kernels, some taken from large-scale applications.

    Bibtex for PERI Auto-Tuning:

    @article { You2008,
     	author = "Haihang You, Sam Williams, Ananta Tiwari, Jaewook Shin, Keith Seymour, Shirely Moore, Paul Hovland, Jeffrey K. Hollingsworth, Mary Hall, Jack Dongarra, Chun Chen, Jacqueline Chame, and David Bailey",
     	title = "PERI Auto-Tuning",
     	journal = "Journal of Physics: Conference Series 125 (2008)",
     	year = "2008" }
    

    Really delete? Yes
    No
    Abstract for Hardware Monitors for Dynamic Page Migration:

    n this paper, we first introduce a profile-driven online page migration scheme and investigate its impact on the performance of multithreaded applications. We use centralized lightweight, inexpensive plug-in hardware monitors to profile the memory access behavior of an application, and then migrate pages to memory local to the most frequently accessing processor. We also investigate the use of several other potential sources of data gathered from hardware monitors and compare their effectiveness to using data from centralized hardware monitors. In particular, we investigate the effectiveness of using cache miss profiles, Translation Lookaside Buffer (TLB) miss profiles and the content of the on-chip TLBs using the valid bit information. Moreover, we also introduce a modest hardware feature, called Address Translation Counters (ATC), and compare its effectiveness with other sources of hardware profiles. Using the Dyninst runtime instrumentation combined with hardware monitors, we were able to add page migration capabilities to a Sun Fire 6800 server without having to modify the operating system kernel, or to re-compile application programs. Our dynamic page migration scheme reduced the total number of non-local memory accesses of applications by up to 90% and improved the execution times up to 16%. We also conducted a simulation based study and demonstrated that cache miss profiles gathered from on-chip CPU monitors, which are typically available in current microprocessors, can be effectively used to guide dynamic page migrations in applications.

    Bibtex for Hardware Monitors for Dynamic Page Migration:

    @article { Hollingsworth2008,
     	author = "Jeffrey K. Hollingsworth, and Mustafa M. Tikir",
     	title = "Hardware Monitors for Dynamic Page Migration",
     	journal = "JPDC (68)9",
     	year = "2008" }
    

    First Next Last

    Page 1 of 13