LIVIVO - The Search Portal for Life Sciences

zur deutschen Oberfläche wechseln
Advanced search

Search results

Result 1 - 10 of total 496

Search options

  1. Book ; Online: Defining a canonical unit for accounting purposes

    Andrijauskas, Fabio / Sfiligoi, Igor / Würthwein, Frank

    2023  

    Abstract: Compute resource providers often put in place batch compute systems to maximize the utilization of such resources. However, compute nodes in such clusters, both physical and logical, contain several complementary resources, with notable examples being ... ...

    Abstract Compute resource providers often put in place batch compute systems to maximize the utilization of such resources. However, compute nodes in such clusters, both physical and logical, contain several complementary resources, with notable examples being CPUs, GPUs, memory and ephemeral storage. User jobs will typically require more than one such resource, resulting in co-scheduling trade-offs of partial nodes, especially in multi-user environments. When accounting for either user billing or scheduling overhead, it is thus important to consider all such resources together. We thus define the concept of a threshold-based "canonical unit" that combines several resource types into a single discrete unit and use it to characterize scheduling overhead and make resource billing more fair for both resource providers and users. Note that the exact definition of a canonical unit is not prescribed and may change between resource providers. Nevertheless, we provide a template and two example definitions that we consider appropriate in the context of the Open Science Grid.

    Comment: 6 pages, 2 figures, To be published in proceedings of PEARC23
    Keywords Computer Science - Distributed ; Parallel ; and Cluster Computing
    Publishing date 2023-05-17
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  2. Book ; Online: Analyzing Transatlantic Network Traffic over Scientific Data Caches

    Deng, Z. / Sim, A. / Wu, K. / Guok, C. / Hazen, D. / Monga, I. / Andrijauskas, F. / Wuerthwein, F. / Weitzel, D.

    2023  

    Abstract: Large scientific collaborations often share huge volumes of data around the world. Consequently a significant amount of network bandwidth is needed for data replication and data access. Users in the same region may possibly share resources as well as ... ...

    Abstract Large scientific collaborations often share huge volumes of data around the world. Consequently a significant amount of network bandwidth is needed for data replication and data access. Users in the same region may possibly share resources as well as data, especially when they are working on related topics with similar datasets. In this work, we study the network traffic patterns and resource utilization for scientific data caches connecting European networks to the US. We explore the efficiency of resource utilization, especially for network traffic which consists mostly of transatlantic data transfers, and the potential for having more caching node deployments. Our study shows that these data caches reduced network traffic volume by 97% during the study period. This demonstrates that such caching nodes are effective in reducing wide-area network traffic.
    Keywords Computer Science - Networking and Internet Architecture ; Computer Science - Distributed ; Parallel ; and Cluster Computing
    Subject code 303
    Publishing date 2023-05-01
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  3. Book ; Online: Testing GitHub projects on custom resources using unprivileged Kubernetes runners

    Sfiligoi, Igor / McDonald, Daniel / Knight, Rob / Würthwein, Frank

    2023  

    Abstract: GitHub is a popular repository for hosting software projects, both due to ease of use and the seamless integration with its testing environment. Native GitHub Actions make it easy for software developers to validate new commits and have confidence that ... ...

    Abstract GitHub is a popular repository for hosting software projects, both due to ease of use and the seamless integration with its testing environment. Native GitHub Actions make it easy for software developers to validate new commits and have confidence that new code does not introduce major bugs. The freely available test environments are limited to only a few popular setups but can be extended with custom Action Runners. Our team had access to a Kubernetes cluster with GPU accelerators, so we explored the feasibility of automatically deploying GPU-providing runners there. All available Kubernetes-based setups, however, require cluster-admin level privileges. To address this problem, we developed a simple custom setup that operates in a completely unprivileged manner. In this paper we provide a summary description of the setup and our experience using it in the context of two Knight lab projects on the Prototype National Research Platform system.

    Comment: 5 pages, 1 figure, To be published in proceedings of PEARC23
    Keywords Computer Science - Software Engineering
    Publishing date 2023-05-17
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  4. Book ; Online: Auto-scaling HTCondor pools using Kubernetes compute resources

    Sfiligoi, Igor / DeFanti, Thomas / Würthwein, Frank

    2022  

    Abstract: HTCondor has been very successful in managing globally distributed, pleasantly parallel scientific workloads, especially as part of the Open Science Grid. HTCondor system design makes it ideal for integrating compute resources provisioned from anywhere, ... ...

    Abstract HTCondor has been very successful in managing globally distributed, pleasantly parallel scientific workloads, especially as part of the Open Science Grid. HTCondor system design makes it ideal for integrating compute resources provisioned from anywhere, but it has very limited native support for autonomously provisioning resources managed by other solutions. This work presents a solution that allows for autonomous, demand-driven provisioning of Kubernetes-managed resources. A high-level overview of the employed architectures is presented, paired with the description of the setups used in both on-prem and Cloud deployments in support of several Open Science Grid communities. The experience suggests that the described solution should be generally suitable for contributing Kubernetes-based resources to existing HTCondor pools.

    Comment: 6 pages, 3 figures, to be published in proceedings of PEARC22
    Keywords Computer Science - Distributed ; Parallel ; and Cluster Computing
    Subject code 000
    Publishing date 2022-05-02
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  5. Book ; Online: Data intensive physics analysis in Azure cloud

    Sfiligoi, Igor / Würthwein, Frank / Davila, Diego

    2021  

    Abstract: The Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) is one of the largest data producers in the scientific world, with standard data products centrally produced, and then used by often competing teams within the collaboration. ... ...

    Abstract The Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) is one of the largest data producers in the scientific world, with standard data products centrally produced, and then used by often competing teams within the collaboration. This work is focused on how a local institution, University of California San Diego (UCSD), partnered with the Open Science Grid (OSG) to use Azure cloud resources to augment its available computing to accelerate time to results for multiple analyses pursued by a small group of collaborators. The OSG is a federated infrastructure allowing many independent resource providers to serve many independent user communities in a transparent manner. Historically the resources would come from various research institutions, spanning small universities to large HPC centers, based on either community needs or grant allocations, so adding commercial clouds as resource providers is a natural evolution. The OSG technology allows for easy integration of cloud resources, but the data-intensive nature of CMS compute jobs required the deployment of additional data caching infrastructure to ensure high efficiency.

    Comment: 11 pages, 5 figures, to be published in proceedings of ICOCBI 2021
    Keywords Computer Science - Distributed ; Parallel ; and Cluster Computing
    Publishing date 2021-10-25
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  6. Book ; Online: Comparing single-node and multi-node performance of an important fusion HPC code benchmark

    Belli, Emily A. / Candy, Jeff / Sfiligoi, Igor / Würthwein, Frank

    2022  

    Abstract: Fusion simulations have traditionally required the use of leadership scale High Performance Computing (HPC) resources in order to produce advances in physics. The impressive improvements in compute and memory capacity of many-GPU compute nodes are now ... ...

    Abstract Fusion simulations have traditionally required the use of leadership scale High Performance Computing (HPC) resources in order to produce advances in physics. The impressive improvements in compute and memory capacity of many-GPU compute nodes are now allowing for some problems that once required a multi-node setup to be also solvable on a single node. When possible, the increased interconnect bandwidth can result in order of magnitude higher science throughput, especially for communication-heavy applications. In this paper we analyze the performance of the fusion simulation tool CGYRO, an Eulerian gyrokinetic turbulence solver designed and optimized for collisional, electromagnetic, multiscale simulation, which is widely used in the fusion research community. Due to the nature of the problem, the application has to work on a large multi-dimensional computational mesh as a whole, requiring frequent exchange of large amounts of data between the compute processes. In particular, we show that the average-scale nl03 benchmark CGYRO simulation can be run at an acceptable speed on a single Google Cloud instance with 16 A100 GPUs, outperforming 8 NERSC Perlmutter Phase1 nodes, 16 ORNL Summit nodes and 256 NERSC Cori nodes. Moving from a multi-node to a single-node GPU setup we get comparable simulation times using less than half the number of GPUs. Larger benchmark problems, however, still require a multi-node HPC setup due to GPU memory capacity needs, since at the time of writing no vendor offers nodes with a sufficient GPU memory setup. The upcoming external NVSWITCH does however promise to deliver an almost equivalent solution for up to 256 NVIDIA GPUs.

    Comment: 6 pages, 1 table, 1 figure, to be published in proceedings of PEARC22
    Keywords Computer Science - Distributed ; Parallel ; and Cluster Computing ; Physics - Plasma Physics
    Subject code 000
    Publishing date 2022-05-19
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  7. Book ; Online: 400Gbps benchmark of XRootD HTTP-TPC

    Arora, Aashay / Guiang, Jonathan / Davila, Diego / Würthwein, Frank / Balcas, Justas / Newman, Harvey

    2023  

    Abstract: Due to the increased demand of network traffic expected during the HL-LHC era, the T2 sites in the USA will be required to have 400Gbps of available bandwidth to their storage solution. With the above in mind we are pursuing a scale test of XRootD ... ...

    Abstract Due to the increased demand of network traffic expected during the HL-LHC era, the T2 sites in the USA will be required to have 400Gbps of available bandwidth to their storage solution. With the above in mind we are pursuing a scale test of XRootD software when used to perform Third Party Copy transfers using the HTTP protocol. Our main objective is to understand the possible limitations in the software stack to achieve the target transfer rate; to that end we have set up a testbed of multiple XRootD servers in both UCSD and Caltech which are connected through a dedicated link capable of 400 Gbps end-to-end. Building upon our experience deploying containerized XRootD servers, we use Kubernetes to easily deploy and test different configurations of our testbed. In this work, we will present our experience doing these tests and the lessons learned.

    Comment: 8 pages, 4 figures, submitted to CHEP'23
    Keywords Computer Science - Networking and Internet Architecture
    Subject code 303
    Publishing date 2023-12-19
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  8. Book ; Online: Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scientific Computing

    Sfiligoi, I. / Schultz, D. / Riedel, B. / Wuerthwein, F. / Barnet, S. / Brik, V.

    2020  

    Abstract: Scientific computing needs are growing dramatically with time and are expanding in science domains that were previously not compute intensive. When compute workflows spike well in excess of the capacity of their local compute resource, capacity should be ...

    Abstract Scientific computing needs are growing dramatically with time and are expanding in science domains that were previously not compute intensive. When compute workflows spike well in excess of the capacity of their local compute resource, capacity should be temporarily provisioned from somewhere else to both meet deadlines and to increase scientific output. Public Clouds have become an attractive option due to their ability to be provisioned with minimal advance notice. The available capacity of cost-effective instances is not well understood. This paper presents expanding the IceCube's production HTCondor pool using cost-effective GPU instances in preemptible mode gathered from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform. Using this setup, we sustained for a whole workday about 15k GPUs, corresponding to around 170 PFLOP32s, integrating over one EFLOP32 hour worth of science output for a price tag of about $60k. In this paper, we provide the reasoning behind Cloud instance selection, a description of the setup and an analysis of the provisioned resources, as well as a short description of the actual science output of the exercise.

    Comment: 5 pages, 7 figures, to be published in proceedings of PEARC'20. arXiv admin note: text overlap with arXiv:2002.06667
    Keywords Computer Science - Performance ; Computer Science - Distributed ; Parallel ; and Cluster Computing
    Publishing date 2020-04-18
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  9. Article ; Online: Running a Pre-exascale, Geographically Distributed, Multi-cloud Scientific Simulation

    Sfiligoi, Igor / Würthwein, Frank / Riedel, Benedikt / Schultz, David

    High Performance Computing

    Abstract: As we approach the Exascale era, it is important to verify that the existing frameworks and tools will still work at that scale. Moreover, public Cloud computing has been emerging as a viable solution for both prototyping and urgent computing. Using the ... ...

    Abstract As we approach the Exascale era, it is important to verify that the existing frameworks and tools will still work at that scale. Moreover, public Cloud computing has been emerging as a viable solution for both prototyping and urgent computing. Using the elasticity of the Cloud, we have thus put in place a pre-exascale HTCondor setup for running a scientific simulation in the Cloud, with the chosen application being IceCube’s photon propagation simulation. I.e. this was not a purely demonstration run, but it was also used to produce valuable and much needed scientific results for the IceCube collaboration. In order to reach the desired scale, we aggregated GPU resources across 8 GPU models from many geographic regions across Amazon Web Services, Microsoft Azure, and the Google Cloud Platform. Using this setup, we reached a peak of over 51k GPUs corresponding to almost 380 PFLOP32s, for a total integrated compute of about 100k GPU hours. In this paper we provide the description of the setup, the problems that were discovered and overcome, as well as a short description of the actual science output of the exercise.
    Keywords covid19
    Publisher PMC
    Document type Article ; Online
    DOI 10.1007/978-3-030-50743-5_2
    Database COVID19

    Kategorien

  10. Book ; Online: Characterizing network paths in and out of the clouds

    Sfiligoi, Igor / Graham, John / Wuerthwein, Frank

    2020  

    Abstract: Commercial Cloud computing is becoming mainstream, with funding agencies moving beyond prototyping and starting to fund production campaigns, too. An important aspect of any scientific computing production campaign is data movement, both incoming and ... ...

    Abstract Commercial Cloud computing is becoming mainstream, with funding agencies moving beyond prototyping and starting to fund production campaigns, too. An important aspect of any scientific computing production campaign is data movement, both incoming and outgoing. And while the performance and cost of VMs is relatively well understood, the network performance and cost is not. This paper provides a characterization of networking in various regions of Amazon Web Services, Microsoft Azure and Google Cloud Platform, both between Cloud resources and major DTNs in the Pacific Research Platform, including OSG data federation caches in the network backbone, and inside the clouds themselves. The paper contains both a qualitative analysis of the results as well as latency and throughput measurements. It also includes an analysis of the costs involved with Cloud-based networking.

    Comment: 7 pages, 1 figure, 5 tables, to be published in CHEP19 proceedings
    Keywords Computer Science - Networking and Internet Architecture ; Computer Science - Distributed ; Parallel ; and Cluster Computing
    Subject code 303
    Publishing date 2020-02-11
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

To top