GPU 加速的 Spark ETL

适用于分析和机器学习数据作业

借助适用于本地或云端数据处理的 Apache Spark,组织现在能够在不更改代码的情况下利用 GPU 加速数据科学工作流程,从而大幅降低基础设施成本。

为什么选择 Apache Spark 3.0?

基于 NVIDIA GPU 的 Spark 的主要优势

执行时间加快

执行时间加快

数据科学家和工程师可以基于 NVIDIA GPU 加速 Apache Spark ETL 工作负载,以加快查询并减少长期工作流程的端到端总时间。这样一来,他们便有时间和精力去专注于更重要的工作。

降低基础设施成本

简化人工智能分析

Spark 3.0 可以协调各种端到端作业,包括数据收集、模型训练以及可视化。相同的 GPU 加速架构可用于 Spark 和 ML/DL(深度学习)框架,免除使用不同集群的需求,使整个流程得到 GPU 加速

面向 AI 的分析

降低基础设施成本

借助内在并行性,GPU 可比 CPU 完成更多的工作。因此,基于 NVIDIA GPU 的 Spark 需要更少的总硬件即可完成一项工作,让组织可节省本地资金成本或云端运营成本。

Spark 3.0 创新

基于许多数据处理任务“易并行计算”的特性,GPU 架构用于 Spark 数据处理查询是水到渠成之事。这与 GPU 在 AI 中加速深度学习工作负载的方式类似。GPU 加速对于开发者而言是透明的,无需更改代码即可获得这些优势。以下是助力 Spark 3.0 实现透明 GPU 加速的三项进步。

CUDA-X AI 上新的 GPU 加速库

NVIDIA® CUDA® 是一种革命性并行计算架构,支持加速计算运算,例如 NVIDIA GPU 架构上的矩阵乘法。NVIDIA 开发的 RAPIDS 是一套基于 CUDA 的开源库,支持完全在 GPU 上执行端到端数据科学和分析作业。对于 Spark 3.0,NVIDIA 已借助由 Spark 查询计划使用的 API 增强 RAPIDS。RAPIDS 现在包含这些 API 的 Java 绑定,因此可以直接通过 Spark 进行调用。

对 Spark 组件的修改

Spark 3.0 在 Catalyst 查询优化器中提供了列处理支持,这正是 RAPIDS 加速器为了加速 SQL 和 DataFrame 运算所采用的功能。当查询计划执行时,这些运算就可以通过 Spark 集群在 GPU 上运行。NVIDIA 还创建了新的 Spark shuffle 算法实施,可优化 Spark 进程之间的数据传输。此算法实施基于 GPU 加速的通信库(包括 UCX、RDMA 和 NCCL)而构建。

Spark 中以 GPU 为核心的调度

NVIDIA 优化了 Spark 3.0 内的作业调度器,支持在特定 GPU 资源上启动 Spark 应用。Spark 3.0 将 GPU 与 CPU 和系统内存一起视作首要资源。因此,Spark 3.0 会将 GPU 加速的工作负载直接放在包含必要 GPU 资源的服务器上,因为完成作业需要这些资源。

Spark 上加速的 ETL 和 AI

随着 ML 和 DL 越来越多地应用于更大的数据集,在为学习阶段准备原始输入数据时,Spark 已成为数据预处理和特征工程的常用工具。Spark 社区致力于将此端到端作业的两个阶段结合在一起,以便数据科学家可以处理单一 Spark 集群,避免在阶段之间将数据移动到外部数据湖而带来的代价。Horovod(由 Uber 支持)和 TensorflowOnSpark(由 Yahoo 支持)就是这种方法的示例。

Spark 3.0 代表关键里程碑,因为 Spark 现在可以利用 GPU 在 Spark 集群上调度 GPU 加速的 ML 和 DL 应用。此加速数据科学作业的完整堆栈如下所示:

Spark 上加速的 ETL 和 AI

开始使用 GPU 加速的 Spark ETL

如果您想要提前访问 RAPIDS 加速器以获取 Apache Spark 3.0 的预览版本,请访问此处的安装文档或联系 NVIDIA 的 Spark 团队。

Adobe

We’re seeing significantly faster performance with NVIDIA-accelerated Spark 3.0 compared to running Spark on CPUs. With these game-changing GPU performance gains, entirely new possibilities open up for enhancing AI-driven features in our full suite of Adobe Experience Cloud apps.

- William Yan, Senior Director of Machine Learning, Adobe

databricks

Our continued work with NVIDIA improves performance with RAPIDS optimizations for Apache Spark 3.0 and Databricks to benefit our joint customers like Adobe. These contributions lead to faster data pipelines, model training and scoring, that directly translate to more breakthroughs and insights for our community of data engineers and data scientists.

- Matei Zaharia, original creator of Apache Spark and Chief Technologist at Databricks

cisco

Cisco has thousands of customers with big data deployments for their data lake who are constantly looking to accelerate their workloads. Apache Spark 3.0 brings newer capabilities to access NVIDIA GPUs natively, thereby defining the next generation of data lakes accelerating AI/ML, ETL, and other workloads. Cisco is working closely with NVIDIA to bring this next phase of data lake innovation to our customers.

- Siva Sivakumar, Senior Director Data Center Solutions, Cisco

Adobe

We’re seeing significantly faster performance with NVIDIA-accelerated Spark 3.0 compared to running Spark on CPUs. With these game-changing GPU performance gains, entirely new possibilities open up for enhancing AI-driven features in our full suite of Adobe Experience Cloud apps.

- William Yan, Senior Director of Machine Learning, Adobe

databricks

Our continued work with NVIDIA improves performance with RAPIDS optimizations for Apache Spark 3.0 and Databricks to benefit our joint customers like Adobe. These contributions lead to faster data pipelines, model training and scoring, that directly translate to more breakthroughs and insights for our community of data engineers and data scientists.

- Matei Zaharia, original creator of Apache Spark and Chief Technologist at Databricks

cisco

Cisco has thousands of customers with big data deployments for their data lake who are constantly looking to accelerate their workloads. Apache Spark 3.0 brings newer capabilities to access NVIDIA GPUs natively, thereby defining the next generation of data lakes accelerating AI/ML, ETL, and other workloads. Cisco is working closely with NVIDIA to bring this next phase of data lake innovation to our customers.

- Siva Sivakumar, Senior Director Data Center Solutions, Cisco

Adobe

We’re seeing significantly faster performance with NVIDIA-accelerated Spark 3.0 compared to running Spark on CPUs. With these game-changing GPU performance gains, entirely new possibilities open up for enhancing AI-driven features in our full suite of Adobe Experience Cloud apps.

- William Yan, Senior Director of Machine Learning, Adobe

databricks

Our continued work with NVIDIA improves performance with RAPIDS optimizations for Apache Spark 3.0 and Databricks to benefit our joint customers like Adobe. These contributions lead to faster data pipelines, model training and scoring, that directly translate to more breakthroughs and insights for our community of data engineers and data scientists.

- Matei Zaharia, original creator of Apache Spark and Chief Technologist at Databricks

cisco

Cisco has thousands of customers with big data deployments for their data lake who are constantly looking to accelerate their workloads. Apache Spark 3.0 brings newer capabilities to access NVIDIA GPUs natively, thereby defining the next generation of data lakes accelerating AI/ML, ETL, and other workloads. Cisco is working closely with NVIDIA to bring this next phase of data lake innovation to our customers.

- Siva Sivakumar, Senior Director Data Center Solutions, Cisco

下载我们的免费电子书!

您是否想要利用 AI 的强大功能来挖掘大数据的价值?下载我们的新电子书“开始凭借 Apache Spark 3.0 加速数据科学”,了解有关 Apache Spark 下一次变革的更多信息。