PlasmaENGINE has several modes of operation to provide you with the flexibility you need. The default mode chosen based on your infrastructure will generally be the best.
Runs vectorized native code on GPU. Row based data will be automatically converted to vectorized (columnar) data. GPUs are great for vectorized data because they have thousands of cores and are built for it.
GPU mode will be the default on machines with NVIDIA GPUs. To force GPU mode, use
spark.plasma.task.use_gpu (see PlasmaENGINE Configuration Parameters).
When a GPU is not available, PlasmaENGINE can use the same native code generation and vectorized operations on CPUs. This mode is still faster than Apache Spark, but not as fast as GPU mode. It can be useful for jobs that might not see enough of a benefit from GPUs to justify the infrastructure cost difference.
CPU mode will be the default on machines with no GPU. To force CPU mode, use
spark.plasma.task.use_cpu (see PlasmaENGINE Configuration Parameters).
When neither GPU or CPU modes make sense, or when some feature is not supported, PlasmaENGINE will delegate part of the plan to vanilla Apache Spark. If your job has three steps and the middle one is not supported on our engine (for example some proprietary native code), PlasmaENGINE will run the first step, let Apache Spark handle the second step, and then finish the third step. The first and third steps will run using generated native code on either CPU or GPU. The second step will run using normal Apache Spark code.
You can control mixed mode with
spark.plasma.mixedModeDelegation (see PlasmaENGINE Configuration Parameters).
PlasmaENGINE can be completely disabled by using
spark.plasma.enabled (see PlasmaENGINE Configuration Parameters). In this case it will run in permanent Apache Spark mode which is almost the same as using vanilla Apache Spark.