Optimization Techniques

optimization

TechniquesDescriptionAPI
Caching1Save intermediate dataframes for re-use or branching computedf.cache() OR df.persist(StorageLevel.MEMORY_AND_DIKS) for more fine-grained control
Broadcast variablesCopy data over to all workers to minimize data transfer.

This applies to the smaller operands.
from pyspark.sql.functions import broadcast
Update to the latest Spark versionTo receive more optimizations from the Tungsten engine and Catalyst optimizer
Use dataframe APIsIt’s higher-level and more efficient than RDDs
Tune memory configurationThis feels a bit like an art thoughSparkConf().set("spark.executor.memory", "4g").set("spark.driver.memory", "2g")
Avoid UDFs as much as possibleThey are not optimized compared to the native APIs
RepartitionHelpful when data is really skewed, leading to one node having more work to do than other nodes.

Rule of thumb is that each partition should be about 128MB.

Footnotes

  1. While Spark’s lazy API shares some optimization techniques like predicate pushdown and projection pushdown with polars, this is not one of them.