Spark

Optimization Techniques

Techniques	Description	API
Caching¹	Save intermediate dataframes for re-use or branching compute	`df.cache()` OR `df.persist(StorageLevel.MEMORY_AND_DIKS)` for more fine-grained control
Broadcast variables	Copy data over to all workers to minimize data transfer. This applies to the smaller operands.	`from pyspark.sql.functions import broadcast`
Update to the latest Spark version	To receive more optimizations from the Tungsten engine and Catalyst optimizer
Use dataframe APIs	It’s higher-level and more efficient than RDDs
Tune memory configuration	This feels a bit like an art though	`SparkConf().set("spark.executor.memory", "4g").set("spark.driver.memory", "2g")`
Avoid UDFs as much as possible	They are not optimized compared to the native APIs
Repartition	Helpful when data is really skewed, leading to one node having more work to do than other nodes. Rule of thumb is that each partition should be about 128MB.

While Spark’s lazy API shares some optimization techniques like predicate pushdown and projection pushdown with polars, this is not one of them. ↩