Apache Spark(PySpark) Matrix Factorization 최적화하기
이번 글에서는 데이터 엔지니어로 근무하며 진행한 MF 최적화 작업을 바탕으로, 최적화 시에 어떠한 관점으로 접근했는지와 관련 자료를 정리해보려 합니다.
Reference
[1] Advanced Apache Spark Training - Sameer Farooqui (Databricks)
[2] Tuning Apache Spark for Large-Scale Workloads
[3] SOS: Optimizing Shuffle I/O
[4] Deep Dive: Apache Spark Memory Management
[5] Matrix Computations and Optimization in Apache Spark
[6] Getting The Best Performance With PySpark
[7] Apache Spark @Scale: A 60 TB+ production use case
[8] Implementing Large-Scale Matrix Factorization on Apache Spark
[9] Optimizing Apache Spark SQL Joins
[10] Optimal Strategies for Large-Scale Batch ETL Jobs
[11] Tuning Spark
[12] Tuning Spark application tasks
[13] Troubleshooting and Tuning Spark for Heavy Workloads
[14] Why Your Spark Apps Are Slow Or Failing, Part II: Data Skew and Garbage Collection
[15] Tuning G1 GC for spark jobs
[16] How do I get a cartesian product of a huge dataset?
[17] https://www.slideshare.net/databricks/scaling-apache-spark-at-facebook