Introduction xix
Chapter 1 Finishing Your Spark Job 1
Installation of the Necessary Components 2
Native Installation Using a Spark Standalone Cluster 3
The History of Distributed Computing That Led to Spark 3
Enter the Cloud 4
Understanding Resource Management 5
Using Various Formats for Storage 8
Text Files 10
Sequence Files 11
Avro Files 11
Parquet Files 12
Making Sense of Monitoring and Instrumentation 13
Spark UI 13
Spark Standalone UI 15
Metrics REST API 16
Metrics System 16
External Monitoring Tools 16
Summary 17
Chapter 2 Cluster Management 19
Background 21
Spark Components 24
Driver 25
Workers and Executors 26
Configuration 27
Spark Standalone 30
Architecture 31
Single-Node Setup Scenario 31
Multi-Node Setup 32
YARN 33
Architecture 35
Dynamic Resource Allocation 37
Scenario 39
Mesos 40
Setup 41
Architecture 42
Dynamic Resource Allocation 44
Basic Setup Scenario 44
Comparison 46
Summary 50
Chapter 3 Performance Tuning 53
Spark Execution Model 54
Partitioning 56
Controlling Parallelism 56
Partitioners 58
Shuffling Data 59
Shuffling and Data Partitioning 61
Operators and Shuffl ing 63
Shuffling Is Not That Bad After All 67
Serialization 67
Kryo Registrators 69
Spark Cache 69
Spark SQL Cache 73
Memory Management 73
Garbage Collection 74
Shared Variables 75
Broadcast Variables 76
Accumulators 78
Data Locality 81
Summary 82
Chapter 4 Security 83
Architecture 84
Security Manager 84
Setup Configurations 85
ACL 86
Configuration 86
Job Submission 87
Web UI 88
Network Security 95
Encryption 96
Event logging 101
Kerberos 101
Apache Sentry 102
Summary 102
Chapter 5 Fault Tolerance or Job Execution 105
Lifecycle of a Spark Job 106
Spark Master 107
Spark Driver 109
Spark Worker 111
Job Lifecycle 112
Job Scheduling 112
Scheduling within an Application 113
Scheduling with External Utilities 120
Fault Tolerance 122
Internal and External Fault Tolerance 122
Service Level Agreements (SLAs) 123
Resilient Distributed Datasets (RDDs) 124
Batch versus Streaming 130
Testing Strategies 133
Recommended Confi gurations 139
Summary 142
Chapter 6 Beyond Spark 145
Data Warehousing 146
Spark SQL CLI 147
Thrift JDBC/ODBC Server 147
Hive on Spark 148
Machine Learning 150
DataFrame 150
MLlib and ML 153
Mahout on Spark 158
Hivemall on Spark 160
External Frameworks 161
Spark Package 161
XGBoost 163
spark-jobserver 164
Future Works 166
Integration with the Parameter Server 167
Deep Learning 175
Enterprise Usage 182
Collecting User Activity Log with Spark and Kafka 183
Real-Time Recommendation with Spark 184
Real-Time Categorization of Twitter Bots 186
Summary 186
Index 189