Apache Spark Boilerplate Template using Scala

In our previous post, we discussed how we can set up spark and pyspark. In this post, let us put up a simple boiler plate code that we can use to build many spark projects out of.

We will use gradle as our build tool. It is good to build a fat jar, i.e., a jar package containing all the dependencies packaged within it. This helps in maintaining a deployment package that does not depend on maven repository or having spark cluster that's open to internet.

Step 1: Create directories

We first create the necessary directories for our set up.

mkdir -p ~/projects/spark-boilerplate
mkdir -p ~/projects/spark-boilerplate/src/main/scala
mkdir -p ~/projects/spark-boilerplate/src/main/resources
mkdir -p ~/projects/spark-boilerplate/src/test/scala
mkdir -p ~/projects/spark-boilerplate/src/test/resources

The spark-boilerplate is the root directory of our project, the src/main/scala will hold our scala files and src/main/resources will hold our properties or other files that's required for the application.

The src/test/scala will hold the scala files that we use to write unit tests and the src/test/resources will hold any additional files required to test the application.

Step 2: Download necessary gradle files

We have to set up the gradle wrapper on our project, however as I keep saying, I do not suggesting anything to be installed at system level. So we are going to use a temporary download of gradle to create the necessary files for our project.

Let us download and unzip the gradle library.

wget https://services.gradle.org/distributions/gradle-8.4-bin.zip -P ~/Downloads
unzip ~/Downloads/gradle-8.4-bin.zip -d /opt

Now head over to the project directory and execute the gradle wrapper command to create the gradle files in our project.

cd ~/projects/spark-boilerplate
/opt/gradle-8.4/bin/gradle wrapper --gradle-version=8.4 --distribution-type=bin --no-daemon

This should create ./gradle/wrapper/gradle-wrapper.jar, ./gradle/wrapper/gradle-wrapper.properties, ./gradlew and ./gradlew.bat. We can now delete our download, our you can choose to keep it as it is.

rm -r /opt/gradle-8.4
rm ~/Downloads/gradle-8.4-bin.zip

Step 3: Create gradle properties file

We will now create a ./gradle.properties at the root directory of our project. This file will define the various properties that will be used while building our project through gradle.

org.gradle.caching=true
org.gradle.configureondemand=true
org.gradle.daemon=false
org.gradle.jvmargs=-Xms256m -Xmx2048m -XX:MaxMetaspaceSize=512m -XX:+UseParallelGC -XX:+HeapDumpOnOutOfMemoryError -Dfile.encoding=UTF-8
org.gradle.parallel=true
org.gradle.warning.mode=all
org.gradle.welcome=never

You can choose not to disable the gradle daemon. It is one more java process that keeps running and takes up resources, so I tend to disable it. Keeping it enabled will result in saving time to initialize daemon every time you run a gradle process.

Step 4: Create the version catalog

The ./gradle/libs.versions.toml file is where we will declare all the versions and libraries that are referenced by the project. It acts as a centralized management file that we can re-use if we build multi module project or even reference from other projects.

[versions]
apache-spark = "3.5.0"
scala = "2.13.8"
scala-test = "3.2.17"

[libraries]
scala-library = { module = "org.scala-lang:scala-library", version.ref = "scala" }
scala-test = { module = "org.scalatest:scalatest_2.13", version.ref = "scala-test" }
spark-core = { module = "org.apache.spark:spark-core_2.13", version.ref = "apache-spark" }
spark-sql = { module = "org.apache.spark:spark-sql_2.13", version.ref = "apache-spark" }

[bundles]
spark = ["spark-core", "spark-sql"]

Step 5: Create gradle build file

The ./build.gradle.kts file is where we define dependencies of the project and also configure various tasks that we will need.

plugins {
    scala
}

project.group = "com.barrelsofdata"
project.version = "1.0.0"

dependencies {
    compileOnly(libs.scala.library)
    compileOnly(libs.bundles.spark)
    
    testImplementation(libs.scala.test)
}

tasks.withType<Test>().configureEach {
    maxParallelForks = (Runtime.getRuntime().availableProcessors() / 2).coerceAtLeast(1)
}

configurations {
    implementation {
        resolutionStrategy.failOnVersionConflict()
    }
    testImplementation {
        extendsFrom(configurations.compileOnly.get())
    }
}

tasks.register<JavaExec>("scalaTest") {
    dependsOn("testClasses")
    mainClass = "org.scalatest.tools.Runner"
    args = listOf("-R", "build/classes/scala/test", "-o")
    jvmArgs = listOf("-Xms128m", "-Xmx512m", "-XX:MetaspaceSize=300m", "-ea", "--add-exports=java.base/sun.nio.ch=ALL-UNNAMED") // https://lists.apache.org/thread/p1yrwo126vjx5tht82cktgjbmm2xtpw9
    classpath = sourceSets.test.get().runtimeClasspath
}

tasks.withType<Test> {
    dependsOn(":scalaTest")
}

tasks.withType<Jar> {
    manifest {
        attributes["Main-Class"] = "com.barrelsofdata.sparkexamples.Driver"
    }
    from (configurations.runtimeClasspath.get().map { if (it.isDirectory()) it else zipTree(it) })
    archiveFileName.set("${archiveBaseName.get()}-${project.version}.${archiveExtension.get()}")
}

tasks.clean {
    doFirst {
        delete("logs/")
    }
}

Note how we add spark dependencies as compileOnly. This is because, we do not need these to be part of our fat jar. The spark libraries will be available on the cluster where we run the job.

Step 6: Create gradle settings file

This will be ./settings.gradle.kts in the root of the project. This will provide the project name and also provisions the dependencies that we declared in the previous steps.

pluginManagement {
    repositories {
        mavenCentral()
        gradlePluginPortal()
    }
}

dependencyResolutionManagement {
    repositories {
        mavenCentral()
    }
}

rootProject.name = "spark-boilerplate"

Step 7: Create driver class

Any spark job would need a main class that runs on the driver. This is the class that would hold our main method. Let us first create the package structure com.barrelsofdata.sparkexamples

mkdir -p ~/projects/spark-boilerplate/src/main/scala/com/barrelsofdata/sparkexamples

And now for the Driver class, the below code initializes a SparkSession and uses spark.sql() to output a simple query output. We also write log messages just to ensure our logs are working. This will be called Driver.scala and will go into the package we created above.

package com.barrelsofdata.sparkexamples

import org.apache.log4j.Loggerimport
import org.apache.spark.sql.SparkSession

object Driver {
  val JOB_NAME: String = "Boilerplate"
  val LOG: Logger = Logger.getLogger(this.getClass.getCanonicalName)

  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName(JOB_NAME).getOrCreate()
    spark.sql("SELECT 'hello' AS col1").show()
    LOG.info("Dummy info message")
    LOG.warn("Dummy warn message")
    LOG.error("Dummy error message")
  }
}

Step 8: Create unit testing setup

Unit testing is of utmost importance for any project, the tests validate that the code is functioning as it should and helps prevent any unexpected situations arising in production.

We create a similar package in our src/test directory and also create a DriverTest class.

package com.barrelsofdata.sparkexamples

import java.util.Properties
import org.apache.log4j.{LogManager, Logger, PropertyConfigurator}
import org.apache.spark.sql.SparkSession
import org.scalatest.BeforeAndAfterAll
import org.scalatest.funsuite.AnyFunSuite

class DriverTest extends AnyFunSuite with BeforeAndAfterAll {
  val JOB_NAME: String = "Driver Test Job"
  val LOGGER_PROPERTIES: String = "log4j-test.properties"
  val LOG: Logger = Logger.getLogger(this.getClass.getCanonicalName)
  var spark: SparkSession = _

  def setupLogger(): Unit = {
    val properties = new Properties
    properties.load(getClass.getClassLoader.getResource(LOGGER_PROPERTIES).openStream())
    LogManager.resetConfiguration()
    PropertyConfigurator.configure(properties)
  }

  override def beforeAll(): Unit = {
    setupLogger()
    LOG.info("Setting up spark session")
    spark = SparkSession.builder().appName(JOB_NAME).master("local[*]").getOrCreate()
  }

  override def afterAll(): Unit = {
    LOG.info("Closing spark session")
    spark.close()
  }

  test("Check if spark session is working") {
    LOG.info("Testing spark job")
    assertResult("hello")(spark.sql("SELECT 'hello'").collect().head.get(0))
  }
}

Do note that this is just a dummy class and is not referring to any functionality from the driver. We just create this to ensure our build process runs through the test.

Step 9: Create logger properties

We will now create a properties for the log4j logger, to be sued during tests. This will go into src/test/resources/log4j-test.properties. We are disabling all WARN and below level logs of spark, but keeping INFO and above logs for our project.

log4j.rootLogger=INFO, console
log4j.logger.com.barrelsofdata.sparkexamples=INFO, console
log4j.additivity.com.barrelsofdata.sparkexamples=false
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.out
log4j.appender.console.immediateFlush=true
log4j.appender.console.encoding=UTF-8
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.conversionPattern=%d{yyyy/MM/dd HH:mm:ss} %p %c: %m%n
log4j.logger.org.apache=ERROR
log4j.logger.org.spark_project=ERROR
log4j.logger.org.sparkproject=ERROR
log4j.logger.parquet=ERROR

We will not add a logger to our application in this post, as the logger cannot be configured programmatically for cluster running jobs.

Step 10: Readme file

No project is complete without a readme file. People looking at your project should know what it is about and have all the necessary instructions right there. So, let's create a README.md file in the root directory.

# Spark Boilerplate
This is a boilerplate project for Apache Spark. The related blog post can be found at [https://barrelsofdata.com/spark-boilerplate-using-scala](https://barrelsofdata.com/spark-boilerplate-using-scala)
## Build instructions
From the root of the project execute the below commands
- To clear all compiled classes, build and log directories
```shell script
./gradlew clean```
- To run tests
```shell script
./gradlew test```
- To build jar
```shell script
./gradlew build```
## Run
```shell script
spark-submit --master yarn --deploy-mode cluster build/libs/spark-boilerplate-1.0.0.jar```

Step 11: Create gitignore file

Every project needs to be maintained through a version control system, and I prefer Git. We will have a lot of temporary files or unrelated files that shouldn't make it to the repository.

We use the below setup for .gitignore file in the root of our project.

# Compiled classes
*.class
# Gradle files
.gradle
# IntelliJ IDEA files
.idea
# Build files
build
# Log files
*.log
logs

At the end of all these, your directory structure structure would look like this:

Spark Boilerplate directory structure

Step 12: Test, Build and Run

Finally, it is time to check if all our above setup is working. We first test and build the project, this will create a jar file in ./build/libs directory with name spark-boilerplate-1.0.0.jar.

./gradlew clean test build

You should see a report saying the tests passed and build successful. We can now run the jar on spark and check if our project is working.

spark-submit --master yarn --deploy-mode client ~/projects/spark-boilerplate/build/libs/spark-boilerplate-1.0.0.jar

Use --master local[*] without deploy-mode, if you have not set up a hadoop cluster in pseudo distributed mode.

Spark Boilerplate output

If you get a output with the query results and log messages as shown above, you have successfully created the boilerplate project for spark. We can now use this to build many more projects, while not bothering about setting up the base again.

You can find the above project on git, as a template repository, at https://git.barrelsofdata.com/barrelsofdata/spark-boilerplate.