Connecting Spark 2.4+ to Hive 1.2 via JDBC

🧩 Scenario

Apache Spark remains the de facto framework for large-scale data processing. Since its release, Spark has evolved rapidly—so has Apache Hive. But one persistent challenge with Hive is the lack of backward compatibility across its APIs.

If your production environment uses the latest Hive + Hadoop setup and you need to connect Spark to an older Hive 1.2 instance, you’ll soon hit compatibility walls. So, how can we make it work?

☕ JDBC to the Rescue

Incompatibility between Spark > 2.4 and Hive < 2.0 arises primarily from protocol and configuration changes.

Hive < 2.0 uses hive-service.jar to implement the Thrift protocol.
Hive ≥ 2.0 migrated to hive-rpc-service.jar.

These architectural differences break backward compatibility, making direct integration with legacy Hive difficult.

Backward compatibility in software is where ancient relics and cutting-edge tech awkwardly high-five each other.

Fortunately, Hive supports multiple connection protocols — including JDBC, which allows Java (and by extension, Spark) to interact with HiveServer2 for query execution and table operations.

🧱 Required JARs for Hive 1.2 JDBC

To connect Spark 2.4+ to Hive 1.2 via JDBC, you’ll need the following JARs in your Spark classpath:

JAR	Version	Purpose
`hive-jdbc-1.2.1.jar`	1.2.1	Contains the driver class `org.apache.hive.jdbc.HiveDriver`
`hive-shims-common-1.2.1.jar`, `hive-shims-0.23-1.2.1.jar`	1.2.1	Provides compatibility shims and Kerberos auth support
`libthrift-0.9.3.jar`	0.9.3	Implements Thrift communication protocol
`hive-service-1.2.jar`	1.2	Contains Thrift service definitions for Hive < 2.0
`hive-serde-1.2.1.jar`	1.2.1	Supports serialization/deserialization logic

🛠️ Required Modifications for Spark 2.4 Compatibility

Using Hive 1.2 JDBC drivers directly with Spark 2.4.8 (or newer) can lead to runtime errors. Below are the two common issues and their workarounds.

❌ Error 1: Unknown Hadoop Version 3

Error message:

Illegal Hadoop Version: 3.x (expected A.B.* format)

Root cause:
Found in hive-shims-common.jar → ShimLoader.java, the method getMajorVersion() doesn’t recognize Hadoop 3.x.

// Original code
switch (Integer.parseInt(parts[0])) {
  case 1:
    return HADOOP20SVERSIONNAME;
  case 2:
    return HADOOP23VERSIONNAME;
  default:
    throw new IllegalArgumentException("Unrecognized Hadoop major version number: " + vers);
}

Fix:
Modify the default block to return a valid version instead of throwing an exception:

default:
    return HADOOP23VERSIONNAME; // Cheat: treat 3.x as Hadoop 2.3

Then recompile the JAR:

mvn clean install && mvn package

This quick patch allows Hive 1.2 shims to function under Hadoop 3.x.

❌ Error 2: Unsupported Method HiveStatement.setQueryTimeout()

Error message:

java.sql.SQLException: Method not supported

Root cause:
In HiveStatement.java, the method setQueryTimeout throws an exception when invoked by Spark 2.4+.

public void setQueryTimeout(int seconds) throws SQLException {
    throw new SQLException("Method not supported");
}

Fix:
Comment out or remove the throw statement:

public void setQueryTimeout(int seconds) throws SQLException {
    // No-op to maintain compatibility
}

Rebuild the JAR to produce your patched version. Once both fixes are applied, your Hive 1.2 driver will operate smoothly with Spark 2.4.

🚀 Launching Spark with Hive 1.2 JDBC

After preparing all the patched JARs (e.g., in a directory named hive1.2jars), you can start spark-shell as follows:

cd hive1.2jars

spark-shell \
  --master yarn \
  --jars hive-jdbc.jar,hive-shims-common.jar,hive-shims-0.23.0.jar,libthrift.jar,hive-serde.jar \
  --conf spark.driver.extraClassPath=hive-jdbc.jar:hive-shims-common.jar:hive-shims-0.23.0.jar:libthrift.jar:hive-serde.jar \
  --conf spark.executor.extraClassPath=hive-jdbc.jar:hive-shims-common.jar:hive-shims-0.23.0.jar:libthrift.jar:hive-serde.jar \
  --conf spark.driver.userClassPathFirst=true \
  --conf spark.executor.userClassPathFirst=true

🔐 Kerberos & JAAS Configuration

When connecting to HiveServer2 in a secure (Kerberized) environment, the Hive JDBC driver may fail to automatically obtain service tokens.
You’ll need to manually configure JAAS for authentication:

--files ./jaas.conf \
--conf spark.executor.extraJavaOptions="-Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=./jaas.conf" \
--conf spark.security.credentials.hiveserver2.enabled=true

Ensure that jaas.conf defines the correct serviceName="hiveserver2" entry for your environment.

✅ Summary

By applying minor source-level patches to the Hive 1.2 JDBC driver and shims, you can connect modern Spark clusters (Hadoop 3.x) to legacy Hive instances without major rewrites.
This approach is especially useful for data migration, legacy audits, or cross-version compatibility testing.

Connecting Spark 2.4+ to Hive 1.2 via JDBC

🧩 Scenario

☕ JDBC to the Rescue

🧱 Required JARs for Hive 1.2 JDBC

🛠️ Required Modifications for Spark 2.4 Compatibility

❌ Error 1: Unknown Hadoop Version 3

❌ Error 2: Unsupported Method HiveStatement.setQueryTimeout()

🚀 Launching Spark with Hive 1.2 JDBC

🔐 Kerberos & JAAS Configuration

✅ Summary

🔗 References

More from this blog

Apache Hadoop — Background

AWS EMR – Private Subnets

Apache Spark Unified Memory

Serverless Framework

AWS S3 Multi-Part Uploads

Command Palette

🧩 Scenario

☕ JDBC to the Rescue

🧱 Required JARs for Hive 1.2 JDBC

🛠️ Required Modifications for Spark 2.4 Compatibility

❌ Error 1: Unknown Hadoop Version 3

❌ Error 2: Unsupported Method HiveStatement.setQueryTimeout()

🚀 Launching Spark with Hive 1.2 JDBC

🔐 Kerberos & JAAS Configuration

✅ Summary

🔗 References

More from this blog