Skip to main content

Command Palette

Search for a command to run...

Connecting Spark 2.4+ to Hive 1.2 via JDBC

Bridging backward compatibility gaps between legacy Hive and modern Spark

Updated
4 min read
Connecting Spark 2.4+ to Hive 1.2 via JDBC
P

My name is Pulkit, and I am seasoned Data Engineer. Along with my expertise in Spark / Hadoop applications, I am deeply fond of AWS Cloud. I love to learn new tech and broaden my horizons with every single day.

🧩 Scenario

Apache Spark remains the de facto framework for large-scale data processing. Since its release, Spark has evolved rapidly—so has Apache Hive. But one persistent challenge with Hive is the lack of backward compatibility across its APIs.

If your production environment uses the latest Hive + Hadoop setup and you need to connect Spark to an older Hive 1.2 instance, you’ll soon hit compatibility walls. So, how can we make it work?


☕ JDBC to the Rescue

Incompatibility between Spark > 2.4 and Hive < 2.0 arises primarily from protocol and configuration changes.

  • Hive < 2.0 uses hive-service.jar to implement the Thrift protocol.

  • Hive ≥ 2.0 migrated to hive-rpc-service.jar.

These architectural differences break backward compatibility, making direct integration with legacy Hive difficult.

Backward compatibility in software is where ancient relics and cutting-edge tech awkwardly high-five each other.

Fortunately, Hive supports multiple connection protocols — including JDBC, which allows Java (and by extension, Spark) to interact with HiveServer2 for query execution and table operations.


🧱 Required JARs for Hive 1.2 JDBC

To connect Spark 2.4+ to Hive 1.2 via JDBC, you’ll need the following JARs in your Spark classpath:

JARVersionPurpose
hive-jdbc-1.2.1.jar1.2.1Contains the driver class org.apache.hive.jdbc.HiveDriver
hive-shims-common-1.2.1.jar, hive-shims-0.23-1.2.1.jar1.2.1Provides compatibility shims and Kerberos auth support
libthrift-0.9.3.jar0.9.3Implements Thrift communication protocol
hive-service-1.2.jar1.2Contains Thrift service definitions for Hive < 2.0
hive-serde-1.2.1.jar1.2.1Supports serialization/deserialization logic

🛠️ Required Modifications for Spark 2.4 Compatibility

Using Hive 1.2 JDBC drivers directly with Spark 2.4.8 (or newer) can lead to runtime errors. Below are the two common issues and their workarounds.


❌ Error 1: Unknown Hadoop Version 3

Error message:

Illegal Hadoop Version: 3.x (expected A.B.* format)

Root cause:
Found in hive-shims-common.jar → ShimLoader.java, the method getMajorVersion() doesn’t recognize Hadoop 3.x.

// Original code
switch (Integer.parseInt(parts[0])) {
  case 1:
    return HADOOP20SVERSIONNAME;
  case 2:
    return HADOOP23VERSIONNAME;
  default:
    throw new IllegalArgumentException("Unrecognized Hadoop major version number: " + vers);
}

Fix:
Modify the default block to return a valid version instead of throwing an exception:

default:
    return HADOOP23VERSIONNAME; // Cheat: treat 3.x as Hadoop 2.3

Then recompile the JAR:

mvn clean install && mvn package

This quick patch allows Hive 1.2 shims to function under Hadoop 3.x.


❌ Error 2: Unsupported Method HiveStatement.setQueryTimeout()

Error message:

java.sql.SQLException: Method not supported

Root cause:
In HiveStatement.java, the method setQueryTimeout throws an exception when invoked by Spark 2.4+.

public void setQueryTimeout(int seconds) throws SQLException {
    throw new SQLException("Method not supported");
}

Fix:
Comment out or remove the throw statement:

public void setQueryTimeout(int seconds) throws SQLException {
    // No-op to maintain compatibility
}

Rebuild the JAR to produce your patched version. Once both fixes are applied, your Hive 1.2 driver will operate smoothly with Spark 2.4.


🚀 Launching Spark with Hive 1.2 JDBC

After preparing all the patched JARs (e.g., in a directory named hive1.2jars), you can start spark-shell as follows:

cd hive1.2jars

spark-shell \
  --master yarn \
  --jars hive-jdbc.jar,hive-shims-common.jar,hive-shims-0.23.0.jar,libthrift.jar,hive-serde.jar \
  --conf spark.driver.extraClassPath=hive-jdbc.jar:hive-shims-common.jar:hive-shims-0.23.0.jar:libthrift.jar:hive-serde.jar \
  --conf spark.executor.extraClassPath=hive-jdbc.jar:hive-shims-common.jar:hive-shims-0.23.0.jar:libthrift.jar:hive-serde.jar \
  --conf spark.driver.userClassPathFirst=true \
  --conf spark.executor.userClassPathFirst=true

🔐 Kerberos & JAAS Configuration

When connecting to HiveServer2 in a secure (Kerberized) environment, the Hive JDBC driver may fail to automatically obtain service tokens.
You’ll need to manually configure JAAS for authentication:

--files ./jaas.conf \
--conf spark.executor.extraJavaOptions="-Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=./jaas.conf" \
--conf spark.security.credentials.hiveserver2.enabled=true

Ensure that jaas.conf defines the correct serviceName="hiveserver2" entry for your environment.


✅ Summary

By applying minor source-level patches to the Hive 1.2 JDBC driver and shims, you can connect modern Spark clusters (Hadoop 3.x) to legacy Hive instances without major rewrites.
This approach is especially useful for data migration, legacy audits, or cross-version compatibility testing.


🔗 References

  1. TIBCO Support – Illegal Hadoop Version Error

  2. StackOverflow – Unrecognized Hadoop Major Version Error

  3. Progress Community – setQueryTimeout Not Working

  4. QuerySurge – Setting Up Hive Connection with Kerberos

More from this blog

B

Beyond Backfills

11 posts

Beyond Backfills is a space to explore the art and craft of data engineering. The goal isn’t just moving bytes—it’s understanding the systems that shape them, and the human curiosity that follows.