# Connecting Spark 2.4+ to Hive 1.2 via JDBC

## 🧩 Scenario

Apache Spark remains the de facto framework for large-scale data processing. Since its release, Spark has evolved rapidly—so has Apache Hive. But one persistent challenge with Hive is the *lack of backward compatibility* across its APIs.

If your production environment uses the latest Hive + Hadoop setup and you need to connect Spark to an older **Hive 1.2** instance, you’ll soon hit compatibility walls. So, how can we make it work?

---

## ☕ JDBC to the Rescue

Incompatibility between **Spark &gt; 2.4** and **Hive &lt; 2.0** arises primarily from protocol and configuration changes.

* **Hive &lt; 2.0** uses `hive-service.jar` to implement the Thrift protocol.
    
* **Hive ≥ 2.0** migrated to `hive-rpc-service.jar`.
    

These architectural differences break backward compatibility, making direct integration with legacy Hive difficult.

> *Backward compatibility in software is where ancient relics and cutting-edge tech awkwardly high-five each other.*

Fortunately, Hive supports multiple connection protocols — including **JDBC**, which allows Java (and by extension, Spark) to interact with HiveServer2 for query execution and table operations.

---

## 🧱 Required JARs for Hive 1.2 JDBC

To connect Spark 2.4+ to Hive 1.2 via JDBC, you’ll need the following JARs in your Spark classpath:

| **JAR** | **Version** | **Purpose** |
| --- | --- | --- |
| `hive-jdbc-1.2.1.jar` | 1.2.1 | Contains the driver class `org.apache.hive.jdbc.HiveDriver` |
| `hive-shims-common-1.2.1.jar`, `hive-shims-0.23-1.2.1.jar` | 1.2.1 | Provides compatibility shims and Kerberos auth support |
| `libthrift-0.9.3.jar` | 0.9.3 | Implements Thrift communication protocol |
| `hive-service-1.2.jar` | 1.2 | Contains Thrift service definitions for Hive &lt; 2.0 |
| `hive-serde-1.2.1.jar` | 1.2.1 | Supports serialization/deserialization logic |

---

## 🛠️ Required Modifications for Spark 2.4 Compatibility

Using Hive 1.2 JDBC drivers directly with Spark 2.4.8 (or newer) can lead to runtime errors. Below are the two common issues and their workarounds.

---

### ❌ Error 1: *Unknown Hadoop Version 3*

**Error message:**

```plaintext
Illegal Hadoop Version: 3.x (expected A.B.* format)
```

**Root cause:**  
Found in `hive-shims-common.jar → ShimLoader.java`, the method `getMajorVersion()` doesn’t recognize Hadoop 3.x.

```java
// Original code
switch (Integer.parseInt(parts[0])) {
  case 1:
    return HADOOP20SVERSIONNAME;
  case 2:
    return HADOOP23VERSIONNAME;
  default:
    throw new IllegalArgumentException("Unrecognized Hadoop major version number: " + vers);
}
```

**Fix:**  
Modify the `default` block to return a valid version instead of throwing an exception:

```java
default:
    return HADOOP23VERSIONNAME; // Cheat: treat 3.x as Hadoop 2.3
```

Then recompile the JAR:

```bash
mvn clean install && mvn package
```

This quick patch allows Hive 1.2 shims to function under Hadoop 3.x.

---

### ❌ Error 2: *Unsupported Method HiveStatement.setQueryTimeout()*

**Error message:**

```plaintext
java.sql.SQLException: Method not supported
```

**Root cause:**  
In `HiveStatement.java`, the method `setQueryTimeout` throws an exception when invoked by Spark 2.4+.

```java
public void setQueryTimeout(int seconds) throws SQLException {
    throw new SQLException("Method not supported");
}
```

**Fix:**  
Comment out or remove the `throw` statement:

```java
public void setQueryTimeout(int seconds) throws SQLException {
    // No-op to maintain compatibility
}
```

Rebuild the JAR to produce your patched version. Once both fixes are applied, your Hive 1.2 driver will operate smoothly with Spark 2.4.

---

## 🚀 Launching Spark with Hive 1.2 JDBC

After preparing all the patched JARs (e.g., in a directory named `hive1.2jars`), you can start `spark-shell` as follows:

```bash
cd hive1.2jars

spark-shell \
  --master yarn \
  --jars hive-jdbc.jar,hive-shims-common.jar,hive-shims-0.23.0.jar,libthrift.jar,hive-serde.jar \
  --conf spark.driver.extraClassPath=hive-jdbc.jar:hive-shims-common.jar:hive-shims-0.23.0.jar:libthrift.jar:hive-serde.jar \
  --conf spark.executor.extraClassPath=hive-jdbc.jar:hive-shims-common.jar:hive-shims-0.23.0.jar:libthrift.jar:hive-serde.jar \
  --conf spark.driver.userClassPathFirst=true \
  --conf spark.executor.userClassPathFirst=true
```

---

## 🔐 Kerberos & JAAS Configuration

When connecting to HiveServer2 in a secure (Kerberized) environment, the Hive JDBC driver may fail to automatically obtain service tokens.  
You’ll need to manually configure JAAS for authentication:

```bash
--files ./jaas.conf \
--conf spark.executor.extraJavaOptions="-Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=./jaas.conf" \
--conf spark.security.credentials.hiveserver2.enabled=true
```

Ensure that `jaas.conf` defines the correct `serviceName="hiveserver2"` entry for your environment.

---

## ✅ Summary

By applying minor source-level patches to the Hive 1.2 JDBC driver and shims, you can connect modern Spark clusters (Hadoop 3.x) to legacy Hive instances without major rewrites.  
This approach is especially useful for **data migration, legacy audits, or cross-version compatibility testing**.

---

## 🔗 References

1. [TIBCO Support – Illegal Hadoop Version Error](https://support.tibco.com/s/article/How-to-resolve-Illegal-Hadoop-Version-Unknown-expected-A-B-format)
    
2. [StackOverflow – Unrecognized Hadoop Major Version Error](https://stackoverflow.com/questions/62938360/unrecognised-hadoop-major-version-1-2-1-error-hive-and-impala-jdbc-connection)
    
3. [Progress Community – setQueryTimeout Not Working](https://community.progress.com/s/article/setquerytimeout-is-not-working-with-apache-hive-jdbc-driver)
    
4. [QuerySurge – Setting Up Hive Connection with Kerberos](https://querysurge.zendesk.com/hc/en-us/articles/115001218863-Setting-Up-a-Hive-Connection-with-Kerberos-using-Apache-JDBC-Drivers-Windows-)