Implementing Kerberos authentication for Apache Spark jobs on Amazon EMR on EKS to entry a Kerberos-enabled Hive Metastore


Many organizations run their Apache Spark analytics platforms on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), utilizing Kerberos authentication to safe connectivity between Spark jobs and a centralized shared Apache Hive Metastore (HMS). With Amazon EMR on Amazon EKS, they gained a brand new choice for operating Spark jobs with the advantages of Kubernetes-based container orchestration, improved useful resource utilization, and sooner job startup instances. Nonetheless, an HMS deployment helps just one authentication mechanism at a time. Which means that they need to configure Kerberos authentication for his or her Spark jobs on Amazon EMR on EKS to connect with the prevailing Kerberos-enabled HMS.

On this submit, we present how one can configure Kerberos authentication for Spark jobs on Amazon EMR on EKS, authenticating towards a Kerberos-enabled HMS so you may run each Amazon EMR on EC2 and Amazon EMR on EKS workloads towards a single, safe HMS deployment.

Overview of resolution

Think about an enterprise knowledge platform staff that’s been operating Spark jobs on Amazon EMR on EC2 for a number of years. Their structure features a Kerberos-enabled standalone HMS that serves because the centralized knowledge catalog, with Microsoft Lively Listing functioning because the Key Distribution Heart (KDC). Because the staff evaluates Amazon EMR on EKS for brand spanking new workloads, their current HMS should proceed serving Amazon EMR on EC2, with each authenticating by means of the identical Kerberos infrastructure. To deal with this, the platform staff should configure their Spark jobs operating on Amazon EMR on EKS to authenticate with the identical KDC. That is to allow them to receive legitimate Kerberos tickets and set up authenticated connections to the HMS whereas sustaining a unified safety posture throughout their knowledge platform.

Scope of Kerberos on this resolution

Kerberos authentication on this resolution secures the connection between Spark jobs and the HMS. Different parts within the structure use AWS and Kubernetes safety mechanisms as a substitute.

Resolution structure

Our resolution implements Kerberos authentication to safe the connection between Spark jobs and the HMS. The structure spans two Amazon Digital Non-public Clouds (Amazon VPCs) related utilizing VPC peering, with distinct parts dealing with id administration, compute, and metadata companies.

Identification and Authentication layer

A self-managed Microsoft Lively Listing Area Controller is deployed in a devoted VPC and serves because the KDC for Kerberos authentication. The Lively Listing server hosts service principals for each the HMS service and Spark jobs. This separate VPC deployment mirrors real-world enterprise architectures the place Lively Listing is usually managed by id groups in their very own community boundary, whether or not on-premises or in AWS.

Information Platform layer

The information platform parts reside in a separate VPC and contains an EKS cluster that hosts each the HMS service and Amazon EMR on EKS based mostly Spark jobs persisting knowledge in an Amazon Easy Storage Service (Amazon S3) bucket.

Hive Metastore service

The HMS is deployed within the EKS hive-metastore namespace and simulates a pre-existing, standalone Kerberos-enabled HMS, a standard enterprise sample the place HMS is managed independently of any knowledge processing platform. You possibly can study extra about different enterprise design patterns within the submit Design patterns for implementing Hive Metastore for Amazon EMR or EKS. The HMS service authenticates with the KDC utilizing its service principal and keytab mounted from a Kubernetes secret.

Apache Spark Execution layer

Apache Spark jobs are deployed utilizing the Spark Operator on EKS. The Spark driver and executor pods are configured with Kerberos credentials by means of mounted ConfigMaps containing krb5.conf and jaas.conf, together with keytab recordsdata from Kubernetes secrets and techniques. When a Spark job should entry Hive tables, the driving force authenticates with the KDC and establishes a safe Easy Authentication and Safety Layer (SASL) connection to the HMS.

Authentication movement

The HMS runs as a long-running Kubernetes service that should be deployed and authenticated earlier than Spark jobs can join.

Throughout HMS deployment:

  1. HMS pod validates its Kerberos configuration. krb5.conf and jaas.conf are mounted from ConfigMaps
  2. Service authenticates with KDC utilizing its principal hive/hive-metastore-svc.hive-metastore.svc.cluster.native@CORP.KERBEROS
  3. keytab is mounted from Kubernetes secret for credential entry
  4. Safe Thrift endpoint is established on port 9083 with SASL authentication enabled

When a Spark job should work together with the HMS:

  1. Spark job submission:
    1. Consumer submits Spark job by means of Spark Operator
    2. Driver and executor pods are created with Kerberos configuration mounted as volumes
    3. krb5.conf ConfigMap supplies KDC connection particulars together with realm and server addresses
    4. jaas.conf ConfigMap specifies a login module configuration with keytab path and principal
    5. Keytab secret accommodates encrypted credentials for Spark service principal spark/analytics-team@CORP.KERBEROS
  2. Authentication and connection:
    1. Spark driver authenticates with KDC utilizing its principal and keytab to acquire a Ticket Granting Ticket (TGT)
    2. When connecting to HMS, Spark requests a service ticket from the KDC for the HMS principal hive/hive-metastore-svc.hive-metastore.svc.cluster.native@CORP.KERBEROS
    3. KDC points a service ticket encrypted with HMS’s secret key
    4. Spark presents this service TGT to HMS over the Thrift connection on port 9083
    5. HMS decrypts the ticket utilizing its keytab, verifies Spark’s id, and establishes the authenticated SASL session
    6. Executor pods use the identical configuration for authenticated operations
  3. Information entry:
    1. Authenticated Spark job queries HMS for desk metadata
    2. HMS validates Kerberos tickets earlier than serving metadata requests
    3. Spark accesses underlying knowledge in Amazon S3 utilizing IRSA

Sequence diagram illustrating the Kerberos authentication flow between a Spark job and the Hive Metastore. The flow proceeds in five phases: (1) Job Submission — a Data Engineer submits a SparkApplication via kubectl, and the Spark Operator creates a driver pod with krb5.conf, jaas.conf, and keytab mounted. (2) Kerberos Authentication — the Spark driver loads its keytab for the spark/analytics-team@CORP.KERBEROS principal and sends an AS-REQ to the Active Directory KDC, which validates the credentials and returns a TGT (Ticket Granting Ticket). (3) Service Ticket Request — the Spark driver sends a TGS-REQ to the KDC requesting a service ticket for the Hive Metastore principal, and the KDC returns a service ticket encrypted with the HMS key. (4) Authenticated Connection — the Spark driver connects to the Hive Metastore over Thrift (port 9083) using SASL with the service ticket; HMS decrypts the ticket using its own keytab, verifies the Spark identity, and establishes an authenticated session. (5) Data Operations — the Spark driver queries table metadata from HMS (backed by Aurora PostgreSQL) and reads/writes table data directly from Amazon S3 using IRSA credentials.

Implementation workflow

The implementation includes three key stakeholders working collectively to determine the Kerberos-enabled communication:

Microsoft Lively Listing Administrator

The Lively Listing Administrator creates service accounts which can be used for HMS and Spark jobs. This includes establishing the service principal names utilizing the setspn utility and producing keytab recordsdata utilizing ktpass for safe credential storage. The administrator configures the suitable Lively Listing permissions and Kerberos AES256 encryption kind. Lastly, the keytab recordsdata are uploaded to AWS Secrets and techniques Supervisor for safe distribution to Kubernetes workloads.

Information Platform Staff

The platform staff handles the Amazon EMR on EKS and Kubernetes configurations. They retrieve keytabs from Secrets and techniques Supervisor and create Kubernetes secrets and techniques for the workloads. They configure Helm charts for HMS deployment with Kerberos settings and arrange ConfigMaps for krb5.conf, jaas.conf, and core-site.xml.

Information Engineering Operations

Information engineers submit jobs utilizing the configured service account with Kerberos authentication. They monitor job execution and confirm authenticated entry to HMS.

Deploy the answer

Within the the rest of this submit, you’ll discover the implementation particulars for this resolution. You will discover the pattern code within the AWS Samples GitHub repository. For extra particulars, together with verification steps for every deployment stage, confer with the README within the repository.

Stipulations

Earlier than you deploy this resolution, be sure that the next stipulations are in place:

Clone the repository and arrange setting variables

Clone the repository to your native machine and set the 2 setting variables. Substitute with the AWS Area the place you need to deploy these assets.

# Clone the Git repository
git clone https://github.com/aws-samples/sample-emr-eks-spark-kerberos-hms.git
cd sample-emr-eks-spark-kerberos-hms

# Set setting variables
export REPO_DIR=$(pwd)
export AWS_REGION=

Setup Microsoft Lively Listing infrastructure

On this part, we deploy a self-managed Microsoft Lively Listing with KDC on a Home windows Server EC2 occasion right into a devoted VPC. That is an deliberately minimal implementation highlighting solely the important thing parts required for this weblog submit.

cd ${REPO_DIR}/microsoft-ad
./setup.sh

Setup EKS infrastructure

This part provisions the Amazon EMR on EKS infrastructure stack, together with VPC, EKS cluster, Amazon Aurora PostgreSQL database, Amazon Elastic Container Registry (Amazon ECR), Amazon S3, Amazon EMR on EKS digital clusters and the Spark Operator. Run the next script.

cd ${REPO_DIR}/data-infra
./setup.sh

Arrange VPC peering

This part establishes community connectivity between the Lively Listing VPC and EKS VPC for Kerberos authentication. Run the next script:

cd ${REPO_DIR}/vpc-peering
./setup.sh

Deploy Hive Metastore with Kerberos authentication

This part deploys a Kerberos-enabled HMS service on the EKS cluster. Full the next steps:

  1. Create Kerberos Service Principal for HMS service
cd ${REPO_DIR}/microsoft-ad/
# Create HMS service principal
./manage-ad-service-principals.sh create hive "hive/hive-metastore-svc.hive-metastore.svc.cluster.native"
# Confirm the service principal was created
./manage-ad-service-principals.sh record

  1. Deploy HMS service with Kerberos authentication
cd ${REPO_DIR}/hive-metastore
./deploy.sh

Arrange Amazon EMR on Amazon EKS with Kerberos authentication

This part configures Spark jobs to authenticate with Kerberos-enabled HMS. This includes creating service ideas for Spark jobs and producing the required configuration recordsdata. Full the next steps:

  1. Create Service Principal for Spark jobs
cd ${REPO_DIR}/microsoft-ad/
# Create Spark service principal
./manage-ad-service-principals.sh create spark "spark/analytics-team"
# Confirm the service principal was created
./manage-ad-service-principals.sh record

  1. Generate Kerberos configurations for Spark jobs
cd ${REPO_DIR}/spark-jobs/
./generate-spark-configs.sh --principal "spark/analytics-team@CORP.KERBEROS" --namespace emr

Submit Spark jobs

This part verifies Kerberos authentication by operating a Spark job that connects to the Kerberized HMS. Full the next steps:

  1. Submit the take a look at Spark job
cd ${REPO_DIR}/spark-jobs
kubectl apply -f spark-job.yaml

  1. Monitor job execution
# Watch the SparkApplication standing
kubectl get sparkapplications -n emr -w
# Verify pod standing
kubectl get pods -n emr | grep "spark-kerberos"

  1. Confirm Kerberos authentication and HMS connection
# Verify Spark driver logs for profitable authentication
kubectl logs spark-kerberos-job-driver -n emr

The logs ought to verify profitable authentication, together with a list of pattern databases and tables.

Understanding Kerberos configuration

The HMS requires particular configuration parameters to allow Kerberos authentication, utilized by means of the beforehand talked about steps. The important thing configurations are outlined within the following part.

HMS configuration (metastore-site.xml)

The next configurations are added to metastore-site.xml file.

Setting Worth Objective
hive.metastore.sasl.enabled true Allow SASL authentication
hive.metastore.kerberos.principal hive/hive-metastore-svc.hive-metastore.svc.cluster.native@CORP.KERBEROS HMS service principal
hive.metastore.kerberos.keytab.file /and many others/safety/keytab/hive.keytab Keytab path

Hadoop safety (core-site.xml)

The next configurations are added to the core-site.xml file.

Setting Worth
hadoop.safety.authentication kerberos
hadoop.safety.authorization true

Spark configuration

Setting Worth Objective
spark.safety.credentials.kerberos.enabled true Allow Kerberos for Spark
spark.hadoop.hive.metastore.sasl.enabled true SASL for HMS connection
spark.kerberos.principal spark/analytics-team@CORP.KERBEROS Spark service principal
spark.kerberos.keytab native:///and many others/safety/keytab/analytics-team.keytab Keytab path

Shared Kerberos recordsdata

Each HMS and Spark pods mount two frequent Kerberos configuration recordsdata: krb5.conf and jaas.conf, utilizing ConfigMaps and Kubernetes secrets and techniques. The krb5.conf file is equivalent throughout each companies and defines how every part connects to the KDC. The jaas.conf file follows the identical construction however differs within the principal and keytab path for every service.

  1. krb5 Configuration
[libdefaults]
	default_realm = CORP.KERBEROS
	dns_lookup_realm = false
	dns_lookup_kdc = false
	ticket_lifetime = 24h
	forwardable = true
	udp_preference_limit = 1
	default_tkt_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96
	default_tgs_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96
	permitted_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96

[realms]
	CORP.KERBEROS = {
		kdc = 
		admin_server = 
	}

[domain_realm]
	.corp.kerberos = CORP.KERBEROS
	corp.kerberos = CORP.KERBEROS

For extra data, see the net documentation for krb5.conf.

  1. JAAS configuration
Consumer {
 com.solar.safety.auth.module.Krb5LoginModule required
 useKeyTab=true
 keyTab="/and many others/safety/keytab/hive.keytab"
 principal="hive/hive-metastore-svc.hive-metastore.svc.cluster.native@CORP.KERBEROS"
 useTicketCache=false
 storeKey=true
 debug=false;
};

Extra safety concerns

This submit focuses on core Kerberos authentication mechanics between Spark and HMS. We suggest two further safety hardening steps based mostly in your group’s safety posture and compliance necessities.

Defending Keytabs at Relaxation with AWS KMS Envelope Encryption

Keytabs saved as Kubernetes Secrets and techniques are solely base64-encoded by default, not encrypted at relaxation. We suggest enabling EKS envelope encryption utilizing an AWS Key Administration Service (AWS KMS) buyer managed key. With envelope encryption, secret knowledge is encrypted with a Information Encryption Key (DEK), which is encrypted by your buyer managed key. This protects keytab content material even when the etcd datastore is compromised. To allow this on an current EKS cluster:

aws eks associate-encryption-config 
  --cluster-name  
  --encryption-config '[{"resources":["secrets"],"supplier":{"keyArn":"arn:aws:kms:::key/"}}]'

Seek advice from the Amazon EKS documentation on envelope encryption for full setup steerage.

Encrypting the Thrift Information Channel with TLS

SASL with Kerberos supplies mutual authentication however doesn’t robotically encrypt knowledge over the Thrift connection. Many deployments default to auth QoP, leaving the information channel unencrypted. We suggest both:

  • Set SASL QoP to auth-conf — allows SASL-layer encryption utilizing Kerberos session keys
  • Layer TLS over Thrift (most well-liked) — allows transport-level encryption utilizing trendy cipher suites

Enabling TLS on HiveServer2 / Hive Metastore Thrift:


  hive.server2.use.SSL
  true


  hive.server2.keystore.path
  /and many others/tls/keystore.jks

Seek advice from the Hive SSL/TLS configuration documentation for full particulars.

Cleansing up

To keep away from incurring future costs, clear up all provisioned assets throughout this setup by executing the next cleanup script.

cd ${REPO_DIR}/
./cleanup.sh

Conclusion

On this submit, we demonstrated how one can implement Kerberos authentication for Amazon EMR on EKS to securely connect with a Kerberos-enabled HMS. This resolution addresses a standard problem confronted by organizations with current Kerberos-enabled HMS deployments who need to undertake Amazon EMR on EKS whereas sustaining their Kerberos-enabled safety posture.

This sample applies whether or not you’re migrating from on-premises Hadoop, operating hybrid Amazon EMR on EC2 or Amazon EMR on EKS environments, or constructing a brand new cloud-native platform. Any state of affairs the place Spark jobs on Kerberos should authenticate with a shared, Kerberos-enabled HMS.

You should use this submit as a place to begin to implement this sample and prolong it additional to fit your group’s knowledge platform wants.


Concerning the authors

Headshot of Krishna Kumar Venkateswaran

Krishna Kumar Venkateswaran is a Cloud Infrastructure Architect at Amazon Net Companies (AWS), enthusiastic about constructing safe purposes and knowledge platforms. He has in depth expertise in Kubernetes, DevOps, and enterprise structure, serving to prospects containerize purposes, streamline deployments, and optimize cloud-native environments.

Headshot of Sunil Chakrapani Sundararaman

Sunil Chakrapani Sundararaman is a DevOps Architect at Amazon Net Companies (AWS), the place he helps enterprise prospects architect and implement Information and Machine Studying platforms within the AWS Cloud. He brings in depth expertise in Information Platform engineering, MLOps, DevOps, and Kubernetes implementations. Sunil focuses on guiding organizations by means of their cloud transformation journey, specializing in constructing scalable and environment friendly options that drive enterprise worth.

Headshot of Avinash Desireddy

Avinash Desireddy is a Specialist Options Architect (Containers) at Amazon Net Companies (AWS), enthusiastic about constructing safe purposes and knowledge platforms. He has in depth expertise in Kubernetes, DevOps, and enterprise structure, serving to prospects and companions containerize purposes, streamline deployments, and optimize cloud-native environments.

Headshot of Suvojit Dasgupta

Suvojit Dasgupta is an Engineering Chief at Amazon Net Companies (AWS). He leads engineering groups, guiding them in designing and implementing scalable, high-performance knowledge platforms for AWS prospects. With experience spanning distributed techniques, real-time and batch knowledge architectures, and cloud-native infrastructure, he drives technical technique and engineering excellence throughout groups. He’s enthusiastic about elevating the bar on engineering practices, and fixing large-scale issues on the intersection of knowledge and enterprise affect.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles