Many organizations run their Apache Spark analytics platforms on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), utilizing Kerberos authentication to safe connectivity between Spark jobs and a centralized shared Apache Hive Metastore (HMS). With Amazon EMR on Amazon EKS, they gained a brand new choice for operating Spark jobs with the advantages of Kubernetes-based container orchestration, improved useful resource utilization, and sooner job startup instances. Nonetheless, an HMS deployment helps just one authentication mechanism at a time. Which means that they need to configure Kerberos authentication for his or her Spark jobs on Amazon EMR on EKS to connect with the prevailing Kerberos-enabled HMS.
On this submit, we present how one can configure Kerberos authentication for Spark jobs on Amazon EMR on EKS, authenticating towards a Kerberos-enabled HMS so you may run each Amazon EMR on EC2 and Amazon EMR on EKS workloads towards a single, safe HMS deployment.
Overview of resolution
Think about an enterprise knowledge platform staff that’s been operating Spark jobs on Amazon EMR on EC2 for a number of years. Their structure features a Kerberos-enabled standalone HMS that serves because the centralized knowledge catalog, with Microsoft Lively Listing functioning because the Key Distribution Heart (KDC). Because the staff evaluates Amazon EMR on EKS for brand spanking new workloads, their current HMS should proceed serving Amazon EMR on EC2, with each authenticating by means of the identical Kerberos infrastructure. To deal with this, the platform staff should configure their Spark jobs operating on Amazon EMR on EKS to authenticate with the identical KDC. That is to allow them to receive legitimate Kerberos tickets and set up authenticated connections to the HMS whereas sustaining a unified safety posture throughout their knowledge platform.
Scope of Kerberos on this resolution
Kerberos authentication on this resolution secures the connection between Spark jobs and the HMS. Different parts within the structure use AWS and Kubernetes safety mechanisms as a substitute.
Resolution structure
Our resolution implements Kerberos authentication to safe the connection between Spark jobs and the HMS. The structure spans two Amazon Digital Non-public Clouds (Amazon VPCs) related utilizing VPC peering, with distinct parts dealing with id administration, compute, and metadata companies.
Identification and Authentication layer
A self-managed Microsoft Lively Listing Area Controller is deployed in a devoted VPC and serves because the KDC for Kerberos authentication. The Lively Listing server hosts service principals for each the HMS service and Spark jobs. This separate VPC deployment mirrors real-world enterprise architectures the place Lively Listing is usually managed by id groups in their very own community boundary, whether or not on-premises or in AWS.
Information Platform layer
The information platform parts reside in a separate VPC and contains an EKS cluster that hosts each the HMS service and Amazon EMR on EKS based mostly Spark jobs persisting knowledge in an Amazon Easy Storage Service (Amazon S3) bucket.
Hive Metastore service
The HMS is deployed within the EKS hive-metastore namespace and simulates a pre-existing, standalone Kerberos-enabled HMS, a standard enterprise sample the place HMS is managed independently of any knowledge processing platform. You possibly can study extra about different enterprise design patterns within the submit Design patterns for implementing Hive Metastore for Amazon EMR or EKS. The HMS service authenticates with the KDC utilizing its service principal and keytab mounted from a Kubernetes secret.
Apache Spark Execution layer
Apache Spark jobs are deployed utilizing the Spark Operator on EKS. The Spark driver and executor pods are configured with Kerberos credentials by means of mounted ConfigMaps containing krb5.conf and jaas.conf, together with keytab recordsdata from Kubernetes secrets and techniques. When a Spark job should entry Hive tables, the driving force authenticates with the KDC and establishes a safe Easy Authentication and Safety Layer (SASL) connection to the HMS.
Authentication movement
The HMS runs as a long-running Kubernetes service that should be deployed and authenticated earlier than Spark jobs can join.
Throughout HMS deployment:
- HMS pod validates its Kerberos configuration.
krb5.confandjaas.confare mounted fromConfigMaps - Service authenticates with KDC utilizing its principal
hive/hive-metastore-svc.hive-metastore.svc.cluster.native@CORP.KERBEROS keytabis mounted from Kubernetes secret for credential entry- Safe Thrift endpoint is established on
port 9083with SASL authentication enabled
When a Spark job should work together with the HMS:
- Spark job submission:
- Consumer submits Spark job by means of Spark Operator
- Driver and executor pods are created with Kerberos configuration mounted as volumes
krb5.confConfigMap supplies KDC connection particulars together with realm and server addressesjaas.confConfigMap specifies a login module configuration withkeytabpath and principalKeytabsecret accommodates encrypted credentials for Spark service principalspark/analytics-team@CORP.KERBEROS
- Authentication and connection:
- Spark driver authenticates with KDC utilizing its principal and
keytabto acquire a Ticket Granting Ticket (TGT) - When connecting to HMS, Spark requests a service ticket from the KDC for the HMS principal
hive/hive-metastore-svc.hive-metastore.svc.cluster.native@CORP.KERBEROS - KDC points a service ticket encrypted with HMS’s secret key
- Spark presents this service TGT to HMS over the Thrift connection on
port 9083 - HMS decrypts the ticket utilizing its
keytab, verifies Spark’s id, and establishes the authenticated SASL session - Executor pods use the identical configuration for authenticated operations
- Spark driver authenticates with KDC utilizing its principal and
- Information entry:
- Authenticated Spark job queries HMS for desk metadata
- HMS validates Kerberos tickets earlier than serving metadata requests
- Spark accesses underlying knowledge in Amazon S3 utilizing IRSA
Implementation workflow
The implementation includes three key stakeholders working collectively to determine the Kerberos-enabled communication:
Microsoft Lively Listing Administrator
The Lively Listing Administrator creates service accounts which can be used for HMS and Spark jobs. This includes establishing the service principal names utilizing the setspn utility and producing keytab recordsdata utilizing ktpass for safe credential storage. The administrator configures the suitable Lively Listing permissions and Kerberos AES256 encryption kind. Lastly, the keytab recordsdata are uploaded to AWS Secrets and techniques Supervisor for safe distribution to Kubernetes workloads.
Information Platform Staff
The platform staff handles the Amazon EMR on EKS and Kubernetes configurations. They retrieve keytabs from Secrets and techniques Supervisor and create Kubernetes secrets and techniques for the workloads. They configure Helm charts for HMS deployment with Kerberos settings and arrange ConfigMaps for krb5.conf, jaas.conf, and core-site.xml.
Information Engineering Operations
Information engineers submit jobs utilizing the configured service account with Kerberos authentication. They monitor job execution and confirm authenticated entry to HMS.
Deploy the answer
Within the the rest of this submit, you’ll discover the implementation particulars for this resolution. You will discover the pattern code within the AWS Samples GitHub repository. For extra particulars, together with verification steps for every deployment stage, confer with the README within the repository.
Stipulations
Earlier than you deploy this resolution, be sure that the next stipulations are in place:
- Entry to a legitimate AWS account and permission to create AWS assets.
- The AWS Command Line Interface (AWS CLI) is put in in your native machine.
- Git, Docker, eksctl, kubectl, Helm, envsubst, jq, and yq utilities are put in in your native machine.
- Familiarity with Kerberos, Apache Hive Metastore (HMS), Apache Spark, Kubernetes, Amazon EKS, and Amazon EMR on Amazon EKS.
Clone the repository and arrange setting variables
Clone the repository to your native machine and set the 2 setting variables. Substitute
Setup Microsoft Lively Listing infrastructure
On this part, we deploy a self-managed Microsoft Lively Listing with KDC on a Home windows Server EC2 occasion right into a devoted VPC. That is an deliberately minimal implementation highlighting solely the important thing parts required for this weblog submit.
Setup EKS infrastructure
This part provisions the Amazon EMR on EKS infrastructure stack, together with VPC, EKS cluster, Amazon Aurora PostgreSQL database, Amazon Elastic Container Registry (Amazon ECR), Amazon S3, Amazon EMR on EKS digital clusters and the Spark Operator. Run the next script.
Arrange VPC peering
This part establishes community connectivity between the Lively Listing VPC and EKS VPC for Kerberos authentication. Run the next script:
Deploy Hive Metastore with Kerberos authentication
This part deploys a Kerberos-enabled HMS service on the EKS cluster. Full the next steps:
- Create Kerberos Service Principal for HMS service
- Deploy HMS service with Kerberos authentication
Arrange Amazon EMR on Amazon EKS with Kerberos authentication
This part configures Spark jobs to authenticate with Kerberos-enabled HMS. This includes creating service ideas for Spark jobs and producing the required configuration recordsdata. Full the next steps:
- Create Service Principal for Spark jobs
- Generate Kerberos configurations for Spark jobs
Submit Spark jobs
This part verifies Kerberos authentication by operating a Spark job that connects to the Kerberized HMS. Full the next steps:
- Submit the take a look at Spark job
- Monitor job execution
- Confirm Kerberos authentication and HMS connection
The logs ought to verify profitable authentication, together with a list of pattern databases and tables.
Understanding Kerberos configuration
The HMS requires particular configuration parameters to allow Kerberos authentication, utilized by means of the beforehand talked about steps. The important thing configurations are outlined within the following part.
HMS configuration (metastore-site.xml)
The next configurations are added to metastore-site.xml file.
| Setting | Worth | Objective |
hive.metastore.sasl.enabled |
true | Allow SASL authentication |
hive.metastore.kerberos.principal |
hive/hive-metastore-svc.hive-metastore.svc.cluster.native@CORP.KERBEROS |
HMS service principal |
hive.metastore.kerberos.keytab.file |
/and many others/safety/keytab/hive.keytab |
Keytab path |
Hadoop safety (core-site.xml)
The next configurations are added to the core-site.xml file.
| Setting | Worth |
hadoop.safety.authentication |
kerberos |
hadoop.safety.authorization |
true |
Spark configuration
| Setting | Worth | Objective |
spark.safety.credentials.kerberos.enabled |
true | Allow Kerberos for Spark |
spark.hadoop.hive.metastore.sasl.enabled |
true | SASL for HMS connection |
spark.kerberos.principal |
spark/analytics-team@CORP.KERBEROS |
Spark service principal |
spark.kerberos.keytab |
native:///and many others/safety/keytab/analytics-team.keytab |
Keytab path |
Shared Kerberos recordsdata
Each HMS and Spark pods mount two frequent Kerberos configuration recordsdata: krb5.conf and jaas.conf, utilizing ConfigMaps and Kubernetes secrets and techniques. The krb5.conf file is equivalent throughout each companies and defines how every part connects to the KDC. The jaas.conf file follows the identical construction however differs within the principal and keytab path for every service.
krb5Configuration
For extra data, see the net documentation for krb5.conf.
- JAAS configuration
Extra safety concerns
This submit focuses on core Kerberos authentication mechanics between Spark and HMS. We suggest two further safety hardening steps based mostly in your group’s safety posture and compliance necessities.
Defending Keytabs at Relaxation with AWS KMS Envelope Encryption
Keytabs saved as Kubernetes Secrets and techniques are solely base64-encoded by default, not encrypted at relaxation. We suggest enabling EKS envelope encryption utilizing an AWS Key Administration Service (AWS KMS) buyer managed key. With envelope encryption, secret knowledge is encrypted with a Information Encryption Key (DEK), which is encrypted by your buyer managed key. This protects keytab content material even when the etcd datastore is compromised. To allow this on an current EKS cluster:
Seek advice from the Amazon EKS documentation on envelope encryption for full setup steerage.
Encrypting the Thrift Information Channel with TLS
SASL with Kerberos supplies mutual authentication however doesn’t robotically encrypt knowledge over the Thrift connection. Many deployments default to auth QoP, leaving the information channel unencrypted. We suggest both:
- Set SASL QoP to auth-conf — allows SASL-layer encryption utilizing Kerberos session keys
- Layer TLS over Thrift (most well-liked) — allows transport-level encryption utilizing trendy cipher suites
Enabling TLS on HiveServer2 / Hive Metastore Thrift:
Seek advice from the Hive SSL/TLS configuration documentation for full particulars.
Cleansing up
To keep away from incurring future costs, clear up all provisioned assets throughout this setup by executing the next cleanup script.
Conclusion
On this submit, we demonstrated how one can implement Kerberos authentication for Amazon EMR on EKS to securely connect with a Kerberos-enabled HMS. This resolution addresses a standard problem confronted by organizations with current Kerberos-enabled HMS deployments who need to undertake Amazon EMR on EKS whereas sustaining their Kerberos-enabled safety posture.
This sample applies whether or not you’re migrating from on-premises Hadoop, operating hybrid Amazon EMR on EC2 or Amazon EMR on EKS environments, or constructing a brand new cloud-native platform. Any state of affairs the place Spark jobs on Kerberos should authenticate with a shared, Kerberos-enabled HMS.
You should use this submit as a place to begin to implement this sample and prolong it additional to fit your group’s knowledge platform wants.
Concerning the authors

