This can be a visitor publish by Aakash Pradeep, Principal Software program Engineer, and Venkatram Bondugula, Software program Engineer at Twilio, in partnership with AWS.
Twilio is a cloud communications platform that gives programmable APIs and instruments for builders to simply combine voice, messaging, electronic mail, video, and different communication options into their purposes and buyer engagement workflows.
On this weblog sequence we focus on how we constructed a multi-engine question platform at Twilio. The first half introduces the use case that led us to construct a brand new platform and why we chosen Amazon Athena alongside our open-source Presto implementation. This second half discusses how Twilio’s question infrastructure platform integrates with AWS Lake Formation to offer fine-grained entry management to all their information.
At Twilio, we confronted essential challenges in managing our multi-engine question platform throughout a fancy information mesh structure spanning a number of AWS accounts and Strains of Enterprise. We would have liked a unified permissions mannequin that might work constantly throughout completely different question engines like OSS Presto and Amazon Athena, eliminating the fragmented authentication experiences in our infrastructure. The rising demand for safe cross-account information sharing required transferring past guide, multi-step provisioning processes that depended closely on human intervention. Moreover, Twilio’s compliance and information stewardship necessities demanded fine-grained entry controls at row, column, and cell ranges, necessitating a scalable and versatile strategy to permission administration. By adopting the AWS Glue Knowledge Catalog as our managed metastore and AWS Lake Formation for governance, we carried out Tag-Primarily based Entry Management (LF-TBAC) to simplify entry administration, enabled information sharing by means of automated workflows, and established a centralized governance framework that offered uniform permissions administration throughout all AWS providers.
Transitioning to a managed metastore and governance options
We mentioned partly 1, how we have been trying to transfer to managed providers to alleviate us of the burden of managing the underlying infrastructure of a question platform. Together with our choice to undertake Amazon Athena, we additionally started to judge the adoption of Amazon EMR Serverless for our Spark workloads, which made us conscious of the truth that we would have liked emigrate to a managed resolution for our Apache Hive metastore.
We chosen the AWS Glue Knowledge Catalog as our managed metastore repository to assist our enterprise-wide information mesh structure. For managing permissions to the Knowledge Catalog belongings, we selected AWS Lake Formation, a service that allows information governance and safety at scale utilizing acquainted database-like permissions. Lake Formation supplies a unified permissions mannequin in addition to assist for enabling information mesh structure that we have been in search of.
Lake Formation’s assist for row, column, and cell-level entry controls supplies the fine-grained entry management (FGAC) capabilities required by our compliance and information stewardship insurance policies. Moreover, Lake Formation’s tag-based entry management (LF-TBAC) function permits us to outline FGAC permissions based mostly on tags hooked up to the Knowledge Catalog sources, enabling versatile and scalable permission administration.
Integrating Odin with AWS Lake Formation
Odin, our Presto-based gateway, serves as a central hub for question processing, managing authentication, routing, and the entire workflow all through a question’s lifecycle. As the first interface, Odin allows customers to attach by means of JDBC or APIs from numerous BI instruments, SQL IDEs, and different purposes.
Past its core routing capabilities, Odin makes use of native caches carried out utilizing Google’s Guava caching library to optimize efficiency throughout the platform. Guava delivers environment friendly in-memory caching for Java purposes by storing information domestically throughout the software occasion, leading to considerably sooner retrieval occasions. Odin employs a number of Guava caching layers throughout numerous modules to make sure optimum response occasions for steadily accessed information and metadata.
Constructing on this efficiency basis, Odin implements authentication and authorization layers to make sure safe and managed entry to information throughout a number of question engines. These safety elements work collectively to confirm consumer identities and implement information entry insurance policies, offering a unified safety framework that abstracts away the complexities of particular person engine implementations whereas sustaining strict governance requirements.
The authentication layer
Completely different question engines like OSS Presto and Amazon Athena every implement their very own authentication mechanisms. To create a constant consumer expertise, Odin supplies a unified authentication layer that shields customers from these underlying variations. Presently, Odin’s pluggable authentication system helps LDAP integration, with plans to broaden this functionality to incorporate Okta authentication utilizing IAM Id heart sooner or later.
The authorization layer
For information customers utilizing AWS Analytics providers resembling AWS Glue, Amazon EMR, and Athena by means of an IAM federated role-based entry, AWS Lake Formation offered essential authorization capabilities for information governance by means of their present integrations. Nonetheless, we would have liked to increase its capabilities to combine with OSS Presto. Moreover, our customers for the question infrastructure platform weren’t mapped to an IAM consumer so would want to construct a customized authorization layer in Odin to confirm permissions and combine with Lake Formation. Our problem was making a constant solution to management information entry throughout all our question engines.
When a consumer runs a question, Odin’s authorization layer checks three key items of knowledge:
- Desk particulars: which database and desk the question is accessing
- Person permissions: what information tags the consumer has entry to
- Useful resource tags: what safety tags are hooked up to the requested desk
We retailer consumer permissions in Amazon DynamoDB, which permits us to shortly search for what every consumer can entry. By matching the consumer’s tags with the desk’s Lake Formation tags, we will decide if the question needs to be allowed. To maintain issues quick, we cache this info quickly, permitting us to expedite authorization for latest requests.
How the authorization works:
- Preliminary examine: First, we see if this consumer not too long ago ran an analogous profitable question (throughout the final 5 minutes).
- Collect info: We gather the desk particulars, consumer permissions, and safety tags—first checking our cache, then fetching from AWS Glue Knowledge Catalog and Lake Formation if wanted.
- Match permissions: We evaluate the consumer’s entry tags saved in a DynamoDB desk towards the desk’s safety tags in Lake Formation.
- Make choice: If the consumer’s permissions match what’s required for his or her question motion (like SELECT or INSERT), entry is granted.
This strategy permits us to utilize Lake Formation tag-based entry management whereas protecting our authorization logic separate from the person question engines. By utilizing good caching and environment friendly lookups, we will confirm permissions in simply milliseconds.
Constructing an information mesh
At Twilio, we’ve a number of line of enterprise (LoBs) every managing their very own information platform infrastructure. The person platforms are unfold throughout a number of AWS accounts, and primarily retailer information on Amazon S3 in number of open desk codecs, resembling Apache Hudi, Apache Iceberg, and Delta Lake. Every platform independently helps analytics and machine studying use circumstances, nonetheless, there was a rising want for safe sharing of knowledge throughout LoBs. Moreover, we would have liked to allow self-service discovery and provisioning of entry to the info with a centralized governance framework.
Knowledge customers carry their very own AWS accounts and selection of instruments, which embody not solely AWS providers resembling Amazon Athena, AWS Glue ETL jobs (Spark), and Amazon EMR, but additionally AWS companion options. To enhance the method of entry success, information auditability and decreasing the operational overhead concerned, we would have liked an automatic framework in place that had minimal human intervention and oversight.
Implementing an information subscription workflow
Beforehand, customers requiring entry to particular information units would want to undergo a number of steps to safe entry, which concerned a number of dependencies and guide actions. To simplify this course of and supply a self-service functionality, we determined to construct a customized integration resolution between ServiceNow and AWS Lake Formation. At Twilio, ServiceNow is used extensively to automate workflows and construct customized purposes to attach disparate methods and enhance operational effectivity.
We automated key components of the info entry course of utilizing Twilio’s customary instruments: Git for model management, Terraform for infrastructure administration, and customized scripts to execute the mandatory AWS actions.
We automated three primary use circumstances:
1. Sharing information between accounts
When one group must share information with one other group or with our central governance account, the method begins with a Git pull request (PR). This triggers our customized Lake Formation automation software, which:
- Connects to the supply AWS account with admin permissions
- Units up information sharing utilizing the safety tags (LF-Tags) laid out in a YAML configuration file
- Completes the share utilizing AWS Useful resource Entry Supervisor (RAM)
- Creates useful resource hyperlinks within the goal account so the info seems of their catalog
- Updates ServiceNow with the newly shared database and desk info
2. Granting permissions to consumer roles
When customers request entry to information, our automation software grants tag-based permissions on to their IAM roles in Lake Formation. This occurs after approval of both a Git PR or ServiceNow ticket.
3. Granting entry to particular person customers
For particular person consumer entry requests:
- Customers submit a request in ServiceNow for particular tables
- After approval, ServiceNow calls our inside API that checks related Lake Formation tags
- The request is validated and despatched to an Amazon Easy Queue Service (Amazon SQS) queue
- A shopper service processes the request, updates the consumer’s permissions in our DynamoDB desk (which Odin makes use of for authorization checks), and consists of retry logic for reliability
- As soon as full, the service updates the ServiceNow ticket to inform the consumer
The general subscription and authorization movement is as proven within the diagram under:
- Customers submit a request in ServiceNow for entry to a database, desk, or LF-Tag
- The system retrieves the related LF-Tags from Lake Formation by means of our API integration
- Upon approval, the automation process provides the consumer to the Person-To-Tag DynamoDB desk, grants IAM function permissions in Lake Formation, and units up cross-account sharing by way of RAM as wanted
- Customers submit SQL question to the Odin presto gateway
- Odin authorizes the consumer by means of LDAP
- Odin parsers the SQL question to establish the tables concerned and the motion being carried out (SELECT, DDL, and extra)
- Odin validates permissions utilizing the Person to LF-Tag mapping and Lake formation grants to authorize the SQL question based mostly on granted permissions
- If approved, Odin routes the question to Amazon Athena or Presto
Utilizing standardized instruments and processes to offer self-service capabilities to the customers helped us scale the governance framework and assist broader use circumstances. Essential capabilities in Lake Formation, resembling Tag-based entry management (TBAC) and cross-account sharing of knowledge, simplified creating automations and our total strategy to governance.
Classes learned- Cache is king
“By adopting AWS Glue Knowledge Catalog as our managed metastore and AWS Lake Formation for Tag-Primarily based Entry Management, we simplified entry administration and enabled information sharing by lowering auth overhead to simply 6-10 milliseconds by means of caching and focused scaling.”
As Odin started dealing with queries at scale, we encountered efficiency bottlenecks in our personalized authorization course of as we needed to retrieve info from a number of providers, notably with advanced queries spanning a number of tables. The authorization checks concerned within the efficiency bottleneck steadily brought about question timeouts which impacted total system reliability. The foundation of the issue lay in our sequential authorization workflow: our system first needed to parse every question to establish all tables requiring identification verification, then make separate API calls to the AWS Glue Knowledge Catalog and Lake Formation for every desk’s permissions. It turned clear that we would have liked to optimize this authentication course of to scale back response occasions and enhance the general question expertise.
We additionally acknowledged there have been completely different caching wants between our POST operations and GET/DELETE HTTP calls, so we determined to separate them into two completely different Utility Load Balancer (ALB) goal teams. For POST requests, which required Lake Formation authentication, we discovered that concentrating visitors by means of simply 2-3 goal cases distributed throughout a number of Availability Zones (AZ) was extra environment friendly. This strategy allowed authentication info to be successfully cached domestically on these devoted cases, dramatically lowering the quantity of API calls to the Lake Formation service.
GET and DELETE requests observe a extra simplified workflow. Since customers have already accomplished preliminary authorization, there isn’t a have to proceed to carry out authorization checks. Though they observe a less complicated workflow, these requests have a lot greater quantity with requests numbering into the 10s of thousands and thousands per hour. Resulting from this scale, we opted to implement horizontal scaling to scale the goal ALB to 10 Amazon EC2 cases to fetch the question historical past from the DynamoDB desk. These EC2 cases make use of native LRU caching with a 5-minute expiration coverage for authentication information.
By implementing authentication caching and adopting specialised approaches for various HTTP request varieties with focused scaling teams, we efficiently lowered Odin’s total overhead to a most of 6-10 milliseconds for each authentication and authorization.
Conclusion and what’s subsequent
On this publish, we explored how we enhanced Odin, our unified multi-engine question platform, with authentication and authorization capabilities utilizing AWS Lake Formation and a customized authorization workflow. By utilizing AWS providers together with Lake Formation, AWS Glue Knowledge Catalog, and Amazon DynamoDB alongside Twilio’s present infrastructure, we created a scalable self-service governance framework that streamlines consumer entry administration, simplifies auditing, and allows seamless information sharing throughout our advanced cloud atmosphere. With this workflow automation, we eradicated operational overhead whereas constructing a safe, strong platform that serves as the inspiration for Twilio’s information mesh structure.
Going ahead, we’re specializing in strengthening our authentication and authorization framework by enabling trusted federation with an identification supplier(IdP) by means of AWS IAM Id Middle, which integrates straight with Lake Formation. Utilizing Trusted Id Propagation capabilities supported by IAM IDC will permit us to determine a constant governance movement based mostly on a consumer identification and can permit us to unlock the complete capabilities of AWS Lake Formation resembling fine-grained entry management with information filters.
To be taught extra and get began with constructing with AWS Lake Formation, see Getting began with Lake Formation, and Easy methods to construct an information mesh structure at scale utilizing AWS Lake Formation tag-based entry management.
Concerning the authors
