The Benefits of an All-in-One Data Lakehouse


In a recent blog, Cloudera Chief Technology Officer Ram Venkatesh described the evolution of a data lakehouse, as well as the benefits of using an open data lakehouse, especially the open Cloudera Data Platform (CDP). If you missed it, you can read up about it here.

Modern data lakehouses are typically deployed in the cloud. Cloud computing brings several distinct advantages that are core to the lakehouse value proposition. The first is near unlimited storage. Leveraging cloud-based object storage frees analytics platforms from any storage constraints. Your data can grow infinitely. The second advantage is virtualized compute power. Analytical engines can be scaled up (or down) on demand, as per the requirements of your workload. Finally, cloud computing adds low cost and high resiliency to these services.

The advantages provide the foundation for the modern data lakehouse architectural pattern. Cloud computing allows for on-demand provisioning of infrastructure and services, however there are two ways that you can deploy a data lakehouse:

  1. First, you can build and configure a data lakehouse within your cloud account, in a manner known as Platform as a Service (PaaS).
  2. Second, you can subscribe to a data lakehouse service, such as Software as a Service (SaaS).

This article will dive deeper into the characteristics of both types of data lakehouse deployments, introducing the benefits of Cloudera’s new all-in-one lakehouse offering, CDP One.

PaaS data lakehouses

Platform as a Service (PaaS) data lakehouses are virtualized deployments of the data lakehouse that are provisioned within your cloud account. Cloudera Data Platform (CDP) public cloud is an example of a PaaS data lakehouse. Let’s dive into the characteristics of these PaaS deployments:

Hardware (compute and storage): With PaaS deployments, the data lakehouse will be provisioned within your cloud account. Your team will make the decision on the size and shape of the infrastructure that comprises the data lakehouse deployment. You will have access to on-demand compute and storage at your discretion.

Security: Even though the PaaS data lakehouse is provisioned for you, it is up to you to define and enforce the security of your cloud deployment. You are responsible for securing the perimeter, defining network rules, and establishing end-point protection that detects and prevents threats. 

Additionally, you are responsible for the security of the cloud-resident data. This data exists outside of your corporate network perimeter, so it is prudent to set up your own SIEM to capture and log all access to the components and data.

Cloud platform security offers a wide range of tools and techniques to make your cloud deployment as secure or even more secure than your on-premises footprint. Integrating these components  to conform to your security controls, however, is your responsibility. 

Operations: Operational activities for PaaS-deployed data lakehouses need to be executed by your operations team. Typically one or more cloud engineers deploy the data lakehouse and subsequently provide operational support for the deployment. Once deployed, the health of the lakehouse needs to be continually monitored for availability and connectivity issues. Should an issue arise, it is up to this cloud ops team to apply corrective measures. 

In addition to health monitoring, your ops team would also be responsible for executing operational and maintenance activities. Software upgrades and security patches need to be tested, scheduled, and delivered by the ops team. Should system resources such as CPU or system memory become constrained, this ops team is responsible to correct. In short, just like on-premise deployments, a small team of operations personnel are required to successfully deploy and manage this type of data lakehouse deployment. 

Cost: PaaS data lakehouses run in your cloud account. You are responsible for paying for the monthly cloud bill. Given that, it is wise to create a cloud spend budget, define cloud controls to prevent runaway spend, and regularly monitor cloud spend. Beyond budget monitoring, there needs to be constant monitoring of cost performance of the lakehouse. This allows you to run workloads that conform to your service level agreement and fit within the budget set.

PaaS data lakehouses are ideal for companies that want to do it themselves (DIY). PaaS deployments give companies finer control on all aspects of the environment. You own the cloud account and can access all the configurations and services that the Cloud provider offers. 

While PaaS data lakehouses provide agility and a quicker path to analytics as compared to on-premise deployments, they do require ongoing operations staffing to ensure successful delivery of analytic services.

SaaS data lakehouses

Software as a Service (SaaS) data lakehouse deployments are turnkey solutions offered as a service. For example, the recently announced CDP One all-in-one data lakehouse is an SaaS offering that runs in the cloud (Amazon Web Services). CDP One provides a self-service experience, meaning low friction and low touchyour business and your users should be focused on generating business value in the form of analytics, rather than focusing on IT, operations, and support. Let’s dive into each category and compare it to PaaS data lakehouse deployments. 

Hardware (compute and storage): As with PaaS data lakehouses, the CDP One data lakehouse resides in the cloud and uses virtualized compute. SaaS data lakehouse size and shape is automatically determined for you. It can grow automatically as needed, driven by your usage and budget. Cloud storage is versioned as well, and should you inadvertently delete important data the SaaS CDP One ops team can quickly recover it for you. To the user, it is a serverless experience.

Security: CDP One is a single-tenant cloud architecture SaaS that enables private and secure access to Cloudera Data Platform. CDP One participates in industry certification and accreditation programs to provide the highest level of assurance regarding our operations, infrastructure, and security controls. Cloudera partners with leading AICPA-certified, third-party auditors to maintain SOC 2 Type 2 report and ISO27001 certifications. Protecting your data is part of the CDP One offering. Access to the data lakehouse is secure, data is encrypted in motion and at rest, and is continuously monitored. Threat vectors take all forms, and the CDP One security service detects and responds to anomalous activity. The CDP One security framework is regularly updated to detect and block the most current security threats. And finally, all activity is captured and logged into the CDP One security information and event management system for full auditing, security alerting, and activity transparency.

Operations: Operations, devOps, and secOps, are part of the CDP One offering. The CDP One data lakehouse is continuously monitored for availability. Any infrastructure issues are automatically detected and quickly resolved. Patches for security issues are regularly applied to the compute nodes and containers automatically with minimal downtime. Software upgrades, always a complex and often lengthy activity, are automatically applied for you on a quarterly basis at a mutually agreed upon time. With CDP One, you do not have to staff or worry about devOps and secOps activities. These operations are part of the service and a key feature that drives lower total cost of ownershipyou do not have to hire or staff an operations team to manage the data lakehouse.

Cost: CDP One is consumption-based. You pay for the compute power and storage you use to drive your analytics. Your data warehouse dashboards might be running during business hours and remain unused during other hours. CDP One can automatically schedule availability of the analytic engines to just the times you need them. Under the covers the service performs extensive cloud benchmarks ensuring that you always get the best cost performance.

The benefits of all-in-one data lakehouses

Operating a production-ready data lakehouse can be challenging. Challenges include deploying and maintaining the data platform as well as managing cloud compute costs. Additionally, your data within the data lakehouse must be kept secure, yet at the same time easily accessible by authorized staff and business intelligence tools within your enterprise. 

If you like to do it yourself, and have the staff and time to configure and manage it, a PaaS data lakehouse deployment might be the best option for you. However, if you’d rather focus instead on the analytical workloads that power your business, then consider Cloudera’s recently announced CDP One, a self-service data lakehouse based on Cloudera’s Cloud Data Platform (CDP Public Cloud), an open data lakehouse software suite. CDP One is an all-in-one data lakehouse Software as a Service (SaaS) offering that enables fast and easy self-service analytics and exploratory data science on any type of data. CDP One requires zero ops, enabling fast and easy self-service analytics on any type of data without the need for specialized ops or cloud expertise.Try it today for free here!