Addressing the Three Scalability Challenges in Fashionable Knowledge Platforms



In legacy analytical programs akin to enterprise knowledge warehouses, the scalability challenges of a system had been primarily related to computational scalability, i.e., the flexibility of a knowledge platform to deal with bigger volumes of knowledge in an agile and cost-efficient approach. Open supply frameworks akin to Apache Impala, Apache Hive and Apache Spark supply a extremely scalable programming mannequin that’s able to processing large volumes of structured and unstructured knowledge via parallel execution on a lot of commodity computing nodes. 

Whereas that programming paradigm was a very good match for the challenges it addressed when it was initially launched, current know-how provide and demand drivers have launched various levels of scalability complexity to fashionable Enterprise Knowledge Platforms that have to adapt to a dynamic panorama characterised by:

  • Proliferation of knowledge processing capabilities and elevated specialization by technical use case and even particular variations of technical use circumstances (for instance, sure households of AI algorithms, akin to Machine Studying, require purposely-built frameworks for environment friendly processing). As well as, knowledge pipelines embody an increasing number of levels, thus making it troublesome for knowledge engineers to compile, handle, and troubleshoot these analytical workloads
  • Explosion of knowledge availability from a wide range of sources, together with on-premises knowledge shops utilized by enterprise knowledge warehousing / knowledge lake platforms, knowledge on cloud object shops usually produced by heterogenous, cloud-only processing applied sciences, or knowledge produced by SaaS purposes which have now advanced into distinct platform ecosystems (e.g., CRM platforms). As well as, extra knowledge is turning into obtainable for processing / enrichment of current and new use circumstances e.g., just lately we have now skilled a fast development in knowledge assortment on the edge and a rise in availability of frameworks for processing that knowledge
  • Rise in polyglot knowledge motion due to the explosion in knowledge availability and the elevated want for complicated knowledge transformations (because of, e.g., completely different knowledge codecs utilized by completely different processing frameworks or proprietary purposes). Because of this, various knowledge integration applied sciences (e.g., ELT versus ETL) have emerged to deal with – in probably the most environment friendly approach – present knowledge motion wants
  • Rise in knowledge safety and governance wants because of a posh and ranging regulatory panorama imposed by completely different sovereigns and, additionally, because of the improve in variety of knowledge customers each inside the boundaries of a corporation (on account of knowledge democratization efforts and self-serve enablement) but in addition exterior these boundaries as corporations develop knowledge merchandise that they commercialize to a broader viewers of finish customers.

These challenges have outlined the guiding rules for the metamorphosis of the Fashionable Knowledge Platform to leverage a composite deployment mannequin (e.g., hybrid multi-cloud), that delivers fit-for-purpose analytics to energy the end-to-end knowledge lifecycle with constant safety and governance and in an open method (utilizing open supply frameworks to keep away from vendor lock-ins and proprietary applied sciences). These 4 capabilities collectively outline the Enterprise Knowledge Cloud.

Understanding Scalability Challenges in Fashionable Enterprise Knowledge Platforms

A consequence of the aforementioned shaping forces is the rise in scalability challenges for contemporary Enterprise Knowledge Platforms. These scalability challenges could be organized in three main classes:

  • Computational Scalability: How can we deploy analytical processing capabilities at scale and in a cost-efficient method, when analytical wants develop at an exponential fee, and we have to implement a large number of technical use circumstances in opposition to large quantities of knowledge?
  • Operational Scalability: How can we handle / function an Enterprise Knowledge Platform in an operationally environment friendly method, notably when that knowledge platform grows in scale and complexity? As well as, how can we allow completely different software growth groups to effectively collaborate and apply agile DevOps disciplines once they leverage completely different programming constructs (e.g., completely different analytical frameworks) for complicated use circumstances that span completely different levels throughout the information lifecycle?
  • Architectural Scalability: How can we preserve architectural coherence when the enterprise knowledge platform wants to meet an growing number of useful and non-functional necessities that require extra refined analytical processing capabilities, whereas delivering enterprise-grade knowledge safety and governance capabilities for knowledge and use circumstances hosted on completely different environments (e.g., public, non-public, hybrid cloud)?

Usually, organizations that leverage narrow-scope, single public cloud options for knowledge processing face incremental prices as they scale to deal with extra complicated use circumstances or an elevated variety of customers. These incremental prices derive from a wide range of causes:

  • Elevated knowledge processing prices related to legacy deployment sorts (e.g., Digital Machine-based autoscaling) as a substitute of utilizing superior deployment sorts akin to containers that cut back time to scale up / down compute assets
  • Restricted flexibility to make use of extra complicated internet hosting fashions (e.g., multi-public cloud or hybrid cloud) that would scale back analytical value per question utilizing probably the most cost-efficient infrastructure atmosphere (leveraging, e.g., pricing disparities between completely different public cloud service suppliers for particular compute occasion sorts / areas)
  • Duplication of storage prices as analytical outputs must be saved in silo-ed knowledge shops, and, oftentimes, utilizing proprietary knowledge codecs between completely different levels of a broader knowledge ecosystem that makes use of completely different instruments for analytical use circumstances
  • Increased prices for third get together instruments required for knowledge safety / governance and workload observability and optimization; The necessity for these instruments stems from both lack of native safety and governance capabilities in public cloud-only options or the dearth of uniformity in safety and governance frameworks employed by completely different options inside the identical knowledge ecosystem
  • Elevated integration prices utilizing completely different free or tight coupling approaches between disparate analytical applied sciences and internet hosting environments. For instance, organizations with current on-premises environments which are attempting to increase their analytical atmosphere to the general public cloud and deploy hybrid-cloud use circumstances have to construct their very own metadata synchronization and knowledge replication capabilities
  • Elevated operational prices to handle Hadoop-as-a-Service environments, given the dearth of area experience by Cloud Service Suppliers that merely bundle open supply frameworks in their very own PaaS runtimes however don’t supply refined proactive or reactive help capabilities, lowering Median Time To Uncover and Restore (MTTD / MTTR) for vital Severity-1 points.

The above challenges and prices could be simply ignored in PoC deployments or on the early levels of a public cloud migration, notably when a corporation is transferring small and fewer vital workloads to the general public cloud. Nevertheless, because the scope of the information platforms extends to incorporate extra complicated use circumstances or course of bigger volumes of knowledge, these ‘overhead prices’ develop into increased and the price for analytical processing will increase. That scenario could be simply illustrated with the notion of marginal value for a unit of analytical processing, i.e., the price to service the subsequent use case or present an analytical atmosphere to a brand new enterprise unit: 

How Cloudera Knowledge Platform (CDP) Addresses Scalability Challenges

In contrast to different platforms, CDP is an Enterprise Knowledge Cloud and allows  organizations to handle scalability challenges by providing a fully-integrated, multi-function, and infrastructure-agnostic knowledge platform. CDP contains all obligatory capabilities associated to knowledge safety, governance and workload observability which are stipulations for a big scale, complicated enterprise-grade deployment: 

Computational Scalability

  • For Knowledge Warehousing use circumstances which are some the most typical and important large knowledge workloads (within the sense that they’re being utilized by many alternative personas and different downstream analytical purposes), CDP delivers decrease cost-per-query vis-a-vis cloud-native knowledge warehouses and different Hadoop-as-a-Service options, based mostly on comparisons carried out utilizing reference efficiency benchmarks for giant knowledge workloads (e.g., benchmarking research carried out by unbiased third get together)
  • CDP leverages containers for almost all of the Knowledge Companies thus enabling nearly instantaneous scale up / down of compute swimming pools, as a substitute of utilizing Digital Machines for auto-scaling, an strategy nonetheless utilized by many distributors
  • CDP provides the flexibility to deploy workloads on versatile internet hosting fashions akin to hybrid cloud or public multi-cloud environments, permitting organizations to run use circumstances on probably the most environment friendly atmosphere all through the use case lifecycle with out even incurring migration / use case refactoring prices

Operational Scalability

  • CDP has launched many operational efficiencies and a single pane of glass for full operational management and for composing complicated knowledge ecosystems by providing pre-integrated analytical processing capabilities as “Knowledge Companies” (beforehand referred to as experiences) , thus lowering operational effort and value to combine completely different levels in a broader knowledge ecosystem and handle their dependencies.
  • For every particular person Knowledge Service, CDP reduces time to configure, deploy and handle completely different analytical environments. That’s achieved by offering templates based mostly on completely different workload necessities (e.g., Excessive Availability Operational Databases) and by automating proactive situation identification and determination (e.g., auto-tuning and auto-healing options offered by CDP Operational Database or COD) 
  • That stage of automation and ease allows knowledge practitioners to face up analytical environments in a self-service method (i.e., with out involvement from the Platform Engineering workforce to configure every Knowledge Service) inside the safety and governance boundaries outlined by the IT Perform

With CDP, software growth groups that leverage the assorted Knowledge Companies can speed up growth of use circumstances and time-to-insights by leveraging the end-to-end knowledge visibility options provided by the Shared Knowledge Expertise (SDX) akin to knowledge lineage and collaborative visualizations Architectural Scalability

  • CDP provides completely different analytical processing capabilities as pre-integrated Knowledge Companies, thus eliminating the necessity for complicated ETL / ELT pipelines which are usually used to combine heterogeneous knowledge processing capabilities
  • CDP contains out-of-the-box, purposely constructed capabilities that allow automated atmosphere administration (for hybrid cloud and public multi-cloud environments), use case orchestration, observability and optimization. CDP Knowledge Engineering (CDE) for instance, contains three capabilities (Managed Airflow, Visible Profiler and Workload Supervisor) to empower knowledge engineers to handle complicated Directed Acyclic Graphs (DAGs) / knowledge pipelines  
  • SDX, which is an integral a part of CDP , delivers uniform knowledge safety and governance, coupled with knowledge visualization capabilities enabling fast onboarding of knowledge and knowledge platform customers and entry to insights for all of CDP throughout hybrid clouds at no additional value.


The sections above current how the Cloudera Knowledge Platform helps organizations overcome scalability challenges throughout computational, architectural and operational areas which are related to implementing Enterprise Knowledge Clouds at scale. Particulars across the Shared Knowledge Expertise (SDX) that removes architectural complexities of enormous knowledge ecosystems could be discovered right here and for an outline of the Cloudera Knowledge Platform processing capabilities please go to