Cloud Data Warehousing

Friday, February 26, 2021

Introduction

A cloud data warehouse is a database delivered in public cloud as a managed service that is optimized for analytics, scale and ease of use. We believe that cloud data warehouses are a game changer and the next wave in data warehousing. Used thoughtfully, cloud data warehouses can dramatically lower your operating costs while giving you the agility to keep up with the demands of the business.

In The late 2000s, working with Oracle 7.3/8i and Microsoft SQL 2000, a“relational” database where data was formatted into tables. The concept of adata service using SQL with dimension fact modeling was a game changer. In the 2005, when relational databases began to struggle with the size and complexity of analytical workloads, we saw the emergence of the Massive Parallel Processing (MPP) data warehouses like Teradata, Netezza and later, Vertica and Greenplum. In 2010, a sea change in data management with an open source project called Hadoop. The concept of a “data lake” where I could query raw unstructured data was a huge leap forward in my ability to capture, store and process more data with more agility at a substantially lower cost.

We’re Now witnessing a third wave of innovation in data warehousing technology with the advent of cloud data warehouses from AWS, Microsoft, Google, Oracle, IBMetc. with Serverless technology services; e.g. AWS EMR, Redshift with Lambda Functions and Azure SQL Datawarehouse DataFactory, Azure Data Lake Storage,Azure Databricks, Azure Event Hub, Azure IOT Hub, Azure Analysis Services, etc.and Google BigQuery, Google Dataflow, Google Data Catalog, Google Dataproc,Google Datafusion, Workflow Orchestration using Airflow, Cloud Data transfer services . As enterprises move to the cloud, they are abandoning their legacy on-premise data warehousing technologies, including Hadoop, for these new cloud data platforms. This transformation is a huge tectonic shift in data management and has profound implications for enterprises.

 

Cloud DATA WAREHOUSE CAPABILITIES

Each of the major public cloud vendors offer their own flavor of a cloud data warehouse service: Microsoft has Azure SQL Data Warehouse, Google offers BigQuery and Amazon has Redshift. There are also cloud offerings from the likes of Snowflake that provide the same capabilities via a service that runs on the public cloud but is managed independently. Foreach of these services, the cloud vendor or data warehouse provider delivers the following capabilities “out of the box”:

  1. Data storage and management: data is stored in a cloud-based file system (i.e. AWS S3, Azure Blob, google Cloud storage, Persistent disks).
  2. Automatic upgrades: there’s no concept of a “version” or software upgrade.
  3. Capacity management: it’s easy to expand (or contract) your data footprint.
  4. Native Services to make fast and ease the application development.
  5. High Scalability, Easy to manage and Flexible infrastructure.
  6. Automated Continuous data replication.

 

THE BENEFITS OF A CLOUD DATA WAREHOUSE

Cloud-based data warehouses free up companies to focus on running their business, rather than running a room full of servers,and they allow business intelligence teams to deliver faster and better insights due to robust, reliable and improved access, scalability, and performance.

  • Data Access: Putting their data in the cloud enables companies to give their analysts access to real-time data from numerous sources like IOT devices, CCTV camera, social media files and streaming, any unstructured audio video files, allowing them to run better analytics quickly.
  • Scalability: It is much faster and less expensive to scale a cloud data warehouse than an on-premise system because it doesn’t require purchasing new hardware (and possibly over- or under-provisioning) and the scaling can happen automatically as needed.
  • Performance: A cloud data warehouse allows for queries to be run much more quickly than they are against a traditional on-premises data warehouse, for lower cost.
  • Low cost pay as you use for even using server less high volume big data resources, e.g. a few hours usage of 500 core CPU, 1000 GB RAM, Multi-terabyte storage.

 

Cloud Data Warehouse Comparison Chart

Features Key Differentiator
Amazon Redshift High-performance and massively parallel processing capabilities.Network isolation security. Direct integration with S3 cloud storage.
Google BigQuery Part of Google Cloud.Full SQL query support. Integration with BigQuery ML for machine learning workloads.
IBM Db2 Warehouse Includes an in-memory columnar database.Cloud deployment options include both IBM Cloud as well as AWS. Integrated support for Apache Spark data analytics.
Microsoft Azure SQL Data Warehouse Data masking security capabilities.Integrated with broader Azure cloud services. Inclusion of Microsoft SQL Server support.
Oracle Autonomous Data Warehouse Based on the latest Oracle Autonomous Database release.Migration support for other databases and cloud data warehouse services. Delivered by purpose-built Oracle Exadata hardware.
SAP Data Warehouse Cloud Pre-built business templates.Integration with existing SAP apps and services. Based on SAP HANA database.
Snowflake SQL based queries for analytics.Support for JSON and XML as well as structured data. Multi-cloud deployment options.

 

FACTORS TO CONSIDER WHEN CHOOSING A CLOUD DATA WAREHOUSE

How these cloud data warehouse vendors deliver these capabilities and how they charge, Let’s dive deeper into the different deployment implementations and pricing models.

Cloud Architecture: Cluster versus Serverless

There are two main camps of cloud data warehouse architectures. The first, older deployment architecture is cluster-based: Amazon Redshift and Azure SQL Data Warehouse fall into this category. Typically, clustered cloud data warehouses are really just clustered Postgres derivatives, ported to run as a service in the cloud. The other flavor, serverless, is more modern and counts Google BigQuery and Snowflake as examples. Essentially, serverless cloud data warehouses make the database cluster “invisible” or shared across many clients. Each architecture has their pros and cons (see below).

Consideration Serverless Clustered
Elasticity No Cluster to manage. Queries automatically scaled up to some throttle point. Customer needs to expand (or reduce) cluster size as data and load expand or reduce.
Management Service is managed by Cloud Vendor. Cluster health and capacity management sometimes necessary.
Cost Per Query or utilization based so more difficult to predict. Priced per node so easier to predict.

Cloud Data Pricing

Besides deployment architecture,another major difference between the cloud data warehouse options is pricing.In all cases, you pay some nominal fee for the amount of data stored. But the pricing differs for compute.

For example, Google BigQuery and Snowflake offer on-demand pricing options based on the amount of data scanned or compute time used. Amazon Redshift and Azure SQL Data Warehouse offerre source pricing based on the number or types of nodes in the cluster. There are pros and cons to both types of pricing models. The on-demand models only charge you for what you use which can make budgeting difficult as it is hard to predict the number of users and the number and size of the queries they will be running. If a user mistakenly ran large number of query e.g. $1,000+ query,charge will be high for the mistake.

Pricing is a major consideration and requires a great deal of use case and workload modeling to find the right fit for your organization.

Challenges and Considerations for Cloud Migration

We’ve seen lots of enterprise sattempt a migration from their on-premise data lakes and/or relational data warehouses to the cloud. For many, their migrations “stall” after the first pilot project due to the following reasons:

  1. Disruption: downstream users (business analysts, data scientists) have to change their habits and re-design, re-tool their reports and dashboards.
  2. Performance: If one does not use Cloud native services then no major performance gain. If your data lake or data     warehouse platform is based on old legacy technology then there is high performance gain but if it is tuned with latest technology then in comparison with on-premise, no significant performance.
  3. Low cost on infrastructure on cloud in comparison to on-premise, Cloud is scalable so no need to keep 30-50%     extra CPU, Memory, Storage resources as contingency.

Where CLOUD PRISMA can help

  • Successful Cloud migration
  • Minimizes or eliminates business disruption due to platform migrations
  • Boost Performance by applying best practices and tools
  • Enhance monitoring of resources,costs and application services

You might also like
this new related posts

If you’re interested to view all our use cases, click here.