Datawarehouse and BigQuery
Data warehouse is a repository where all data from multiple sources are stored —raw data, metadata, and summary data— in a structural approach. The data will be used for various user from data analyst, data scientist, and business analyst. Data may need to be cleaned to ensure data quality before using. Data in a data warehouse can be aggregated for specific need (i.e., marketing, sales, inventory) are called data marts.
What is Google BigQuery?
Google BigQuery is a part of Google Cloud Platform (GCP). It is the autonomous data to AI platform, automating the entire data life cycle, from ingestion to AI-driven insights, so you can go from data to AI to action faster [1]. BigQuery is serverless, meaning users don't have to manage infrastructure. BigQuery separates storage and compute engines. It can automatically scale to handle large datasets and run complex, petabyte-scale queries quickly. BigQuery supports standard ANSI SQL (GoogleSQL). This makes it accessible to data analysts and engineers who are already familiar with SQL.
Ways to Minimize Cost
The cost and time for querying depends on amount of data being scanned by BigQuery (don't run SELECT * over millions rows of data). However, we can optimize query time and cost of data tables using partitioning and clustering. These 2 approaches let BiqQuery to scan only relevant rows that matched with the filter condition in a column.
Partitioning
Partitioning segment a table into groups of rows based on numeric values in a column (i.e., integer, date). The query cost can be estimated before running. BigQuery skip rows that do not match the query's filter on the partition column, the process is called data pruning. The limitation of partitions is 4,000.
Clustering
Clustering provides additional performance from partitioning. Clustering organizes, sorts, data blocks within a single table or within each partition of a partitioned table. It further improve performance within the scanned data by ensuring rows with similar values are stored physically close together in storage blocks. We can specify up to 4 clustering columns. However, querying cost can only after the query runs (the initial estimate may be an upper bound), as the data pruning happens during query execution. Note that table size less than 1 GB will not receive significant improvement from clustering.
The image shows an a query to retrieve the distinct VendorIDs between tpep_dropoff_datetime 2024-03-01 and 2024-03-15 from NYC yellow taxi dataset without partitioning and clustering.
Let's see the performance gain from partitioning and clustering. The SQL below partition the NYC yellow taxi data by tpep_dropoff_datetime and cluster it by VendorID.
CREATE OR REPLACE TABLE `de-zoomcamp-487008.de_zoomcamp.yellow_tripdata_partitioned_clustered`
PARTITION BY DATE(tpep_dropoff_datetime)
CLUSTER BY VendorID AS
SELECT * FROM `de-zoomcamp-487008.de_zoomcamp.yellow_tripdata_non_partitioned`;
Then we can run the previous query to see the performance.
With partitioning and clustering, we can see the the data can is reduced to 26.84 MB (about 91% decreased in data scanned by BigQuery).
References
- From data warehouse to autonomous data and AI platform https://cloud.google.com/bigquery?hl=en


Comments
Post a Comment