Posts

Showing posts from February, 2026

Datawarehouse and BigQuery

Image
          Data warehouse is a repository where all data from multiple sources are stored —raw data, metadata, and summary data— in a structural approach. The data will be used for various user from data analyst, data scientist, and business analyst. Data may need to be cleaned to ensure data quality before using. Data in a data warehouse can be aggregated for specific need (i.e., marketing, sales, inventory) are called data marts. What is Google BigQuery?           Google BigQuery is a part of Google Cloud Platform (GCP). It is the autonomous data to AI platform, automating the entire data life cycle, from ingestion to AI-driven insights, so you can go from data to AI to action faster [ 1 ]. BigQuery is serverless, meaning users don't have to  manage infrastructure. BigQuery separates storage and compute engines. It can automatically scale to handle large datasets and run complex, petabyte-scale queries quickly. BigQuery s...

Datawarehouse and BigQuery

Image
          Data warehouse is a repository where all data from multiple sources are stored —raw data, metadata, and summary data— in a structural approach. The data will be used for various user from data analyst, data scientist, and business analyst. Data may need to be cleaned to ensure data quality before using. Data in a data warehouse can be aggregated for specific need (i.e., marketing, sales, inventory) are called data marts. What is Google BigQuery?           Google BigQuery is a part of Google Cloud Platform (GCP). It is the autonomous data to AI platform, automating the entire data life cycle, from ingestion to AI-driven insights, so you can go from data to AI to action faster [ 1 ]. BigQuery is serverless, meaning users don't have to  manage infrastructure. BigQuery separates storage and compute engines. It can automatically scale to handle large datasets and run complex, petabyte-scale queries quickly. BigQuery s...

Building Data Ingestion Pipeline with Kestra

Image
This week, I built data ingestion pipelines for NYC taxi dataset same as last week. Instead of doing separated tasks, I use Kestra to organize and manage my workflow. I will write a blog on basics of Kestra, but here let's me give a brief introduction. Kestra is a workflow orchestrator that use flow code or no code to automate works, i.e., building a data pipeline. Here is how my pipelines are built Ingestion to PostgreSQL In this flow, we are going to take the NYC data, transform it, and put into our PostgreSQL. For input, we need to determine which year, month and taxi type (green or yellow) from the dataset that we want. The flow is illustrate as below: The flow code can be found  here .  Here are description of each task. Set label This set up specific filename and the target database table names based on these inputs (e.g., yellow_tripdata_2019-01.csv). Data extraction Extract taxi data from https://github.com/DataTalksClub/nyc-tlc-data/r...

Popular posts from this blog

Basics Dockers for Data Engineering

Using Secrets as Environment Variables in GitHub Codespace

Datawarehouse and BigQuery