Datawarehouse and BigQuery

Image
          Data warehouse is a repository where all data from multiple sources are stored —raw data, metadata, and summary data— in a structural approach. The data will be used for various user from data analyst, data scientist, and business analyst. Data may need to be cleaned to ensure data quality before using. Data in a data warehouse can be aggregated for specific need (i.e., marketing, sales, inventory) are called data marts. What is Google BigQuery?           Google BigQuery is a part of Google Cloud Platform (GCP). It is the autonomous data to AI platform, automating the entire data life cycle, from ingestion to AI-driven insights, so you can go from data to AI to action faster [ 1 ]. BigQuery is serverless, meaning users don't have to  manage infrastructure. BigQuery separates storage and compute engines. It can automatically scale to handle large datasets and run complex, petabyte-scale queries quickly. BigQuery s...

Basics Dockers for Data Engineering

Hello there! I have started the Data Engineering Zoomcamp 2026 cohort led by Alexey Grigorev. I want to share what I am learning as I go. My goal is to explain how things work and why we use them. Codes of each learning modules are in my GitHub.

Knowing what to do as a data engineer

Data engineer builds and maintain what so called "data pipeline" a program that ingest data from sources, clean, transform and put them in a database. The goal is to make sure the pipeline works whatever it takes. Now, imagine that you build a pipeline using Python 3.11 and the production server is on 3.8: your pipeline cannot run on the server. What we need is a software that allow us to run the same version and dependencies on every machine. That is what docker do.

What is Docker

Docker is a containerization software; it isolates an environment from the local host machine while still using its resources. Docker is a leaner version of virtual machine because it shares the operating system with host machine (faster startup time) and contains only the application and its necessary dependency (consume less disk space). To ensure portability—such as running a data pipeline on a different machine—Docker uses Images: lightweight, standalone packages of a container that can run on any system with Docker installed [1].

Base images are pre-built templates—often maintained by official developers or open-source community—providing a starting point for others [2]. Examples include official images for programming languages like Python or databases like PostgreSQL. One benefit of using base image is that we don't need to install the program in our local host machine, just install Docker. Running base image is simple and straightforward, requiring the base image name,  environmental variables, and requirements for the base image (i.e., port) from your terminal. The code block below is an example of running a PostgreSQL version 18 database, image postgres:18.
docker run \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql \
  -p 5432:5432 \
  postgres:18
Where
  • -e is environmental variables (user name, password, and database name in this example)
  • -v is volume (will be discussed in more detailed later)
  • -p is port (host_port:container_port)
Interactive container In case when we need to interact with a container, i.e., passing additional input to the container via bash/python shell. We need to add -it into docker run statement. The prompt will be appeared in the terminal to receive your inputs.
To end the process of a running container, press Ctrl + C

However, for specialized tasks like building a data pipeline, we must create a custom image. This is done by taking a  base image and adding our specific code, configurations, and environment variables. To create a custom image, Docker requires specific instructions: the programming language (base image), the necessary dependencies, the file structure, and the build procedures. These are defined in a Dockerfile—a line-by-line script that directs Docker on how to assemble the image. Below is an example of the Dockerfile I used to build a data pipeline for the Data Engineering Zoomcamp (referencing the ingest_data.py script).
# Use a slim Python image to keep the final container size small
FROM python:3.13.11-slim

# Copy the 'uv' binary from the official Astral image for fast dependency management
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/

# Set the working directory inside the container
WORKDIR /code

# Ensure the virtual environment created by uv is automatically used by Python
ENV PATH="/code/.venv/bin:$PATH"

# Copy dependency files first to leverage Docker's layer caching
# Modifying code won't trigger a full re-install of packages
COPY pyproject.toml .python-version uv.lock ./

# Install dependencies exactly as specified in the lockfile
RUN uv sync --locked

# Copy the application script into the container
COPY ingest_data.py .

# Define the command to run when the container starts
ENTRYPOINT ["python", "ingest_data.py"]
In the example Docker file, I use uv as a python package management. When building the image, Docker reads the pyproject.toml and uv.lock files to pull dependencies from the cache, ensuring a faster and more predictable build. Unlike a standard pip install, which may download the latest compatible versions and cause dependency drift, uv locks specific versions the moment libraries are added. This provides much-needed consistency across environments; production environment becomes a 1:1 match with development environment.

To build a custom image run the following from terminal where the Docker file and dependencies is located
docker build -t image_name:image_version .
The -t stands for tag which let us specify the image name using tag (name and version). The dot , , tells Docker to use resources in the current directory. Then, we can run it the same way  we run a base image.

By default, Docker containers are stateless, meaning any changes made within a running container are lost once it is deleted. While this prevents the host machine from becoming cluttered with temporary files, stopped containers can still leave behind logs and filesystem layers. To maintain a clean environment and follow best practices, it is better to remove stopped containers (the corpses) and spin up a new container rather than restarting stopped ones. But most of the times, we would like to save our works, stateful, especially as a data engineering who will store data into an organization database, that why we need a Docker volume.

Docker Volume

A Docker volume is a persistent storage spaces that exists outside the container's temporary file system, like a dedicated virtual disk. While a container's internal storage is wiped when the container is deleted, a volume ensures data persists. There are two primary methods for creating volume:

1. Named Volume
Docker manages the storage location on the host machine automatically (e.g., -v volume_name:/container_path). The data is persistent; if you stop or delete the container and then create a new one using the same volume, your data (like a database's records) will still be there.

2. Bind Mount
You explicitly map a specific directory on your local host machine to a path inside the container (e.g., -v /local_path:/container_path). This is excellent for development because any change made on your local computer immediately reflects inside the container.
Caution !! If we map a bind mount to non-existing directory on the host machine, Docker will automatically create. However, because the Docker runs with root privileges the newly created folder will be owned by the root user. This may require using runing sudo in the terminal or administrative permissions to modify or delete that directory later.
For Data Engineers, volumes are essential for databases like PostgreSQL. To interact with that data from local host machine, we must also map a port, connecting a port on your local machine to the database port inside the Docker environment.

Working with Multiple Containers

We now know that we need one image for spinning up our database. We can interact with the database using python library pgcli or we can use a pgAdmin image in another container, enhance with GUI, to manage the database in an another image. However, due to isolated nature of Docker containers, 2 different containers cannot see each other even we map them to the same port (i.e., PostgreSQL, data pipeline and pgAdmin). To connect different container, we need to create a Docker network for them [3]. 

Example of using docker network

Managing and launching multiple containers individually can be complex and error-prone. To streamline this process, we use a Docker Compose file—a configuration file in YAML format that instructs Docker to deploy multiple containers simultaneously, defined as services [4].

Containers defined within the same Docker Compose file are automatically joined to the same internal network, allowing them to communicate using their service names. Any container created outside of this file is isolated by default and must be explicitly connected to the Compose network to interact with the services.

For instance, we can use Docker Compose to set up a PostgreSQL database and pgAdmin as shown below. However, it is often better to run the data ingestion pipeline as a separate, one-off container. This separation ensures we only ingest data when necessary, rather than every time the database environment is started.
services:
  pgdatabase:
    image: postgres:18
    environment:
      POSTGRES_USER: "root"
      POSTGRES_PASSWORD: "root"
      POSTGRES_DB: "ny_taxi"
    volumes:
      - "ny_taxi_postgres_data:/var/lib/postgresql"
    ports:
      - "5432:5432"
  pgadmin:
    image: dpage/pgadmin4
    environment:
      PGADMIN_DEFAULT_EMAIL: "admin@admin.com"
      PGADMIN_DEFAULT_PASSWORD: "root"
    volumes:
      - "pgadmin_data:/var/lib/pgadmin"
    ports:
      - "8085:80"
volumes:
  ny_taxi_postgres_data:
  pgadmin_data:


References

  1. Use containers to Build, Share and Run your applications. https://www.docker.com/resources/what-container/
  2. Docker Official Images https://docs.docker.com/docker-hub/image-library/trusted-content/#docker-official-images
  3. docker network create https://docs.docker.com/reference/cli/docker/network/create/
  4. Docker Compose. https://docs.docker.com/compose/

Comments

Popular posts from this blog

Using Secrets as Environment Variables in GitHub Codespace

Datawarehouse and BigQuery