Data warehouse is a repository where all data from multiple sources are stored —raw data, metadata, and summary data— in a structural approach. The data will be used for various user from data analyst, data scientist, and business analyst. Data may need to be cleaned to ensure data quality before using. Data in a data warehouse can be aggregated for specific need (i.e., marketing, sales, inventory) are called data marts. What is Google BigQuery? Google BigQuery is a part of Google Cloud Platform (GCP). It is the autonomous data to AI platform, automating the entire data life cycle, from ingestion to AI-driven insights, so you can go from data to AI to action faster [ 1 ]. BigQuery is serverless, meaning users don't have to manage infrastructure. BigQuery separates storage and compute engines. It can automatically scale to handle large datasets and run complex, petabyte-scale queries quickly. BigQuery s...
Get link
Facebook
X
Pinterest
Email
Other Apps
Basics Dockers for Data Engineering
Get link
Facebook
X
Pinterest
Email
Other Apps
Hello there! I have started the
Data Engineering Zoomcamp 2026
cohort led by Alexey Grigorev. I want to share what I am learning as I go. My goal is to explain how things
work and why we use them. Codes of each learning modules are in my GitHub.
Knowing what to do as a data engineer
Data engineer builds and maintain what so called "data pipeline" a program that ingest data from sources, clean, transform and
put them in a database. The goal is to make sure the pipeline works whatever it takes. Now, imagine that you
build a pipeline using Python 3.11 and the production server is on 3.8: your pipeline cannot run on the server.
What we need is a software that allow us to run the same version and dependencies on every machine. That is what
docker do.
What is Docker
Docker is a containerization software; it isolates an environment from the local host machine while still using its
resources. Docker is a leaner version of virtual machine because it shares the operating system with host machine
(faster startup time) and contains only the application and its necessary dependency (consume less disk
space). To ensure portability—such as running a data pipeline on a different machine—Docker uses Images: lightweight, standalone packages of a container that can run on any system with Docker installed [1].
Base images are pre-built templates—often maintained
by official developers or open-source community—providing a starting point for others [2]. Examples include official
images for programming languages like Python or
databases like PostgreSQL. One benefit of using base image is that we don't need to install the program in our local host machine, just install Docker. Running base image is simple and straightforward, requiring the base image name, environmental variables, and requirements for the base image (i.e., port) from your terminal. The code block below is an example of running a PostgreSQL version 18 database, image postgres:18.
-e is environmental variables (user name, password, and database name in this example)
-v is volume (will be discussed in more detailed later)
-p is port (host_port:container_port)
Interactive container In case when we need to interact with a container, i.e., passing additional input to the container via bash/python shell. We need to add -it into docker run statement. The prompt will be appeared in the terminal to receive your inputs.
To end the process of a running container, press Ctrl + C.
However, for specialized
tasks like building a data pipeline, we must create
a custom image. This is done by taking a base image and adding our specific code, configurations, and environment
variables. To create a custom image, Docker requires specific instructions: the programming language (base image), the
necessary dependencies, the file structure, and the build procedures. These are defined in a Dockerfile—a
line-by-line script that directs Docker on how to assemble the image. Below is an example of the Dockerfile
I used to build a data pipeline for the Data Engineering Zoomcamp (referencing the ingest_data.py script).
# Use a slim Python image to keep the final container size small
FROM python:3.13.11-slim
# Copy the 'uv' binary from the official Astral image for fast dependency management
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/
# Set the working directory inside the container
WORKDIR /code
# Ensure the virtual environment created by uv is automatically used by Python
ENV PATH="/code/.venv/bin:$PATH"
# Copy dependency files first to leverage Docker's layer caching
# Modifying code won't trigger a full re-install of packages
COPY pyproject.toml .python-version uv.lock ./
# Install dependencies exactly as specified in the lockfile
RUN uv sync --locked
# Copy the application script into the container
COPY ingest_data.py .
# Define the command to run when the container starts
ENTRYPOINT ["python", "ingest_data.py"]
In the example Docker file, I use uv as a python package
management. When building the image, Docker reads the
pyproject.toml and
uv.lock files to pull dependencies from the
cache, ensuring a faster and more predictable build.
Unlike a standard pip install, which may download the latest
compatible versions and cause dependency drift, uv locks specific
versions the moment libraries are added. This provides much-needed consistency across environments; production
environment becomes a 1:1 match with development environment.
To build a custom image run the following from terminal where the Docker file and dependencies is located
docker build -t image_name:image_version .
The -t stands for tag which let us specify the image name using tag (name and version). The dot , . , tells Docker to use resources in the current directory. Then, we can run it the same way we run a base image.
By default, Docker containers are stateless, meaning any changes made within a running container are lost once it is deleted. While this prevents the host
machine from becoming cluttered with temporary files, stopped containers can still leave behind logs and
filesystem layers. To maintain a clean environment and follow best practices, it is better to remove stopped
containers (the corpses) and spin up a new container rather than restarting stopped ones.
But most of the times, we would like to save our works, stateful, especially as a data engineering who will
store data into an organization database, that why we need a Docker volume.
Docker Volume
A Docker volume is a persistent storage spaces that exists outside the container's temporary file system, like a dedicated virtual disk. While a container's internal storage is wiped when the container is deleted, a volume
ensures data persists. There are two primary methods for creating volume:
1. Named Volume Docker manages the storage location on the host machine automatically (e.g., -v volume_name:/container_path). The data is
persistent; if you stop or delete the container and then create a new one using the same volume, your data (like a
database's records) will still be there.
2. Bind Mount
You explicitly map a specific directory on your local host machine to a path inside the container (e.g.,
-v /local_path:/container_path). This is excellent for development
because any change made on your local computer immediately reflects inside the container.
Caution !!
If we map a bind mount to non-existing directory on the host machine, Docker will automatically
create. However, because the Docker runs with root privileges the newly created folder will
be owned by the root user. This may require using runing sudo in the terminal or
administrative permissions to modify or delete that directory later.
For Data Engineers, volumes are essential for databases like PostgreSQL. To interact with that data from local host
machine, we must also map a port, connecting a port on your local machine to the database port inside the Docker
environment.
Working with Multiple Containers
We now know that we need one image for spinning up our database. We can interact with the database using python
library pgcli or we can use a pgAdminimage in another container, enhance with GUI, to manage the database
in an another image. However, due to isolated nature of Docker containers, 2 different containers cannot see each other even we map them to the same port (i.e., PostgreSQL, data pipeline and pgAdmin). To connect different container, we need to create a Docker network for them [3].
Example of using docker network
Managing and launching multiple containers individually can be complex and error-prone. To streamline this process, we use a Docker Compose file—a configuration file in YAML format that instructs Docker to deploy multiple containers simultaneously, defined as services [4].
Containers defined within the same Docker Compose file are automatically joined to the same internal network, allowing them to communicate using their service names. Any container created outside of this file is isolated by default and must be explicitly connected to the Compose network to interact with the services.
For instance, we can use Docker Compose to set up a PostgreSQL database and pgAdmin as shown below. However, it is often better to run the data ingestion pipeline as a separate, one-off container. This separation ensures we only ingest data when necessary, rather than every time the database environment is started.
My Problem with Secret Management In the second module of Data Engineering Zoomcamp , I've learned about Kestra, an open source workflow orchestrator, to automate and provision Data Engineering tasks. Today, it is inevitable for data engineers to work with cloud platform like Google Cloud Platform (GCP), Microsoft Azure, Amazon Web Service (AWS), etc. Thus, we need to pass our credential into Kestra, as well as other workflow orchestrator, to streamline our works. I use GitHub for version control and coding via its Codespaces. I think it is not a good decision to put my credential in an public repository. I following Manage Secrets in Kestra | How-to Guide from Kestra to keep my credential secret and use it as a variable in my workflows. Since I use an open-source version, I can not directly add my credential as a secret via Kestra UI. Thus, I opt another method to store my credentials in a .env file. Then, convert them to base-64 an...
Data warehouse is a repository where all data from multiple sources are stored —raw data, metadata, and summary data— in a structural approach. The data will be used for various user from data analyst, data scientist, and business analyst. Data may need to be cleaned to ensure data quality before using. Data in a data warehouse can be aggregated for specific need (i.e., marketing, sales, inventory) are called data marts. What is Google BigQuery? Google BigQuery is a part of Google Cloud Platform (GCP). It is the autonomous data to AI platform, automating the entire data life cycle, from ingestion to AI-driven insights, so you can go from data to AI to action faster [ 1 ]. BigQuery is serverless, meaning users don't have to manage infrastructure. BigQuery separates storage and compute engines. It can automatically scale to handle large datasets and run complex, petabyte-scale queries quickly. BigQuery s...
Comments
Post a Comment