Datawarehouse and BigQuery

Image
          Data warehouse is a repository where all data from multiple sources are stored —raw data, metadata, and summary data— in a structural approach. The data will be used for various user from data analyst, data scientist, and business analyst. Data may need to be cleaned to ensure data quality before using. Data in a data warehouse can be aggregated for specific need (i.e., marketing, sales, inventory) are called data marts. What is Google BigQuery?           Google BigQuery is a part of Google Cloud Platform (GCP). It is the autonomous data to AI platform, automating the entire data life cycle, from ingestion to AI-driven insights, so you can go from data to AI to action faster [ 1 ]. BigQuery is serverless, meaning users don't have to  manage infrastructure. BigQuery separates storage and compute engines. It can automatically scale to handle large datasets and run complex, petabyte-scale queries quickly. BigQuery s...

Using Secrets as Environment Variables in GitHub Codespace

My Problem with Secret Management

        In the second module of Data Engineering Zoomcamp, I've learned about Kestra, an open source workflow orchestrator, to automate and provision Data Engineering tasks. Today, it is inevitable for data engineers to work with cloud platform like Google Cloud Platform (GCP), Microsoft Azure, Amazon Web Service (AWS), etc. Thus, we need to pass our credential into Kestra, as well as other workflow orchestrator, to streamline our works. I use GitHub for version control and coding via its Codespaces. I think it is not a good decision to put my credential in an public repository.

I following Manage Secrets in Kestra | How-to Guide from Kestra to keep my credential secret and use it as a variable in my workflows. Since I use an open-source version, I can not directly add my credential as a secret via Kestra UI. Thus, I opt another method to store my credentials in a .env file. Then, convert them to base-64 and store in a .env_encoded file and make sure that both .env and .env_encoded are in .gitignore to prevent leaking. However, this safety approach will let me recuring add my credential every time a re-open the Codespace, which sometimes make me annoying 😅. So, I end up with making it as a secret variable in my GitHub Codespace.

Add a Secret to GitHub Codespaces

  1. Go to your repository and click on ⚙️Settings (the right icon on the repo navigation tab)
  2. On the left hand side, Click Secrets and variables dropdown under Security section and select Codespaces
  3. Add the credentials and name the secret (i.e., GCP_SERVICE_ACCOUNT)
    Note that google cloud service account is in .json format, we need to minify it before adding to the secret.

Normal JSON
{
  "name": "Sam",
  "city": "Bangkok"
}
Minified JSON
{"name":"Sam","city":"Bangkok"}

Parse the Secret into Codespaces Environment

        Once the secret is added, we can open go into our working directory in the Codespaces to create a new file, let's say setup_secrets.sh. Open that file and put the following code into it.

  #!/bin/bash
  # setup_secrets.sh

  # 1. Create the .env_encoded file (which Kestra will read)
  # We take the GitHub Secret, Base64 encode it, and prefix it with SECRET_
  echo "SECRET_GCP_CREDS=$(echo -n "$GCP_SERVICE_ACCOUNT_JSON" | base64 -w 0)" > .env_encoded
Then update docker-compose.yaml following what Will said in the Youtube.
services:
  kestra:
    image: kestra/kestra:latest
    env_file:
      - .env_encoded  # Kestra reads the Base64 secrets from here
    environment:
      KESTRA_CONFIGURATION: |
        kestra:
          secret:
            type: ENVIRONMENT
Lastly, we want this .env_concoded to spin up everytime we re-open our Codespaces. Add the following line to the .devcontainser.json
"postStartCommand": "bash setup_secrets.sh"
Now, we can use credentials as a secret environmental variable in Kestra without re-creating .env and .env_encoded everytime we re-open Codespaces.

Comments

Popular posts from this blog

Basics Dockers for Data Engineering

Datawarehouse and BigQuery