Data Engineer Project Workflow

DONT FORGET TO REPLACE ME LATER

Data Engineer Pipeline Productionizing High Level Task Document 

TaskDescription
1. Understand use case and expected input output-outputGet the initial understanding of the project. Expected Input and output
2. Hadover checklistGet Handover: Checklist from Data Scientist and Review it
3. Create SR for Data Sources AccessRaise SR for Required Data Sources. It’s better to have a service account setup instead of getting individual account access
4. Run the scripts on localRun all scripts on local and monitor the execution time
5. SRs for SSO set, If it’s a ChatbotRequired for Cloud Memo Approval per ITD
6. Architecture diagramHigh level Diagram Showcasing the flow and Resources used
7. Data ModelRelationship between data elements
8. Table creationSnowflake table Creation for storing output
9. Process FlowER Diagram, Flow Diagram etc
10. Code changes to dump output to snowflakeWrite function to dump output into snowflake table
11. Create Roles and policy createTerraform Scripts for roles and policy creation.
12. Repo creation for infrastructure — communicate –> Roles and PolicyTerraform scripts for communication of roles and policy
13. Raise SR for Gitaction setup and repo whitelistingGitaction setup
14. Build docker and run the code on dockerDocker image creation and Testing code on docker
15. Pipeline Scheduler Testing end to endTest pipeline end to end

Roles and AWS Infrastructure Deployment

Overview

The diagram describes a workflow for deploying AWS infrastructure using roles and permissions via two separate pipelines. These pipelines handle different responsibilities:


Key Components

1. Repositories & Pipelines

Both use GitHub Actions for their CI/CD pipelines.

2. Terraform

Workflow Steps

Step 1: Roles and Policies Deployment
Repository: pbc-gitactions-role-management
  1. Code Commit or Workflow Dispatch
    • A developer pushes changes to the repository, or manually triggers a workflow.
  2. GitHub Actions Pipeline
    • The pipeline runs the configured workflow.
  3. Terraform Execution
    • The workflow uses Terraform scripts to define and create AWS roles and policies.
    • These roles and policies specify what actions various users or services can perform in AWS.
  4. AWS Cloud
    • Terraform, authenticated via pre-created permissions, creates/updates:
      • Roles
      • Permissions
      • Policies
Key Point:

Step 2: Infrastructure Deployment
Repository: pbc-ghg-recommender
  1. Code Commit or Workflow Dispatch
    • A developer pushes infrastructure code (e.g., for a data quality pipeline), or manually triggers a workflow.
  2. GitHub Actions Pipeline
    • The pipeline runs the configured workflow.
  3. Assume Role for AWS Authentication
    • This pipeline does not have direct permissions to create resources.
    • Instead, it assumes a role created by the first pipeline for authentication.
    • This mechanism is called role assumption in AWS, where an entity temporarily gets the permissions of the assumed role.
  4. Terraform Execution
    • The pipeline uses Terraform scripts to define and create specific AWS infrastructure (e.g., Lambda functions, S3 buckets, etc.).
  5. AWS Cloud
    • Terraform, authenticated via the assumed role, provisions the required infrastructure.
Key Point:
Gitaction Yaml file for reference:

Explain a little bit about the aws part:

How It Works
  1. Initial OIDC Authentication:
    • Uses GitHub’s OIDC (OpenID Connect) integration with AWS
    • The permissions: id-token: write at the top of the workflow enables this
    • No long-term AWS credentials need to be stored as secrets
  2. Dynamic Account Selection:
    • ${{ env.account }} resolves to:
      • 992382849580 when running on main branch (prod)
      • 301691044089 when running on other branches (dev)
  3. First Role Assumption:
    • Assumes the role pbc-gitactions-role-management-role in the selected account
  4. Role Hopping (Cross-Account Access):
    • After assuming the first role, it then assumes a second role
    • This second role aws-oidc-legacy-role-2 is in a different AWS account (022592466027)
    • This is a common pattern for centralized permissions management
  5. Result:
    • After this step completes, all subsequent AWS operations (like Terraform commands) will use the permissions of the final role
    • The workflow verifies successful authentication in the next step using aws sts get-caller-identity

This approach improves security by:


Workflow Interactions


Security Model


Summary Table

PipelinePurposeTriggerAWS Access MethodTerraform Action
pbc-gitactions-role-managementManage roles & policiesPush/DispatchPre-created permissionsCreate roles & policies
pbc-warrant-data-qualityDeploy infrastructurePush/DispatchAssume role (from pipeline 1)Create specific infrastructure

Illustrative Flow

  1. Roles & Policies Pipeline:
    • Runs → Creates/updates roles and policies in AWS via Terraform.
  2. Infrastructure Pipeline:
    • Runs after roles are in place → Assumes a role → Provisions infrastructure via Terraform.

Benefits of This Workflow