Deploy your data pipelines with Github Actions

Deploy your data pipelines with Github Actions

Automate, customize, and execute your software development workflows right in your repository with GitHub Actions. You can discover, create, and share actions to perform any job you'd like, including CI/CD, and combine actions in a completely customized workflow.

GitHub Actions is a powerful tool for automating software development workflows, and it can also be used to automate data pipeline processes. In this post, we will walk through an example of using GitHub Actions to automate a data pipeline for a simple data analysis project.

The first step in setting up a data pipeline with GitHub Actions is to create a new repository for your project. Once you have a repository, you can create a new workflow by creating a new file in the .github/workflows directory.

Here's an example workflow file that runs a data pipeline using Python and pandas:

name: Data Pipeline

on:
  push:
    branches:
      - main

jobs:
  data-pipeline:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout code
      uses: actions/checkout@v2

    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: 3.8

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install pandas

    - name: Run data pipeline
      run: |
        python data_pipeline.py

This workflow will run when code is pushed to the main branch of your repository. The workflow starts by checking out the code from the repository, then sets up a Python environment with version 3.8 and installs the dependencies needed for the pipeline. The last step runs the data_pipeline.py script.

Here's an example of the data_pipeline.py script, which uses pandas to process a CSV file and write the results to another file.

import pandas as pd

def main():
    # read data from input file
    df = pd.read_csv('input.csv')
    # process data
    df['new_column'] = df['column1'] + df['column2']
    # write data to output file
    df.to_csv('output.csv', index=False)

if __name__ == '__main__':
    main()

This script reads data from an input file called input.csv, performs some processing on the data using pandas, and then writes the results to an output file called output.csv.

Once your workflow and script are set up, you can push the code to the main branch of your repository and see the workflow run automatically. You can also view the logs for each step of the workflow to troubleshoot any issues that may arise.

  • Input and Output files: In the example above, the data_pipeline.py script reads data from an input file called input.csv and writes the results to an output file called output.csv. In a real-world scenario, you might need to read data from multiple files, or write data to a database or a cloud storage service. You can adjust the script accordingly and use the appropriate library to read and write data from different sources.

  • Environment Variables: In some cases, you might need to pass sensitive information (e.g. database credentials, API keys) to your script. Instead of hardcoding this information in the script, you can use environment variables to securely pass these values. You can define environment variables in your GitHub Actions workflow file, and then access them in your script using the os.environ module in python.

- name: Run data pipeline
      env:
        DB_USER: ${{ secrets.DB_USER }}
        DB_PASSWORD: ${{ secrets.DB_PASSWORD }}
        API_KEY: ${{ secrets.API_KEY }}
      run: |
        python data_pipeline.py
import os

def main():
    db_user = os.environ['DB_USER']
    db_password = os.environ['DB_PASSWORD']
    api_key = os.environ['API_KEY']
    # use the credentials to connect to the database
    # or use the api_key to make requests
  • Dependency Management: In the example above, the workflow installs the dependencies needed for the pipeline using pip. However, in some cases, you might need to install system-level dependencies or use a different package manager. GitHub Actions provides a variety of Dependency Management actions that you can use to install dependencies for different languages and package managers.

  • Parallelization: One of the advantages of GitHub Actions is that you can run multiple jobs in parallel. This can be useful if you have multiple steps in your pipeline that can be run independently. For example, you can have one job that reads data from a database, another job that processes the data, and a third job that writes the results to a file. Each job can run in parallel, and then the results can be combined in the final step.

jobs:
  read-data:
    runs-on: ubuntu-latest
    steps:
    - name: Read data
      run: |
        python read_data.py

  process-data:
    runs-on: ubuntu-latest
    steps:
    - name: Process data
      run: |
        python process_data.py

  write-data:
    runs-on: ubuntu-latest
    steps:
    - name: Write data
      run: |
        python write_data.py

With GitHub Actions, you can easily automate data pipeline processes and take advantage of the powerful features of GitHub, such as version control and collaboration, to streamline your data analysis workflows. I hope this additional information and examples will help you better understand how to use GitHub Actions with data pipelines. Remember that this is a basic example, and you can adjust it to your needs and add more complexity to your pipeline.

Did you find this article valuable?

Support Harsh Daiya by becoming a sponsor. Any amount is appreciated!