Data Engineering and Data Science Bulletproof using Git and GitHub

Flávio Brito
13 min readAug 5, 2021

Introduction

GitHub is an online language agnostic platform, based on git, capable of tracking and save the project code and data as private or public projects.

Before version control era, multiple backup files kept many complete copies of the same project directory, named like this: version_01, ver_01.10. Seemed to be a mess, yes it really was. But as developers we needed to find a way to control the changes, using a simple file or put some comments on code. When something went wrong we needed to search into many folders to check if we figure out the differences. I don’t need to say that was many hours redoing tasks.

After many years, others version control solutions were create to help developers control better projects code. Git started to become popular when a well developed portal created a set of features to help open source community to develop and share, inspired in a bazar idea, everybody could download a project and start to collaborate.

The idea of this article is to bring you some kind of integration with the GIT and GitHub solutions, and also bring out some ideas for you to keep under your utility belt. Let’s rock!

Install GH CLI

There is a visual way to create a remote repository in GH

To do it, you need to login in you GH account and click on the (+) at the top right of the GH page as shown in the following figure.

As a Data guy, probably you will be curious about how to do things from command line. Let’s talk about how to do it.

Another way is to do it by GH cli command. To do it, you need to install the GH CLI. Using a MAC you can download the client from https://cli.github.com/ page or run.

brew install gh

Authentication

After installing it, you need to authenticate doing this:

gh auth login
? What account do you want to log into? [Use arrows to move, type to filter]
> GitHub.com
GitHub Enterprise Server

The GitHub.com site is per default, you can only hit <ENTER>

After it, GH cli will ask you which protocol you will prefer to use for authentication. Let’s use HTTPS

? What is your preferred protocol for Git operations?  [Use arrows to move, type to filter]
> HTTPS
SSH

Now, you need to inform about the credentials

? How would you like to authenticate GitHub CLI?  [Use arrows to move, type to filter]
> Login with a web browser
Paste an authentication token

Let’s accept the default, “Login with a web browser”

! First copy your one-time code: E2**-****
- Press Enter to open github.com in your browser...

Copy the code to clipboard and hit <ENTER>

Your default browser will open asking you to enter the code

Paste the code and click in <Continue>

After it a new page asking you to Authorize GitHub CLI will appear. You can check what kind of services you will give rights to access. In this we can keep all as default.

The following step you will need to confirm the access typing your GH account password.

Congratulations, your device is now recognized by GH.

Check in your terminal if you received this message

✓ Authentication complete. Press Enter to continue...- gh config set -h github.com git_protocol https
✓ Configured git protocol
✓ Logged in as flaviobrito

Let’s Start!

Create a Remote Repository

gh repo create data_eng

You can decide if this repository will be public, private or internal. For this article purpose I will keep it public per default.

? Visibility  [Use arrows to move, type to filter]
> Public
Private
Internal

Now GH will ask you if you want to add some files like .gitignore, license

I will keep the default (N)

? Would you like to add a .gitignore? (y/N)? Would you like to add a license? (y/N)

Now we can add origin git remote to our local repository.

? This will add an "origin" git remote to your local repository. Continue? (Y/n)

Done!

✓ Created repository flaviobrito/data_eng on GitHub
✓ Added remote <https://github.com/flaviobrito/data_eng.git>

This is the starting point of all projects. A remote repository. Yes, you can control everything without creating a remote repository, but think that it is not a good practice to keep only a local copy of everything that you are doing. If you want to keep your project private and then open it to the community, it’s also possible. No matter if you want to open to keep it private, the most important is to have a remote version to guarantee that tomorrow everything will be fine.

Check the Remote Repository

Before start working remotely it is good to test if your machine can see the remote repository. To do it, run:

git remote show origin

The follow outcome will be expected:

* remote origin
Fetch URL: <https://github.com/flaviobrito/data_eng.git>
Push URL: <https://github.com/flaviobrito/data_eng.git>
HEAD branch: (unknown)

Manage Repositories on GitHub

GitHub will check your credentials every time you try to write remotely in the project, before starting to give git commands, you need to identify yourself in git.

$ git config --global user.name "YOUR NAME"
$ git config --global user.email <your email>

Your name need to be in double quotes (") and your email doesn’t need <> symbols.

Edit Repository Config

It’s possible to edit the repository configuration doing:

git config -e

A text editor like vim will be available for you, like this:

[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
ignorecase = true
precomposeunicode = true
[remote "origin"]
url = <https://github.com/flaviobrito/data_eng.git>
fetch = +refs/heads/*:refs/remotes/origin/*
~
~
~

to see more information about the repository use :

git config -l

to see something like this:

core.excludesfile=~/.gitignore
core.legacyheaders=false
core.quotepath=false
mergetool.keepbackup=true
push.default=simple
color.ui=auto
color.interactive=auto
repack.usedeltabaseoffset=true
alias.s=status
alias.a=!git add . && git status
alias.au=!git add -u . && git status
alias.aa=!git add . && git add -u . && git status
alias.c=commit
alias.cm=commit -m
alias.ca=commit --amend
alias.ac=!git add . && git commit
alias.acm=!git add . && git commit -m
alias.l=log --graph --all --pretty=format:'%C(yellow)%h%C(cyan)%d%Creset %s %C(white)- %an, %ar%Creset'
alias.ll=log --stat --abbrev-commit
alias.lg=log --color --graph --pretty=format:'%C(bold white)%h%Creset -%C(bold green)%d%Creset %s %C(bold green)(%cr)%Creset %C(bold blue)<%an>%Creset' --abbrev-commit --date=relative
alias.llg=log --color --graph --pretty=format:'%C(bold white)%H %d%Creset%n%s%n%+b%C(bold blue)%an <%ae>%Creset %C(bold green)%cr (%ci)' --abbrev-commit
alias.d=diff

As you can see it is possible to work with alias on git, as on the next example:

The command to check the repository status is

git status

or following the alias list, alias.s=status

git s

The git will store the username and e-mail in a .gitconfig file under your home folder. To check it, you can open the file using vi or another text editor.

vi ~/.gitconfig

Ignoring Unwanted Files Globally

Git check inside of .gitignore file inside the repository if you want top ignore a specify directory or some files. But you can do it globally, to do it, you need to create a file under your home directory named .gitignore_global

vi ~/.gitignore_global

with the contents:

*~
.*.swp
.DS_Store

To apply this new setting, run

git config --global core.excludesfile ~/.gitignore_global

Validate Important Commands Rights

Create a very simple file named README.md with a simple markdown line like:

echo "# data_eng" >> README.md

Initialize Git Repository Locally

initialize your repository if you're not already done. Doing this git will start keep the changes locally.

git init

Add a File to be Versionable.

Doing this you guarantee that git will log all the changes in this file when you commit.

git add README.md

Branches

Nice to start with the right foot on Git. A git branch brings the idea of a tree, where you can derivate many branches as you always need pointing to a snapshot of your changes.

With branches you can work in your changes without needing to wait for another developer. Others features like releases can be also used to expand the use of it.

The list of branches can be seen using:

git branchflaviobrito@brito data_eng % git branch
* master
flaviobrito@brito data_eng %

that is the same of git branch — list.

the outcome in this case is:

* master

Rename a Branch

It’s important to know how to rename a branch. Sometimes we need to change the name because of a typo for example. Let’s do it, changing the current branch (master) to main using -M option that it’s equal to — — move — force. Following the git documentation this is a Move/rename a branch and the corresponding reference log.

git branch -M maingit branch

the outcome will be:

* main

What happened, the master branch name was changed to main.

The Repo Structure and Files

A list of all directories and files inside of each one is a good idea to follow a complex project and document it outside of git. Using ls-tree command and following the branch name, in this case main, will be printed a tree with only the names.

git ls-tree -r main --name-only README.md

In our case, there is only one file, a more complex structure will be created to demonstrate the tree.

git ls-tree -r BI-2316-Bigquery_Uploader --name-onlyMakefile
README.md
common/__init__.py
config/__init__.py
db/__init__.py
module/__init__.py
tests/__init__.py

Create a Branch

A best practices is to commit codes in the main (master) branch that are already tested and approved for production, until it is good to work in another branch to avoid introduce not approved / tested features or a bug.

Before creating a new branch, let’s talk about how it can be named. Avoid creating a branch name using a mnemonic, code or numbers. I recommend you follow this syntax:

<JIRA TICKET NUMBER>-<TITLE>

BI-2316-Bigquery Uploader

where BI-2316 is the ticket number on Jira where you can easily find more information about the features or issues related with this branch. A better practice is to give a small title. Important to known that it cannot have space between words.

Avoid titles like these:

new-pre-merge
master-new-last
BI-1371

This probably brings a lot of confusion and brings the need to search for more information to clarify what does it mean.

git checkout -b BI-2316-Bigquery_Uploader

In this case, git will create a new branch pointing the current branch to the checkout branch name.

Listing the branches

* BI-2316-Bigquery_Uploader
main

The current branch now is marked with a start.

To see local and remote branches you need — — all option:

git branch --all
* BI-2316-Bigquery_Uploader
main
remotes/origin/BI-2316-Bigquery_Uploader
remotes/origin/main

Create a Branch From a Another Branch

Creating a new branch from another has special cases, like test merge for example. To do it, run:

Syntax: git branch new old

git branch BI-2316-Bigquery_Uploader_GCS BI-2316-Bigquery_Uploader

Change to another branch

To move to another branch, type the name of the target branch, like:

git checkout main

In this case the current branch will be the branch: main

How Do I Clean Outdated Branches?

In a corporate environment it is a good practice that after merged successfully a branch is removed. The fetch command is to fetch all remote branches and remove references that are no longer in use, keeping only local references. This safely leaves local work in local/origin.

git fetch --all --prune

Delete a Branch

It is possible to delete a branch, typing git branch command with -D and the branch name. You cannot delete a branch that you are inside.

git branch -D branch_name

To delete a branch BI-2316-Bigquery_Uploader is necessary to jump to another branch. In this case main, and only after it will be possible to delete a BI-2316-Bigquery_Uploader.

git checkout main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
git branch -D BI-2316-Bigquery_Uploader
Deleted branch BI-2316-Bigquery_Uploader (was e9cf5ec).

Push the Changes

We already added a file and did a commit, but in this case all changes are still local. To push all of them into the remote branch:

Syntaxe:

git push <remote> <branch>

git push -u origin main

where -u push local content to remote (GitHub).

Always check the Status

Before start working in a branch it is good to check the status doing this:

git status

the outcome is

On branch main
Your branch is up to date with 'origin/main'.

Pull

To bring the changes from remote repository to your local repository you need to use git push command.

Syntax: git pull remote_name branch_name

git push origin main

Merge

Working into a branch and after test all, it’s time to merge the new features from master into the branch that you are working on.

Merge a Master/Main to a Branch

You can do:

git checkout BI-2316-Bigquery_Uploader
git pull
git merge main

or

git merge origin/BI-2316-Bigquery_Uploader main

In this case all changes in main will be merged into BI-2316-Bigquery_Uploader. This will create a “merge commit”. Merging is nice because it’s a non-destructive operation. You can abort it. Check the topic below.

Merge a Branch to Master/Main

git checkout main
git pull
git merge origin/BI-2316-Bigquery_Uploader
git push -u origin main
git merge --no-ff --no-commit BI-2316-Bigquery_Uploader

Test merge before commit, avoid a fast-forward commit by --no-ff,

What Changed?

Alway check log. To verify the history of all changes run:

git log

the outcome is

commit e8b2e09e09486e56bd25d1486cd487ca02637bd7 (HEAD -> main, origin/BI-2316-Bigquery_Uploader, BI-2316-Bigquery_Uploader)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 16:44:19 2021 +0200
structurecommit e9cf5ec40c683fd38316ca589ee4c46088ea743a (origin/main)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 15:11:56 2021 +0200
first commit

Relax!!, the e9cf5ec40c683fd38316ca589ee4c46088ea743a will be different in your case.

Do you want more details? Try add --stat

git log --stat
commit e8b2e09e09486e56bd25d1486cd487ca02637bd7 (HEAD -> main, origin/BI-2316-Bigquery_Uploader, BI-2316-Bigquery_Uploader)
Author: flaviobrito <EMAIL>
Date: Thu Sun 1 16:44:19 2021 +0200
structure Makefile | 27 +++++++++++++++++++++++++++
common/__init__.py | 0
config/__init__.py | 0
db/__init__.py | 0
module/__init__.py | 0
tests/__init__.py | 0
6 files changed, 27 insertions(+)
commit e9cf5ec40c683fd38316ca589ee4c46088ea743a (origin/main)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 15:11:56 2021 +0200
first commit README.md | 1 +
1 file changed, 1 insertion(+)

A short log version

git shortlog
flaviobrito (2):
first commit
structure

Or a graph version

git log --graph --oneline --decorate
* e8b2e09 (HEAD -> main, origin/BI-2316-Bigquery_Uploader, BI-2316-Bigquery_Uploader) structure
* e9cf5ec (origin/main) first commit

Filter Log

In a long project a simple log check can have a huge output.

Last 3 commits. In our case, it will be the same outcome.

commit e8b2e09e09486e56bd25d1486cd487ca02637bd7 (HEAD -> main, origin/BI-2316-Bigquery_Uploader, BI-2316-Bigquery_Uploader)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 16:44:19 2021 +0200
structurecommit e9cf5ec40c683fd38316ca589ee4c46088ea743a (origin/main)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 15:11:56 2021 +0200
first commit

Filter by Date

git log --after="2021-7-1"

the outcome is

commit e8b2e09e09486e56bd25d1486cd487ca02637bd7 (HEAD -> main, origin/BI-2316-Bigquery_Uploader, BI-2316-Bigquery_Uploader)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 16:44:19 2021 +0200
structurecommit e9cf5ec40c683fd38316ca589ee4c46088ea743a (origin/main)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 15:11:56 2021 +0200
first commit

Filter by Author

git log --author="flaviobrito"

the outcome is

commit e8b2e09e09486e56bd25d1486cd487ca02637bd7 (HEAD -> main, origin/BI-2316-Bigquery_Uploader, BI-2316-Bigquery_Uploader)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 16:44:19 2021 +0200
structurecommit e9cf5ec40c683fd38316ca589ee4c46088ea743a (origin/main)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 15:11:56 2021 +0200
first commit

Filter by File

It’s important to check what happened to a particular file or a list of files. To do it you can add -- and a list of files.

git log -- README.md

the outcome is

commit e9cf5ec40c683fd38316ca589ee4c46088ea743a (origin/main)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 15:11:56 2021 +0200
first commit

or

git log --follow --all -p .

where the . is current directory.

Log what were Merged or Not

There are 2 options to verify what was merged or no-merged.

git log --no-merges

and what were the merged files

git log --merges

Troubleshooting

Can’t push to GitHub because of large file which I already deleted

This occurred to me when a large file was generated without me realizing it. After trying to do another push I received a message telling me that every push was blocked. No fix if you only remove the file after trying to push it, git has a great memory, in that case I also removed it from git history.

This case can be used to avoid keeping a huge list of unwanted files into git repository.

Supposed that you are reading 200MB of JSON files and don’t want git to track them and you didn’t put the data/request folder into the .gitignore. Bam! Git tracks them. Now how we can remove from git cache:

git rm --cached data/requests/2015/* -r

the option -ris recursive. All subdirectories will be removed from cache. This is not a physical remove, only from git context.

Undo a Merge

A wrong merge can bring a lot of work if we didn’t check previously if there are conflicts after it. In case of a big merge, an advice to keep things simple, don’t wait so long to do a merge into the master. But in the case that you do, a best practice is to create a new branch to be merged with the target branch and then if it’s is clear do it in master. In case of many conflicts you can UNDO it by doing:

git merge --abort

Summary

In this article, we showed a focused analysis on git and GitHub daily use and how to improve your productivity.

You can check the author’s GitHub repositories for code, ideas, and resources in data science and data engineering. Please feel free to add me on LinkedIn or follow me on Twitter.

--

--