Data Engineering and Data Science Bulletproof using Git and GitHub
Introduction
GitHub is an online language agnostic platform, based on git, capable of tracking and save the project code and data as private or public projects.
Before version control era, multiple backup files kept many complete copies of the same project directory, named like this: version_01, ver_01.10. Seemed to be a mess, yes it really was. But as developers we needed to find a way to control the changes, using a simple file or put some comments on code. When something went wrong we needed to search into many folders to check if we figure out the differences. I don’t need to say that was many hours redoing tasks.
After many years, others version control solutions were create to help developers control better projects code. Git started to become popular when a well developed portal created a set of features to help open source community to develop and share, inspired in a bazar idea, everybody could download a project and start to collaborate.
The idea of this article is to bring you some kind of integration with the GIT and GitHub solutions, and also bring out some ideas for you to keep under your utility belt. Let’s rock!
Install GH CLI
There is a visual way to create a remote repository in GH
To do it, you need to login in you GH account and click on the (+) at the top right of the GH page as shown in the following figure.
As a Data guy, probably you will be curious about how to do things from command line. Let’s talk about how to do it.
Another way is to do it by GH cli command. To do it, you need to install the GH CLI. Using a MAC you can download the client from https://cli.github.com/ page or run.
brew install gh
Authentication
After installing it, you need to authenticate doing this:
gh auth login
? What account do you want to log into? [Use arrows to move, type to filter]
> GitHub.com
GitHub Enterprise Server
The GitHub.com site is per default, you can only hit <ENTER>
After it, GH cli will ask you which protocol you will prefer to use for authentication. Let’s use HTTPS
? What is your preferred protocol for Git operations? [Use arrows to move, type to filter]
> HTTPS
SSH
Now, you need to inform about the credentials
? How would you like to authenticate GitHub CLI? [Use arrows to move, type to filter]
> Login with a web browser
Paste an authentication token
Let’s accept the default, “Login with a web browser”
! First copy your one-time code: E2**-****
- Press Enter to open github.com in your browser...
Copy the code to clipboard and hit <ENTER>
Your default browser will open asking you to enter the code
Paste the code and click in <Continue>
After it a new page asking you to Authorize GitHub CLI will appear. You can check what kind of services you will give rights to access. In this we can keep all as default.
The following step you will need to confirm the access typing your GH account password.
Congratulations, your device is now recognized by GH.
Check in your terminal if you received this message
✓ Authentication complete. Press Enter to continue...- gh config set -h github.com git_protocol https
✓ Configured git protocol
✓ Logged in as flaviobrito
Let’s Start!
Create a Remote Repository
gh repo create data_eng
You can decide if this repository will be public, private or internal. For this article purpose I will keep it public per default.
? Visibility [Use arrows to move, type to filter]
> Public
Private
Internal
Now GH will ask you if you want to add some files like .gitignore, license
I will keep the default (N)
? Would you like to add a .gitignore? (y/N)? Would you like to add a license? (y/N)
Now we can add origin git remote to our local repository.
? This will add an "origin" git remote to your local repository. Continue? (Y/n)
Done!
✓ Created repository flaviobrito/data_eng on GitHub
✓ Added remote <https://github.com/flaviobrito/data_eng.git>
This is the starting point of all projects. A remote repository. Yes, you can control everything without creating a remote repository, but think that it is not a good practice to keep only a local copy of everything that you are doing. If you want to keep your project private and then open it to the community, it’s also possible. No matter if you want to open to keep it private, the most important is to have a remote version to guarantee that tomorrow everything will be fine.
Check the Remote Repository
Before start working remotely it is good to test if your machine can see the remote repository. To do it, run:
git remote show origin
The follow outcome will be expected:
* remote origin
Fetch URL: <https://github.com/flaviobrito/data_eng.git>
Push URL: <https://github.com/flaviobrito/data_eng.git>
HEAD branch: (unknown)
Manage Repositories on GitHub
GitHub will check your credentials every time you try to write remotely in the project, before starting to give git commands, you need to identify yourself in git.
$ git config --global user.name "YOUR NAME"
$ git config --global user.email <your email>
Your name need to be in double quotes (") and your email doesn’t need <> symbols.
Edit Repository Config
It’s possible to edit the repository configuration doing:
git config -e
A text editor like vim will be available for you, like this:
[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
ignorecase = true
precomposeunicode = true
[remote "origin"]
url = <https://github.com/flaviobrito/data_eng.git>
fetch = +refs/heads/*:refs/remotes/origin/*
~
~
~
to see more information about the repository use :
git config -l
to see something like this:
core.excludesfile=~/.gitignore
core.legacyheaders=false
core.quotepath=false
mergetool.keepbackup=true
push.default=simple
color.ui=auto
color.interactive=auto
repack.usedeltabaseoffset=true
alias.s=status
alias.a=!git add . && git status
alias.au=!git add -u . && git status
alias.aa=!git add . && git add -u . && git status
alias.c=commit
alias.cm=commit -m
alias.ca=commit --amend
alias.ac=!git add . && git commit
alias.acm=!git add . && git commit -m
alias.l=log --graph --all --pretty=format:'%C(yellow)%h%C(cyan)%d%Creset %s %C(white)- %an, %ar%Creset'
alias.ll=log --stat --abbrev-commit
alias.lg=log --color --graph --pretty=format:'%C(bold white)%h%Creset -%C(bold green)%d%Creset %s %C(bold green)(%cr)%Creset %C(bold blue)<%an>%Creset' --abbrev-commit --date=relative
alias.llg=log --color --graph --pretty=format:'%C(bold white)%H %d%Creset%n%s%n%+b%C(bold blue)%an <%ae>%Creset %C(bold green)%cr (%ci)' --abbrev-commit
alias.d=diff
As you can see it is possible to work with alias on git, as on the next example:
The command to check the repository status is
git status
or following the alias list, alias.s=status
git s
The git will store the username and e-mail in a .gitconfig file under your home folder. To check it, you can open the file using vi or another text editor.
vi ~/.gitconfig
Ignoring Unwanted Files Globally
Git check inside of .gitignore file inside the repository if you want top ignore a specify directory or some files. But you can do it globally, to do it, you need to create a file under your home directory named .gitignore_global
vi ~/.gitignore_global
with the contents:
*~
.*.swp
.DS_Store
To apply this new setting, run
git config --global core.excludesfile ~/.gitignore_global
Validate Important Commands Rights
Create a very simple file named README.md with a simple markdown line like:
echo "# data_eng" >> README.md
Initialize Git Repository Locally
initialize your repository if you're not already done. Doing this git will start keep the changes locally.
git init
Add a File to be Versionable.
Doing this you guarantee that git will log all the changes in this file when you commit.
git add README.md
Branches
Nice to start with the right foot on Git. A git branch brings the idea of a tree, where you can derivate many branches as you always need pointing to a snapshot of your changes.
With branches you can work in your changes without needing to wait for another developer. Others features like releases can be also used to expand the use of it.
The list of branches can be seen using:
git branchflaviobrito@brito data_eng % git branch
* master
flaviobrito@brito data_eng %
that is the same of git branch — list.
the outcome in this case is:
* master
Rename a Branch
It’s important to know how to rename a branch. Sometimes we need to change the name because of a typo for example. Let’s do it, changing the current branch (master) to main using -M option that it’s equal to — — move — force. Following the git documentation this is a Move/rename a branch and the corresponding reference log.
git branch -M maingit branch
the outcome will be:
* main
What happened, the master branch name was changed to main.
The Repo Structure and Files
A list of all directories and files inside of each one is a good idea to follow a complex project and document it outside of git. Using ls-tree command and following the branch name, in this case main, will be printed a tree with only the names.
git ls-tree -r main --name-only README.md
In our case, there is only one file, a more complex structure will be created to demonstrate the tree.
git ls-tree -r BI-2316-Bigquery_Uploader --name-onlyMakefile
README.md
common/__init__.py
config/__init__.py
db/__init__.py
module/__init__.py
tests/__init__.py
Create a Branch
A best practices is to commit codes in the main (master) branch that are already tested and approved for production, until it is good to work in another branch to avoid introduce not approved / tested features or a bug.
Before creating a new branch, let’s talk about how it can be named. Avoid creating a branch name using a mnemonic, code or numbers. I recommend you follow this syntax:
<JIRA TICKET NUMBER>-<TITLE>
BI-2316-Bigquery Uploader
where BI-2316 is the ticket number on Jira where you can easily find more information about the features or issues related with this branch. A better practice is to give a small title. Important to known that it cannot have space between words.
Avoid titles like these:
new-pre-merge
master-new-last
BI-1371
This probably brings a lot of confusion and brings the need to search for more information to clarify what does it mean.
git checkout -b BI-2316-Bigquery_Uploader
In this case, git will create a new branch pointing the current branch to the checkout branch name.
Listing the branches
* BI-2316-Bigquery_Uploader
main
The current branch now is marked with a start.
To see local and remote branches you need — — all option:
git branch --all
* BI-2316-Bigquery_Uploader
main
remotes/origin/BI-2316-Bigquery_Uploader
remotes/origin/main
Create a Branch From a Another Branch
Creating a new branch from another has special cases, like test merge for example. To do it, run:
Syntax: git branch new old
git branch BI-2316-Bigquery_Uploader_GCS BI-2316-Bigquery_Uploader
Change to another branch
To move to another branch, type the name of the target branch, like:
git checkout main
In this case the current branch will be the branch: main
How Do I Clean Outdated Branches?
In a corporate environment it is a good practice that after merged successfully a branch is removed. The fetch command is to fetch all remote branches and remove references that are no longer in use, keeping only local references. This safely leaves local work in local/origin.
git fetch --all --prune
Delete a Branch
It is possible to delete a branch, typing git branch command with -D and the branch name. You cannot delete a branch that you are inside.
git branch -D branch_name
To delete a branch BI-2316-Bigquery_Uploader is necessary to jump to another branch. In this case main, and only after it will be possible to delete a BI-2316-Bigquery_Uploader.
git checkout main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.git branch -D BI-2316-Bigquery_Uploader
Deleted branch BI-2316-Bigquery_Uploader (was e9cf5ec).
Push the Changes
We already added a file and did a commit, but in this case all changes are still local. To push all of them into the remote branch:
Syntaxe:
git push <remote> <branch>
git push -u origin main
where -u push local content to remote (GitHub).
Always check the Status
Before start working in a branch it is good to check the status doing this:
git status
the outcome is
On branch main
Your branch is up to date with 'origin/main'.
Pull
To bring the changes from remote repository to your local repository you need to use git push command.
Syntax: git pull remote_name branch_name
git push origin main
Merge
Working into a branch and after test all, it’s time to merge the new features from master into the branch that you are working on.
Merge a Master/Main to a Branch
You can do:
git checkout BI-2316-Bigquery_Uploader
git pull
git merge main
or
git merge origin/BI-2316-Bigquery_Uploader main
In this case all changes in main will be merged into BI-2316-Bigquery_Uploader. This will create a “merge commit”. Merging is nice because it’s a non-destructive operation. You can abort it. Check the topic below.
Merge a Branch to Master/Main
git checkout main
git pull
git merge origin/BI-2316-Bigquery_Uploader
git push -u origin maingit merge --no-ff --no-commit BI-2316-Bigquery_Uploader
Test merge before commit, avoid a fast-forward commit by --no-ff
,
What Changed?
Alway check log. To verify the history of all changes run:
git log
the outcome is
commit e8b2e09e09486e56bd25d1486cd487ca02637bd7 (HEAD -> main, origin/BI-2316-Bigquery_Uploader, BI-2316-Bigquery_Uploader)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 16:44:19 2021 +0200 structurecommit e9cf5ec40c683fd38316ca589ee4c46088ea743a (origin/main)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 15:11:56 2021 +0200 first commit
Relax!!, the e9cf5ec40c683fd38316ca589ee4c46088ea743a will be different in your case.
Do you want more details? Try add --stat
git log --stat
commit e8b2e09e09486e56bd25d1486cd487ca02637bd7 (HEAD -> main, origin/BI-2316-Bigquery_Uploader, BI-2316-Bigquery_Uploader)
Author: flaviobrito <EMAIL>
Date: Thu Sun 1 16:44:19 2021 +0200 structure Makefile | 27 +++++++++++++++++++++++++++
common/__init__.py | 0
config/__init__.py | 0
db/__init__.py | 0
module/__init__.py | 0
tests/__init__.py | 0
6 files changed, 27 insertions(+)commit e9cf5ec40c683fd38316ca589ee4c46088ea743a (origin/main)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 15:11:56 2021 +0200 first commit README.md | 1 +
1 file changed, 1 insertion(+)
A short log version
git shortlog
flaviobrito (2):
first commit
structure
Or a graph version
git log --graph --oneline --decorate
* e8b2e09 (HEAD -> main, origin/BI-2316-Bigquery_Uploader, BI-2316-Bigquery_Uploader) structure
* e9cf5ec (origin/main) first commit
Filter Log
In a long project a simple log check can have a huge output.
Last 3 commits. In our case, it will be the same outcome.
commit e8b2e09e09486e56bd25d1486cd487ca02637bd7 (HEAD -> main, origin/BI-2316-Bigquery_Uploader, BI-2316-Bigquery_Uploader)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 16:44:19 2021 +0200 structurecommit e9cf5ec40c683fd38316ca589ee4c46088ea743a (origin/main)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 15:11:56 2021 +0200 first commit
Filter by Date
git log --after="2021-7-1"
the outcome is
commit e8b2e09e09486e56bd25d1486cd487ca02637bd7 (HEAD -> main, origin/BI-2316-Bigquery_Uploader, BI-2316-Bigquery_Uploader)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 16:44:19 2021 +0200 structurecommit e9cf5ec40c683fd38316ca589ee4c46088ea743a (origin/main)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 15:11:56 2021 +0200 first commit
Filter by Author
git log --author="flaviobrito"
the outcome is
commit e8b2e09e09486e56bd25d1486cd487ca02637bd7 (HEAD -> main, origin/BI-2316-Bigquery_Uploader, BI-2316-Bigquery_Uploader)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 16:44:19 2021 +0200 structurecommit e9cf5ec40c683fd38316ca589ee4c46088ea743a (origin/main)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 15:11:56 2021 +0200 first commit
Filter by File
It’s important to check what happened to a particular file or a list of files. To do it you can add --
and a list of files.
git log -- README.md
the outcome is
commit e9cf5ec40c683fd38316ca589ee4c46088ea743a (origin/main)
Author: flaviobrito <EMAIL>
Date: Sun Aug 1 15:11:56 2021 +0200 first commit
or
git log --follow --all -p .
where the .
is current directory.
Log what were Merged or Not
There are 2 options to verify what was merged or no-merged.
git log --no-merges
and what were the merged files
git log --merges
Troubleshooting
Can’t push to GitHub because of large file which I already deleted
This occurred to me when a large file was generated without me realizing it. After trying to do another push I received a message telling me that every push was blocked. No fix if you only remove the file after trying to push it, git has a great memory, in that case I also removed it from git history.
This case can be used to avoid keeping a huge list of unwanted files into git repository.
Supposed that you are reading 200MB of JSON files and don’t want git to track them and you didn’t put the data/request folder into the .gitignore. Bam! Git tracks them. Now how we can remove from git cache:
git rm --cached data/requests/2015/* -r
the option -r
is recursive. All subdirectories will be removed from cache. This is not a physical remove, only from git context.
Undo a Merge
A wrong merge can bring a lot of work if we didn’t check previously if there are conflicts after it. In case of a big merge, an advice to keep things simple, don’t wait so long to do a merge into the master. But in the case that you do, a best practice is to create a new branch to be merged with the target branch and then if it’s is clear do it in master. In case of many conflicts you can UNDO it by doing:
git merge --abort
Summary
In this article, we showed a focused analysis on git and GitHub daily use and how to improve your productivity.
You can check the author’s GitHub repositories for code, ideas, and resources in data science and data engineering. Please feel free to add me on LinkedIn or follow me on Twitter.