Git deep learning and the underlying mechanisms

In this article we would like to give some insight why we do use Git. One of the main operations during the software development process is to manage the changes made in the projects‘ code. The process of managing code changes is called version control and it is currently the most popular version control system. Learning it is a core skill for the efficient development of software.

Although in small projects with few developers, it can be simple enough to use this software, in projects with higher complexity and more developers the hurdles increase and productivity may be constrained, because of issues relevant to it’s processes.

One of the reasons for the increased amount of difficulties relevant to Git is the learning approach of it. Because of time pressure in projects, many Git users simply learn the basic commands to commit and push their changes to the repository, as soon as possible.

Another approach for learning Git, is to understand at some level the ‚mechanics‘ of Git and the main concepts of its operation, instead of simply add-commit-push (standard process of recording changes). This ‚deep learning‘ process, although it costs some more time for thought at the beginning, it can decrease the time lost because of Git issues and the time required solving them in the future.

So why do we use this software again?

The simplest way to think of Git is a way to manage change of source code in a project and track the history of changes made by developers.

Where are the project’s files stored?

File Storage Image

Files are stored normally in 3 places:

  • The working directory contains the actual files that a developer creates and can make changes to. The working directory is stored inside the project’s folder.
  • The local repository is stored inside the .git folder (which is inside the project’s folder) and is a reconstruction of the project’s source code, so that Git can effectively manage changes in the original files written by developers.

Every developer has both the working directory and the local repository stored in the computer. The working directory is used from the developer and the local repository from Git.

  • There is also the remote repository that is an exact copy of the local repository (NOT the working directory) normally stored in a server. In a project, there is only a single remote repository, but each developer has both a local repository and a working directory.

The term bare repository is also used to describe a Git repository without the working directory.

What lies in the local repository?

As already said above the local repository contains information that Git uses to manage changes in your project’s code.

The first step that it does is to take all the files in the working directory, split them into ‚pieces‘ and reorganize them in a way more efficient for it to manage them.

Local repository Image

To understand the concept let’s assume that the word ‚time‘ appears in many places inside your files. Git assigns a ‚code‘ to ‚time‘ it and stores it in a database of ‚file pieces‘ organized based on their codes. This code assigned to each file-piece makes it possible for Git to retrieve it afterwards when the file will be recreated in the original format (that developers can understand). This ‚file-pieces‘ database is stored inside the directory .git/objects.

The technical term for ‚file-pieces‘ is blobs and their content is actually a ‚chunk‘ of binary data that have no sense for a human, but only for Git.

The code assigned to each blob is actually a 40 character SHA1 hash of the blob’s contents, which makes it pretty much unique for each blob.

How can Git recreate files and directories?

As stated above Git breaks files into pieces and stores them in a format that cannot be read by humans, but only from it. From these pieces must the file and directory structure of the project be recomposed.

Files and directories Image

This is done through tree objects. A tree object is stored in the .git/objects directory along with blobs and commit objects and is identified through a hash code as well. It contains information:

  • To restructure files.
  • To restructure directories.

A tree object can have references both to blobs as well as to other tree objects.

With blobs and trees, Git can recreate a project to a format that a developer can understand (the actual project’s files).

How Git keeps track of the history of changes?

With blobs and trees, Git manages to recreate a project from chunks of binary data to a format that can be understood and processed by a developer. However, the purpose of Git is to grant access to different states/versions of the project’s history.

This is possible through commit objects. Commit objects are also stored in the .git/objects directory and are identified through a hash code as well. A commit object contains:

  • A tree object that contains information to recreate the entire project at a specific state/version.
  • Hash codes of 1 or 2 commits that point to the previous commit(s) (2 commits in case the commit was result of merging 2 states of the project’s history).
  • Extra information such as the author, the commiter, a comment etc.

That’s almost all about the structure (not the functionality!) of Git with very simple words.