Data Visualization
Analyzing Programmers' knowledge network

Project Summary

The goal of this project was to analyze the GitHub users’ behavior, explore relations among overall programming language usage and the overlaps in the different languages that programmers use.

This analysis considers the 50 most frequently used programming languages on GitHub based on the amount of opened pull requests.

The GitHub repository of this visualization can be found here.
The Process Book for this visualization can be found here.
The screencast of this visualization can be seen here.

How to use the visualization

The visualization below shows the interdependance between users of multiple programming languages.

The amount of space that each language occupies in the chart is proportional to the number of unique contributors and the edges represents the number of common unique contributors.

The languages on the diagram are sorted by the programming paradigm they belong to, designated by the different arc colors. If you wish to change the sorting, you can do that by clicking on the Ascending or Descending buttons. Of course, if you wish to revert to the original sorting, you can always do so by clicking the Paradigm button.

By hovering over an arc belonging to a specific language, you can see some general statistics for that language, such as the average number of Pull Requests per month, average number of actors that used that language per month, as well as the average number of Pull Requests per actor. Also, you will be shown the connection strength of that language with all the others on the diagram, designated by the thickness of the ribbons that connect them.

Additionally, by clicking on the arc belonging to a specific language, more detailed information about the language will be shown, including the programming paradigm it belongs to, the languages with which it is correlated the most, and some more statistics about the language itself, and how it compares to the mean value across all languages.

By dragging the slider below the diagram, the statistics of the language usage will change, to reflect the selected month.

If the name of the language is clicked it is removed from the diagram, and can be returned back to the diagram by selecting it in the filter list.

×
Removed Languages

    The data was collected from the GitHub archive. The datasets are divided by hour and the initial time window that is considered in this analysis is between 2017-07-01 until 2017-11-30. Only the PullRequest events are considered and the programming language of the repository was extracted from the payload.pull_request.base.repo.language field.

    The code to extract and aggregate the data is available here and more information about this dataset's metadata can be found here.