On Sunday at FOSDEM, I have a 5 minute lightning talk about extracting data from open source communities in the HPC, Big Data, Data Science devroom (slides).
Open source communities are filled with huge amounts of data just waiting to be analyzed. Getting this data into a format that can be easily used for analysis may seem intimidating at first, but there are some very useful open source tools that make this task relatively easy.
The primary tools used in this talk are the open source Metrics Grimoire tools that take data from various community sources and store it in a database where it can be easily queried and analyzed.
Tools covered:
- CVSAnalY to gather and analyze source code repository data
- MLStats to gather and analyze mailing list data
- Other Metrics Grimoire tools for bug trackers, IRC, Wikis and more
- Gource to visualize source code repository data
MLStats and CVSAnaly – Installation and data import:
It’s very easy to get started with MLStats and CVSAnaly and use them to import data from your mailing lists and code repositories.
- Install
- Create database
- Import data
$ python setup.py install
mysql> create database mlstats;
mysql> create database cvsanaly;
$ mlstats http://URLOFYOURLIST
$ cvsanaly2 /path/to/repo
MLStats – Queries to extract data:
- Top 100 messages (most replied to threads):
- Other queries:
- # of messages from a specific person
- # of messages per person from email domain
- Find all messages with specific word in subject line (patch)
- More queries
SELECT subject, COUNT(*) as total
FROM messages
GROUP BY subject
ORDER by total DESC
LIMIT 100;
CVSAnalY – Queries to extract data:
- Number of commits per person by email domain:
- Other queries:
- Top commit authors all time
- # of commits for specific person
- More Queries
SELECT p.name, p.email,
COUNT(distinct(s.id)) as num_commits
FROM people p, scmlog s
WHERE email like "%company.com"
AND p.id=s.author_id
GROUP BY email
ORDER BY num_commits DESC;
Other Metrics Grimoire Tools:
- Bicho – Bug / issue tracker data
- IRCAnalysis
- MediaWikiAnalysis
- Many more tools
Gource:
Gource is an amazing tool to visualize activity from your source code repositories. I did a full talk about Gource on Friday at the FLOSS Community Metrics meeting, so have a look at that blog post for details about using Gource.