Querying our GitHub & NuGet data

In my previous posts, I’ve discussed how to retrieve the list of top repositories from GitHub,
get the list of packages each one uses, and store all that informationin a RavenDb database. Now, it’s time to query all the data.

RavenDb provides Map/Reduce indexes for querying the data. Let’s see some of the basic ones that interest us.

Repositories/ByName

from repo in docs.Repos
select new {
    Name = repo.Name,
    FullName = repo.FullName
}

Packages/ByName

from package in docs.Packages
select new {
    Name = package.Name
}

Packages/ByTag

from package in docs.Packages
from tag in package.Tags
select new {
    Tag = tag
}

These are all simple indexes, which are there mostly for searching and viewing our data. Much more interesting is to see the relations between repositories and packages. Let’s look at an index that allows us to get a list of repositories that use a certain package.Here’s the index definition:

Repositories/ByPackage

from repo in docs.Repos
from package in repo.Packages
from tag in package.Tags
select new {
    PackageName = package.Name,
    PackageVersion = package.Version
}

Querying this index gives us a nice list of repositories:

Using RavenDB Index to query NuGet packages used in GitHub

We can also create a more complex index, that will tell us which package (regardless of its versions) is the most popularly used:

MAP

from repo in docs.Repos
from packageRef in repo.Packages
let package = LoadDocument(packageRef.Id)
select new {
    PackageName = packageRef.Name,
    Repos = new string[] { repo.FullName },
    Tags = package.Tags,
    Count = 1
}

REDUCE

from r in results
group r by r.PackageName into g
select new {
    PackageName = g.Key,
    Repos = g.SelectMany(x = x.Repos).Distinct(),
    Tags = g.SelectMany(x = x.Tags).Distinct(),
    Count = g.Sum(x = x.Count)
}

FIELDS

  • Tags:
    • Storage: Yes
    • Indexing: Analyze
  • Count:
    • Sort: Int

This index allows us to see which packages are the most widely used:

Using Raven DB to query NuGet packages used in GitHub

And even drill further into the data, and for example, see which of the packages that have the sql tag is most widely used (and by which repositories):

Query NuGet packages used in GitHub from Raven DB

Here I’ve shown you how to query the data. In my next post, I’ll try to gain some insights from the data, and see if I can see some usage patterns.