Data Quality and Networks - playing with the Github Innovation graph

[2024-06-19]

Can the great data be of bad quality?

I’d argue that there’s a case for that, even more so I’d say that against the point that ‘to adopt AI companies need to fix data quality first’.

So, if there’s a very interesting (well, to me) domain, but the data is of bad quality in the traditional sense, does it make sense to try to discover what to do with it as-is, not attempting to fix it in principle?

When we’re talking about quality here and forward, one has to mention that to have superb AI capability, a company won’t benefit much from having a classical warehouse in the best sense: accurate, valid, correct data, all covered in tests, metrics, etc.

Being formal, in my opinion, it should focus more on ‘relative’ fitness to purpose, a measure of world/domain representation in all of its messy richness.

Recently, I’ve noticed a mention of the open-source Github dataset (Github Innovation Graph), which was meant to access activity mapped to the country.

While, macro-policies are not that interesting, the topic and content seemed incredibly interesting, with supposed stats on:

Developers
Languages/Topics
Organizations/Repo statistics
And collaborations

Last one was potentially the most exciting to explore a bit.

I’ll try to get a bit of intuition from data where Germany stands on the Tech map, based on that data (first - need to answer if it’s a) possible/relevant, b) reasonable based on few checks)

Well, first of all, mental considerations:

What’s the data about (supposedly), what it may answer
Ok, clear selling proposition, but how understandable is the methodology?
Ok - does it seem reasonable?