Open Data, Hidden Mess: What 405 Canadian Innovation Grants Reveal About Government Data Quality

TL;DR

Canada’s Innovative Solutions Canada programme publishes the grants it gives to help SMEs solve public-sector challenges. But our deep dive into the 405 award records worth about CAD $150 million shows that messy formats, duplicate names and inconsistent geography make it challenging to get the real story. Clean data isn’t a luxury—it’s the bedrock of trustworthy digital government and ability to manage and track the program activity.

The Innovative Solutions Canada Program

Innovative Solutions Canada is an initiative designed to stimulate technology research, development, and commercialization of Canadian innovations. The program’s Challenge Stream and Testing Stream help startups and small/medium-sized businesses (SMEs) overcome technology testing and development hurdles so that they can produce globally-demanded products and services, while also improving government operations.

The Awarded Dataset at a Glance

Phase 1: 297 award recipients
Phase 2: 108 award recipients
Total records: 405
Total public funding tracked: CAD $150 072 387.07

Take a look at the program: https://ised-isde.canada.ca/site/innovative-solutions-canada/en
Take a look at the data: https://ised-isde.canada.ca/site/innovative-solutions-canada/en/innovative-solutions-canada-awarded-companies

How We Got the Data (Yes, You Can Too)

No coding required: we used Excel Power Query (Data → Get Data → From Web) to pull the two HTML tables—“Phase 1 award recipients” and “Phase 2 award recipients”—directly from the government site. Power Query keeps the raw values intact, so every quirk you see below is exactly what’s published. The Power Query code can be found at end of blog post you can copy and paste into your blank Power Query.

Data Quality Scoreboard

Overall Data Quality Score: ≈ 27% of records had an issue with data quality.
100 % of rows need award-amount cleaning
31 % affected by department name drift
12 % have duplicate business names

Data Quality Issues

Problem	How often?	Real example
Currency clutter	Every record needs cleaning; 2 are unreadable as published	*$168,270.30 · $1, 000, 000.00
Department name chaos	39 spellings for 32 real departments	National Research Council vs. National Research Council Canada (NRC)
Company duplicates	25 duplicate spellings (≈ 8 %)	BI Expertise Inc vs. BI Expertise Inc.
Geography clashes	2 rows where city & province disagree	“Scarborough, ON” filed under “Quebec”

Problem #1: The Money Formatting Nightmare

Examples straight from the source:
• $149,605.75*
• *$168,270.30
• $1, 000, 000.00
• $ 1,147,427.45 CAD
• $150 000.00 (non-breaking space)

With dollar signs, asterisks, currency codes and rogue spaces sprinkled at random, no tool can sum the column without a clean-up script.

Problem #2: The Department Identity Crisis

Five ways to spell one department:

National Research Council Canada (NRC)
National Research Council
National Research Council Canada
National Research Council of Canada
Natiocal Research Council Canada (typo)

Problem #3: When Businesses Have Multiple Identities

Corporate-suffix chaos means one company can look like three:

BI Expertise Inc ↔ BI Expertise Inc.
Biosa Technologies Limited ↔ Biosa Technologies Ltd.
Few-cycle Inc. ↔ few-cycle Inc.
Pyrogenesis Canada Inc ↔ PyroGenesis Canada Inc. ↔ Pyrogenesis Inc.
Terragon Environmental Technologies Inc. ↔ Terragon Environmental Technologies Inc

Problem #4: The Geographic Redundancy Puzzle

Two columns tell the same story—“City, Province or Territory” and “Province”—but they sometimes disagree. Example: Scarborough, ON tagged as “Quebec”.

The Quick Take: Why Anyone Should Care

Publishing raw datasets is a cornerstone of modern, digital-by-default government. In theory, anyone can slice the data to see which departments spend what, where the money goes, and which innovators rise to the top. In practice, data quality often stands between citizens and insight.

We need to have 100% quality data:

Clean dollar amounts you can add up.
One official name per federal department.
Consistent spelling for every winning company.
Location fields that agree (city vs province).

Why it matters: When “National Research Council” is spelled five different ways, its spending is split into five buckets. Unparseable dollar cells prevent calculating accurate totals, and a single province mismatch breaks any map. Transparency fails when the data is too dirty to use.

Industry studies peg the price of dirty data at 15–25 % of revenue. For government, the cost is measured in public confidence. Data that is technically “open” but practically unusable creates what experts call the transparency paradox—the appearance of openness without real insight.

It also calls into question how the well the Innovative Solutions Canada team itself is able to create their own internal metrics to manage its operations and gauge their effectiveness and the precious tax dollars being spent. These public awards metrics are likely also the tip of the iceberg. Behind the scenes, in the more detailed operational data and documentation, it is likely that there are more data quality issues that call into question the certainty and effectiveness of tracking and managing the grants and the projects execution. Are all projects managed by the same criteria? Are there many different sets of metrics being used?

The Real-World Impact

What are the impacts?

Citizens: Want totals for your province? Not without cleaning the file first.
Journalists: Investigations turn into data-wrangling marathons.
Researchers: Academic studies on innovation funding stall on preprocessing.
Government: The dataset meant to boost transparency instead erodes trust.
Innovative Solutions Canada team itself: They need data to run their operations and manage their effectiveness. If I had to deal with these data issues, then every day the team itself is dealing with these issues.

Looking Forward

The Innovative Solutions Canada programme is still a laudable example of digital-government transparency. But good intentions need good data plumbing. Clean, consistent, analysis-ready datasets aren’t a technical nicety—they’re a democratic requirement. Only then can open data deliver on its promise of accountability and informed public debate.

What Needs to Change

Data Standards: lock down formats before publishing (e.g., dropdowns for department names).
Validation: automated checks should catch currencies, suffixes and mismatched provinces on upload.
Master Data Management: maintain authoritative lists of departments, companies and geographies.
Quality Monitoring: schedule routine audits and publish a changelog with each data refresh.

Analysis Summary

The analysis was done by merging the Phase 1 and Phase 2 CSVs, profiled each column, and applied rule-based cleaning: strip non-numeric characters from money fields, collapse parenthetical abbreviations in department names, normalise corporate suffixes, and compare city abbreviations to formal province names. No fancy AI—just careful regexes, a touch of fuzzy matching, and plenty of validation checks.

The analysis code and datasets can be found in Github repo:
https://github.com/sitrucp/goc_ised_data_quality

Django forms have some complexity when first encountered. Recommend following any basic Django forms tutorial, and applying to your specific…

Ah yes that does help to clear it up. However, I’m facing an issue now that whenever my form is…

Hey NS, The data is saved to database by the form post action eg starting with: if request.method == 'POST':.…

Hi Curtis, I’m relatively new to django and I’m trying to create a Django application where i can get a…

Looks like it is working fine but my code was missing a line break at the end of if then…