TL;DR
Canada’s Innovative Solutions Canada programme publishes the grants it gives to help SMEs solve public-sector challenges. But our deep dive into the 405 award records worth about CAD $150 million shows that messy formats, duplicate names and inconsistent geography make it challenging to get the real story. Clean data isn’t a luxury—it’s the bedrock of trustworthy digital government and ability to manage and track the program activity.
The Innovative Solutions Canada Program
Innovative Solutions Canada is an initiative designed to stimulate technology research, development, and commercialization of Canadian innovations. The program’s Challenge Stream and Testing Stream help startups and small/medium-sized businesses (SMEs) overcome technology testing and development hurdles so that they can produce globally-demanded products and services, while also improving government operations.
The Awarded Dataset at a Glance
- Phase 1: 297 award recipients
- Phase 2: 108 award recipients
- Total records: 405
- Total public funding tracked: CAD $150 072 387.07
Take a look at the program: https://ised-isde.canada.ca/site/innovative-solutions-canada/en
Take a look at the data: https://ised-isde.canada.ca/site/innovative-solutions-canada/en/innovative-solutions-canada-awarded-companies
How We Got the Data (Yes, You Can Too)
No coding required: we used Excel Power Query (Data → Get Data → From Web) to pull the two HTML tables—“Phase 1 award recipients” and “Phase 2 award recipients”—directly from the government site. Power Query keeps the raw values intact, so every quirk you see below is exactly what’s published. The Power Query code can be found at end of blog post you can copy and paste into your blank Power Query.
Data Quality Scoreboard
- Overall Data Quality Score: ≈ 27% of records had an issue with data quality.
- 100 % of rows need award-amount cleaning
- 31 % affected by department name drift
- 12 % have duplicate business names
Data Quality Issues
Problem | How often? | Real example |
---|---|---|
Currency clutter | Every record needs cleaning; 2 are unreadable as published | *$168,270.30 · $1, 000, 000.00 |
Department name chaos | 39 spellings for 32 real departments | National Research Council vs. National Research Council Canada (NRC) |
Company duplicates | 25 duplicate spellings (≈ 8 %) | BI Expertise Inc vs. BI Expertise Inc. |
Geography clashes | 2 rows where city & province disagree | “Scarborough, ON” filed under “Quebec” |
Problem #1: The Money Formatting Nightmare
Examples straight from the source:
• $149,605.75*
• *$168,270.30
• $1, 000, 000.00
• $ 1,147,427.45 CAD
• $150 000.00
(non-breaking space)
With dollar signs, asterisks, currency codes and rogue spaces sprinkled at random, no tool can sum the column without a clean-up script.
Problem #2: The Department Identity Crisis
Five ways to spell one department:
National Research Council Canada (NRC)
National Research Council
National Research Council Canada
National Research Council of Canada
Natiocal Research Council Canada (typo)
Problem #3: When Businesses Have Multiple Identities
Corporate-suffix chaos means one company can look like three:
BI Expertise Inc ↔ BI Expertise Inc.
Biosa Technologies Limited ↔ Biosa Technologies Ltd.
Few-cycle Inc. ↔ few-cycle Inc.
Pyrogenesis Canada Inc ↔ PyroGenesis Canada Inc. ↔ Pyrogenesis Inc.
Terragon Environmental Technologies Inc. ↔ Terragon Environmental Technologies Inc
Problem #4: The Geographic Redundancy Puzzle
Two columns tell the same story—“City, Province or Territory” and “Province”—but they sometimes disagree. Example: Scarborough, ON tagged as “Quebec”.
The Quick Take: Why Anyone Should Care
Publishing raw datasets is a cornerstone of modern, digital-by-default government. In theory, anyone can slice the data to see which departments spend what, where the money goes, and which innovators rise to the top. In practice, data quality often stands between citizens and insight.
We need to have 100% quality data:
- Clean dollar amounts you can add up.
- One official name per federal department.
- Consistent spelling for every winning company.
- Location fields that agree (city vs province).
Why it matters: When “National Research Council” is spelled five different ways, its spending is split into five buckets. Unparseable dollar cells prevent calculating accurate totals, and a single province mismatch breaks any map. Transparency fails when the data is too dirty to use.
Industry studies peg the price of dirty data at 15–25 % of revenue. For government, the cost is measured in public confidence. Data that is technically “open” but practically unusable creates what experts call the transparency paradox—the appearance of openness without real insight.
It also calls into question how the well the Innovative Solutions Canada team itself is able to create their own internal metrics to manage its operations and gauge their effectiveness and the precious tax dollars being spent. These public awards metrics are likely also the tip of the iceberg. Behind the scenes, in the more detailed operational data and documentation, it is likely that there are more data quality issues that call into question the certainty and effectiveness of tracking and managing the grants and the projects execution. Are all projects managed by the same criteria? Are there many different sets of metrics being used?
The Real-World Impact
What are the impacts?
- Citizens: Want totals for your province? Not without cleaning the file first.
- Journalists: Investigations turn into data-wrangling marathons.
- Researchers: Academic studies on innovation funding stall on preprocessing.
- Government: The dataset meant to boost transparency instead erodes trust.
- Innovative Solutions Canada team itself: They need data to run their operations and manage their effectiveness. If I had to deal with these data issues, then every day the team itself is dealing with these issues.
Looking Forward
The Innovative Solutions Canada programme is still a laudable example of digital-government transparency. But good intentions need good data plumbing. Clean, consistent, analysis-ready datasets aren’t a technical nicety—they’re a democratic requirement. Only then can open data deliver on its promise of accountability and informed public debate.
What Needs to Change
- Data Standards: lock down formats before publishing (e.g., dropdowns for department names).
- Validation: automated checks should catch currencies, suffixes and mismatched provinces on upload.
- Master Data Management: maintain authoritative lists of departments, companies and geographies.
- Quality Monitoring: schedule routine audits and publish a changelog with each data refresh.
Analysis Summary
The analysis was done by merging the Phase 1 and Phase 2 CSVs, profiled each column, and applied rule-based cleaning: strip non-numeric characters from money fields, collapse parenthetical abbreviations in department names, normalise corporate suffixes, and compare city abbreviations to formal province names. No fancy AI—just careful regexes, a touch of fuzzy matching, and plenty of validation checks.
The analysis code and datasets can be found in Github repo:
https://github.com/sitrucp/goc_ised_data_quality