Legacy pay rate bands extraction v2 — independent validation
Validator: Senior Data Engineer (peer-review pass, v2 — role-columns variant) Date: 2026-05-18 Subject of validation:
/docs/30-49-domains/35-gig/statistics/2026-05-18-legacy-pay-rate-bands.csv(10-column variant)/docs/30-49-domains/35-gig/scratchpad/2026-05-18-legacy-pay-rate-bands-query.md(overwritten v2)- Baseline: prior
CONFIRMEDvalidator file (...-validation.md).
Verdict
CONFIRMED. Every numeric headline claim in v2 was independently reproduced from the source database; the schema additions (job_types LEFT JOIN, three new projected columns) are correctly applied; the CSV is well-formed; and the title-pollution finding is, if anything, understated rather than overstated — the seeding plan's Stage-2 admin review needs to be more robust than the v1 design assumed.
Per-claim audit
| # | Claim | Executor value | Validator value | Match | Notes |
|---|---|---|---|---|---|
| 1 | CSV row count | 10,166 | 10,166 (1 header + 10,166 data; CSV reader and wc -l = 10,167) | YES | |
| 2 | Reproducing executor's SQL gives the same count | 10,166 | 10,166 | YES | Ran the executor's verification SELECT COUNT(*) block verbatim. |
| 3 | Distinct (company_id, jod_job_title) | 2,231 | 2,231 (from CSV; from DB extract output: 2,231) | YES | Caveat: A naive SELECT COUNT(DISTINCT company_id, job_title) returns 2,170 because the MySQL collation utf8mb4_0900_ai_ci is case-insensitive. The CSV preserves case-sensitive distinct values because each variant appears on a different row (distinguished by other columns), so Python's case-sensitive dedup gives 2,231. The executor's number is the right number for CSV-shaped reporting; both numbers are correct under their respective semantics. |
| 4 | Distinct (company_id, job_type_id) | 80 | 80 | YES | Reproduced both from DB extract and direct SQL. |
| 5 | Good Idea Concepts (219) distinct titles | 499 | 499 (from CSV); 478 (DB-side COUNT DISTINCT due to ci collation) | YES | Same case-sensitivity caveat as #3. |
| 6 | NTUC Foodfare (106) distinct titles | 262 | 262 (CSV); 250 (DB-side ci) | YES | Same caveat. |
| 7 | The Robertson House (244) distinct titles | 233 | 233 (CSV); 233 (DB-side ci) | YES | No case collisions for this account. |
| 8 | Top 10 companies by distinct title count | as reported | identical ordering and counts | YES | See top-15 table in the §"Independent re-query" section below. |
| 9 | Only 7 of 57 job_type_id values in use, IDs 51–57 = Retail, F&B, Hotel, Healthcare, Logistics, Events, Others | 7 of 57 | 7 of 57 — IDs and titles exactly as claimed | YES | SELECT DISTINCT job_type_id against the same filters. |
| 10 | 4 of top 5 companies: distinct (title, job_type_id) == distinct title | claimed | confirmed: 219, 106, 224, 78 are equal; 244 has 236 vs 233 (delta = 3 because 3 titles span 2 job_type_id values, see §"Independent re-query") | YES | |
| 11 | 52 near-duplicate title groups (case/whitespace) | 52 | 52 | YES | Independent grouping: collapse whitespace runs to single space + lowercase, group by (company_id, normalised_title), keep groups with ≥2 raw variants. |
| 12 | 0 NULL jod_job_title rows | 0 | 0 | YES | Both DB-side (NULL/empty count = 0) and CSV-side (empty string count = 0). |
| 13 | 0 NULL job_type_id rows in deduped output | 0 | 0 | YES | Same approach. The LEFT JOIN means NULLs would be preserved as empty string in the CSV; none exist in the filtered data. |
| 14 | Title pollution finding (admins encoding date/worker/time into title) | claimed real | CONFIRMED and understated — see §"Title pollution finding" below | YES | Quantified: 34.5% of distinct titles contain a date pattern; 8.7% contain a time pattern; 77.1% contain parentheses (usually worker name). For the top 8 accounts this is 86–97%. |
Schema verification
DESCRIBE job_types:
| Field | Type | Null | Key | Default |
|---|---|---|---|---|
id | bigint unsigned | NO | PRI | (auto_increment) |
title | varchar(255) | NO | NULL | |
description | varchar(255) | YES | NULL | |
wage_range | varchar(255) | YES | NULL | |
image | varchar(255) | NO | default.jpeg | |
hex_color | varchar(7) | YES | NULL | |
status | smallint | NO | NULL | |
created_by, updated_by | bigint | YES | NULL | |
created_at, updated_at | timestamp | YES | NULL |
SELECT COUNT(*) FROM job_types = 57✓ (matches executor's claim).titleis the human-readable label ✓ — the column the executor joined.- The LEFT JOIN is the right call: an inner join would silently drop
jj.job_type_id IS NULLrows. In the filtered set the count of NULLjob_type_idis 0, so the join shape doesn't affect this extraction, but the LEFT JOIN remains defensible against future data. statusfiltering onjob_typeswas deliberately not applied (executor noted this). This is the correct call — we want the label for whatever taxonomy entry the legacy job points to, regardless of whether the platform-managed taxonomy row is currently active.
Independent re-query
Used to cross-check every numeric headline:
SELECT COUNT(*) FROM (
SELECT DISTINCT
jj.company_id, c.name AS company_name,
jj.location_id, l.name AS location_name,
jj.job_title AS jod_job_title,
jj.job_type_id, jt.title AS job_type_title,
TIME(s.slot_start_date) AS starts_at,
TIME(s.slot_end_date) AS ends_at,
jj.hourly_rate
FROM jod_jobs jj
JOIN companies c ON c.id = jj.company_id AND c.status = 1 AND c.deleted_at IS NULL
JOIN locations l ON l.id = jj.location_id AND l.status = 1 AND l.deleted_at IS NULL
LEFT JOIN job_types jt ON jt.id = jj.job_type_id
JOIN slots s ON s.jod_job_id = jj.id AND s.deleted_at IS NULL
JOIN slot_user su ON su.slot_id = s.id AND su.deleted_at IS NULL
JOIN payments p ON p.id = su.payment_id AND p.payment_status = 2 AND p.deleted_at IS NULL
WHERE jj.created_at >= '2025-05-18' AND jj.deleted_at IS NULL AND jj.status = 3
) x;
-- 10,166
SELECT DISTINCT jj.job_type_id, jt.title FROM jod_jobs jj ... (same filters) ORDER BY jj.job_type_id;
-- 51 Retail, 52 F&B, 53 Hotel, 54 Healthcare, 55 Logistics, 56 Events, 57 Others
-- Top 5 companies by distinct title vs distinct (title, job_type_id):
SELECT jj.company_id, c.name,
COUNT(DISTINCT jj.job_title) AS distinct_title,
COUNT(DISTINCT CONCAT(jj.job_title,'|',IFNULL(jj.job_type_id,'NULL'))) AS distinct_title_type
FROM ... GROUP BY jj.company_id, c.name ORDER BY distinct_title DESC LIMIT 5;
-- 219 Good Idea 478 | 478
-- 106 NTUC Foodfare 250 | 250
-- 244 Robertson 233 | 236 <-- 3 titles span 2 job_type_id values
-- 224 Food Theory 172 | 172
-- 78 NTUC Hyper 151 | 151
(The DB-side counts are case-insensitive because of the utf8mb4_0900_ai_ci collation. The CSV-side counts are case-sensitive and equal 499, 262, 233, 175, 160. Both are correct under their respective semantics. The "4 of 5" claim holds in both — the only company with distinct_title_type > distinct_title is Robertson House, regardless of which counting convention you use.)
Top 15 from the DB extract output (case-sensitive, matches CSV):
| company_id | company_name | distinct titles |
|---|---|---|
| 219 | Good Idea Concepts Pte Ltd | 499 |
| 106 | NTUC Foodfare Co-operative Ltd | 262 |
| 244 | The Robertson House | 233 |
| 224 | THE FOOD THEORY GROUP PTE. LTD | 175 |
| 78 | NTUC Fairprice (Hyper) | 160 |
| 240 | GoGoX Singapore | 125 |
| 223 | BBZ DESIGN INTERNATIONAL | 107 |
| 211 | NTUC Fairprice Northeast | 80 |
| 83 | NTUC Fairprice (Finest) | 53 |
| 260 | Shokudo Concepts | 49 |
| 114 | Creative Eateries | 45 |
| 151 | McDonald's | 36 |
| 206 | NTUC Fairprice North | 36 |
| 205 | NTUC Fairprice East | 32 |
| 209 | NTUC Fairprice Northwest | 31 |
CSV integrity
- File line count: 10,167 (= 1 header + 10,166 data rows). ✓
- Header:
company_id,company_name,location_id,location_name,jod_job_title,job_type_id,job_type_title,starts_at,ends_at,hourly_rate— matches the spec exactly (10 columns). ✓ - Column count consistency: every data row has exactly 10 fields under Python
csv.reader. No malformed rows. - NULL handling: the executor's spec says NULL
job_type_idbecomes empty string in the CSV; the filtered data happens to contain zero such rows, so the rule was vacuously applied. No emptyjod_job_titleeither. ✓ - Set match against DB extract: the 10,166 CSV rows exactly equal the 10,166 row output of the extraction SQL (compared by sorted-line equality after running mysql with
-B -N).
Spot checks (10 random CSV rows)
Sampled with random.seed(73). For each row, an exact-match query was run against the source DB using BINARY job_title (case-sensitive) and the row's full grain (company_id, location_id, title, job_type_id, starts_at, ends_at, hourly_rate). The DB returned ≥1 matching slot for every sample.
| # | company | location | title | job_type | times | rate | DB matching slots |
|---|---|---|---|---|---|---|---|
| 1 | 204 | 1552 | Experienced FP Cashier | 51 Retail | 13:00–22:00 | 11.41 | 6 ✓ |
| 2 | 106 | 2266 | Kitchen Assistant (Isaac Lim) | 52 F&B | 10:00–18:00 | 15.35 | 2 ✓ |
| 3 | 219 | 2198 | Service Crew (MMVC) | 52 F&B | 19:00–23:00 | 13.00 | 1 ✓ |
| 4 | 219 | 1732 | Service Crew (20/12/25 Sandeep) | 52 F&B | 22:04–22:30 | 13.00 | 1 ✓ |
| 5 | 240 | 1804 | Customer Support Admin (Siti) | 55 Logistics | 09:00–17:00 | 10.00 | 13 ✓ |
| 6 | 150 | 1323 | Part Time Service Crew | 52 F&B | 12:30–17:00 | 14.00 | 1 ✓ |
| 7 | 219 | 1731 | Service Crew | 52 F&B | 12:00–22:30 | 13.00 | 16 ✓ |
| 8 | 205 | 1569 | Experienced FP Cashier | 51 Retail | 09:00–17:00 | 11.41 | 1 ✓ |
| 9 | 244 | 1838 | Bartender (11/10/25 Yunkai) | 53 Hotel | 16:01–16:30 | 17.00 | 1 ✓ |
| 10 | 78 | 731 | RA Night Refill | 51 Retail | 22:30–07:00 | 13.60 | 24 ✓ |
All ten samples reproduce. The jod_job_title and job_type_title columns are projected verbatim from the source.
Title pollution finding — independent assessment
The executor's finding is CONFIRMED and, if anything, understated. The legacy job_title is being abused as free-form per-shift commentary by a non-trivial fraction of accounts, especially the top-volume ones.
Independent random sample of 15 distinct (company, title) pairs
Seed random.seed(2026):
| company | title |
|---|---|
| 106 NTUC Foodfare | Food Packer (Fu Ang) |
| 114 Creative Eateries | Service Crew (19/05/25 Sebastien) |
| 114 Creative Eateries | Service Crew - Weekday (1/06/25 See Ming) |
| 145 Winter Time | Events@ Changi City Point L1 Atrium |
| 205 NTUC FP East | Experienced Retail Assistant Fresh |
| 206 NTUC FP North | RA Trolley |
| 211 NTUC FP Northeast | RA Dry Non-food |
| 219 Good Idea | Service Crew (MMTP1) |
| 219 Good Idea | Service Crew (Sandeep)(12PM-10PM) |
| 219 Good Idea | Service Crew (Shu Hui)(28/12/25) |
| 224 The Food Theory | Service Crew (1/8/25 Sandeep) |
| 240 GoGoX | Warehouse Assistant (Irfaan 14/2/26) |
| 244 Robertson House | F&B Kitchen Assistant (Hong Wei Xian) 19/9/25 |
| 244 Robertson House | Finance Executive (Mei 1/6/25) |
| 281 Jas Prestige | Service Crew |
11 of 15 sampled titles carry per-shift metadata (worker name, date, time, outlet code). Only 4 are clean role names (Experienced Retail Assistant Fresh, RA Trolley, RA Dry Non-food, Service Crew). The pollution is not concentrated in one outlier account — it appears across at least 7 different companies in this random sample, including F&B (Creative Eateries, Robertson House), retail/logistics (GoGoX), and Good Idea Concepts.
Quantification across all 2,231 distinct titles
| Pattern | Distinct titles matching | % of total |
|---|---|---|
Contains date (e.g. 12/7/25, 1/06/25, 19 Feb) | 769 | 34.5% |
Contains time-of-day (e.g. 8AM-12NN, 0800-1300) | 194 | 8.7% |
| Contains parentheses (usually a worker name) | 1,721 | 77.1% |
Per-company pollution rate (date-pattern OR parens-pattern, across distinct titles)
| company_id | company_name | polluted titles / total | % |
|---|---|---|---|
| 219 | Good Idea Concepts | 478 / 499 | 96% |
| 244 | The Robertson House | 225 / 233 | 97% |
| 106 | NTUC Foodfare | 226 / 262 | 86% |
| 224 | The Food Theory Group | 166 / 175 | 95% |
| 78 | NTUC Fairprice (Hyper) | 117 / 160 | 73% |
| 240 | GoGoX | 116 / 125 | 93% |
| 223 | BBZ Design | 103 / 107 | 96% |
| 260 | Shokudo Concepts | 47 / 49 | 96% |
| 211 | NTUC FP Northeast | 43 / 80 | 54% |
| 114 | Creative Eateries | 36 / 45 | 80% |
For the eight top accounts by title-count, 73–97% of "distinct roles" are actually role + per-shift metadata. The pollution is the dominant phenomenon in this dataset, not a side observation.
What the pollution actually encodes
The "Irfaan" deep dive on GoGoX is illustrative. Within just the suffix-variants for a single worker at one company, the DB shows ~55 distinct titles that are all variations of Customer Support Admin (Irfaan) or Warehouse Assistant (Irfaan) with appended (DD/MM/YY), (HHam-HHpm), (start-end), or combinations. The admin was clearly using the title field as a free-text per-shift assignment label. None of these belong on Org::JobRole.title.
Near-duplicate count cross-check
| Method | Group count |
|---|---|
| Executor | 52 |
Validator (independent: lowercase + collapse whitespace, group by (company_id, normalised_title), count groups with ≥2 raw variants) | 52 |
Top 10 sample groups inspected by hand — all are genuine duplicates (case-only or whitespace-only differences). Examples (validated):
(106) "food packer (chee rong)"— 3 variants:Food Packer (Chee rong),Food Packer (Chee Rong),Food Packer (CHEE RONG).(219) "service crew (leonard)"— 3 variants:Service Crew (LEONARD),Service Crew (leonard),Service Crew (Leonard).(240) "customer support admin (irfaan)"— 3 variants:... (Irfaan),... (irfaan),... (IRFAAN).(78) "retail assistant dairy"— 2 variants differing only by a U+00A0 non-breaking space:Retail Assistant\xa0DairyvsRetail Assistant Dairy.(78) "the bar @ experienced fp cashier (wyley)"— 2 variants:The bAR @ ...vsThe Bar @ ....
Note: this 52 is a lower bound. The true near-duplicate rate is much higher once date/time suffixes are stripped — the deeper "Irfaan" example shows ~55 variants for the same effective role, and that's just one worker at one company.
Issues found
None that block the validation conclusion
No discrepancy was found in the row count, the schema, the JOIN, the projection, the per-company counts, the NULL counts, the job_type identification, or the near-duplicate count. The v2 extract is a faithful execution of the executor's stated query and the CSV is internally well-formed.
Documentation-clarity nit (not blocking)
- The executor's claim "Distinct (company_id, jod_job_title) = 2,231" is correct for CSV reporting but a naive
SELECT COUNT(DISTINCT company_id, job_title)in MySQL returns 2,170 because the database collation isutf8mb4_0900_ai_ci(case- and accent-insensitive). Both numbers are correct under their respective semantics. The query record should add a one-line note that the case-sensitivity of downstream CSV-reading tools and the case-insensitivity of MySQL counting can produce different "distinct" answers on the same dataset.
Material observation for seed-data planning (not a discrepancy)
- The title-pollution finding is at least as bad as the executor described. Stage-2 admin review in UC-8 cannot be a "spot-check the CSV in a spreadsheet" exercise — at top accounts, 73–97% of rows need normalisation, and the noisiness is structural (admins are using
job_titleas per-shift commentary, not as a role label). This affects the design of Stage 2, not Stage 1, and is consistent with the existing model spec; it just raises the cost of the admin step.
Independent recommendation on the seeding implication
The data still supports Org::JobRole as the right design. In fact, the title-pollution evidence makes the design more necessary, not less:
- The pollution is exactly what
Org::JobRoleexists to fix. The reason 219, 244, 106, 224, 240, and 223 each have 100–500 "distinct titles" is precisely that the legacy system has no concept separating "the role being staffed" from "the specific shift, worker, date".Org::JobRoleintroduces that separation. Without the model, these accounts will keep generating one new "role" per shift forever. - The 80 distinct
(company, job_type_id)pairs vs 2,231 distinct(company, title)pairs is the right framing of the consolidation target.job_type_id(Retail / F&B / Hotel / Healthcare / Logistics / Events / Others) is too coarse to be the role spine — McDonald's would have one row for "F&B" and NTUC would have one row for "Retail", which doesn't help anyone configure pay rates. But it sets a useful upper bound: any account with N(title, job_type_id)pairs and Mjob_type_idvalues has somewhere between M and N legitimate roles, and the rest is pollution. For Good Idea Concepts this is somewhere between 1 (F&B only) and ~10–20 (Service Crew, Kitchen Assistant, Barista, Cleaner, etc.), not 499. - The 4-of-5-top-companies-don't-need-job_type_id finding is not a reason to drop
job_type_idfrom the seed pipeline. It means the existing taxonomy is consistent within each role for those four — useful as a sanity-check column at Stage 2, not as a disambiguator. Robertson House (244) shows that the same title can carry twojob_type_ids; the seed importer should pick the modal value and surface the conflicts. - Stage 2 needs more design work than UC-8 currently describes. The current UC-8 spec describes admin review of a CSV; the data shows that at the top accounts, an admin would need to manually consolidate 100–500 rows into 5–20 roles. This is too much manual work to do safely in a spreadsheet without tooling. Recommendation: before running the importer, build a suffix-stripping pre-processor that removes patterns like
(<worker-name>),(<date>),(<time-range>),- DD/MM/YY, etc., and outputs a "suggested canonical role" column for each row. The admin then reviews the suggestions instead of doing the work from scratch.
Bottom line: the v2 extract is solid. Use it. The seeding plan stays directionally right but Stage 2 should be re-scoped from "admin spot-checks the CSV" to "admin reviews a pre-processor's suggested consolidation" — this is a UC-8 update, not a design rethink.