Legacy pay rate bands extraction v2 — independent validation

Validator: Senior Data Engineer (peer-review pass, v2 — role-columns variant) Date: 2026-05-18 Subject of validation:

/docs/30-49-domains/35-gig/statistics/2026-05-18-legacy-pay-rate-bands.csv (10-column variant)
/docs/30-49-domains/35-gig/scratchpad/2026-05-18-legacy-pay-rate-bands-query.md (overwritten v2)
Baseline: prior CONFIRMED validator file (...-validation.md).

Verdict

CONFIRMED. Every numeric headline claim in v2 was independently reproduced from the source database; the schema additions (job_types LEFT JOIN, three new projected columns) are correctly applied; the CSV is well-formed; and the title-pollution finding is, if anything, understated rather than overstated — the seeding plan's Stage-2 admin review needs to be more robust than the v1 design assumed.

Per-claim audit

#	Claim	Executor value	Validator value	Match	Notes
1	CSV row count	10,166	10,166 (1 header + 10,166 data; CSV reader and `wc -l = 10,167`)	YES
2	Reproducing executor's SQL gives the same count	10,166	10,166	YES	Ran the executor's verification `SELECT COUNT(*)` block verbatim.
3	Distinct `(company_id, jod_job_title)`	2,231	2,231 (from CSV; from DB extract output: 2,231)	YES	Caveat: A naive `SELECT COUNT(DISTINCT company_id, job_title)` returns 2,170 because the MySQL collation `utf8mb4_0900_ai_ci` is case-insensitive. The CSV preserves case-sensitive distinct values because each variant appears on a different row (distinguished by other columns), so Python's case-sensitive dedup gives 2,231. The executor's number is the right number for CSV-shaped reporting; both numbers are correct under their respective semantics.
4	Distinct `(company_id, job_type_id)`	80	80	YES	Reproduced both from DB extract and direct SQL.
5	Good Idea Concepts (219) distinct titles	499	499 (from CSV); 478 (DB-side `COUNT DISTINCT` due to ci collation)	YES	Same case-sensitivity caveat as #3.
6	NTUC Foodfare (106) distinct titles	262	262 (CSV); 250 (DB-side ci)	YES	Same caveat.
7	The Robertson House (244) distinct titles	233	233 (CSV); 233 (DB-side ci)	YES	No case collisions for this account.
8	Top 10 companies by distinct title count	as reported	identical ordering and counts	YES	See top-15 table in the §"Independent re-query" section below.
9	Only 7 of 57 `job_type_id` values in use, IDs 51–57 = Retail, F&B, Hotel, Healthcare, Logistics, Events, Others	7 of 57	7 of 57 — IDs and titles exactly as claimed	YES	`SELECT DISTINCT job_type_id` against the same filters.
10	4 of top 5 companies: `distinct (title, job_type_id) == distinct title`	claimed	confirmed: 219, 106, 224, 78 are equal; 244 has 236 vs 233 (delta = 3 because 3 titles span 2 `job_type_id` values, see §"Independent re-query")	YES
11	52 near-duplicate title groups (case/whitespace)	52	52	YES	Independent grouping: collapse whitespace runs to single space + lowercase, group by `(company_id, normalised_title)`, keep groups with ≥2 raw variants.
12	0 NULL `jod_job_title` rows	0	0	YES	Both DB-side (NULL/empty count = 0) and CSV-side (empty string count = 0).
13	0 NULL `job_type_id` rows in deduped output	0	0	YES	Same approach. The LEFT JOIN means NULLs would be preserved as empty string in the CSV; none exist in the filtered data.
14	Title pollution finding (admins encoding date/worker/time into title)	claimed real	CONFIRMED and understated — see §"Title pollution finding" below	YES	Quantified: 34.5% of distinct titles contain a date pattern; 8.7% contain a time pattern; 77.1% contain parentheses (usually worker name). For the top 8 accounts this is 86–97%.

Schema verification

DESCRIBE job_types:

Field	Type	Null	Key	Default
`id`	bigint unsigned	NO	PRI	(auto_increment)
`title`	varchar(255)	NO		NULL
`description`	varchar(255)	YES		NULL
`wage_range`	varchar(255)	YES		NULL
`image`	varchar(255)	NO		default.jpeg
`hex_color`	varchar(7)	YES		NULL
`status`	smallint	NO		NULL
`created_by`, `updated_by`	bigint	YES		NULL
`created_at`, `updated_at`	timestamp	YES		NULL

SELECT COUNT(*) FROM job_types = 57 ✓ (matches executor's claim).
title is the human-readable label ✓ — the column the executor joined.
The LEFT JOIN is the right call: an inner join would silently drop jj.job_type_id IS NULL rows. In the filtered set the count of NULL job_type_id is 0, so the join shape doesn't affect this extraction, but the LEFT JOIN remains defensible against future data.
status filtering on job_types was deliberately not applied (executor noted this). This is the correct call — we want the label for whatever taxonomy entry the legacy job points to, regardless of whether the platform-managed taxonomy row is currently active.

Independent re-query

Used to cross-check every numeric headline:

SELECT COUNT(*) FROM (
  SELECT DISTINCT
    jj.company_id, c.name AS company_name,
    jj.location_id, l.name AS location_name,
    jj.job_title AS jod_job_title,
    jj.job_type_id, jt.title AS job_type_title,
    TIME(s.slot_start_date) AS starts_at,
    TIME(s.slot_end_date)   AS ends_at,
    jj.hourly_rate
  FROM jod_jobs jj
  JOIN companies c ON c.id = jj.company_id AND c.status = 1 AND c.deleted_at IS NULL
  JOIN locations l ON l.id = jj.location_id AND l.status = 1 AND l.deleted_at IS NULL
  LEFT JOIN job_types jt ON jt.id = jj.job_type_id
  JOIN slots s     ON s.jod_job_id = jj.id  AND s.deleted_at IS NULL
  JOIN slot_user su ON su.slot_id  = s.id   AND su.deleted_at IS NULL
  JOIN payments p   ON p.id        = su.payment_id AND p.payment_status = 2 AND p.deleted_at IS NULL
  WHERE jj.created_at >= '2025-05-18' AND jj.deleted_at IS NULL AND jj.status = 3
) x;
-- 10,166

SELECT DISTINCT jj.job_type_id, jt.title FROM jod_jobs jj ... (same filters) ORDER BY jj.job_type_id;
-- 51 Retail, 52 F&B, 53 Hotel, 54 Healthcare, 55 Logistics, 56 Events, 57 Others

-- Top 5 companies by distinct title vs distinct (title, job_type_id):
SELECT jj.company_id, c.name,
       COUNT(DISTINCT jj.job_title) AS distinct_title,
       COUNT(DISTINCT CONCAT(jj.job_title,'|',IFNULL(jj.job_type_id,'NULL'))) AS distinct_title_type
FROM ... GROUP BY jj.company_id, c.name ORDER BY distinct_title DESC LIMIT 5;
-- 219 Good Idea     478 | 478
-- 106 NTUC Foodfare 250 | 250
-- 244 Robertson     233 | 236   <-- 3 titles span 2 job_type_id values
-- 224 Food Theory   172 | 172
--  78 NTUC Hyper    151 | 151

(The DB-side counts are case-insensitive because of the utf8mb4_0900_ai_ci collation. The CSV-side counts are case-sensitive and equal 499, 262, 233, 175, 160. Both are correct under their respective semantics. The "4 of 5" claim holds in both — the only company with distinct_title_type > distinct_title is Robertson House, regardless of which counting convention you use.)

Top 15 from the DB extract output (case-sensitive, matches CSV):

company_id	company_name	distinct titles
219	Good Idea Concepts Pte Ltd	499
106	NTUC Foodfare Co-operative Ltd	262
244	The Robertson House	233
224	THE FOOD THEORY GROUP PTE. LTD	175
78	NTUC Fairprice (Hyper)	160
240	GoGoX Singapore	125
223	BBZ DESIGN INTERNATIONAL	107
211	NTUC Fairprice Northeast	80
83	NTUC Fairprice (Finest)	53
260	Shokudo Concepts	49
114	Creative Eateries	45
151	McDonald's	36
206	NTUC Fairprice North	36
205	NTUC Fairprice East	32
209	NTUC Fairprice Northwest	31

CSV integrity

File line count: 10,167 (= 1 header + 10,166 data rows). ✓
Header: company_id,company_name,location_id,location_name,jod_job_title,job_type_id,job_type_title,starts_at,ends_at,hourly_rate — matches the spec exactly (10 columns). ✓
Column count consistency: every data row has exactly 10 fields under Python csv.reader. No malformed rows.
NULL handling: the executor's spec says NULL job_type_id becomes empty string in the CSV; the filtered data happens to contain zero such rows, so the rule was vacuously applied. No empty jod_job_title either. ✓
Set match against DB extract: the 10,166 CSV rows exactly equal the 10,166 row output of the extraction SQL (compared by sorted-line equality after running mysql with -B -N).

Spot checks (10 random CSV rows)

Sampled with random.seed(73). For each row, an exact-match query was run against the source DB using BINARY job_title (case-sensitive) and the row's full grain (company_id, location_id, title, job_type_id, starts_at, ends_at, hourly_rate). The DB returned ≥1 matching slot for every sample.

#	company	location	title	job_type	times	rate	DB matching slots
1	204	1552	Experienced FP Cashier	51 Retail	13:00–22:00	11.41	6 ✓
2	106	2266	Kitchen Assistant (Isaac Lim)	52 F&B	10:00–18:00	15.35	2 ✓
3	219	2198	Service Crew (MMVC)	52 F&B	19:00–23:00	13.00	1 ✓
4	219	1732	Service Crew (20/12/25 Sandeep)	52 F&B	22:04–22:30	13.00	1 ✓
5	240	1804	Customer Support Admin (Siti)	55 Logistics	09:00–17:00	10.00	13 ✓
6	150	1323	Part Time Service Crew	52 F&B	12:30–17:00	14.00	1 ✓
7	219	1731	Service Crew	52 F&B	12:00–22:30	13.00	16 ✓
8	205	1569	Experienced FP Cashier	51 Retail	09:00–17:00	11.41	1 ✓
9	244	1838	Bartender (11/10/25 Yunkai)	53 Hotel	16:01–16:30	17.00	1 ✓
10	78	731	RA Night Refill	51 Retail	22:30–07:00	13.60	24 ✓

All ten samples reproduce. The jod_job_title and job_type_title columns are projected verbatim from the source.

Title pollution finding — independent assessment

The executor's finding is CONFIRMED and, if anything, understated. The legacy job_title is being abused as free-form per-shift commentary by a non-trivial fraction of accounts, especially the top-volume ones.

Independent random sample of 15 distinct (company, title) pairs

Seed random.seed(2026):

company	title
106 NTUC Foodfare	`Food Packer (Fu Ang)`
114 Creative Eateries	`Service Crew (19/05/25 Sebastien)`
114 Creative Eateries	`Service Crew - Weekday (1/06/25 See Ming)`
145 Winter Time	`Events@ Changi City Point L1 Atrium`
205 NTUC FP East	`Experienced Retail Assistant Fresh`
206 NTUC FP North	`RA Trolley`
211 NTUC FP Northeast	`RA Dry Non-food`
219 Good Idea	`Service Crew (MMTP1)`
219 Good Idea	`Service Crew (Sandeep)(12PM-10PM)`
219 Good Idea	`Service Crew (Shu Hui)(28/12/25)`
224 The Food Theory	`Service Crew (1/8/25 Sandeep)`
240 GoGoX	`Warehouse Assistant (Irfaan 14/2/26)`
244 Robertson House	`F&B Kitchen Assistant (Hong Wei Xian) 19/9/25`
244 Robertson House	`Finance Executive (Mei 1/6/25)`
281 Jas Prestige	`Service Crew`

11 of 15 sampled titles carry per-shift metadata (worker name, date, time, outlet code). Only 4 are clean role names (Experienced Retail Assistant Fresh, RA Trolley, RA Dry Non-food, Service Crew). The pollution is not concentrated in one outlier account — it appears across at least 7 different companies in this random sample, including F&B (Creative Eateries, Robertson House), retail/logistics (GoGoX), and Good Idea Concepts.

Quantification across all 2,231 distinct titles

Pattern	Distinct titles matching	% of total
Contains date (e.g. `12/7/25`, `1/06/25`, `19 Feb`)	769	34.5%
Contains time-of-day (e.g. `8AM-12NN`, `0800-1300`)	194	8.7%
Contains parentheses (usually a worker name)	1,721	77.1%

Per-company pollution rate (date-pattern OR parens-pattern, across distinct titles)

company_id	company_name	polluted titles / total	%
219	Good Idea Concepts	478 / 499	96%
244	The Robertson House	225 / 233	97%
106	NTUC Foodfare	226 / 262	86%
224	The Food Theory Group	166 / 175	95%
78	NTUC Fairprice (Hyper)	117 / 160	73%
240	GoGoX	116 / 125	93%
223	BBZ Design	103 / 107	96%
260	Shokudo Concepts	47 / 49	96%
211	NTUC FP Northeast	43 / 80	54%
114	Creative Eateries	36 / 45	80%

For the eight top accounts by title-count, 73–97% of "distinct roles" are actually role + per-shift metadata. The pollution is the dominant phenomenon in this dataset, not a side observation.

What the pollution actually encodes

The "Irfaan" deep dive on GoGoX is illustrative. Within just the suffix-variants for a single worker at one company, the DB shows ~55 distinct titles that are all variations of Customer Support Admin (Irfaan) or Warehouse Assistant (Irfaan) with appended (DD/MM/YY), (HHam-HHpm), (start-end), or combinations. The admin was clearly using the title field as a free-text per-shift assignment label. None of these belong on Org::JobRole.title.

Near-duplicate count cross-check

Method	Group count
Executor	52
Validator (independent: lowercase + collapse whitespace, group by `(company_id, normalised_title)`, count groups with ≥2 raw variants)	52

Top 10 sample groups inspected by hand — all are genuine duplicates (case-only or whitespace-only differences). Examples (validated):

(106) "food packer (chee rong)" — 3 variants: Food Packer (Chee rong), Food Packer (Chee Rong), Food Packer (CHEE RONG).
(219) "service crew (leonard)" — 3 variants: Service Crew (LEONARD), Service Crew (leonard), Service Crew (Leonard).
(240) "customer support admin (irfaan)" — 3 variants: ... (Irfaan), ... (irfaan), ... (IRFAAN).
(78) "retail assistant dairy" — 2 variants differing only by a U+00A0 non-breaking space: Retail Assistant\xa0Dairy vs Retail Assistant Dairy.
(78) "the bar @ experienced fp cashier (wyley)" — 2 variants: The bAR @ ... vs The Bar @ ....

Note: this 52 is a lower bound. The true near-duplicate rate is much higher once date/time suffixes are stripped — the deeper "Irfaan" example shows ~55 variants for the same effective role, and that's just one worker at one company.

Issues found

None that block the validation conclusion

No discrepancy was found in the row count, the schema, the JOIN, the projection, the per-company counts, the NULL counts, the job_type identification, or the near-duplicate count. The v2 extract is a faithful execution of the executor's stated query and the CSV is internally well-formed.

Documentation-clarity nit (not blocking)

The executor's claim "Distinct (company_id, jod_job_title) = 2,231" is correct for CSV reporting but a naive SELECT COUNT(DISTINCT company_id, job_title) in MySQL returns 2,170 because the database collation is utf8mb4_0900_ai_ci (case- and accent-insensitive). Both numbers are correct under their respective semantics. The query record should add a one-line note that the case-sensitivity of downstream CSV-reading tools and the case-insensitivity of MySQL counting can produce different "distinct" answers on the same dataset.

Material observation for seed-data planning (not a discrepancy)

The title-pollution finding is at least as bad as the executor described. Stage-2 admin review in UC-8 cannot be a "spot-check the CSV in a spreadsheet" exercise — at top accounts, 73–97% of rows need normalisation, and the noisiness is structural (admins are using job_title as per-shift commentary, not as a role label). This affects the design of Stage 2, not Stage 1, and is consistent with the existing model spec; it just raises the cost of the admin step.

Independent recommendation on the seeding implication

The data still supports Org::JobRole as the right design. In fact, the title-pollution evidence makes the design more necessary, not less:

The pollution is exactly what Org::JobRole exists to fix. The reason 219, 244, 106, 224, 240, and 223 each have 100–500 "distinct titles" is precisely that the legacy system has no concept separating "the role being staffed" from "the specific shift, worker, date". Org::JobRole introduces that separation. Without the model, these accounts will keep generating one new "role" per shift forever.
The 80 distinct (company, job_type_id) pairs vs 2,231 distinct (company, title) pairs is the right framing of the consolidation target. job_type_id (Retail / F&B / Hotel / Healthcare / Logistics / Events / Others) is too coarse to be the role spine — McDonald's would have one row for "F&B" and NTUC would have one row for "Retail", which doesn't help anyone configure pay rates. But it sets a useful upper bound: any account with N (title, job_type_id) pairs and M job_type_id values has somewhere between M and N legitimate roles, and the rest is pollution. For Good Idea Concepts this is somewhere between 1 (F&B only) and ~10–20 (Service Crew, Kitchen Assistant, Barista, Cleaner, etc.), not 499.
The 4-of-5-top-companies-don't-need-job_type_id finding is not a reason to drop job_type_id from the seed pipeline. It means the existing taxonomy is consistent within each role for those four — useful as a sanity-check column at Stage 2, not as a disambiguator. Robertson House (244) shows that the same title can carry two job_type_ids; the seed importer should pick the modal value and surface the conflicts.
Stage 2 needs more design work than UC-8 currently describes. The current UC-8 spec describes admin review of a CSV; the data shows that at the top accounts, an admin would need to manually consolidate 100–500 rows into 5–20 roles. This is too much manual work to do safely in a spreadsheet without tooling. Recommendation: before running the importer, build a suffix-stripping pre-processor that removes patterns like (<worker-name>), (<date>), (<time-range>), - DD/MM/YY, etc., and outputs a "suggested canonical role" column for each row. The admin then reviews the suggestions instead of doing the work from scratch.

Bottom line: the v2 extract is solid. Use it. The seeding plan stays directionally right but Stage 2 should be re-scoped from "admin spot-checks the CSV" to "admin reviews a pre-processor's suggested consolidation" — this is a UC-8 update, not a design rethink.

Verdict​

Per-claim audit​

Schema verification​

Independent re-query​

CSV integrity​

Spot checks (10 random CSV rows)​

Title pollution finding — independent assessment​

Independent random sample of 15 distinct (company, title) pairs​

Quantification across all 2,231 distinct titles​

Per-company pollution rate (date-pattern OR parens-pattern, across distinct titles)​

What the pollution actually encodes​

Near-duplicate count cross-check​

Issues found​

None that block the validation conclusion​

Documentation-clarity nit (not blocking)​

Material observation for seed-data planning (not a discrepancy)​

Independent recommendation on the seeding implication​