Skip to main content

Legacy pay rate bands extraction v2 — independent validation

Validator: Senior Data Engineer (peer-review pass, v2 — role-columns variant) Date: 2026-05-18 Subject of validation:

  • /docs/30-49-domains/35-gig/statistics/2026-05-18-legacy-pay-rate-bands.csv (10-column variant)
  • /docs/30-49-domains/35-gig/scratchpad/2026-05-18-legacy-pay-rate-bands-query.md (overwritten v2)
  • Baseline: prior CONFIRMED validator file (...-validation.md).

Verdict

CONFIRMED. Every numeric headline claim in v2 was independently reproduced from the source database; the schema additions (job_types LEFT JOIN, three new projected columns) are correctly applied; the CSV is well-formed; and the title-pollution finding is, if anything, understated rather than overstated — the seeding plan's Stage-2 admin review needs to be more robust than the v1 design assumed.

Per-claim audit

#ClaimExecutor valueValidator valueMatchNotes
1CSV row count10,16610,166 (1 header + 10,166 data; CSV reader and wc -l = 10,167)YES
2Reproducing executor's SQL gives the same count10,16610,166YESRan the executor's verification SELECT COUNT(*) block verbatim.
3Distinct (company_id, jod_job_title)2,2312,231 (from CSV; from DB extract output: 2,231)YESCaveat: A naive SELECT COUNT(DISTINCT company_id, job_title) returns 2,170 because the MySQL collation utf8mb4_0900_ai_ci is case-insensitive. The CSV preserves case-sensitive distinct values because each variant appears on a different row (distinguished by other columns), so Python's case-sensitive dedup gives 2,231. The executor's number is the right number for CSV-shaped reporting; both numbers are correct under their respective semantics.
4Distinct (company_id, job_type_id)8080YESReproduced both from DB extract and direct SQL.
5Good Idea Concepts (219) distinct titles499499 (from CSV); 478 (DB-side COUNT DISTINCT due to ci collation)YESSame case-sensitivity caveat as #3.
6NTUC Foodfare (106) distinct titles262262 (CSV); 250 (DB-side ci)YESSame caveat.
7The Robertson House (244) distinct titles233233 (CSV); 233 (DB-side ci)YESNo case collisions for this account.
8Top 10 companies by distinct title countas reportedidentical ordering and countsYESSee top-15 table in the §"Independent re-query" section below.
9Only 7 of 57 job_type_id values in use, IDs 51–57 = Retail, F&B, Hotel, Healthcare, Logistics, Events, Others7 of 577 of 57 — IDs and titles exactly as claimedYESSELECT DISTINCT job_type_id against the same filters.
104 of top 5 companies: distinct (title, job_type_id) == distinct titleclaimedconfirmed: 219, 106, 224, 78 are equal; 244 has 236 vs 233 (delta = 3 because 3 titles span 2 job_type_id values, see §"Independent re-query")YES
1152 near-duplicate title groups (case/whitespace)5252YESIndependent grouping: collapse whitespace runs to single space + lowercase, group by (company_id, normalised_title), keep groups with ≥2 raw variants.
120 NULL jod_job_title rows00YESBoth DB-side (NULL/empty count = 0) and CSV-side (empty string count = 0).
130 NULL job_type_id rows in deduped output00YESSame approach. The LEFT JOIN means NULLs would be preserved as empty string in the CSV; none exist in the filtered data.
14Title pollution finding (admins encoding date/worker/time into title)claimed realCONFIRMED and understated — see §"Title pollution finding" belowYESQuantified: 34.5% of distinct titles contain a date pattern; 8.7% contain a time pattern; 77.1% contain parentheses (usually worker name). For the top 8 accounts this is 86–97%.

Schema verification

DESCRIBE job_types:

FieldTypeNullKeyDefault
idbigint unsignedNOPRI(auto_increment)
titlevarchar(255)NONULL
descriptionvarchar(255)YESNULL
wage_rangevarchar(255)YESNULL
imagevarchar(255)NOdefault.jpeg
hex_colorvarchar(7)YESNULL
statussmallintNONULL
created_by, updated_bybigintYESNULL
created_at, updated_attimestampYESNULL
  • SELECT COUNT(*) FROM job_types = 57 ✓ (matches executor's claim).
  • title is the human-readable label ✓ — the column the executor joined.
  • The LEFT JOIN is the right call: an inner join would silently drop jj.job_type_id IS NULL rows. In the filtered set the count of NULL job_type_id is 0, so the join shape doesn't affect this extraction, but the LEFT JOIN remains defensible against future data.
  • status filtering on job_types was deliberately not applied (executor noted this). This is the correct call — we want the label for whatever taxonomy entry the legacy job points to, regardless of whether the platform-managed taxonomy row is currently active.

Independent re-query

Used to cross-check every numeric headline:

SELECT COUNT(*) FROM (
SELECT DISTINCT
jj.company_id, c.name AS company_name,
jj.location_id, l.name AS location_name,
jj.job_title AS jod_job_title,
jj.job_type_id, jt.title AS job_type_title,
TIME(s.slot_start_date) AS starts_at,
TIME(s.slot_end_date) AS ends_at,
jj.hourly_rate
FROM jod_jobs jj
JOIN companies c ON c.id = jj.company_id AND c.status = 1 AND c.deleted_at IS NULL
JOIN locations l ON l.id = jj.location_id AND l.status = 1 AND l.deleted_at IS NULL
LEFT JOIN job_types jt ON jt.id = jj.job_type_id
JOIN slots s ON s.jod_job_id = jj.id AND s.deleted_at IS NULL
JOIN slot_user su ON su.slot_id = s.id AND su.deleted_at IS NULL
JOIN payments p ON p.id = su.payment_id AND p.payment_status = 2 AND p.deleted_at IS NULL
WHERE jj.created_at >= '2025-05-18' AND jj.deleted_at IS NULL AND jj.status = 3
) x;
-- 10,166
SELECT DISTINCT jj.job_type_id, jt.title FROM jod_jobs jj ... (same filters) ORDER BY jj.job_type_id;
-- 51 Retail, 52 F&B, 53 Hotel, 54 Healthcare, 55 Logistics, 56 Events, 57 Others
-- Top 5 companies by distinct title vs distinct (title, job_type_id):
SELECT jj.company_id, c.name,
COUNT(DISTINCT jj.job_title) AS distinct_title,
COUNT(DISTINCT CONCAT(jj.job_title,'|',IFNULL(jj.job_type_id,'NULL'))) AS distinct_title_type
FROM ... GROUP BY jj.company_id, c.name ORDER BY distinct_title DESC LIMIT 5;
-- 219 Good Idea 478 | 478
-- 106 NTUC Foodfare 250 | 250
-- 244 Robertson 233 | 236 <-- 3 titles span 2 job_type_id values
-- 224 Food Theory 172 | 172
-- 78 NTUC Hyper 151 | 151

(The DB-side counts are case-insensitive because of the utf8mb4_0900_ai_ci collation. The CSV-side counts are case-sensitive and equal 499, 262, 233, 175, 160. Both are correct under their respective semantics. The "4 of 5" claim holds in both — the only company with distinct_title_type > distinct_title is Robertson House, regardless of which counting convention you use.)

Top 15 from the DB extract output (case-sensitive, matches CSV):

company_idcompany_namedistinct titles
219Good Idea Concepts Pte Ltd499
106NTUC Foodfare Co-operative Ltd262
244The Robertson House233
224THE FOOD THEORY GROUP PTE. LTD175
78NTUC Fairprice (Hyper)160
240GoGoX Singapore125
223BBZ DESIGN INTERNATIONAL107
211NTUC Fairprice Northeast80
83NTUC Fairprice (Finest)53
260Shokudo Concepts49
114Creative Eateries45
151McDonald's36
206NTUC Fairprice North36
205NTUC Fairprice East32
209NTUC Fairprice Northwest31

CSV integrity

  • File line count: 10,167 (= 1 header + 10,166 data rows). ✓
  • Header: company_id,company_name,location_id,location_name,jod_job_title,job_type_id,job_type_title,starts_at,ends_at,hourly_rate — matches the spec exactly (10 columns). ✓
  • Column count consistency: every data row has exactly 10 fields under Python csv.reader. No malformed rows.
  • NULL handling: the executor's spec says NULL job_type_id becomes empty string in the CSV; the filtered data happens to contain zero such rows, so the rule was vacuously applied. No empty jod_job_title either. ✓
  • Set match against DB extract: the 10,166 CSV rows exactly equal the 10,166 row output of the extraction SQL (compared by sorted-line equality after running mysql with -B -N).

Spot checks (10 random CSV rows)

Sampled with random.seed(73). For each row, an exact-match query was run against the source DB using BINARY job_title (case-sensitive) and the row's full grain (company_id, location_id, title, job_type_id, starts_at, ends_at, hourly_rate). The DB returned ≥1 matching slot for every sample.

#companylocationtitlejob_typetimesrateDB matching slots
12041552Experienced FP Cashier51 Retail13:00–22:0011.416 ✓
21062266Kitchen Assistant (Isaac Lim)52 F&B10:00–18:0015.352 ✓
32192198Service Crew (MMVC)52 F&B19:00–23:0013.001 ✓
42191732Service Crew (20/12/25 Sandeep)52 F&B22:04–22:3013.001 ✓
52401804Customer Support Admin (Siti)55 Logistics09:00–17:0010.0013 ✓
61501323Part Time Service Crew52 F&B12:30–17:0014.001 ✓
72191731Service Crew52 F&B12:00–22:3013.0016 ✓
82051569Experienced FP Cashier51 Retail09:00–17:0011.411 ✓
92441838Bartender (11/10/25 Yunkai)53 Hotel16:01–16:3017.001 ✓
1078731RA Night Refill51 Retail22:30–07:0013.6024 ✓

All ten samples reproduce. The jod_job_title and job_type_title columns are projected verbatim from the source.

Title pollution finding — independent assessment

The executor's finding is CONFIRMED and, if anything, understated. The legacy job_title is being abused as free-form per-shift commentary by a non-trivial fraction of accounts, especially the top-volume ones.

Independent random sample of 15 distinct (company, title) pairs

Seed random.seed(2026):

companytitle
106 NTUC FoodfareFood Packer (Fu Ang)
114 Creative EateriesService Crew (19/05/25 Sebastien)
114 Creative EateriesService Crew - Weekday (1/06/25 See Ming)
145 Winter TimeEvents@ Changi City Point L1 Atrium
205 NTUC FP EastExperienced Retail Assistant Fresh
206 NTUC FP NorthRA Trolley
211 NTUC FP NortheastRA Dry Non-food
219 Good IdeaService Crew (MMTP1)
219 Good IdeaService Crew (Sandeep)(12PM-10PM)
219 Good IdeaService Crew (Shu Hui)(28/12/25)
224 The Food TheoryService Crew (1/8/25 Sandeep)
240 GoGoXWarehouse Assistant (Irfaan 14/2/26)
244 Robertson HouseF&B Kitchen Assistant (Hong Wei Xian) 19/9/25
244 Robertson HouseFinance Executive (Mei 1/6/25)
281 Jas PrestigeService Crew

11 of 15 sampled titles carry per-shift metadata (worker name, date, time, outlet code). Only 4 are clean role names (Experienced Retail Assistant Fresh, RA Trolley, RA Dry Non-food, Service Crew). The pollution is not concentrated in one outlier account — it appears across at least 7 different companies in this random sample, including F&B (Creative Eateries, Robertson House), retail/logistics (GoGoX), and Good Idea Concepts.

Quantification across all 2,231 distinct titles

PatternDistinct titles matching% of total
Contains date (e.g. 12/7/25, 1/06/25, 19 Feb)76934.5%
Contains time-of-day (e.g. 8AM-12NN, 0800-1300)1948.7%
Contains parentheses (usually a worker name)1,72177.1%

Per-company pollution rate (date-pattern OR parens-pattern, across distinct titles)

company_idcompany_namepolluted titles / total%
219Good Idea Concepts478 / 49996%
244The Robertson House225 / 23397%
106NTUC Foodfare226 / 26286%
224The Food Theory Group166 / 17595%
78NTUC Fairprice (Hyper)117 / 16073%
240GoGoX116 / 12593%
223BBZ Design103 / 10796%
260Shokudo Concepts47 / 4996%
211NTUC FP Northeast43 / 8054%
114Creative Eateries36 / 4580%

For the eight top accounts by title-count, 73–97% of "distinct roles" are actually role + per-shift metadata. The pollution is the dominant phenomenon in this dataset, not a side observation.

What the pollution actually encodes

The "Irfaan" deep dive on GoGoX is illustrative. Within just the suffix-variants for a single worker at one company, the DB shows ~55 distinct titles that are all variations of Customer Support Admin (Irfaan) or Warehouse Assistant (Irfaan) with appended (DD/MM/YY), (HHam-HHpm), (start-end), or combinations. The admin was clearly using the title field as a free-text per-shift assignment label. None of these belong on Org::JobRole.title.

Near-duplicate count cross-check

MethodGroup count
Executor52
Validator (independent: lowercase + collapse whitespace, group by (company_id, normalised_title), count groups with ≥2 raw variants)52

Top 10 sample groups inspected by hand — all are genuine duplicates (case-only or whitespace-only differences). Examples (validated):

  • (106) "food packer (chee rong)" — 3 variants: Food Packer (Chee rong), Food Packer (Chee Rong), Food Packer (CHEE RONG).
  • (219) "service crew (leonard)" — 3 variants: Service Crew (LEONARD), Service Crew (leonard), Service Crew (Leonard).
  • (240) "customer support admin (irfaan)" — 3 variants: ... (Irfaan), ... (irfaan), ... (IRFAAN).
  • (78) "retail assistant dairy" — 2 variants differing only by a U+00A0 non-breaking space: Retail Assistant\xa0Dairy vs Retail Assistant Dairy.
  • (78) "the bar @ experienced fp cashier (wyley)" — 2 variants: The bAR @ ... vs The Bar @ ....

Note: this 52 is a lower bound. The true near-duplicate rate is much higher once date/time suffixes are stripped — the deeper "Irfaan" example shows ~55 variants for the same effective role, and that's just one worker at one company.

Issues found

None that block the validation conclusion

No discrepancy was found in the row count, the schema, the JOIN, the projection, the per-company counts, the NULL counts, the job_type identification, or the near-duplicate count. The v2 extract is a faithful execution of the executor's stated query and the CSV is internally well-formed.

Documentation-clarity nit (not blocking)

  • The executor's claim "Distinct (company_id, jod_job_title) = 2,231" is correct for CSV reporting but a naive SELECT COUNT(DISTINCT company_id, job_title) in MySQL returns 2,170 because the database collation is utf8mb4_0900_ai_ci (case- and accent-insensitive). Both numbers are correct under their respective semantics. The query record should add a one-line note that the case-sensitivity of downstream CSV-reading tools and the case-insensitivity of MySQL counting can produce different "distinct" answers on the same dataset.

Material observation for seed-data planning (not a discrepancy)

  • The title-pollution finding is at least as bad as the executor described. Stage-2 admin review in UC-8 cannot be a "spot-check the CSV in a spreadsheet" exercise — at top accounts, 73–97% of rows need normalisation, and the noisiness is structural (admins are using job_title as per-shift commentary, not as a role label). This affects the design of Stage 2, not Stage 1, and is consistent with the existing model spec; it just raises the cost of the admin step.

Independent recommendation on the seeding implication

The data still supports Org::JobRole as the right design. In fact, the title-pollution evidence makes the design more necessary, not less:

  1. The pollution is exactly what Org::JobRole exists to fix. The reason 219, 244, 106, 224, 240, and 223 each have 100–500 "distinct titles" is precisely that the legacy system has no concept separating "the role being staffed" from "the specific shift, worker, date". Org::JobRole introduces that separation. Without the model, these accounts will keep generating one new "role" per shift forever.
  2. The 80 distinct (company, job_type_id) pairs vs 2,231 distinct (company, title) pairs is the right framing of the consolidation target. job_type_id (Retail / F&B / Hotel / Healthcare / Logistics / Events / Others) is too coarse to be the role spine — McDonald's would have one row for "F&B" and NTUC would have one row for "Retail", which doesn't help anyone configure pay rates. But it sets a useful upper bound: any account with N (title, job_type_id) pairs and M job_type_id values has somewhere between M and N legitimate roles, and the rest is pollution. For Good Idea Concepts this is somewhere between 1 (F&B only) and ~10–20 (Service Crew, Kitchen Assistant, Barista, Cleaner, etc.), not 499.
  3. The 4-of-5-top-companies-don't-need-job_type_id finding is not a reason to drop job_type_id from the seed pipeline. It means the existing taxonomy is consistent within each role for those four — useful as a sanity-check column at Stage 2, not as a disambiguator. Robertson House (244) shows that the same title can carry two job_type_ids; the seed importer should pick the modal value and surface the conflicts.
  4. Stage 2 needs more design work than UC-8 currently describes. The current UC-8 spec describes admin review of a CSV; the data shows that at the top accounts, an admin would need to manually consolidate 100–500 rows into 5–20 roles. This is too much manual work to do safely in a spreadsheet without tooling. Recommendation: before running the importer, build a suffix-stripping pre-processor that removes patterns like (<worker-name>), (<date>), (<time-range>), - DD/MM/YY, etc., and outputs a "suggested canonical role" column for each row. The admin then reviews the suggestions instead of doing the work from scratch.

Bottom line: the v2 extract is solid. Use it. The seeding plan stays directionally right but Stage 2 should be re-scoped from "admin spot-checks the CSV" to "admin reviews a pre-processor's suggested consolidation" — this is a UC-8 update, not a design rethink.