Legacy pay rate bands extraction — independent validation
Validator: Senior Data Engineer (peer-review pass) Date: 2026-05-18 Subject of validation:
/docs/30-49-domains/35-gig/statistics/2026-05-18-legacy-pay-rate-bands.csv/docs/30-49-domains/35-gig/scratchpad/2026-05-18-legacy-pay-rate-bands-query.md
Verdict
CONFIRMED. Every numeric headline claim was independently reproduced from the source database; enum values, active filters, and date arithmetic were re-derived from the legacy PHP code; the CSV is internally well-formed and exactly matches the projection of the underlying query at the (company, location, starts_at, ends_at, hourly_rate) level. One minor caveat noted under "Issues found" regarding seed-data usage (record-of-evidence vs. seed-shape mismatch).
Per-claim audit
| # | Claim | Executor value | Validator value | Match | Notes |
|---|---|---|---|---|---|
| 1 | Rows in CSV | 50,746 | 50,746 (50,747 lines incl. header; csv.reader = 50,746 data rows) | YES | Verified with Python csv reader; awk-on-comma is unreliable because some company names are quoted strings containing commas. |
| 2 | Distinct (company_id, location_id, starts_at, ends_at) | 5,648 | 5,648 | YES | Independently reproduced via DB temp-table aggregate AND from a csv.reader pass over the CSV. |
| 3 | Active companies covered | 69 | 69 | YES | CSV-side: 69 distinct company_id. DB-side aggregate: 69. |
| 4 | Active outlets covered | 384 | 384 | YES | CSV-side: 384 distinct location_id (initial awk reported 383 because some location_name fields contain commas; csv.reader gives 384). |
| 5 | jod_jobs.status = 3 for :completed | 3 | 3 | YES | Constants.php:28 — public const JOB_JOD_STATUS_COMPLETED = 3;. |
| 6 | payments.payment_status = 2 for :processed | 2 | 2 | YES | Constants.php:140 — public const PAYMENT_STATUS_PROCESSED = 2;. |
| 7 | Last-12-month window starts 2025-05-18 | 2025-05-18 | 2025-05-18 | YES | Today is 2026-05-18; 2026-05-18 minus 12 months = 2025-05-18. |
| 8 | companies.status = 1 AND deleted_at IS NULL is "active" | correct | correct | YES | Constants.php:134 COMPANY_STATUS_ENABLE = 1; CompanyRepositoryEloquent.php:114, 221 use this filter pattern. DB counts: 166 active vs 149 disabled (matches executor's "166 active"). |
| 9 | locations.status = 1 AND deleted_at IS NULL is "active" | correct | correct | YES | Constants.php:126 LOCATION_STATUS_ENABLE = 1; LocationRepositoryEloquent.php:498, 539, 560, 581, 605 use this filter pattern. DB counts: 1,312 active vs 1,003 disabled. |
| 10 | Top 5 companies by row count | NTUC FP Hyper (6,145), NTUC FP North (4,793), NTUC FP Northwest (3,879), NTUC Foodfare (3,123), McDonald's (2,812) | identical, exact match | YES | Re-derived from CSV with Counter and cross-checked with GROUP BY company_id, c.name on the DB. |
| 11 | Multi-rate-per-bucket violation rate 19.2% | 19.2% | 19.196% on (company, outlet, role, day_type, band); 19.224% on (company, outlet, day_type, band) | YES | Both groupings round to 19.2%. |
| 12 | One row per slot (no double-counting from multi-worker slots) | claimed | confirmed | YES | DB temp-table: COUNT(*) = COUNT(DISTINCT slot_id) = 50,746. |
| 13 | CSV is a true projection of the DB query (no spurious or missing combinations) | implicit | confirmed | YES | DB-side SELECT DISTINCT company_id, location_id, TIME(start), TIME(end), hourly_rate = 7,286 unique rows. CSV-side same = 7,286 unique rows. Set membership match. |
| 14 | hourly_rate = 0 slots are included in CSV | "few" | 0 | NO (minor) | DB count of hourly_rate = 0 rows passing the same filter is 0. The CSV correctly contains 0 zero-rate rows. The executor's note 5 is wrong in fact but harmless. |
| 15 | Midnight-wrap rows exist | "some" | 1,971 | YES (qualitative) | 1,971 rows have ends_at < starts_at (3.88% of all rows). |
| 16 | Active filter exclusion is small | implicit | 77 slots out of 50,823 dropped by the active filter (0.15%) | YES | Material risk: the rate evidence for those 77 slots from inactive companies/locations is dropped. Not a concern for analysis; relevant for seed data (none of those companies/locations would be migrated anyway). |
Enum verification
Source file
/Users/alaay/jod/repo/jodgig-api/app/Constants/Constants.php
Snippets
// line 13
public const USER_STATUS_ENABLE = 1;
// line 28
public const JOB_JOD_STATUS_COMPLETED = 3;
// line 126
public const LOCATION_STATUS_ENABLE = 1;
// line 134
public const COMPANY_STATUS_ENABLE = 1;
// line 140
public const PAYMENT_STATUS_PROCESSED = 2;
// line 152
public const ENABLE = 1;
Active-filter usage in PHP
app/Repositories/CompanyRepositoryEloquent.php:114—->where('companies.status', Constants::ENABLE).app/Repositories/CompanyRepositoryEloquent.php:221—->where('companies.status', Constants::COMPANY_STATUS_ENABLE).app/Repositories/LocationRepositoryEloquent.php:498—->where('locations.status', 1).app/Repositories/JodJobRepositoryEloquent.php:3154—->where('locations.status', Constants::ENABLE).
The "status = 1" convention is universal in the legacy code. Soft-delete column is deleted_at IS NULL (Laravel default). The executor's filter exactly matches this convention.
Cross-check on payment_status (not status)
The executor was careful to note that payments carries both status and payment_status, and that the processed enum lives on payment_status. Verified: grepping the legacy code shows payment_status = 2 is the universal "processed" filter (e.g., PaymentRepositoryEloquent.php, PaymentService.php); the use of payment_status here is correct.
Independent re-query
Using a different query shape (temp table to materialise once, then aggregate from it) to cross-check all key counts:
CREATE TEMPORARY TABLE tmp_valid_slots AS
SELECT DISTINCT s.id AS slot_id, jj.company_id, jj.location_id,
TIME(s.slot_start_date) AS starts_at,
TIME(s.slot_end_date) AS ends_at,
jj.hourly_rate
FROM jod_jobs jj
JOIN companies c ON c.id = jj.company_id AND c.status = 1 AND c.deleted_at IS NULL
JOIN locations l ON l.id = jj.location_id AND l.status = 1 AND l.deleted_at IS NULL
JOIN slots s ON s.jod_job_id = jj.id AND s.deleted_at IS NULL
JOIN slot_user su ON su.slot_id = s.id AND su.deleted_at IS NULL
JOIN payments p ON p.id = su.payment_id AND p.payment_status = 2 AND p.deleted_at IS NULL
WHERE jj.created_at >= '2025-05-18' AND jj.deleted_at IS NULL AND jj.status = 3;
SELECT COUNT(*), COUNT(DISTINCT slot_id),
COUNT(DISTINCT company_id), COUNT(DISTINCT location_id)
FROM tmp_valid_slots;
-- Result: 50746 | 50746 | 69 | 384
SELECT COUNT(*) FROM (
SELECT DISTINCT company_id, location_id, starts_at, ends_at FROM tmp_valid_slots
) x;
-- Result: 5648
A different shape was also tested using EXISTS instead of JOIN to slot_user/payments — but it timed out under the harness's interactive limits. The temp-table approach above is materially independent of the executor's SELECT DISTINCT join-chain query and produced the same row counts.
The 5,648 distinct band keys, 69 distinct companies, 384 distinct locations, and 50,746 total slot rows match the executor's numbers exactly.
CSV integrity
- Row count: 50,747 file lines; 1 header + 50,746 data rows. ✓
- Header:
company_id,company_name,location_id,location_name,starts_at,ends_at,hourly_rate— exactly matches the spec. ✓ - Malformed rows: 0. All 50,746 rows have exactly 7 fields when parsed with the Python
csvmodule. - Hourly-rate parsing: every row's
hourly_rateis a validfloat. No zero-rate rows (executor's note 5 is incorrect on this point — there are zero). - Quoting: company names like
"NTUC Fairprice Co-operative Ltd Supermarkets (Central)"(no embedded commas) are unquoted;"Four Points by Sheraton Singapore, Riverview"is correctly quoted with embedded commas escaped.csv.QUOTE_MINIMALis applied. ✓ - Set match against DB: the set of distinct
(company_id, location_id, starts_at, ends_at, hourly_rate)tuples in the CSV (7,286) is identical to the set computed by running the sameSELECT DISTINCTdirectly on the DB (7,286). No spurious or missing tuples. - Midnight-wrap rows: 1,971 rows have
ends_at < starts_at(overnight shifts). Consistent with the new spec's "midnight wrap is allowed". - Row-count vs distinct-key inflation: 50,746 rows / 5,648 distinct keys = ~9.0 slots per band on average. The duplication is the desired one-row-per-slot shape (used for downstream counting).
Spot checks
Five rows sampled with random.seed(42). For each, the band shape was queried against the DB and the slot count compared with the CSV occurrence count.
| Sample | company_id | location_id | starts_at | ends_at | rate | CSV count | DB COUNT(DISTINCT s.id) | Match |
|---|---|---|---|---|---|---|---|---|
| 1 | 212 (NTUC FP Central) | 1666 (City Square Mall) | 08:00 | 13:00 | 11.41 | 86 | 86 | YES |
| 2 | 78 (NTUC FP Hyper) | 731 (FairPrice Xtra Parkway Parade) | 13:45 | 22:05 | 12.10 | 210 | 210 | YES |
| 3 | 10 (JP Pepperdine) | 997 (Jack's Place Clementi 321) | 22:00 | 23:00 | 13.00 | 8 | 8 | YES (DB raw COUNT(*) = 13 is post-join-inflation; COUNT(DISTINCT s.id) = 8 is the correct figure that matches the CSV) |
| 4 | 244 (The Robertson House) | 1838 (F&B Speakeasy Bar) | 17:00 | 23:00 | 17.00 | 2 | 2 | YES |
| 5 | 151 (McDonald's) | 1161 (McDonald's Downtown East 2) | 09:00 | 17:00 | 16.00 | 5 | 5 | YES |
All five spot checks pass.
Issues found
Minor (do not change the conclusion)
- Executor note 5 wrong on zero-rate rows. The executor wrote "A few jod_jobs have a zero hourly_rate ... not filtered out". In fact there are zero such rows in the filtered set. The CSV does not contain any zero-rate rows. Severity: cosmetic. Update the note.
- Executor's
SELECT DISTINCTincludesslot_id, so theDISTINCTis functionally a no-op for deduplication of the inflated slot_user×payments cross-product. The result is correct (one row per slot) becauseslot_idis part of theDISTINCTkey, but the comment "DISTINCT deduplicates the row-explosion from multiple processed payments per slot" is slightly misleading — what actually deduplicates is the combination ofs.idbeing unique per slot andDISTINCTbeing applied to the projected columns. Severity: doc-clarity nit.
Material (relevant to seeding decision)
- The CSV is not in the shape of
Gig::PayRateand must not be loaded directly as seed data. The CSV is shaped as "one row per slot observation" — it is a record of what shifts actually ran in the last 12 months, with their posted hourly_rate. AGig::PayRateseed would be the deduplicated projection at the(company, outlet, role, day_type, starts_at, ends_at)level with a canonical rate chosen for each bucket. Loading the CSV verbatim would create 50,746Gig::PayRaterows (massively redundant), and 19.2% of the buckets have multiple rates so a unique-constraint violation on the proposed key is guaranteed. Severity: high for seed-data usage; none for analysis usage. Before seeding, a separate step is required to:- aggregate rows to the bucket grain,
- pick a canonical rate per bucket (median? most-recent? most-frequent?),
- infer
day_type(legacy data has no PH calendar; treat PH-fallen-on-weekday slots accordingly), - infer
org_job_role_idmapping (legacyjob_type_id→ newOrg::JobRole), - resolve the 1,404 multi-rate buckets to a single rate per bucket using a documented rule.
role/job_type_idis absent from the CSV but is required for the proposed unique key. Anyone using the CSV for hypothesis-testing of the role-included key needs to re-pull or run the executor's supplementary query that addsjob_type_id. Severity: moderate for seed-data preparation. The validation here used the supplementary query and confirmed the 19.2% figure on the role-inclusive grain.
Methodological
- The day-type approximation is an upper bound. The executor flagged this in note 4. Confirmed:
day_typeis derived fromDAYOFWEEK(slot_start_date)only, with no public-holiday calendar. Singapore has 11 public holidays / year, so a single-digit-percent of "weekday" slots are actuallypublic_holidayin the new schema. This systematically inflates the 19.2% rate-disagreement statistic. The true(company, outlet, role, day_type)rate-disagreement rate with PH detection would be lower — exactly as the executor noted.
Independent recommendation on the unique-key hypothesis
Agree with the executor (with one refinement). The proposed unique key (org_company_id, org_outlet_id, org_job_role_id, day_type, starts_at, ends_at) is the right shape. Evidence:
- 80.8% of legacy buckets at this grain already have a single rate (5,910 of 7,314 buckets). The proposed unique key is correct for the dominant case.
- The 19.2% with rate disagreement decomposes into three causes, none of which invalidate the key:
- Public holidays. Some of these are PH-on-weekday slots that the legacy DB can't distinguish — they would route to a different
day_type = public_holidaybucket in the new schema and disappear from the violation set. Per the executor's note 4, this is the upper bound; the true figure is lower. - Rate revisions over time. A company that raised the Service Crew weekday-day rate from $11 to $12 mid-year has two rates for the same band — but this is a temporal signal, not a key-design issue. The new schema's
Gig::PayRateHistorycaptures the change properly. - Per-shift admin overrides. The job-templates-pay-rate analysis (
statistics/job-templates-pay-rate.md) showed 46% of jobs have a rate override. Some overrides will manifest as multi-rate bands here. The new schema treats the override as a per-Shift attribute, not a PayRate row.
- Public holidays. Some of these are PH-on-weekday slots that the legacy DB can't distinguish — they would route to a different
- The data-quality alternative — making the key looser (drop
starts_at/ends_at) — collapses 7,314 buckets to ~1,500(company, outlet, role, day_type)buckets. That throws away the band signal that the legacy data clearly carries (e.g. McDonald's day vs night, NTUC overnight bands). Wrong direction. - The data-quality alternative — making the key tighter (add
is_default,currency,label) — adds zero discriminating power for this dataset (all legacy data is single-currency, single-default-implied) and would only matter once Indonesia expansion or rate-label proliferation creates a real need. Defer.
Refinement: the unique key should be enforced with nulls_not_distinct=true (Postgres 15+) so that two "whole day_type" rows (both starts_at and ends_at NULL) for the same (company, outlet, role, day_type) cannot coexist. This is already stated in the spec (Invariant 4) — re-affirming after seeing the data: a small number of legacy bands span full days, and treating them as not-distinct in the index avoids needing the Manager-layer NULL-band check on the write path for the most common case. Standard SQL treats NULLs as distinct, so the index alone is not enough — nulls_not_distinct=true is the Postgres feature that makes the constraint behave the way the spec describes.
Bottom line: the data supports the proposed unique key. Build it. The 19.2% multi-rate-per-bucket finding is a data-quality observation, not a key-shape concern — it should be expected and handled at seed-import time by choosing a single canonical rate per bucket, with the legacy slot rows preserved separately as historical evidence (not seeded as PayRate rows).