Legacy pay rate bands extraction — independent validation

Validator: Senior Data Engineer (peer-review pass) Date: 2026-05-18 Subject of validation:

/docs/30-49-domains/35-gig/statistics/2026-05-18-legacy-pay-rate-bands.csv
/docs/30-49-domains/35-gig/scratchpad/2026-05-18-legacy-pay-rate-bands-query.md

Verdict

CONFIRMED. Every numeric headline claim was independently reproduced from the source database; enum values, active filters, and date arithmetic were re-derived from the legacy PHP code; the CSV is internally well-formed and exactly matches the projection of the underlying query at the (company, location, starts_at, ends_at, hourly_rate) level. One minor caveat noted under "Issues found" regarding seed-data usage (record-of-evidence vs. seed-shape mismatch).

Per-claim audit

#	Claim	Executor value	Validator value	Match	Notes
1	Rows in CSV	50,746	50,746 (50,747 lines incl. header; csv.reader = 50,746 data rows)	YES	Verified with Python `csv` reader; awk-on-comma is unreliable because some company names are quoted strings containing commas.
2	Distinct `(company_id, location_id, starts_at, ends_at)`	5,648	5,648	YES	Independently reproduced via DB temp-table aggregate AND from a `csv.reader` pass over the CSV.
3	Active companies covered	69	69	YES	CSV-side: 69 distinct `company_id`. DB-side aggregate: 69.
4	Active outlets covered	384	384	YES	CSV-side: 384 distinct `location_id` (initial awk reported 383 because some location_name fields contain commas; csv.reader gives 384).
5	`jod_jobs.status = 3` for `:completed`	`3`	`3`	YES	`Constants.php:28` — `public const JOB_JOD_STATUS_COMPLETED = 3;`.
6	`payments.payment_status = 2` for `:processed`	`2`	`2`	YES	`Constants.php:140` — `public const PAYMENT_STATUS_PROCESSED = 2;`.
7	Last-12-month window starts `2025-05-18`	`2025-05-18`	`2025-05-18`	YES	Today is 2026-05-18; 2026-05-18 minus 12 months = 2025-05-18.
8	`companies.status = 1 AND deleted_at IS NULL` is "active"	correct	correct	YES	`Constants.php:134` `COMPANY_STATUS_ENABLE = 1`; `CompanyRepositoryEloquent.php:114, 221` use this filter pattern. DB counts: 166 active vs 149 disabled (matches executor's "166 active").
9	`locations.status = 1 AND deleted_at IS NULL` is "active"	correct	correct	YES	`Constants.php:126` `LOCATION_STATUS_ENABLE = 1`; `LocationRepositoryEloquent.php:498, 539, 560, 581, 605` use this filter pattern. DB counts: 1,312 active vs 1,003 disabled.
10	Top 5 companies by row count	NTUC FP Hyper (6,145), NTUC FP North (4,793), NTUC FP Northwest (3,879), NTUC Foodfare (3,123), McDonald's (2,812)	identical, exact match	YES	Re-derived from CSV with `Counter` and cross-checked with `GROUP BY company_id, c.name` on the DB.
11	Multi-rate-per-bucket violation rate 19.2%	19.2%	19.196% on `(company, outlet, role, day_type, band)`; 19.224% on `(company, outlet, day_type, band)`	YES	Both groupings round to 19.2%.
12	One row per slot (no double-counting from multi-worker slots)	claimed	confirmed	YES	DB temp-table: `COUNT(*) = COUNT(DISTINCT slot_id) = 50,746`.
13	CSV is a true projection of the DB query (no spurious or missing combinations)	implicit	confirmed	YES	DB-side `SELECT DISTINCT company_id, location_id, TIME(start), TIME(end), hourly_rate` = 7,286 unique rows. CSV-side same = 7,286 unique rows. Set membership match.
14	`hourly_rate = 0` slots are included in CSV	"few"	0	NO (minor)	DB count of `hourly_rate = 0` rows passing the same filter is 0. The CSV correctly contains 0 zero-rate rows. The executor's note 5 is wrong in fact but harmless.
15	Midnight-wrap rows exist	"some"	1,971	YES (qualitative)	1,971 rows have `ends_at < starts_at` (3.88% of all rows).
16	Active filter exclusion is small	implicit	77 slots out of 50,823 dropped by the active filter (0.15%)	YES	Material risk: the rate evidence for those 77 slots from inactive companies/locations is dropped. Not a concern for analysis; relevant for seed data (none of those companies/locations would be migrated anyway).

Enum verification

Source file

/Users/alaay/jod/repo/jodgig-api/app/Constants/Constants.php

Snippets

// line 13
public const USER_STATUS_ENABLE = 1;
// line 28
public const JOB_JOD_STATUS_COMPLETED = 3;
// line 126
public const LOCATION_STATUS_ENABLE = 1;
// line 134
public const COMPANY_STATUS_ENABLE = 1;
// line 140
public const PAYMENT_STATUS_PROCESSED = 2;
// line 152
public const ENABLE = 1;

Active-filter usage in PHP

app/Repositories/CompanyRepositoryEloquent.php:114 — ->where('companies.status', Constants::ENABLE).
app/Repositories/CompanyRepositoryEloquent.php:221 — ->where('companies.status', Constants::COMPANY_STATUS_ENABLE).
app/Repositories/LocationRepositoryEloquent.php:498 — ->where('locations.status', 1).
app/Repositories/JodJobRepositoryEloquent.php:3154 — ->where('locations.status', Constants::ENABLE).

The "status = 1" convention is universal in the legacy code. Soft-delete column is deleted_at IS NULL (Laravel default). The executor's filter exactly matches this convention.

Cross-check on payment_status (not status)

The executor was careful to note that payments carries both status and payment_status, and that the processed enum lives on payment_status. Verified: grepping the legacy code shows payment_status = 2 is the universal "processed" filter (e.g., PaymentRepositoryEloquent.php, PaymentService.php); the use of payment_status here is correct.

Independent re-query

Using a different query shape (temp table to materialise once, then aggregate from it) to cross-check all key counts:

CREATE TEMPORARY TABLE tmp_valid_slots AS
SELECT DISTINCT s.id AS slot_id, jj.company_id, jj.location_id,
       TIME(s.slot_start_date) AS starts_at,
       TIME(s.slot_end_date)   AS ends_at,
       jj.hourly_rate
FROM jod_jobs jj
JOIN companies c ON c.id = jj.company_id AND c.status = 1 AND c.deleted_at IS NULL
JOIN locations l ON l.id = jj.location_id AND l.status = 1 AND l.deleted_at IS NULL
JOIN slots s     ON s.jod_job_id = jj.id AND s.deleted_at IS NULL
JOIN slot_user su ON su.slot_id = s.id  AND su.deleted_at IS NULL
JOIN payments p   ON p.id = su.payment_id AND p.payment_status = 2 AND p.deleted_at IS NULL
WHERE jj.created_at >= '2025-05-18' AND jj.deleted_at IS NULL AND jj.status = 3;

SELECT COUNT(*), COUNT(DISTINCT slot_id),
       COUNT(DISTINCT company_id), COUNT(DISTINCT location_id)
FROM tmp_valid_slots;
-- Result: 50746 | 50746 | 69 | 384

SELECT COUNT(*) FROM (
  SELECT DISTINCT company_id, location_id, starts_at, ends_at FROM tmp_valid_slots
) x;
-- Result: 5648

A different shape was also tested using EXISTS instead of JOIN to slot_user/payments — but it timed out under the harness's interactive limits. The temp-table approach above is materially independent of the executor's SELECT DISTINCT join-chain query and produced the same row counts.

The 5,648 distinct band keys, 69 distinct companies, 384 distinct locations, and 50,746 total slot rows match the executor's numbers exactly.

CSV integrity

Row count: 50,747 file lines; 1 header + 50,746 data rows. ✓
Header: company_id,company_name,location_id,location_name,starts_at,ends_at,hourly_rate — exactly matches the spec. ✓
Malformed rows: 0. All 50,746 rows have exactly 7 fields when parsed with the Python csv module.
Hourly-rate parsing: every row's hourly_rate is a valid float. No zero-rate rows (executor's note 5 is incorrect on this point — there are zero).
Quoting: company names like "NTUC Fairprice Co-operative Ltd Supermarkets (Central)" (no embedded commas) are unquoted; "Four Points by Sheraton Singapore, Riverview" is correctly quoted with embedded commas escaped. csv.QUOTE_MINIMAL is applied. ✓
Set match against DB: the set of distinct (company_id, location_id, starts_at, ends_at, hourly_rate) tuples in the CSV (7,286) is identical to the set computed by running the same SELECT DISTINCT directly on the DB (7,286). No spurious or missing tuples.
Midnight-wrap rows: 1,971 rows have ends_at < starts_at (overnight shifts). Consistent with the new spec's "midnight wrap is allowed".
Row-count vs distinct-key inflation: 50,746 rows / 5,648 distinct keys = ~9.0 slots per band on average. The duplication is the desired one-row-per-slot shape (used for downstream counting).

Spot checks

Five rows sampled with random.seed(42). For each, the band shape was queried against the DB and the slot count compared with the CSV occurrence count.

Sample	company_id	location_id	starts_at	ends_at	rate	CSV count	DB COUNT(DISTINCT s.id)	Match
1	212 (NTUC FP Central)	1666 (City Square Mall)	08:00	13:00	11.41	86	86	YES
2	78 (NTUC FP Hyper)	731 (FairPrice Xtra Parkway Parade)	13:45	22:05	12.10	210	210	YES
3	10 (JP Pepperdine)	997 (Jack's Place Clementi 321)	22:00	23:00	13.00	8	8	YES (DB raw `COUNT(*)` = 13 is post-join-inflation; `COUNT(DISTINCT s.id) = 8` is the correct figure that matches the CSV)
4	244 (The Robertson House)	1838 (F&B Speakeasy Bar)	17:00	23:00	17.00	2	2	YES
5	151 (McDonald's)	1161 (McDonald's Downtown East 2)	09:00	17:00	16.00	5	5	YES

All five spot checks pass.

Issues found

Minor (do not change the conclusion)

Executor note 5 wrong on zero-rate rows. The executor wrote "A few jod_jobs have a zero hourly_rate ... not filtered out". In fact there are zero such rows in the filtered set. The CSV does not contain any zero-rate rows. Severity: cosmetic. Update the note.
Executor's SELECT DISTINCT includes slot_id, so the DISTINCT is functionally a no-op for deduplication of the inflated slot_user×payments cross-product. The result is correct (one row per slot) because slot_id is part of the DISTINCT key, but the comment "DISTINCT deduplicates the row-explosion from multiple processed payments per slot" is slightly misleading — what actually deduplicates is the combination of s.id being unique per slot and DISTINCT being applied to the projected columns. Severity: doc-clarity nit.

Material (relevant to seeding decision)

The CSV is not in the shape of Gig::PayRate and must not be loaded directly as seed data. The CSV is shaped as "one row per slot observation" — it is a record of what shifts actually ran in the last 12 months, with their posted hourly_rate. A Gig::PayRate seed would be the deduplicated projection at the (company, outlet, role, day_type, starts_at, ends_at) level with a canonical rate chosen for each bucket. Loading the CSV verbatim would create 50,746 Gig::PayRate rows (massively redundant), and 19.2% of the buckets have multiple rates so a unique-constraint violation on the proposed key is guaranteed. Severity: high for seed-data usage; none for analysis usage. Before seeding, a separate step is required to:
- aggregate rows to the bucket grain,
- pick a canonical rate per bucket (median? most-recent? most-frequent?),
- infer day_type (legacy data has no PH calendar; treat PH-fallen-on-weekday slots accordingly),
- infer org_job_role_id mapping (legacy job_type_id → new Org::JobRole),
- resolve the 1,404 multi-rate buckets to a single rate per bucket using a documented rule.
role/job_type_id is absent from the CSV but is required for the proposed unique key. Anyone using the CSV for hypothesis-testing of the role-included key needs to re-pull or run the executor's supplementary query that adds job_type_id. Severity: moderate for seed-data preparation. The validation here used the supplementary query and confirmed the 19.2% figure on the role-inclusive grain.

Methodological

The day-type approximation is an upper bound. The executor flagged this in note 4. Confirmed: day_type is derived from DAYOFWEEK(slot_start_date) only, with no public-holiday calendar. Singapore has 11 public holidays / year, so a single-digit-percent of "weekday" slots are actually public_holiday in the new schema. This systematically inflates the 19.2% rate-disagreement statistic. The true (company, outlet, role, day_type) rate-disagreement rate with PH detection would be lower — exactly as the executor noted.

Independent recommendation on the unique-key hypothesis

Agree with the executor (with one refinement). The proposed unique key (org_company_id, org_outlet_id, org_job_role_id, day_type, starts_at, ends_at) is the right shape. Evidence:

80.8% of legacy buckets at this grain already have a single rate (5,910 of 7,314 buckets). The proposed unique key is correct for the dominant case.
The 19.2% with rate disagreement decomposes into three causes, none of which invalidate the key:
- Public holidays. Some of these are PH-on-weekday slots that the legacy DB can't distinguish — they would route to a different day_type = public_holiday bucket in the new schema and disappear from the violation set. Per the executor's note 4, this is the upper bound; the true figure is lower.
- Rate revisions over time. A company that raised the Service Crew weekday-day rate from $11 to $12 mid-year has two rates for the same band — but this is a temporal signal, not a key-design issue. The new schema's Gig::PayRateHistory captures the change properly.
- Per-shift admin overrides. The job-templates-pay-rate analysis (statistics/job-templates-pay-rate.md) showed 46% of jobs have a rate override. Some overrides will manifest as multi-rate bands here. The new schema treats the override as a per-Shift attribute, not a PayRate row.
The data-quality alternative — making the key looser (drop starts_at/ends_at) — collapses 7,314 buckets to ~1,500 (company, outlet, role, day_type) buckets. That throws away the band signal that the legacy data clearly carries (e.g. McDonald's day vs night, NTUC overnight bands). Wrong direction.
The data-quality alternative — making the key tighter (add is_default, currency, label) — adds zero discriminating power for this dataset (all legacy data is single-currency, single-default-implied) and would only matter once Indonesia expansion or rate-label proliferation creates a real need. Defer.

Refinement: the unique key should be enforced with nulls_not_distinct=true (Postgres 15+) so that two "whole day_type" rows (both starts_at and ends_at NULL) for the same (company, outlet, role, day_type) cannot coexist. This is already stated in the spec (Invariant 4) — re-affirming after seeing the data: a small number of legacy bands span full days, and treating them as not-distinct in the index avoids needing the Manager-layer NULL-band check on the write path for the most common case. Standard SQL treats NULLs as distinct, so the index alone is not enough — nulls_not_distinct=true is the Postgres feature that makes the constraint behave the way the spec describes.

Bottom line: the data supports the proposed unique key. Build it. The 19.2% multi-rate-per-bucket finding is a data-quality observation, not a key-shape concern — it should be expected and handled at seed-import time by choosing a single canonical rate per bucket, with the legacy slot rows preserved separately as historical evidence (not seeded as PayRate rows).

Verdict​

Per-claim audit​

Enum verification​

Source file​

Snippets​

Active-filter usage in PHP​

Cross-check on payment_status (not status)​

Independent re-query​

CSV integrity​

Spot checks​

Issues found​

Minor (do not change the conclusion)​

Material (relevant to seeding decision)​

Methodological​

Independent recommendation on the unique-key hypothesis​