Migration: Legacy Pay-Rate Data → `Gig::PayRate`

Source of truth for migrating legacy jodgig pay-rate data into the new Gig::PayRate + Org::JobRole system. Living document — updated as findings, decisions, and pipeline rules evolve.

Purpose

When a client moves from the legacy jodgig PHP system to the new jodapp Rails system, their Gig::PayRate rate cards must be pre-seeded from their last 12 months of activity. The client should log into the new system on day 1 and see their roles, their outlets, and their rate cards already configured — no manual setup required. This document records the validated findings from data analysis and the seeding pipeline that delivers that experience.

What we're migrating

From (legacy):

jodgig.jod_jobs — one row per shift (admins create new rows per shift, not per role)
jodgig.slots — the actual shift timing inside each jod_jobs row
jodgig.payments — gig payment records (status :processed means rate was actually applied)
jodgig.job_templates — per-company template metadata, including cleaner job_title
jodgig.job_types — platform-wide industry taxonomy (Retail, F&B, Hotel, etc.)

To (new system):

Org::JobRole — per-company role taxonomy (title, employment_type, status)
Gig::PayRate — rate cards per (org_company, org_outlet, org_job_role, day_type, starts_at, ends_at)
Gig::CompanySetting / Gig::OutletSetting — gig configuration (seeded with system defaults, not from legacy)

Validated findings from data analysis

Every number here was reproduced independently by a second engineer. See the validation files in scratchpad/ for the per-claim audit.

1. Data volumes (validated)

Metric	Value
Legacy paid completed slots (last 12 months)	50,746
Unique `(company, location, band, rate)` tuples	7,286
Unique `(company, location, role, band, rate)` tuples	10,166
Active companies in scope	69
Active outlets in scope	384
Distinct `(company, raw role title)` pairs	2,231

Source: statistics/2026-05-18-legacy-pay-rate-bands.csv.

2. Per-outlet variation is the norm, not the exception

38 of 40 multi-outlet companies have different band-sets across their outlets. McDonald's has 48 outlets with 48 distinct band-sets. NTUC FairPrice regions similarly diverge. The strategic accounts are exactly the ones where per-outlet variation dominates.

Implication: per-outlet Gig::PayRate rows (and per-outlet Gig::OutletSetting rows) are required. Company-only PayRates would lose 90%+ of the variation in real customer data.

3. Title pollution is severe but already solved in Rails

Raw legacy titles are 86–97% polluted at top accounts (admins encoding worker names, dates, time ranges, and outlet codes into the job_title field). Examples:

Service Crew (Leonard 1/6/25)
Food Packer (Hui Min 25 Feb 11.30-15.30)
F&B Kitchen Assistant 9/1/2026 kenji
Customer Support Admin (Irfaan) — ~55 distinct variants for one effective role for one worker

The Rails sync already normalises these. Gig::TempJobs::JobTitleNormalizer (in jodapp-api/app/domains/gig/temp_jobs/job_title_normalizer.rb) strips suffixes, dates, parens, quantifiers, and modifiers, then titleizes the result. It is the canonical canonicalisation layer.

For NTUC FP Hyper: 151 raw titles → 27 normalised roles (5.6× compression).

Known gap: the normaliser does not yet expand abbreviations. RA Dry and Retail Assistant Dry land as parallel rows. Tracked in jodapp-api #1644.

4. Bands are mostly typing noise, not rate-driven (the key seeding insight)

81% of (outlet, role, day_type) groups have a single hourly_rate across all bands. Admins type bespoke time windows per shift (07:45–16:05, 07:50–16:00, 08:00–17:00…) but every one of those bands for the same outlet+role+day_type pays the same rate.

The remaining 19% of groups with multiple rates are not band-driven either:

Most show temporal rate revisions ($11.41 → $12.10 → $12.91 over the 12-month window)
Some show PH rates leaking into weekday/weekend ($12.10 + $24.20 clean 2× multiplier)

Zero examples found of "rate differs between 09:00–17:00 and 14:00–22:00 within the same day_type". The bands themselves are typing noise.

Seeding rule that follows: collapse to one row per (outlet, role, day_type) with starts_at: NULL, ends_at: NULL (meaning "applies to whole day_type"), using the most recent rate. Keep banded rows only when rates genuinely vary by more than ~20% across bands for the same key — rare in practice, but the schema supports it for the companies that do (e.g. McDonald's at outlets with a real night premium).

Source: NTUC FP Hyper analysis — statistics/2026-05-18-seed-ntuc-fp-hyper.csv and scratchpad/2026-05-18-seed-ntuc-fp-hyper-pipeline.md.

5. The legacy platform taxonomy is effectively abandoned

Of 57 defined job_types (the legacy platform-wide taxonomy: Cashier, Waiter, Service Crew, etc.), only 7 appear in 12 months of completed-and-paid shifts, and the 7 in use are coarse industry buckets:

`job_type_id`	`job_type_title`
51	Retail
52	F&B
53	Hotel
54	Healthcare
55	Logistics
56	Events
57	Others

The fine-grained roles (IDs 1–30) are dead. job_type_id is useless as a per-company role discriminator. The per-company role taxonomy lives entirely in the (now normalised) jod_jobs.job_title → Org::JobRole.

Seeding pipeline (the production design)

Six stages. Stages 1–2 already exist in production (the gig_temp_jobs Rails sync). Stages 3–6 are the new migration-specific logic that produces the seed CSV.

[Legacy MySQL]
   │
   │ (1) Extract  — last 12 months, active, completed, processed
   │   SQL filter: 12 mo, active company+location, jod_jobs.status=3, payment.status=2
   ▼
[Raw legacy rows with date]
   │
   │ (2) Normalise titles  — Gig::TempJobs::JobTitleNormalizer
   │   Strip suffixes / dates / parens / quantifiers / modifiers
   │   Titleize + capitalise abbreviations
   │   Output: job_title_normalized + role_slug
   ▼
[Normalised legacy rows]
   │
   │ (3) Derive day_type  — calendar lookup (NO heuristic)
   │   Geo::PublicHoliday  → public_holiday
   │   DAYOFWEEK ∈ (Sat, Sun) → weekend
   │   else → weekday
   ▼
[Rows with day_type assigned]
   │
   │ (4) Consolidate rates  — group + collapse
   │   Group by (company, outlet, role, day_type)
   │   If single rate across all bands → one row, NULL/NULL band
   │   If rates vary >20% across bands → keep banded rows (rare)
   │   Otherwise (temporal variation) → pick most recent rate, NULL/NULL band
   │   Set confidence: high (≥10 slots) / medium (3–9) / low (1–2)
   ▼
[Consolidated rate cards]
   │
   │ (5) Generate defaults  — one per (company, role)
   │   org_outlet_id = NULL, starts_at = NULL, ends_at = NULL
   │   hourly_rate = most common rate across all outlets for that role
   │   is_default = true
   ▼
[Seed-ready rows]
   │
   │ (6) Output seed CSV  — 13 columns, importable
   ▼
[gig-pay-rates-seed.csv]
   │
   │ Importer (separate concern)
   ▼
[Org::JobRole + Gig::PayRate rows in production]

Stage 1 — Extract

SQL query reference: scratchpad/2026-05-18-legacy-pay-rate-bands-query.md. Includes the slot date for downstream day_type derivation.

Filters (validated):

jod_jobs.created_at within the last 12 months (relative to migration date)
Active companies and locations (status = 1 or appropriate per jodgig schema)
jod_jobs.status = 3 (:completed)
payment.status = 2 (:processed)

Stage 2 — Title normalisation

Source: jodapp-api/app/domains/gig/temp_jobs/job_title_normalizer.rb.

The normaliser already runs as part of the gig_temp_jobs sync. For migration seeding, two options:

Read from gig_temp_jobs directly if the sync is current. job_title_normalized and role_slug columns are already populated per (remote_company_id, remote_location_id, normalised title).
Re-run the normaliser in-process (Ruby or a Python port of the rules) over a fresh legacy extract. Useful for one-off analyses where gig_temp_jobs is not available.

Known gap, tracked in jodapp-api #1644: the normaliser does not yet expand RA → Retail Assistant, FP → FairPrice, etc. Until that lands, parallel naming survives normalisation and admins will see both RA Dry and Retail Assistant Dry as separate roles in the seed output. Acceptable for v1; will be cleaner once the issue is resolved and the sync is re-run.

Stage 3 — `day_type` derivation

Calendar lookup. No heuristic.

Input data: migration/sg-public-holidays-2025-2026.csv. This file holds Singapore Ministry of Manpower official holidays for 2025 and 2026 and is the seed source for the production Geo::PublicHoliday table.

Logic per slot:

If slot.date exists in Geo::PublicHoliday for SG → public_holiday
Else if DAYOFWEEK(slot.date) is Saturday or Sunday → weekend
Else → weekday

The 1.5× rate-ratio heuristic used in the first NTUC FP Hyper draft is deprecated — only used while Geo::PublicHoliday is unpopulated. For NTUC it happened to work (clean 2× multiplier); for McDonald's (~17% PH premium) it would silently mis-classify every PH slot as weekday.

Stage 4 — Rate consolidation (with band collapse)

The load-bearing stage. For each (remote_company_id, remote_location_id, role_slug, day_type) group:

Collect all distinct hourly_rate values across all bands within the group.
If single rate: emit one row with starts_at: NULL, ends_at: NULL (whole day_type), hourly_rate = the single value, source_slot_count = sum across all bands in the group. Confidence by slot count.
If multiple rates AND the spread is ≤ 20% (temporal variation): emit one row with starts_at: NULL, ends_at: NULL, hourly_rate = the most recent rate. Older rates are dropped (they would have been history entries in the new system). Confidence is medium at best.
If multiple rates AND the spread is > 20% (genuine time-of-day variation): emit one row per band, preserving starts_at / ends_at. Each row's hourly_rate is the most recent for that exact band. This case is rare — only companies like McDonald's with real night premiums hit it.

The 20% threshold is a starting heuristic. Adjust based on cross-company analysis once the pipeline runs at scale.

Stage 5 — Company-wide default selection

For each (remote_company_id, role_slug):

Across all outlet-specific rows already emitted (from Stage 4), find the most common hourly_rate.
Emit one additional row with remote_location_id: NULL, starts_at: NULL, ends_at: NULL, hourly_rate = the most common rate, is_default: true, day_type: weekday (the canonical fallback), confidence: inferred.

This row is the fallback used when a new outlet (created post-migration) has no outlet-specific rate yet. Honours the invariant "exactly one is_default per (company, role)".

Stage 6 — Output seed CSV

Schema: 13 columns. See the seed CSV format section below.

Seed CSV format

File: gig-pay-rates-seed.csv (one file per company, or one consolidated file with all companies).

Column	Type	Notes
`remote_company_id`	int	For resolving to `org_company_id` via the legacy → new mapping
`remote_location_id`	int (nullable)	Empty for company-wide defaults
`role_title`	string	Canonical, from `JobTitleNormalizer.normalize`
`role_slug`	string	Parameterized form; find-or-create key for `Org::JobRole`
`employment_type`	string	Always `gig` for this seed
`day_type`	string	`weekday`, `weekend`, or `public_holiday`
`starts_at`	time (nullable)	`HH:MM` or empty (whole day_type)
`ends_at`	time (nullable)	`HH:MM` or empty
`hourly_rate`	decimal	Local currency (SGD for SG companies)
`is_default`	bool	`true` for exactly one row per `(remote_company_id, role_slug)`
`label`	string	Generated: "Weekday Day", "PH Night", "Default"
`confidence`	enum	`high` (≥10 slots) / `medium` (3–9) / `low` (1–2) / `inferred` (company default) / `manual` (admin-added)
`source_slot_count`	int or string	Backing slot count, or `inferred` / `manual`
`notes`	string (optional)	Free text, e.g. "PH inferred from premium pattern"

Worked example — NTUC FairPrice (Hyper)

Validated draft seed: statistics/2026-05-18-seed-ntuc-fp-hyper.csv (682 rows, before the band-collapse rule was finalised).

After the band-collapse rule (Stage 4 of the production pipeline):

Stage	Count
Legacy paid completed slots (input)	6,230
Distinct raw titles	151
Distinct normalised roles	27
`(outlet, role, day_type)` groups	121
Seed rows (post band-collapse)	~148
└ Outlet-specific PayRate rows	~121
└ Company-wide default rows (one per role)	27
8 outlets covered (1 outlet had zero paid slots)

Per-role and per-outlet breakdowns are in the pipeline record: scratchpad/2026-05-18-seed-ntuc-fp-hyper-pipeline.md.

Expected production seed across all 69 companies: roughly 3,000–5,000 rows — manageable for a one-time bulk insert.

Importer behaviour

The seed CSV is read by an importer (separate concern from the consolidation pipeline). For each row, in dependency order:

Resolve company: Org::Company by legacy_company_id mapping. Fail loud if missing.
Find-or-create Org::JobRole by (org_company_id, role_title, employment_type). First row for a role creates it; subsequent rows reference it.
Resolve outlet by legacy_location_id, or NULL for company-wide.
Find-or-create Gig::PayRate by (org_company_id, org_outlet_id, org_job_role_id, day_type, starts_at, ends_at). The Manager's no-overlap validation runs here.
Set is_default. The Manager enforces exactly one default per (company, role).
Set currency from the company's geo_country.currency.

Wrap each company's import in a transaction. If any row fails, roll back that company's entire seed and surface the failures for admin review before retry.

Operational decisions (decided 2026-05-18)

Gig::TempJobs::JobTitleNormalizer is the canonical canonicalisation — do not invent a parallel normaliser.
Band-collapse rule: collapse to NULL/NULL band when rates are uniform across bands for the same (outlet, role, day_type). Only keep banded rows when rates genuinely vary by more than 20%.
Most-recent-rate wins for temporal variation (one rate revised multiple times over 12 months).
Calendar lookup, not heuristic, for day_type derivation. Geo::PublicHoliday must be seeded before production runs (SG holidays ready in this folder).
Eager seeding, magic feel — the rate-management UI shows a "review your auto-seeded rates" banner for low-confidence rows; high-confidence rows are silently active from day 1.
No retroactive propagation — a later edit to Gig::CompanySetting does not modify existing Gig::OutletSetting rows or existing Gig::PayRate rows. The migration produces a snapshot; the snapshot is independent thereafter.

Open issues and known limitations

Abbreviation expansion not yet implemented. Result: RA Dry and Retail Assistant Dry land as parallel Org::JobRole rows in the seed for companies with mixed usage. Tracked in jodapp-api #1644. Once resolved and the sync is re-run, the next iteration of the seed will collapse these correctly.
Non-breaking space (U+00A0) handling. Two NTUC entries contain non-breaking spaces between words. The normaliser does not currently squish these to regular spaces. Low-impact; flag if it surfaces beyond NTUC.
Geo::PublicHoliday not yet in production. Defined in gig.dbml as geo_public_holidays. SG holiday data is in this folder (sg-public-holidays-2025-2026.csv) ready to seed. Indonesia holidays need to be added before ID launch.
20% rate-spread threshold for band preservation is a starting heuristic. Revisit after the pipeline runs against McDonald's data (the canonical case for real night-premium banding) to confirm or adjust.
Single-outlet Org::Company cases. The company-wide default row + an outlet-specific row are functionally redundant for single-outlet companies. Acceptable overhead in v1; could be optimised later.

Files in this folder

File	Purpose
`migration-gig-pay-rate.md`	This file. Source of truth.
`sg-public-holidays-2025-2026.csv`	Singapore Ministry of Manpower holidays. Seed source for `Geo::PublicHoliday` and `day_type` derivation input.

References

Specs

models/gig-pay-rate.md — the Gig::PayRate model spec (target state)
33-org/models/org-job-role.md — the Org::JobRole model spec, including UC-8 (bulk import)
models/gig-company-setting.md, models/gig-outlet-setting.md — settings model specs (not affected by this migration)

Decision notes

notes/2026-05-07-decision-gig-pay-rates-org-job-roles.md — why Gig::PayRate + Org::JobRole instead of alternatives
notes/2026-05-18-settings-cascade.md — settings pattern (eager + explicit)

Analysis artefacts

statistics/2026-05-18-legacy-pay-rate-bands.csv — full legacy extract (10,166 deduped rows; evidence base)
scratchpad/2026-05-18-legacy-pay-rate-bands-query.md — the SQL query used
scratchpad/2026-05-18-legacy-pay-rate-bands-validation.md — v1 validation report (CONFIRMED)
scratchpad/2026-05-18-legacy-pay-rate-bands-validation-v2.md — v2 (with role columns) validation report (CONFIRMED)
statistics/2026-05-18-seed-ntuc-fp-hyper.csv — draft seed CSV for NTUC FP Hyper (one company)
scratchpad/2026-05-18-seed-ntuc-fp-hyper-pipeline.md — pipeline record incl. Python port of the normaliser
statistics/job-templates-pay-rate.md — original "template explosion" analysis that drove the Gig::PayRate design

Code

jodapp-api/app/domains/gig/temp_jobs/job_title_normalizer.rb — the canonical title normaliser
jodapp-api/app/domains/gig/temp_jobs/sync_service.rb — the live sync that runs the normaliser
jodapp-api/docs/db/gig.dbml — DBML for gig_pay_rates, gig_company_settings, gig_outlet_settings, geo_public_holidays

Purpose​

What we're migrating​

Validated findings from data analysis​

1. Data volumes (validated)​

2. Per-outlet variation is the norm, not the exception​

3. Title pollution is severe but already solved in Rails​

4. Bands are mostly typing noise, not rate-driven (the key seeding insight)​

5. The legacy platform taxonomy is effectively abandoned​

Seeding pipeline (the production design)​

Stage 1 — Extract​

Stage 2 — Title normalisation​

Stage 3 — day_type derivation​

Stage 4 — Rate consolidation (with band collapse)​

Stage 5 — Company-wide default selection​

Stage 6 — Output seed CSV​

Seed CSV format​

Worked example — NTUC FairPrice (Hyper)​

Importer behaviour​

Operational decisions (decided 2026-05-18)​

Open issues and known limitations​

Files in this folder​

References​

Specs​

Decision notes​

Analysis artefacts​

Code​