Migration: Legacy Pay-Rate Data → Gig::PayRate
Source of truth for migrating legacy jodgig pay-rate data into the new Gig::PayRate + Org::JobRole system. Living document — updated as findings, decisions, and pipeline rules evolve.
Purpose
When a client moves from the legacy jodgig PHP system to the new jodapp Rails system, their Gig::PayRate rate cards must be pre-seeded from their last 12 months of activity. The client should log into the new system on day 1 and see their roles, their outlets, and their rate cards already configured — no manual setup required. This document records the validated findings from data analysis and the seeding pipeline that delivers that experience.
What we're migrating
From (legacy):
jodgig.jod_jobs— one row per shift (admins create new rows per shift, not per role)jodgig.slots— the actual shift timing inside eachjod_jobsrowjodgig.payments— gig payment records (status:processedmeans rate was actually applied)jodgig.job_templates— per-company template metadata, including cleanerjob_titlejodgig.job_types— platform-wide industry taxonomy (Retail, F&B, Hotel, etc.)
To (new system):
Org::JobRole— per-company role taxonomy (title,employment_type,status)Gig::PayRate— rate cards per(org_company, org_outlet, org_job_role, day_type, starts_at, ends_at)Gig::CompanySetting/Gig::OutletSetting— gig configuration (seeded with system defaults, not from legacy)
Validated findings from data analysis
Every number here was reproduced independently by a second engineer. See the validation files in scratchpad/ for the per-claim audit.
1. Data volumes (validated)
| Metric | Value |
|---|---|
| Legacy paid completed slots (last 12 months) | 50,746 |
Unique (company, location, band, rate) tuples | 7,286 |
Unique (company, location, role, band, rate) tuples | 10,166 |
| Active companies in scope | 69 |
| Active outlets in scope | 384 |
Distinct (company, raw role title) pairs | 2,231 |
Source: statistics/2026-05-18-legacy-pay-rate-bands.csv.
2. Per-outlet variation is the norm, not the exception
38 of 40 multi-outlet companies have different band-sets across their outlets. McDonald's has 48 outlets with 48 distinct band-sets. NTUC FairPrice regions similarly diverge. The strategic accounts are exactly the ones where per-outlet variation dominates.
Implication: per-outlet Gig::PayRate rows (and per-outlet Gig::OutletSetting rows) are required. Company-only PayRates would lose 90%+ of the variation in real customer data.
3. Title pollution is severe but already solved in Rails
Raw legacy titles are 86–97% polluted at top accounts (admins encoding worker names, dates, time ranges, and outlet codes into the job_title field). Examples:
Service Crew (Leonard 1/6/25)Food Packer (Hui Min 25 Feb 11.30-15.30)F&B Kitchen Assistant 9/1/2026 kenjiCustomer Support Admin (Irfaan)— ~55 distinct variants for one effective role for one worker
The Rails sync already normalises these. Gig::TempJobs::JobTitleNormalizer (in jodapp-api/app/domains/gig/temp_jobs/job_title_normalizer.rb) strips suffixes, dates, parens, quantifiers, and modifiers, then titleizes the result. It is the canonical canonicalisation layer.
For NTUC FP Hyper: 151 raw titles → 27 normalised roles (5.6× compression).
Known gap: the normaliser does not yet expand abbreviations. RA Dry and Retail Assistant Dry land as parallel rows. Tracked in jodapp-api #1644.
4. Bands are mostly typing noise, not rate-driven (the key seeding insight)
81% of (outlet, role, day_type) groups have a single hourly_rate across all bands. Admins type bespoke time windows per shift (07:45–16:05, 07:50–16:00, 08:00–17:00…) but every one of those bands for the same outlet+role+day_type pays the same rate.
The remaining 19% of groups with multiple rates are not band-driven either:
- Most show temporal rate revisions (
$11.41 → $12.10 → $12.91over the 12-month window) - Some show PH rates leaking into weekday/weekend (
$12.10 + $24.20clean 2× multiplier)
Zero examples found of "rate differs between 09:00–17:00 and 14:00–22:00 within the same day_type". The bands themselves are typing noise.
Seeding rule that follows: collapse to one row per (outlet, role, day_type) with starts_at: NULL, ends_at: NULL (meaning "applies to whole day_type"), using the most recent rate. Keep banded rows only when rates genuinely vary by more than ~20% across bands for the same key — rare in practice, but the schema supports it for the companies that do (e.g. McDonald's at outlets with a real night premium).
Source: NTUC FP Hyper analysis — statistics/2026-05-18-seed-ntuc-fp-hyper.csv and scratchpad/2026-05-18-seed-ntuc-fp-hyper-pipeline.md.
5. The legacy platform taxonomy is effectively abandoned
Of 57 defined job_types (the legacy platform-wide taxonomy: Cashier, Waiter, Service Crew, etc.), only 7 appear in 12 months of completed-and-paid shifts, and the 7 in use are coarse industry buckets:
job_type_id | job_type_title |
|---|---|
| 51 | Retail |
| 52 | F&B |
| 53 | Hotel |
| 54 | Healthcare |
| 55 | Logistics |
| 56 | Events |
| 57 | Others |
The fine-grained roles (IDs 1–30) are dead. job_type_id is useless as a per-company role discriminator. The per-company role taxonomy lives entirely in the (now normalised) jod_jobs.job_title → Org::JobRole.
Seeding pipeline (the production design)
Six stages. Stages 1–2 already exist in production (the gig_temp_jobs Rails sync). Stages 3–6 are the new migration-specific logic that produces the seed CSV.
[Legacy MySQL]
│
│ (1) Extract — last 12 months, active, completed, processed
│ SQL filter: 12 mo, active company+location, jod_jobs.status=3, payment.status=2
▼
[Raw legacy rows with date]
│
│ (2) Normalise titles — Gig::TempJobs::JobTitleNormalizer
│ Strip suffixes / dates / parens / quantifiers / modifiers
│ Titleize + capitalise abbreviations
│ Output: job_title_normalized + role_slug
▼
[Normalised legacy rows]
│
│ (3) Derive day_type — calendar lookup (NO heuristic)
│ Geo::PublicHoliday → public_holiday
│ DAYOFWEEK ∈ (Sat, Sun) → weekend
│ else → weekday
▼
[Rows with day_type assigned]
│
│ (4) Consolidate rates — group + collapse
│ Group by (company, outlet, role, day_type)
│ If single rate across all bands → one row, NULL/NULL band
│ If rates vary >20% across bands → keep banded rows (rare)
│ Otherwise (temporal variation) → pick most recent rate, NULL/NULL band
│ Set confidence: high (≥10 slots) / medium (3–9) / low (1–2)
▼
[Consolidated rate cards]
│
│ (5) Generate defaults — one per (company, role)
│ org_outlet_id = NULL, starts_at = NULL, ends_at = NULL
│ hourly_rate = most common rate across all outlets for that role
│ is_default = true
▼
[Seed-ready rows]
│
│ (6) Output seed CSV — 13 columns, importable
▼
[gig-pay-rates-seed.csv]
│
│ Importer (separate concern)
▼
[Org::JobRole + Gig::PayRate rows in production]
Stage 1 — Extract
SQL query reference: scratchpad/2026-05-18-legacy-pay-rate-bands-query.md. Includes the slot date for downstream day_type derivation.
Filters (validated):
jod_jobs.created_atwithin the last 12 months (relative to migration date)- Active companies and locations (
status = 1or appropriate perjodgigschema) jod_jobs.status = 3(:completed)payment.status = 2(:processed)
Stage 2 — Title normalisation
Source: jodapp-api/app/domains/gig/temp_jobs/job_title_normalizer.rb.
The normaliser already runs as part of the gig_temp_jobs sync. For migration seeding, two options:
- Read from
gig_temp_jobsdirectly if the sync is current.job_title_normalizedandrole_slugcolumns are already populated per(remote_company_id, remote_location_id, normalised title). - Re-run the normaliser in-process (Ruby or a Python port of the rules) over a fresh legacy extract. Useful for one-off analyses where
gig_temp_jobsis not available.
Known gap, tracked in jodapp-api #1644: the normaliser does not yet expand RA → Retail Assistant, FP → FairPrice, etc. Until that lands, parallel naming survives normalisation and admins will see both RA Dry and Retail Assistant Dry as separate roles in the seed output. Acceptable for v1; will be cleaner once the issue is resolved and the sync is re-run.
Stage 3 — day_type derivation
Calendar lookup. No heuristic.
Input data: migration/sg-public-holidays-2025-2026.csv. This file holds Singapore Ministry of Manpower official holidays for 2025 and 2026 and is the seed source for the production Geo::PublicHoliday table.
Logic per slot:
- If
slot.dateexists inGeo::PublicHolidayfor SG →public_holiday - Else if
DAYOFWEEK(slot.date)is Saturday or Sunday →weekend - Else →
weekday
The 1.5× rate-ratio heuristic used in the first NTUC FP Hyper draft is deprecated — only used while Geo::PublicHoliday is unpopulated. For NTUC it happened to work (clean 2× multiplier); for McDonald's (~17% PH premium) it would silently mis-classify every PH slot as weekday.
Stage 4 — Rate consolidation (with band collapse)
The load-bearing stage. For each (remote_company_id, remote_location_id, role_slug, day_type) group:
- Collect all distinct
hourly_ratevalues across all bands within the group. - If single rate: emit one row with
starts_at: NULL, ends_at: NULL(whole day_type),hourly_rate= the single value,source_slot_count= sum across all bands in the group. Confidence by slot count. - If multiple rates AND the spread is ≤ 20% (temporal variation): emit one row with
starts_at: NULL, ends_at: NULL,hourly_rate= the most recent rate. Older rates are dropped (they would have been history entries in the new system). Confidence is medium at best. - If multiple rates AND the spread is > 20% (genuine time-of-day variation): emit one row per band, preserving
starts_at/ends_at. Each row'shourly_rateis the most recent for that exact band. This case is rare — only companies like McDonald's with real night premiums hit it.
The 20% threshold is a starting heuristic. Adjust based on cross-company analysis once the pipeline runs at scale.
Stage 5 — Company-wide default selection
For each (remote_company_id, role_slug):
- Across all outlet-specific rows already emitted (from Stage 4), find the most common
hourly_rate. - Emit one additional row with
remote_location_id: NULL,starts_at: NULL,ends_at: NULL,hourly_rate= the most common rate,is_default: true,day_type: weekday(the canonical fallback),confidence: inferred.
This row is the fallback used when a new outlet (created post-migration) has no outlet-specific rate yet. Honours the invariant "exactly one is_default per (company, role)".
Stage 6 — Output seed CSV
Schema: 13 columns. See the seed CSV format section below.
Seed CSV format
File: gig-pay-rates-seed.csv (one file per company, or one consolidated file with all companies).
| Column | Type | Notes |
|---|---|---|
remote_company_id | int | For resolving to org_company_id via the legacy → new mapping |
remote_location_id | int (nullable) | Empty for company-wide defaults |
role_title | string | Canonical, from JobTitleNormalizer.normalize |
role_slug | string | Parameterized form; find-or-create key for Org::JobRole |
employment_type | string | Always gig for this seed |
day_type | string | weekday, weekend, or public_holiday |
starts_at | time (nullable) | HH:MM or empty (whole day_type) |
ends_at | time (nullable) | HH:MM or empty |
hourly_rate | decimal | Local currency (SGD for SG companies) |
is_default | bool | true for exactly one row per (remote_company_id, role_slug) |
label | string | Generated: "Weekday Day", "PH Night", "Default" |
confidence | enum | high (≥10 slots) / medium (3–9) / low (1–2) / inferred (company default) / manual (admin-added) |
source_slot_count | int or string | Backing slot count, or inferred / manual |
notes | string (optional) | Free text, e.g. "PH inferred from premium pattern" |
Worked example — NTUC FairPrice (Hyper)
Validated draft seed: statistics/2026-05-18-seed-ntuc-fp-hyper.csv (682 rows, before the band-collapse rule was finalised).
After the band-collapse rule (Stage 4 of the production pipeline):
| Stage | Count |
|---|---|
| Legacy paid completed slots (input) | 6,230 |
| Distinct raw titles | 151 |
| Distinct normalised roles | 27 |
(outlet, role, day_type) groups | 121 |
| Seed rows (post band-collapse) | ~148 |
| └ Outlet-specific PayRate rows | ~121 |
| └ Company-wide default rows (one per role) | 27 |
| 8 outlets covered (1 outlet had zero paid slots) |
Per-role and per-outlet breakdowns are in the pipeline record: scratchpad/2026-05-18-seed-ntuc-fp-hyper-pipeline.md.
Expected production seed across all 69 companies: roughly 3,000–5,000 rows — manageable for a one-time bulk insert.
Importer behaviour
The seed CSV is read by an importer (separate concern from the consolidation pipeline). For each row, in dependency order:
- Resolve company:
Org::Companybylegacy_company_idmapping. Fail loud if missing. - Find-or-create
Org::JobRoleby(org_company_id, role_title, employment_type). First row for a role creates it; subsequent rows reference it. - Resolve outlet by
legacy_location_id, or NULL for company-wide. - Find-or-create
Gig::PayRateby(org_company_id, org_outlet_id, org_job_role_id, day_type, starts_at, ends_at). The Manager's no-overlap validation runs here. - Set
is_default. The Manager enforces exactly one default per(company, role). - Set
currencyfrom the company'sgeo_country.currency.
Wrap each company's import in a transaction. If any row fails, roll back that company's entire seed and surface the failures for admin review before retry.
Operational decisions (decided 2026-05-18)
Gig::TempJobs::JobTitleNormalizeris the canonical canonicalisation — do not invent a parallel normaliser.- Band-collapse rule: collapse to NULL/NULL band when rates are uniform across bands for the same
(outlet, role, day_type). Only keep banded rows when rates genuinely vary by more than 20%. - Most-recent-rate wins for temporal variation (one rate revised multiple times over 12 months).
- Calendar lookup, not heuristic, for
day_typederivation.Geo::PublicHolidaymust be seeded before production runs (SG holidays ready in this folder). - Eager seeding, magic feel — the rate-management UI shows a "review your auto-seeded rates" banner for low-confidence rows; high-confidence rows are silently active from day 1.
- No retroactive propagation — a later edit to
Gig::CompanySettingdoes not modify existingGig::OutletSettingrows or existingGig::PayRaterows. The migration produces a snapshot; the snapshot is independent thereafter.
Open issues and known limitations
- Abbreviation expansion not yet implemented. Result:
RA DryandRetail Assistant Dryland as parallelOrg::JobRolerows in the seed for companies with mixed usage. Tracked in jodapp-api #1644. Once resolved and the sync is re-run, the next iteration of the seed will collapse these correctly. - Non-breaking space (U+00A0) handling. Two NTUC entries contain non-breaking spaces between words. The normaliser does not currently squish these to regular spaces. Low-impact; flag if it surfaces beyond NTUC.
Geo::PublicHolidaynot yet in production. Defined ingig.dbmlasgeo_public_holidays. SG holiday data is in this folder (sg-public-holidays-2025-2026.csv) ready to seed. Indonesia holidays need to be added before ID launch.- 20% rate-spread threshold for band preservation is a starting heuristic. Revisit after the pipeline runs against McDonald's data (the canonical case for real night-premium banding) to confirm or adjust.
- Single-outlet
Org::Companycases. The company-wide default row + an outlet-specific row are functionally redundant for single-outlet companies. Acceptable overhead in v1; could be optimised later.
Files in this folder
| File | Purpose |
|---|---|
migration-gig-pay-rate.md | This file. Source of truth. |
sg-public-holidays-2025-2026.csv | Singapore Ministry of Manpower holidays. Seed source for Geo::PublicHoliday and day_type derivation input. |
References
Specs
models/gig-pay-rate.md— theGig::PayRatemodel spec (target state)33-org/models/org-job-role.md— theOrg::JobRolemodel spec, including UC-8 (bulk import)models/gig-company-setting.md,models/gig-outlet-setting.md— settings model specs (not affected by this migration)
Decision notes
notes/2026-05-07-decision-gig-pay-rates-org-job-roles.md— whyGig::PayRate+Org::JobRoleinstead of alternativesnotes/2026-05-18-settings-cascade.md— settings pattern (eager + explicit)
Analysis artefacts
statistics/2026-05-18-legacy-pay-rate-bands.csv— full legacy extract (10,166 deduped rows; evidence base)scratchpad/2026-05-18-legacy-pay-rate-bands-query.md— the SQL query usedscratchpad/2026-05-18-legacy-pay-rate-bands-validation.md— v1 validation report (CONFIRMED)scratchpad/2026-05-18-legacy-pay-rate-bands-validation-v2.md— v2 (with role columns) validation report (CONFIRMED)statistics/2026-05-18-seed-ntuc-fp-hyper.csv— draft seed CSV for NTUC FP Hyper (one company)scratchpad/2026-05-18-seed-ntuc-fp-hyper-pipeline.md— pipeline record incl. Python port of the normaliserstatistics/job-templates-pay-rate.md— original "template explosion" analysis that drove theGig::PayRatedesign
Code
jodapp-api/app/domains/gig/temp_jobs/job_title_normalizer.rb— the canonical title normaliserjodapp-api/app/domains/gig/temp_jobs/sync_service.rb— the live sync that runs the normaliserjodapp-api/docs/db/gig.dbml— DBML forgig_pay_rates,gig_company_settings,gig_outlet_settings,geo_public_holidays