Skip to main content

Migration: Legacy Pay-Rate Data → Gig::PayRate

Source of truth for migrating legacy jodgig pay-rate data into the new Gig::PayRate + Org::JobRole system. Living document — updated as findings, decisions, and pipeline rules evolve.

Purpose

When a client moves from the legacy jodgig PHP system to the new jodapp Rails system, their Gig::PayRate rate cards must be pre-seeded from their last 12 months of activity. The client should log into the new system on day 1 and see their roles, their outlets, and their rate cards already configured — no manual setup required. This document records the validated findings from data analysis and the seeding pipeline that delivers that experience.

What we're migrating

From (legacy):

  • jodgig.jod_jobs — one row per shift (admins create new rows per shift, not per role)
  • jodgig.slots — the actual shift timing inside each jod_jobs row
  • jodgig.payments — gig payment records (status :processed means rate was actually applied)
  • jodgig.job_templates — per-company template metadata, including cleaner job_title
  • jodgig.job_types — platform-wide industry taxonomy (Retail, F&B, Hotel, etc.)

To (new system):

  • Org::JobRole — per-company role taxonomy (title, employment_type, status)
  • Gig::PayRate — rate cards per (org_company, org_outlet, org_job_role, day_type, starts_at, ends_at)
  • Gig::CompanySetting / Gig::OutletSetting — gig configuration (seeded with system defaults, not from legacy)

Validated findings from data analysis

Every number here was reproduced independently by a second engineer. See the validation files in scratchpad/ for the per-claim audit.

1. Data volumes (validated)

MetricValue
Legacy paid completed slots (last 12 months)50,746
Unique (company, location, band, rate) tuples7,286
Unique (company, location, role, band, rate) tuples10,166
Active companies in scope69
Active outlets in scope384
Distinct (company, raw role title) pairs2,231

Source: statistics/2026-05-18-legacy-pay-rate-bands.csv.

2. Per-outlet variation is the norm, not the exception

38 of 40 multi-outlet companies have different band-sets across their outlets. McDonald's has 48 outlets with 48 distinct band-sets. NTUC FairPrice regions similarly diverge. The strategic accounts are exactly the ones where per-outlet variation dominates.

Implication: per-outlet Gig::PayRate rows (and per-outlet Gig::OutletSetting rows) are required. Company-only PayRates would lose 90%+ of the variation in real customer data.

3. Title pollution is severe but already solved in Rails

Raw legacy titles are 86–97% polluted at top accounts (admins encoding worker names, dates, time ranges, and outlet codes into the job_title field). Examples:

  • Service Crew (Leonard 1/6/25)
  • Food Packer (Hui Min 25 Feb 11.30-15.30)
  • F&B Kitchen Assistant 9/1/2026 kenji
  • Customer Support Admin (Irfaan) — ~55 distinct variants for one effective role for one worker

The Rails sync already normalises these. Gig::TempJobs::JobTitleNormalizer (in jodapp-api/app/domains/gig/temp_jobs/job_title_normalizer.rb) strips suffixes, dates, parens, quantifiers, and modifiers, then titleizes the result. It is the canonical canonicalisation layer.

For NTUC FP Hyper: 151 raw titles → 27 normalised roles (5.6× compression).

Known gap: the normaliser does not yet expand abbreviations. RA Dry and Retail Assistant Dry land as parallel rows. Tracked in jodapp-api #1644.

4. Bands are mostly typing noise, not rate-driven (the key seeding insight)

81% of (outlet, role, day_type) groups have a single hourly_rate across all bands. Admins type bespoke time windows per shift (07:45–16:05, 07:50–16:00, 08:00–17:00…) but every one of those bands for the same outlet+role+day_type pays the same rate.

The remaining 19% of groups with multiple rates are not band-driven either:

  • Most show temporal rate revisions ($11.41 → $12.10 → $12.91 over the 12-month window)
  • Some show PH rates leaking into weekday/weekend ($12.10 + $24.20 clean 2× multiplier)

Zero examples found of "rate differs between 09:00–17:00 and 14:00–22:00 within the same day_type". The bands themselves are typing noise.

Seeding rule that follows: collapse to one row per (outlet, role, day_type) with starts_at: NULL, ends_at: NULL (meaning "applies to whole day_type"), using the most recent rate. Keep banded rows only when rates genuinely vary by more than ~20% across bands for the same key — rare in practice, but the schema supports it for the companies that do (e.g. McDonald's at outlets with a real night premium).

Source: NTUC FP Hyper analysis — statistics/2026-05-18-seed-ntuc-fp-hyper.csv and scratchpad/2026-05-18-seed-ntuc-fp-hyper-pipeline.md.

5. The legacy platform taxonomy is effectively abandoned

Of 57 defined job_types (the legacy platform-wide taxonomy: Cashier, Waiter, Service Crew, etc.), only 7 appear in 12 months of completed-and-paid shifts, and the 7 in use are coarse industry buckets:

job_type_idjob_type_title
51Retail
52F&B
53Hotel
54Healthcare
55Logistics
56Events
57Others

The fine-grained roles (IDs 1–30) are dead. job_type_id is useless as a per-company role discriminator. The per-company role taxonomy lives entirely in the (now normalised) jod_jobs.job_titleOrg::JobRole.

Seeding pipeline (the production design)

Six stages. Stages 1–2 already exist in production (the gig_temp_jobs Rails sync). Stages 3–6 are the new migration-specific logic that produces the seed CSV.

[Legacy MySQL]

│ (1) Extract — last 12 months, active, completed, processed
│ SQL filter: 12 mo, active company+location, jod_jobs.status=3, payment.status=2

[Raw legacy rows with date]

│ (2) Normalise titles — Gig::TempJobs::JobTitleNormalizer
│ Strip suffixes / dates / parens / quantifiers / modifiers
│ Titleize + capitalise abbreviations
│ Output: job_title_normalized + role_slug

[Normalised legacy rows]

│ (3) Derive day_type — calendar lookup (NO heuristic)
│ Geo::PublicHoliday → public_holiday
│ DAYOFWEEK ∈ (Sat, Sun) → weekend
│ else → weekday

[Rows with day_type assigned]

│ (4) Consolidate rates — group + collapse
│ Group by (company, outlet, role, day_type)
│ If single rate across all bands → one row, NULL/NULL band
│ If rates vary >20% across bands → keep banded rows (rare)
│ Otherwise (temporal variation) → pick most recent rate, NULL/NULL band
│ Set confidence: high (≥10 slots) / medium (3–9) / low (1–2)

[Consolidated rate cards]

│ (5) Generate defaults — one per (company, role)
│ org_outlet_id = NULL, starts_at = NULL, ends_at = NULL
│ hourly_rate = most common rate across all outlets for that role
│ is_default = true

[Seed-ready rows]

│ (6) Output seed CSV — 13 columns, importable

[gig-pay-rates-seed.csv]

│ Importer (separate concern)

[Org::JobRole + Gig::PayRate rows in production]

Stage 1 — Extract

SQL query reference: scratchpad/2026-05-18-legacy-pay-rate-bands-query.md. Includes the slot date for downstream day_type derivation.

Filters (validated):

  • jod_jobs.created_at within the last 12 months (relative to migration date)
  • Active companies and locations (status = 1 or appropriate per jodgig schema)
  • jod_jobs.status = 3 (:completed)
  • payment.status = 2 (:processed)

Stage 2 — Title normalisation

Source: jodapp-api/app/domains/gig/temp_jobs/job_title_normalizer.rb.

The normaliser already runs as part of the gig_temp_jobs sync. For migration seeding, two options:

  • Read from gig_temp_jobs directly if the sync is current. job_title_normalized and role_slug columns are already populated per (remote_company_id, remote_location_id, normalised title).
  • Re-run the normaliser in-process (Ruby or a Python port of the rules) over a fresh legacy extract. Useful for one-off analyses where gig_temp_jobs is not available.

Known gap, tracked in jodapp-api #1644: the normaliser does not yet expand RA → Retail Assistant, FP → FairPrice, etc. Until that lands, parallel naming survives normalisation and admins will see both RA Dry and Retail Assistant Dry as separate roles in the seed output. Acceptable for v1; will be cleaner once the issue is resolved and the sync is re-run.

Stage 3 — day_type derivation

Calendar lookup. No heuristic.

Input data: migration/sg-public-holidays-2025-2026.csv. This file holds Singapore Ministry of Manpower official holidays for 2025 and 2026 and is the seed source for the production Geo::PublicHoliday table.

Logic per slot:

  1. If slot.date exists in Geo::PublicHoliday for SG → public_holiday
  2. Else if DAYOFWEEK(slot.date) is Saturday or Sunday → weekend
  3. Else → weekday

The 1.5× rate-ratio heuristic used in the first NTUC FP Hyper draft is deprecated — only used while Geo::PublicHoliday is unpopulated. For NTUC it happened to work (clean 2× multiplier); for McDonald's (~17% PH premium) it would silently mis-classify every PH slot as weekday.

Stage 4 — Rate consolidation (with band collapse)

The load-bearing stage. For each (remote_company_id, remote_location_id, role_slug, day_type) group:

  1. Collect all distinct hourly_rate values across all bands within the group.
  2. If single rate: emit one row with starts_at: NULL, ends_at: NULL (whole day_type), hourly_rate = the single value, source_slot_count = sum across all bands in the group. Confidence by slot count.
  3. If multiple rates AND the spread is ≤ 20% (temporal variation): emit one row with starts_at: NULL, ends_at: NULL, hourly_rate = the most recent rate. Older rates are dropped (they would have been history entries in the new system). Confidence is medium at best.
  4. If multiple rates AND the spread is > 20% (genuine time-of-day variation): emit one row per band, preserving starts_at / ends_at. Each row's hourly_rate is the most recent for that exact band. This case is rare — only companies like McDonald's with real night premiums hit it.

The 20% threshold is a starting heuristic. Adjust based on cross-company analysis once the pipeline runs at scale.

Stage 5 — Company-wide default selection

For each (remote_company_id, role_slug):

  1. Across all outlet-specific rows already emitted (from Stage 4), find the most common hourly_rate.
  2. Emit one additional row with remote_location_id: NULL, starts_at: NULL, ends_at: NULL, hourly_rate = the most common rate, is_default: true, day_type: weekday (the canonical fallback), confidence: inferred.

This row is the fallback used when a new outlet (created post-migration) has no outlet-specific rate yet. Honours the invariant "exactly one is_default per (company, role)".

Stage 6 — Output seed CSV

Schema: 13 columns. See the seed CSV format section below.

Seed CSV format

File: gig-pay-rates-seed.csv (one file per company, or one consolidated file with all companies).

ColumnTypeNotes
remote_company_idintFor resolving to org_company_id via the legacy → new mapping
remote_location_idint (nullable)Empty for company-wide defaults
role_titlestringCanonical, from JobTitleNormalizer.normalize
role_slugstringParameterized form; find-or-create key for Org::JobRole
employment_typestringAlways gig for this seed
day_typestringweekday, weekend, or public_holiday
starts_attime (nullable)HH:MM or empty (whole day_type)
ends_attime (nullable)HH:MM or empty
hourly_ratedecimalLocal currency (SGD for SG companies)
is_defaultbooltrue for exactly one row per (remote_company_id, role_slug)
labelstringGenerated: "Weekday Day", "PH Night", "Default"
confidenceenumhigh (≥10 slots) / medium (3–9) / low (1–2) / inferred (company default) / manual (admin-added)
source_slot_countint or stringBacking slot count, or inferred / manual
notesstring (optional)Free text, e.g. "PH inferred from premium pattern"

Worked example — NTUC FairPrice (Hyper)

Validated draft seed: statistics/2026-05-18-seed-ntuc-fp-hyper.csv (682 rows, before the band-collapse rule was finalised).

After the band-collapse rule (Stage 4 of the production pipeline):

StageCount
Legacy paid completed slots (input)6,230
Distinct raw titles151
Distinct normalised roles27
(outlet, role, day_type) groups121
Seed rows (post band-collapse)~148
└ Outlet-specific PayRate rows~121
└ Company-wide default rows (one per role)27
8 outlets covered (1 outlet had zero paid slots)

Per-role and per-outlet breakdowns are in the pipeline record: scratchpad/2026-05-18-seed-ntuc-fp-hyper-pipeline.md.

Expected production seed across all 69 companies: roughly 3,000–5,000 rows — manageable for a one-time bulk insert.

Importer behaviour

The seed CSV is read by an importer (separate concern from the consolidation pipeline). For each row, in dependency order:

  1. Resolve company: Org::Company by legacy_company_id mapping. Fail loud if missing.
  2. Find-or-create Org::JobRole by (org_company_id, role_title, employment_type). First row for a role creates it; subsequent rows reference it.
  3. Resolve outlet by legacy_location_id, or NULL for company-wide.
  4. Find-or-create Gig::PayRate by (org_company_id, org_outlet_id, org_job_role_id, day_type, starts_at, ends_at). The Manager's no-overlap validation runs here.
  5. Set is_default. The Manager enforces exactly one default per (company, role).
  6. Set currency from the company's geo_country.currency.

Wrap each company's import in a transaction. If any row fails, roll back that company's entire seed and surface the failures for admin review before retry.

Operational decisions (decided 2026-05-18)

  1. Gig::TempJobs::JobTitleNormalizer is the canonical canonicalisation — do not invent a parallel normaliser.
  2. Band-collapse rule: collapse to NULL/NULL band when rates are uniform across bands for the same (outlet, role, day_type). Only keep banded rows when rates genuinely vary by more than 20%.
  3. Most-recent-rate wins for temporal variation (one rate revised multiple times over 12 months).
  4. Calendar lookup, not heuristic, for day_type derivation. Geo::PublicHoliday must be seeded before production runs (SG holidays ready in this folder).
  5. Eager seeding, magic feel — the rate-management UI shows a "review your auto-seeded rates" banner for low-confidence rows; high-confidence rows are silently active from day 1.
  6. No retroactive propagation — a later edit to Gig::CompanySetting does not modify existing Gig::OutletSetting rows or existing Gig::PayRate rows. The migration produces a snapshot; the snapshot is independent thereafter.

Open issues and known limitations

  1. Abbreviation expansion not yet implemented. Result: RA Dry and Retail Assistant Dry land as parallel Org::JobRole rows in the seed for companies with mixed usage. Tracked in jodapp-api #1644. Once resolved and the sync is re-run, the next iteration of the seed will collapse these correctly.
  2. Non-breaking space (U+00A0) handling. Two NTUC entries contain non-breaking spaces between words. The normaliser does not currently squish these to regular spaces. Low-impact; flag if it surfaces beyond NTUC.
  3. Geo::PublicHoliday not yet in production. Defined in gig.dbml as geo_public_holidays. SG holiday data is in this folder (sg-public-holidays-2025-2026.csv) ready to seed. Indonesia holidays need to be added before ID launch.
  4. 20% rate-spread threshold for band preservation is a starting heuristic. Revisit after the pipeline runs against McDonald's data (the canonical case for real night-premium banding) to confirm or adjust.
  5. Single-outlet Org::Company cases. The company-wide default row + an outlet-specific row are functionally redundant for single-outlet companies. Acceptable overhead in v1; could be optimised later.

Files in this folder

FilePurpose
migration-gig-pay-rate.mdThis file. Source of truth.
sg-public-holidays-2025-2026.csvSingapore Ministry of Manpower holidays. Seed source for Geo::PublicHoliday and day_type derivation input.

References

Specs

Decision notes

Analysis artefacts

Code

  • jodapp-api/app/domains/gig/temp_jobs/job_title_normalizer.rb — the canonical title normaliser
  • jodapp-api/app/domains/gig/temp_jobs/sync_service.rb — the live sync that runs the normaliser
  • jodapp-api/docs/db/gig.dbml — DBML for gig_pay_rates, gig_company_settings, gig_outlet_settings, geo_public_holidays