Seed pipeline record — NTUC FairPrice Hyper (company_id 78)

Date: 2026-05-18 Database: jodgig_2026_clean Window: 2025-05-18 → 2026-05-18 (last 12 months, matching the prior legacy bands extract) Output CSV: /docs/30-49-domains/35-gig/statistics/2026-05-18-seed-ntuc-fp-hyper.csv Source extract: based on the verified query in 2026-05-18-legacy-pay-rate-bands-query.md, restricted to company_id = 78 and with slot_date, job_template_id, and template_job_title projected in.

Purpose

Produce a draft Gig::PayRate seed CSV for one familiar company so the team can judge the shape before any pipeline is built. NTUC FP Hyper is the chosen case study: 8 active outlets, 151 distinct raw job_title values, 6,230 paid completed slots in the window.

Pipeline steps

Extract slot-level data from jodgig_2026_clean (SQL below).
Normalise role title with a Python port of Gig::TempJobs::JobTitleNormalizer, preferring job_templates.job_title over jod_jobs.job_title (the Ruby class's "Selection Strategy").
Derive day_type per slot. Weekend = Sat/Sun. PH = heuristic (rate ≥ 1.5× the modal weekday rate for the same outlet/role/band).
Consolidate to seed rows by (remote_company_id, remote_location_id, role_slug, day_type, starts_at, ends_at). Most-recent rate wins (using slot date for recency). source_slot_count = group size. Confidence = high (≥10), medium (3–9), low (1–2).
Add one company-wide default row per (company, role_slug). Rate = mode of the outlet-specific rates for that role. is_default = true, source_slot_count = inferred.
Generate human labels using the lookup defined in the task spec.
Write the 13-column CSV.

Extraction SQL

The query is the same shape as 2026-05-18-legacy-pay-rate-bands-query.md but adds slot_date, job_template_id, and template_job_title so the normaliser can apply its "prefer template" rule, and the PH heuristic + recency logic have a date to work with.

SELECT
  jj.id              AS jod_job_id,
  jj.company_id,
  jj.location_id,
  l.name             AS location_name,
  jj.job_title       AS jod_job_title,
  jj.job_template_id,
  jt.job_title       AS template_job_title,
  s.id               AS slot_id,
  DATE(s.slot_start_date) AS slot_date,
  TIME(s.slot_start_date) AS starts_at,
  TIME(s.slot_end_date)   AS ends_at,
  jj.hourly_rate
FROM jod_jobs jj
JOIN companies     c  ON c.id  = jj.company_id  AND c.status = 1 AND c.deleted_at IS NULL
JOIN locations     l  ON l.id  = jj.location_id AND l.status = 1 AND l.deleted_at IS NULL
LEFT JOIN job_templates jt ON jt.id = jj.job_template_id
JOIN slots         s  ON s.jod_job_id  = jj.id   AND s.deleted_at IS NULL
JOIN slot_user     su ON su.slot_id    = s.id    AND su.deleted_at IS NULL
JOIN payments      p  ON p.id          = su.payment_id AND p.payment_status = 2 AND p.deleted_at IS NULL
WHERE jj.created_at >= '2025-05-18'
  AND jj.deleted_at IS NULL
  AND jj.status = 3
  AND jj.company_id = 78
ORDER BY jj.location_id, jod_job_title, slot_date, starts_at, ends_at;

Raw rows returned: 6,230.
Active outlets touched: 8 (603, 676, 681 if present, 682, 692, 693, 694, 695, 731 — note 681 has zero paid completed slots in the window, so it does not appear).
Distinct raw effective_title values (template or job title): 151.

job_types/job_type_id is intentionally not joined: the normaliser works off the free-text title (with template fallback). job_type_id is the legacy platform taxonomy and is not what Org::JobRole will key on.

Python normaliser port

Pasted in full so a reviewer (or the v1 validator) can verify it matches the Ruby original at jodapp-api/app/domains/gig/temp_jobs/job_title_normalizer.rb. Step numbers are kept in 1:1 correspondence with the Ruby source.

import re

JOB_TITLE_MODIFIERS_TO_REMOVE = [
    # --- time related ---
    'cny', 'full shift', 'full time', 'full-time', 'night shift',
    'part time', 'part-time', 'weekday',
    # --- experience related  ---
    'experienced', 'junior', 'senior',
    # --- context/ random words/chars ---
    'the bar experienced', 'unity experienced', 'x',
]

CAPITALIZED_ABBREVIATIONS = {
    '4d': '4D', 'boh': 'BOH', 'cmx': 'CMX', 'ffs': 'FFS', 'fp': 'FP',
    'jr': 'JR', 'jt': 'JT', "mcdonald's": "McDonald's", 'ntuc': 'NTUC',
    'pt': 'PT', 'ra': 'RA', 'sco': 'SCO',
}


def _ruby_titleize(s: str) -> str:
    # ActiveSupport titleize: capitalize every word. After the previous cleanup
    # steps reduce the input to [a-z0-9 '&/]+ this matches the Ruby behaviour
    # exactly. Apostrophes are kept inside the word so "Mcdonald'S" -> "Mcdonald's"
    # rounds back via the abbreviation pass.
    def cap(m):
        w = m.group(0)
        return w[:1].upper() + w[1:].lower()
    return re.sub(r"[A-Za-z][A-Za-z0-9']*", cap, s)


def normalize_title(job_title_raw, template_title_raw=None):
    # 1. Selection Strategy: Prefer Template, Fallback to Raw
    raw_string = template_title_raw if (template_title_raw and template_title_raw.strip()) else job_title_raw
    if raw_string is None or not raw_string.strip():
        return ''

    title = raw_string.lower()

    # 2. Sanitize: Smart quotes to straight quotes
    title = re.sub(r"[‘’]", "'", title)

    # 3. Remove Quantifiers at the START (e.g., "4x", "X5 ")
    title = re.sub(r"^\s*\d+x\s*", '', title)
    title = re.sub(r"^\s*[xX]\s*\d+\s*", '', title)

    # 4. Remove Quantifiers (e.g., "x2", "X 5")
    title = re.sub(r"\s*[xX]\s*\d+\b", '', title)

    # 5. Remove Date/Time Ranges
    title = re.sub(
        r"\d{1,2}(?::\d{2})?\s*(?:am|pm)?\s*(?:to|-)\s*\d{1,2}(?::\d{2})?\s*(?:am|pm)?",
        '', title, flags=re.IGNORECASE,
    )

    # 6. Remove Dates
    title = re.sub(r"\d{1,2}[\/\-\.]\d{1,2}(?:[\/\-\.]\d{2,4})?", '', title)

    # 7. Remove Parentheses AND everything inside them (Ruby uses the unclosed form)
    title = re.sub(r"\s*\(.*$", '', title)

    # 8. Remove " - " and everything after it
    title = re.sub(r"\s+-\s*.*$", '', title)

    # 9. Remove location names after @
    title = re.sub(r"\s*@.*$", '', title)

    # 10. Remove Common Trailing Extraneous Words/Phrases
    for word in JOB_TITLE_MODIFIERS_TO_REMOVE:
        title = re.sub(rf"\s+{re.escape(word)}\s*$", '', title)

    # 11. Remove Common Leading Qualifier Phrases
    for phrase in JOB_TITLE_MODIFIERS_TO_REMOVE:
        title = re.sub(rf"^{re.escape(phrase)}\s+", '', title)

    # 12. Clean up special chars (ALLOW '&' HERE)
    title = re.sub(r"[^a-z0-9\s'&/]", ' ', title)

    # 13. Whitespace cleanup (squish)
    title = re.sub(r"\s+", ' ', title).strip()

    # 14. Capitalize first letter (titleize)
    title = _ruby_titleize(title)

    # 15. Apply Specific Capitalization Fixes (Post-titleize)
    for lower, correct in CAPITALIZED_ABBREVIATIONS.items():
        titled = _ruby_titleize(lower)
        title = re.sub(rf"\b{re.escape(titled)}\b", correct, title)

    return title


def slugify(title):
    # Approximates Rails' String#parameterize
    s = title.lower()
    s = re.sub(r"[^a-z0-9]+", '-', s)
    return s.strip('-')

Translation notes (where Ruby semantics needed thought)

titleize in ActiveSupport runs underscore → humanize → word-by-word capitalize. Because by step 12 the input is restricted to [a-z0-9 '&/], the only behaviour that matters is "capitalize every word, keep apostrophes intact". _ruby_titleize above does exactly that.
Step 7 in Ruby is title.gsub!(/\s*\(.*$/, ''). Note the unclosed ( form, not the commented-out (...) form. The Python port mirrors this: anything from the first ( to end-of-string is removed.
Smart-quote substitution in step 2 keeps the same Unicode characters (‘ U+2018 and ’ U+2019).
No ambiguity needed escalation to the team. The Ruby class is small enough and each step is self-contained.

Normaliser sanity check

A few representative samples from company 78:

Raw	Normalised	Slug
`RA Cashiering`	`RA Cashiering`	`ra-cashiering`
`Experienced FP Cashier x4`	`FP Cashier`	`fp-cashier`
`Experienced FP Cashier (Jun Hui)`	`FP Cashier`	`fp-cashier`
`Experienced FP Cashier x 3`	`FP Cashier`	`fp-cashier`
`Retail Assistant - Dry`	`Retail Assistant`	`retail-assistant`
`RA Dairy & Frozen`	`RA Dairy & Frozen`	`ra-dairy-frozen`

Note: Retail Assistant - Dry collapses to Retail Assistant because step 8 removes " - " and everything after it. The data still contains Retail Assistant Dry, Retail Assistant Frozen, etc. (no " - ") which normalise to their full names. Both flavours appear in the CSV — see the retail-assistant-* slugs in the per-role table below. This is an artefact of the legacy admins' inconsistent naming, not a normaliser bug.

Distinct role count

Stage	Distinct values
Raw `effective_title` (template or job)	151
After normalisation (per `(company, role_slug)`)	27

27 lands inside the expected 5–15 ± "messy real data" window. The team's mental model was that NTUC has roughly 5 base roles (Cashier, Retail Assistant by department, Bartender, Trolley, Night Refill). The 27 we get reflects two real sources of inflation that the normaliser correctly preserves:

Department suffixes on Retail Assistant — RA Dairy & Frozen, RA Dry, RA Deli, RA Pharmacy, RA Trolley, RA Fruits & Vegetables, RA Bartendering, RA Cashiering, RA Customer Service, RA Lotte, RA Food Halal, RA Meat & Seafood, RA Night Refill. These are genuinely different roles with different rate structures and the design supports keeping them distinct (Org::JobRole is per-company).
Coexistence of two naming conventions — RA Dry (RA-prefix) and Retail Assistant Dry (full-word). The team will likely merge these in admin Stage-2 review; the normaliser correctly preserves both because they are distinct strings.

Consolidation rules applied

Group key: (remote_company_id, remote_location_id, role_slug, day_type, starts_at, ends_at).
Rate selection: mode of the rates at the most-recent slot date in the group. Older rates are not seeded — they become history once the system is live.
source_slot_count = total group size (across all dates and historical rates).
Confidence: high if ≥ 10 slots, medium if 3–9, low if 1–2.
Company-wide default per (company, role_slug): one row, remote_location_id='', day_type='weekday', both bands NULL, is_default=true, hourly_rate = mode of the outlet-specific rates for that role, source_slot_count='inferred', confidence medium.

`day_type` derivation

Per slot, parse slot_start_date → Python date. weekday >= 5 (Saturday or Sunday) → weekend; otherwise weekday.
Public holiday heuristic (documented in the row's notes): for each (outlet, role_slug, starts_at, ends_at) we compute the modal weekday rate (the baseline). Any weekday slot whose rate is >= 1.5 × baseline is reclassified to public_holiday. The threshold was chosen because Singapore PH multipliers are typically 2.0× and the NTUC PH rate in the data is $24.20 (exactly 2.0× the modal $12.10 weekday rate, well above the 1.5× floor).
Weekend slots are NOT reclassified by the heuristic. A weekend can also be a PH (e.g. CNY landing on a Saturday) but our signal is rate-vs-baseline, and weekends don't have a separate baseline. The risk is a small under-count of weekend-PH rows; we accept this in the draft and surface it as an oddity below.

The heuristic is a stopgap until Geo::PublicHoliday lands. Every row created via the heuristic includes public_holiday inferred via heuristic ... in its notes column so admin Stage-2 review can sort on it.

Data oddities encountered

Two naming conventions for Retail Assistant (see Distinct role count, point 2). Resolved by keeping both as distinct roles; flag for admin merge.
Retail Assistant - Dry collapses to bare Retail Assistant because of normaliser step 8. There are 2 such slots and they show up under the retail-assistant slug if any, otherwise they merge into the RA * rows. Looking at the data this didn't actually create a retail-assistant slug in the seed (the " - " rows were all Retail Assistant - <Dept> patterns that exist alongside hyphenless Retail Assistant <Dept> rows). Admin should review.
Singleton bands. 338 of the 682 seed rows (50 %) are low confidence (1–2 slots backing them). NTUC's per-slot starts_at/ends_at are admin-typed and rarely repeat exactly (07:45–16:05, 07:50–16:00, 08:00–17:00, …). The seed faithfully reflects this but the operational expectation is that admin will consolidate near-duplicate bands during onboarding — most NTUC bands cluster around two patterns: ~08:00–~17:00 (day) and ~14:00–~22:00 (afternoon-to-close).
Outlet skew. Outlet 731 (Parkway Parade) contributes 407 of the 655 outlet-specific seed rows — 62 %. The other 7 outlets share 248 rows. This matches the underlying slot distribution: 731 is by far the most active outlet in the window.
Outlet 681 (Jurong Point) is configured but has zero paid completed slots in the window and thus has no rows in the seed. Active-outlet count from locations (9) ≠ outlets actually used (8). Brief was correct.
PH heuristic is conservative. Only 32 rows tagged public_holiday. The real PH count is likely higher (PH multipliers slightly below 1.5× would be missed, weekend-PH overlap is not detected). Acceptable for a draft; admin will refine once Geo::PublicHoliday lands.

Reproducing

# 1. Extract
mysql -uroot jodgig_2026_clean -B -e "<extraction SQL above>" \
  > /tmp/pay_rate_seed/c78_raw.tsv

# 2. Build the seed CSV
python3 /tmp/pay_rate_seed/build_seed.py

The script writes to /docs/30-49-domains/35-gig/statistics/2026-05-18-seed-ntuc-fp-hyper.csv directly.

Purpose​

Pipeline steps​

Extraction SQL​

Python normaliser port​

Translation notes (where Ruby semantics needed thought)​

Normaliser sanity check​

Distinct role count​

Consolidation rules applied​

day_type derivation​

Data oddities encountered​

Reproducing​