Seed pipeline record — NTUC FairPrice Hyper (company_id 78)
Date: 2026-05-18
Database: jodgig_2026_clean
Window: 2025-05-18 → 2026-05-18 (last 12 months, matching the prior legacy bands extract)
Output CSV: /docs/30-49-domains/35-gig/statistics/2026-05-18-seed-ntuc-fp-hyper.csv
Source extract: based on the verified query in 2026-05-18-legacy-pay-rate-bands-query.md, restricted to company_id = 78 and with slot_date, job_template_id, and template_job_title projected in.
Purpose
Produce a draft Gig::PayRate seed CSV for one familiar company so the team can judge the shape before any pipeline is built. NTUC FP Hyper is the chosen case study: 8 active outlets, 151 distinct raw job_title values, 6,230 paid completed slots in the window.
Pipeline steps
- Extract slot-level data from
jodgig_2026_clean(SQL below). - Normalise role title with a Python port of
Gig::TempJobs::JobTitleNormalizer, preferringjob_templates.job_titleoverjod_jobs.job_title(the Ruby class's "Selection Strategy"). - Derive
day_typeper slot. Weekend = Sat/Sun. PH = heuristic (rate ≥ 1.5× the modal weekday rate for the same outlet/role/band). - Consolidate to seed rows by
(remote_company_id, remote_location_id, role_slug, day_type, starts_at, ends_at). Most-recent rate wins (using slot date for recency).source_slot_count= group size. Confidence = high (≥10), medium (3–9), low (1–2). - Add one company-wide default row per
(company, role_slug). Rate = mode of the outlet-specific rates for that role.is_default = true,source_slot_count = inferred. - Generate human labels using the lookup defined in the task spec.
- Write the 13-column CSV.
Extraction SQL
The query is the same shape as 2026-05-18-legacy-pay-rate-bands-query.md but adds slot_date, job_template_id, and template_job_title so the normaliser can apply its "prefer template" rule, and the PH heuristic + recency logic have a date to work with.
SELECT
jj.id AS jod_job_id,
jj.company_id,
jj.location_id,
l.name AS location_name,
jj.job_title AS jod_job_title,
jj.job_template_id,
jt.job_title AS template_job_title,
s.id AS slot_id,
DATE(s.slot_start_date) AS slot_date,
TIME(s.slot_start_date) AS starts_at,
TIME(s.slot_end_date) AS ends_at,
jj.hourly_rate
FROM jod_jobs jj
JOIN companies c ON c.id = jj.company_id AND c.status = 1 AND c.deleted_at IS NULL
JOIN locations l ON l.id = jj.location_id AND l.status = 1 AND l.deleted_at IS NULL
LEFT JOIN job_templates jt ON jt.id = jj.job_template_id
JOIN slots s ON s.jod_job_id = jj.id AND s.deleted_at IS NULL
JOIN slot_user su ON su.slot_id = s.id AND su.deleted_at IS NULL
JOIN payments p ON p.id = su.payment_id AND p.payment_status = 2 AND p.deleted_at IS NULL
WHERE jj.created_at >= '2025-05-18'
AND jj.deleted_at IS NULL
AND jj.status = 3
AND jj.company_id = 78
ORDER BY jj.location_id, jod_job_title, slot_date, starts_at, ends_at;
- Raw rows returned: 6,230.
- Active outlets touched: 8 (603, 676, 681 if present, 682, 692, 693, 694, 695, 731 — note 681 has zero paid completed slots in the window, so it does not appear).
- Distinct raw
effective_titlevalues (template or job title): 151.
job_types/job_type_id is intentionally not joined: the normaliser works off the free-text title (with template fallback). job_type_id is the legacy platform taxonomy and is not what Org::JobRole will key on.
Python normaliser port
Pasted in full so a reviewer (or the v1 validator) can verify it matches the Ruby original at jodapp-api/app/domains/gig/temp_jobs/job_title_normalizer.rb. Step numbers are kept in 1:1 correspondence with the Ruby source.
import re
JOB_TITLE_MODIFIERS_TO_REMOVE = [
# --- time related ---
'cny', 'full shift', 'full time', 'full-time', 'night shift',
'part time', 'part-time', 'weekday',
# --- experience related ---
'experienced', 'junior', 'senior',
# --- context/ random words/chars ---
'the bar experienced', 'unity experienced', 'x',
]
CAPITALIZED_ABBREVIATIONS = {
'4d': '4D', 'boh': 'BOH', 'cmx': 'CMX', 'ffs': 'FFS', 'fp': 'FP',
'jr': 'JR', 'jt': 'JT', "mcdonald's": "McDonald's", 'ntuc': 'NTUC',
'pt': 'PT', 'ra': 'RA', 'sco': 'SCO',
}
def _ruby_titleize(s: str) -> str:
# ActiveSupport titleize: capitalize every word. After the previous cleanup
# steps reduce the input to [a-z0-9 '&/]+ this matches the Ruby behaviour
# exactly. Apostrophes are kept inside the word so "Mcdonald'S" -> "Mcdonald's"
# rounds back via the abbreviation pass.
def cap(m):
w = m.group(0)
return w[:1].upper() + w[1:].lower()
return re.sub(r"[A-Za-z][A-Za-z0-9']*", cap, s)
def normalize_title(job_title_raw, template_title_raw=None):
# 1. Selection Strategy: Prefer Template, Fallback to Raw
raw_string = template_title_raw if (template_title_raw and template_title_raw.strip()) else job_title_raw
if raw_string is None or not raw_string.strip():
return ''
title = raw_string.lower()
# 2. Sanitize: Smart quotes to straight quotes
title = re.sub(r"[‘’]", "'", title)
# 3. Remove Quantifiers at the START (e.g., "4x", "X5 ")
title = re.sub(r"^\s*\d+x\s*", '', title)
title = re.sub(r"^\s*[xX]\s*\d+\s*", '', title)
# 4. Remove Quantifiers (e.g., "x2", "X 5")
title = re.sub(r"\s*[xX]\s*\d+\b", '', title)
# 5. Remove Date/Time Ranges
title = re.sub(
r"\d{1,2}(?::\d{2})?\s*(?:am|pm)?\s*(?:to|-)\s*\d{1,2}(?::\d{2})?\s*(?:am|pm)?",
'', title, flags=re.IGNORECASE,
)
# 6. Remove Dates
title = re.sub(r"\d{1,2}[\/\-\.]\d{1,2}(?:[\/\-\.]\d{2,4})?", '', title)
# 7. Remove Parentheses AND everything inside them (Ruby uses the unclosed form)
title = re.sub(r"\s*\(.*$", '', title)
# 8. Remove " - " and everything after it
title = re.sub(r"\s+-\s*.*$", '', title)
# 9. Remove location names after @
title = re.sub(r"\s*@.*$", '', title)
# 10. Remove Common Trailing Extraneous Words/Phrases
for word in JOB_TITLE_MODIFIERS_TO_REMOVE:
title = re.sub(rf"\s+{re.escape(word)}\s*$", '', title)
# 11. Remove Common Leading Qualifier Phrases
for phrase in JOB_TITLE_MODIFIERS_TO_REMOVE:
title = re.sub(rf"^{re.escape(phrase)}\s+", '', title)
# 12. Clean up special chars (ALLOW '&' HERE)
title = re.sub(r"[^a-z0-9\s'&/]", ' ', title)
# 13. Whitespace cleanup (squish)
title = re.sub(r"\s+", ' ', title).strip()
# 14. Capitalize first letter (titleize)
title = _ruby_titleize(title)
# 15. Apply Specific Capitalization Fixes (Post-titleize)
for lower, correct in CAPITALIZED_ABBREVIATIONS.items():
titled = _ruby_titleize(lower)
title = re.sub(rf"\b{re.escape(titled)}\b", correct, title)
return title
def slugify(title):
# Approximates Rails' String#parameterize
s = title.lower()
s = re.sub(r"[^a-z0-9]+", '-', s)
return s.strip('-')
Translation notes (where Ruby semantics needed thought)
titleizein ActiveSupport runsunderscore→humanize→ word-by-word capitalize. Because by step 12 the input is restricted to[a-z0-9 '&/], the only behaviour that matters is "capitalize every word, keep apostrophes intact"._ruby_titleizeabove does exactly that.- Step 7 in Ruby is
title.gsub!(/\s*\(.*$/, ''). Note the unclosed(form, not the commented-out(...)form. The Python port mirrors this: anything from the first(to end-of-string is removed. - Smart-quote substitution in step 2 keeps the same Unicode characters (
‘U+2018 and’U+2019). - No ambiguity needed escalation to the team. The Ruby class is small enough and each step is self-contained.
Normaliser sanity check
A few representative samples from company 78:
| Raw | Normalised | Slug |
|---|---|---|
RA Cashiering | RA Cashiering | ra-cashiering |
Experienced FP Cashier x4 | FP Cashier | fp-cashier |
Experienced FP Cashier (Jun Hui) | FP Cashier | fp-cashier |
Experienced FP Cashier x 3 | FP Cashier | fp-cashier |
Retail Assistant - Dry | Retail Assistant | retail-assistant |
RA Dairy & Frozen | RA Dairy & Frozen | ra-dairy-frozen |
Note:
Retail Assistant - Drycollapses toRetail Assistantbecause step 8 removes" - "and everything after it. The data still containsRetail Assistant Dry,Retail Assistant Frozen, etc. (no" - ") which normalise to their full names. Both flavours appear in the CSV — see theretail-assistant-*slugs in the per-role table below. This is an artefact of the legacy admins' inconsistent naming, not a normaliser bug.
Distinct role count
| Stage | Distinct values |
|---|---|
Raw effective_title (template or job) | 151 |
After normalisation (per (company, role_slug)) | 27 |
27 lands inside the expected 5–15 ± "messy real data" window. The team's mental model was that NTUC has roughly 5 base roles (Cashier, Retail Assistant by department, Bartender, Trolley, Night Refill). The 27 we get reflects two real sources of inflation that the normaliser correctly preserves:
- Department suffixes on Retail Assistant —
RA Dairy & Frozen,RA Dry,RA Deli,RA Pharmacy,RA Trolley,RA Fruits & Vegetables,RA Bartendering,RA Cashiering,RA Customer Service,RA Lotte,RA Food Halal,RA Meat & Seafood,RA Night Refill. These are genuinely different roles with different rate structures and the design supports keeping them distinct (Org::JobRoleis per-company). - Coexistence of two naming conventions —
RA Dry(RA-prefix) andRetail Assistant Dry(full-word). The team will likely merge these in admin Stage-2 review; the normaliser correctly preserves both because they are distinct strings.
Consolidation rules applied
- Group key:
(remote_company_id, remote_location_id, role_slug, day_type, starts_at, ends_at). - Rate selection: mode of the rates at the most-recent slot date in the group. Older rates are not seeded — they become history once the system is live.
source_slot_count= total group size (across all dates and historical rates).- Confidence:
highif ≥ 10 slots,mediumif 3–9,lowif 1–2. - Company-wide default per
(company, role_slug): one row,remote_location_id='',day_type='weekday', both bands NULL,is_default=true,hourly_rate = mode of the outlet-specific rates for that role,source_slot_count='inferred', confidencemedium.
day_type derivation
- Per slot, parse
slot_start_date→ Pythondate.weekday >= 5(Saturday or Sunday) →weekend; otherwiseweekday. - Public holiday heuristic (documented in the row's
notes): for each(outlet, role_slug, starts_at, ends_at)we compute the modalweekdayrate (the baseline). Any weekday slot whose rate is>= 1.5 × baselineis reclassified topublic_holiday. The threshold was chosen because Singapore PH multipliers are typically 2.0× and the NTUC PH rate in the data is$24.20(exactly 2.0× the modal$12.10weekday rate, well above the 1.5× floor). - Weekend slots are NOT reclassified by the heuristic. A weekend can also be a PH (e.g. CNY landing on a Saturday) but our signal is rate-vs-baseline, and weekends don't have a separate baseline. The risk is a small under-count of weekend-PH rows; we accept this in the draft and surface it as an oddity below.
The heuristic is a stopgap until Geo::PublicHoliday lands. Every row created via the heuristic includes public_holiday inferred via heuristic ... in its notes column so admin Stage-2 review can sort on it.
Data oddities encountered
- Two naming conventions for Retail Assistant (see Distinct role count, point 2). Resolved by keeping both as distinct roles; flag for admin merge.
Retail Assistant - Drycollapses to bareRetail Assistantbecause of normaliser step 8. There are 2 such slots and they show up under theretail-assistantslug if any, otherwise they merge into theRA *rows. Looking at the data this didn't actually create aretail-assistantslug in the seed (the" - "rows were allRetail Assistant - <Dept>patterns that exist alongside hyphenlessRetail Assistant <Dept>rows). Admin should review.- Singleton bands. 338 of the 682 seed rows (50 %) are
lowconfidence (1–2 slots backing them). NTUC's per-slotstarts_at/ends_atare admin-typed and rarely repeat exactly (07:45–16:05,07:50–16:00,08:00–17:00, …). The seed faithfully reflects this but the operational expectation is that admin will consolidate near-duplicate bands during onboarding — most NTUC bands cluster around two patterns:~08:00–~17:00(day) and~14:00–~22:00(afternoon-to-close). - Outlet skew. Outlet 731 (Parkway Parade) contributes 407 of the 655 outlet-specific seed rows — 62 %. The other 7 outlets share 248 rows. This matches the underlying slot distribution: 731 is by far the most active outlet in the window.
- Outlet 681 (Jurong Point) is configured but has zero paid completed slots in the window and thus has no rows in the seed. Active-outlet count from
locations(9) ≠ outlets actually used (8). Brief was correct. - PH heuristic is conservative. Only 32 rows tagged
public_holiday. The real PH count is likely higher (PH multipliers slightly below 1.5× would be missed, weekend-PH overlap is not detected). Acceptable for a draft; admin will refine onceGeo::PublicHolidaylands.
Reproducing
# 1. Extract
mysql -uroot jodgig_2026_clean -B -e "<extraction SQL above>" \
> /tmp/pay_rate_seed/c78_raw.tsv
# 2. Build the seed CSV
python3 /tmp/pay_rate_seed/build_seed.py
The script writes to /docs/30-49-domains/35-gig/statistics/2026-05-18-seed-ntuc-fp-hyper.csv directly.