Site Reliability Engineer

Ho Chi Minh City, Ho Chi Minh, Vietnam | Engineering | Full-time

Apply

Trusting Social is an AI Fintech pioneer that's revolutionizing credit access in emerging markets. Our mission is "Advancing AI to Meet the Financial Needs of Everyday Consumers with Empathy." We've assessed over 1 billion consumers across four countries, and we're on a mission to provide 100 million credit lines using the power of AI and Big Data.

How You'll Make an Impact

Keep Sophia Voice Bot — our customer-facing AI voice product — reliable, observable, and recoverable under real production load. You own the SLOs, the on-call response, and the release-safety guardrails for Sophia and its supporting AI pipeline (ASR, LLM, TTS, RAG, telephony). Reliability is the deliverable; AI is the workload you make reliable.

What You'll Do

  • SLOs and SLIs for Sophia Voice Bot: availability, p95/p99 latency, time-to-first-audio, ASR/TTS quality proxies, retrieval quality signals, call completion rate, error budget tracking.
  • Incident response: on-call rotation, paging hygiene, runbook authoring, postmortem facilitation, follow-up action tracking, reliability reviews.
  • Observability for Sophia: metrics, logs, traces across ASR / LLM / TTS / RAG / telephony; prompt-response logging, token and cost dashboards, model-version-aware views, end-to-end call trace visibility.
  • Release safety for Sophia: canary releases, progressive rollouts, regression detection between model and prompt versions, automated rollback, load testing voice endpoints before launch.
  • Production hygiene: alerting tuned to user impact (not noise), capacity and quota monitoring for model providers, dependency health checks, SLO-aligned dashboards.
  • AI-assisted SRE tooling: building or improving internal AI copilots for alert triage, log analysis, postmortem drafting, and runbook generation.

What We're Looking For

  • 3–5 years running production systems (SRE or strong DevOps). You can debug a flaky service end-to-end solo.
  • SRE practice: you've carried a pager and run incidents, written postmortems, and defined and monitored SLOs/SLIs with error-budget conversations.
  • Observability (Prometheus, Grafana, OpenTelemetry, Datadog, or similar): queries, dashboards, alerts tuned for signal.
  • Extensive experience with Kubernetes: networking, rollout deployment, and graceful shutdown.
  • Platform fundamentals: Kubernetes; CI/CD (GitHub Actions, ArgoCD, or similar) with rollback and progressive delivery; infrastructure-as-code (Terraform preferred); a major cloud (AWS, GCP, or Azure) with reliability and cost awareness.
  • You write tooling in Python, Go, or Bash, not just config.
  • You use AI/LLM tools (Claude, ChatGPT, Cursor, Copilot) daily and can cite specific time saved.

Nice to Have

  • Familiarity with inference servers ( Triton,..) or LLM gateways (vLLM,..).
  • Experience building internal tooling using LLMs
  • DevSecOps awareness: container scanning, secret management, dependency hygiene, SBOMs.
  • Vietnamese–English bilingual working ability.

What You'll Learn Here

  • How to apply AI to the SRE workflow itself, a meta-skill that compounds across your career.
  • Direct mentorship from senior/staff SRE who own the platform layer, with a clear path to senior IC.

What We Offer

Join our vibrant team and enjoy:

  • Opportunity to work and learn from one of the best and brightest technology teams in Vietnam
  • Be part of a winning team with exponential growth regionally, experience recruiting world-class talents
  • Competitive compensation package, including 13th-month salary and performance bonuses
  • Comprehensive health care coverage for you and your dependents
  • Generous leave policies, including annual leave, sick leave, and flexible work hours
  • Convenient central district 1 office location, next to a future metro station
  • Onsite lunch with multiple options, including vegetarian
  • Grab for work allowance and fully equipped workstations
  • Fun and engaging team building activities, sponsored sports clubs, and happy hour every Thursday
  • Unlimited free coffee, tea, snacks, and fruit to keep you energized
  • An opportunity to make a social impact by helping to democratize credit access in emerging markets

At Trusting Social, we live by ownership, integrity, and agility in execution. We believe in doing what's right, what's best, and what's innovative. If you're smart, driven, and want to make a difference in the world with the most advanced and fascinating technology, come join our team. We offer the runway to truly make an impact.

Learn more about us:
https://trustingsocial.com
https://www.youtube.com/watch?v=inAEDGvOcL8&t=29s