As a Senior MLOps & AI Infrastructure Engineer at dLocal, you will be a key individual contributor in the team that builds and operates our ML and AI platform, with a strong focus on Feature Store and MLOps workflows.
You will implement and evolve the components that Data Science and AI teams use every day to take models and AI‑powered services from idea to production: feature pipelines, training and deployment workflows, observability and automation.
A core part of this role is to use agents and AI services to automate as much as possible of what we do in MLOps — from feature store and platform operations to fraud/anomaly workflows and ML cost optimization — working side by side with the AI Team and the MLOps Technical Referent.
What will I be doing?
- Implement and maintain online and offline feature pipelines that feed our enterprise Feature Store, combining:
- Flink‑based streaming jobs ingesting large volumes of events from multiple sources (payments, fraud, anomaly, etc.) into online stores.
- Databricks / Spark pipelines for offline feature computation, backfills and training datasets.
- Ensure:
- Point‑in‑time correctness for offline training and backtesting.
- Low‑latency, high‑throughput online feature serving with clear SLAs, TTL semantics and multi‑tenant safety.
- Contribute to the feature catalog and specs:
- Define entities, feature views, schemas, SLAs, PII classification and owners.
- Help data scientists and domain teams onboard new features safely and consistently across Flink and Databricks.
- Develop tooling for:
- Backfills and materialization coordination between Flink and Databricks (Lakehouse / Delta).
- Offline–online parity checks, data quality, drift and freshness monitoring for critical feature groups.
- Unified feature retrieval APIs (online/offline/batch) and SDK/CLI usage from models and services.
- Implement and improve training and evaluation pipelines:
- Reproducible workflows, experiment tracking and model registry integration.
- Promotion flows from dev → staging → production, following platform standards.
- Work on online and batch inference paths:
- Model packaging and deployment.
- Rollout strategies (canary, shadow, rollback) aligned with SRE/Infra.
- Instrument pipelines and services with metrics, logs and traces:
- Integrate with our observability stack (e.g. OTel, Coralogix).
- Expose dashboards and alerts for ML components (latency, errors, drift, freshness).
- Integrate and extend agents and AI services (built by the AI Team and MLOps) to automate key parts of the Feature Store and MLOps workflows (health checks, drift and quality analysis, documentation/specs, incident triage, FinOps suggestions, etc.).
- Design these automations with clear guardrails: observable, auditable and easy to roll back, always keeping humans in control of production decisions.
- Implement changes that respect platform standards around:
- Access control, secrets management and PII handling in features and models.
- Environment separation and change management for ML/AI components.
- Participate in on‑call rotations or escalation paths for ML pipelines and feature infrastructure:
- Diagnose and fix incidents.
- Contribute improvements to playbooks, dashboards and tests.
5. Collaboration and technical contribution
- Work closely with:
- MLOps Technical Referent to align on architecture and technical direction.
- Data Science squads and the AI Team to understand requirements and unblock use cases.
- Fraud, Anomaly and other product squads as consumers of features and models.
- Contribute to internal documentation, RFCs, examples and onboarding guides so other engineers and data scientists can adopt the platform more easily.
- Mentor mid‑level engineers on good practices in pipelines, testing, observability and automation.
What skills do I need?
- Solid experience as a Senior Engineer working on:
- MLOps, data platforms, or large‑scale backend / distributed systems.
- Hands‑on experience with big data / streaming technologies (e.g. Spark, Flink, Kafka, Kinesis, or similar).
- Proven track record building production‑grade ML pipelines:
- Experiment tracking and reproducible training flows.
- CI/CD for models and data pipelines.
- Online and batch inference at scale.
- Familiarity with cloud‑based ML platforms and containerized deployments (e.g. Databricks, SageMaker, Vertex AI, or equivalent).
- Strong understanding of observability:
- Metrics, logs and traces.
- Data and model drift, freshness and quality checks.
- Ability to write clean, maintainable code and collaborate through reviews, design docs and pairing sessions.
- Comfortable communicating with Data Scientists, ML Engineers and Infra/SRE, translating requirements into concrete technical solutions.
- Experience working with or around Feature Stores (Feast, Databricks Feature Store, custom implementations, etc.).
- Exposure to LLMs, agents and AI assistants, especially applied to:
- Developer productivity (code/infra copilots).
- Log/metric/incident analysis or documentation generation.
- Experience in Fintech, risk, fraud or anomaly detection environments.
- Contributions to internal standards, RFCs, runbooks or technical talks.
