Job Summary
We are seeking a hardcore, hands-on AI Data Engineer to
build the high-performance data infrastructure required to power autonomous
AI agents. You won't just be moving data from A to B; you will be
architecting Dynamic Context Windows, managing Real-time Semantic Indexes,
and building Self-Cleaning Data Pipelines that feed our "Super
Employee" agents.
Job Responsibilities
· Vector & Graph ETL: Design and maintain
pipelines that transform unstructured data (PDFs, emails, logs, chats) into
optimized embeddings for Vector Databases (Pinecone, Weaviate, Milvus).
· Semantic Data Modeling: Engineer data
structures that optimize for Retrieval-Augmented Generation (RAG), ensuring
agents find the "needle in the haystack" in milliseconds.
· Knowledge Graph Construction: Build and scale
Knowledge Graphs (Neo4j) to represent complex relationships in our trading
and support data that standard vector search misses.
· Automated Data Labeling & Synthetic Data:
Implement pipelines using LLMs to auto-label datasets or generate synthetic
edge cases for agent training and evaluation.
· Stream Processing for Agents: Build real-time
data "listeners" (Kafka/Flink) that feed live context to agents,
allowing them to react to market or support events as they happen.
· Data Reliability & "Drift"
Detection: Build monitoring for "Embedding Drift", identifying when
the statistical distribution of your data changes and the agent's
"knowledge" becomes stale.
Essential Skills
· Vector Database Mastery: Expert-level
configuration of HNSW indexes, scalar quantization, and metadata filtering
strategies within Pinecone, Milvus, or Qdrant.
· Advanced Python & Rust: Proficiency in
Python for AI logic and Rust (or C++) for high-performance data processing
and custom embedding functions.
· Big Data Ecosystem: Hands-on experience with
Apache Spark, Flink, and Kafka in a high-throughput environment
(Trading/FinTech preferred).
· LLM Data Tooling: Deep experience with
Unstructured.io, LlamaIndex, or LangChain for document parsing and chunking
strategy optimization.
· MLOps & DataOps: Mastery of DVC (Data
Version Control) and Airflow/Prefect for managing complex, non-linear AI data
workflows.
· Embedding Models: Understanding of how to
fine-tune embedding models (e.g., BGE, Cohere, or OpenAI) to better represent
domain-specific (Trading) terminology.
Additional qualifications:
· Chunking Strategy Architect: You don't just
"split text." You implement Semantic Chunking and Parent-Child
retrieval strategies to maximize LLM context relevance.
· Cold/Warm/Hot Storage Strategy: Managing cost
and latency by tiering data between Vector DBs (Hot), SQL/NoSQL (Warm), and
S3/Data Lakes (Cold).
· Privacy & Redaction Pipelines: Building
automated PII (Personally Identifiable Information) redaction into the
ingestion layer to ensure agents never "see" or "leak"
sensitive user data.
Background Check required
No criminal record
Others
Work mode- Hybrid model working (3 days work from office)
Office Location-Rai Durg, Hyderabad
Interview rounds-3-4 rounds of interviews.



