Job Summary:
Zintro is looking for an experienced Data Engineer specializing in large language models (LLMs) to design, build, and maintain data pipelines while ensuring data quality. The role involves supporting LLM development initiatives, requiring expertise in data engineering and machine learning systems, particularly with models like GPT and BERT.
Key Responsibilities:
Design and optimize scalable data pipelines for LLM training and inference.
Collaborate with cross-functional teams to integrate LLMs into production.
Ensure data integrity, preprocessing, and feature engineering for model training.
Develop efficient data architectures, monitor pipeline performance, and implement quality checks.
Support versioning and stay updated on LLM and data engineering advancements.
Requirements:
Bachelor’s or Master’s in Computer Science, Engineering, or related field.
Proven experience in large-scale data engineering with expertise in LLM technologies (GPT, BERT, T5) and NLP data preprocessing.
Proficient in data tools like Apache Kafka, Spark, Hadoop, and Airflow.
Strong programming skills in Python, Java, or Scala, with experience in Pandas, NumPy, and TensorFlow/PyTorch.
Knowledge of relational and NoSQL databases, cloud platforms (AWS, GCP, Azure), and data versioning tools (MLflow, DVC).
Understanding of data privacy, security, debugging, and performance optimization.
Excellent communication skills, strong teamwork abilities, and keen attention to detail.