Key Responsibilities :
- Design, develop, and maintain scalable ETL pipelines using PySpark on Hadoop or Cloudera Data Platform (CDP)
- Implement data ingestion from various sources (databases, APIs, file systems) into data lakes or warehouses.
Data Transformation & Processing
Cleanse, transform, and enrich large datasets using PySpark.Optimize Spark jobs for performance and resource efficiency.Data Quality & Monitoring
Implement validation checks and monitoring to ensure data accuracy and reliability.Troubleshoot and resolve data pipeline issues in real-time.Automation & Orchestration
Automate workflows using tools like Apache Oozie Airflow , or similar.Schedule and monitor batch jobs and data refresh cycles.Work closely with data scientists, analysts, and product teams to understand data needs.Document data flows, pipeline configurations, and engineering processes.Required Skills & Qualifications :
Bachelor's or Master's degree in Computer Science, Information Systems, or related field.4+ years of experience in data engineering roles.Strong hands-on experience with PySpark Hadoop Hive HDFS , and ClouderaProficiency in SQL Python , and data modelingExperience with data warehousing ETL frameworks , and big data ecosystemsFamiliarity with data governance security , and compliance standards.Preferred Skills :
Experience with cloud platforms (AWS, Azure, or GCP).Knowledge of Kafka HBase Apache NiFi , or Delta LakeExposure to machine learning pipelines or GenAI integration is a plus.Strong problem-solving and communication skills.Seniority level
Mid-Senior level
Employment type
Full-time
Job function
Information Technology
J-18808-Ljbffr