Spark/PySpark Developer - Bihar, India - ATech
Description
Job Profile : Spark ( Pyspark ) Developer
Industry Type : IT Services
Job description :
- The developer must have sound knowledge in Apache Spark and Python programming.
- Deep experience in developing data processing tasks using pySpark such as reading data from external sources, merge data, perform data enrichment and load in to target data destinations.
- Experience in deployment and operationalizing the code is added advantage
Have knowledge and skills in Devops/version control and containerization.
Preferable having deployment knowledge.
- Create Spark jobs for data transformation and aggregation
Produce unit tests for Spark transformations and helper methods
- Write Scaladoc-style documentation with all code
- Design data processing pipelines to perform batch and Real- time/stream analytics on structured and unstructured data
- Spark query tuning and performance optimization
Good understanding of different file formats (ORC, Parquet, AVRO) to optimize queries/processing and compression techniques.
- SQL database integration (Microsoft, Oracle, Postgres, and/or MySQL)
- Experience working with (HDFS, S3, Cassandra, and/or DynamoDB)
- Deep understanding of distributed systems (e.g. CAP theorem, partitioning, replication, consistency, and consensus)
- Experience in building cloud scalable high-performance data lake solutions
- Hands on expertise in cloud services like AWS, and/or Microsoft Azure.
- As a Spark developer you will manage the development of scalable distributed Architecture defined by the Architect or tech Lead in our team.
- Analyse, assemble large data sets to designed for the functional and non-functional requirements.
- You will develop ETL scripts for big data sources.
- Identify, design optimise data processing automate for reports and dashboards.
- You will be responsible for workflow optimizations, data optimizations and ETL optimization as per the requirements elucidated by the team.
- Work with stakeholders such as Product managers, Technical Leads Service Layer engineers to ensure end-to-end requirements are addressed.
- Strong team player to adhere to Software Development Life cycle (SDLC) and documentations needed to represent every stage of SDLC.
- Hands on working experience on any of the data engineering analytics platform (Hortonworks Cloudera MapR AWS), AWS preferred
- Hands-on experience on Data Ingestion Apache Nifi, Apache Airflow, Sqoop, and Oozie
- Hands-on working experience of data processing at scale with event driven systems, message queues (Kafka Flink Spark Streaming)
- Hands on working Experience with AWS Services like EMR, Kinesis, S3, Cloud Formation, Glue, API Gateway, Lake Foundation
- Hands on working Experience with AWS Athena
- Data Warehouse exposure on Apache Nifi, Apache Airflow, Kylo
- Operationalization of ML models on AWS (e.g. deployment, scheduling, model monitoring etc.)
- Feature Engineering Data Processing to be used for Model development
- Experience gathering and processing raw data at scale (including writing scripts, web scraping, calling APIs, write SQL queries, etc.)
- Experience building data pipelines for structured unstructured, real-time batch, events synchronous asynchronous using MQ, Kafka, Steam processing
- Hands-on working experience in analysing source system data and data flows, working with structured and unstructured data
- Must be very strong in writing SQL queries