Sai K.

AWS, Azure Spark Data Engineer & ETL Developer

, United States

Experience: 10 Years

Sai

AWS, Azure Spark Data Engineer & ETL Developer

140000 USD / Year

Immediate: Available

10 Years

Now you can Instantly Chat with Sai!

Chat Now

About Me

Sai Krishna is with over 10 years in software development experience including around 4+ yrs Cloud/Bigdata/Spark Engineer Big Data/Hadoop developer, Azure, AWS & ETL related technologies. - Hands on experience on major components in Hadoop Ecosystem ...

Skills

Positions

Back End Developers

Data Analysts

Data Scientist

Cloud Architects

Software Engineer

Data Engineer

Portfolio Projects

Description

Design and develop ETL integration patterns using Pyspark (Python On Spark) on cloudera, AWS EMR, Azure databricks.
Leverage the AWS technologies like EMR (Spark cluster), S3, Glue, Athena, Redshift, Athena, Glue to build data pipeline and make it available for analytics.
Build data lake to store data from different source systems (EMR/EHR, registry, un-structured notes) from multiple practices stored in multiple formats (CSV/XML/Parquet/JSON)
Data Ingestion (Batch Processing), by importing and exporting the data from HDFS to Relational Database systems and vice-versa & from Flat files (like .CSV, .txt) using Nifi pipelines.
Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
Used Spark-SQL & Scala API’s for querying & transformation of data in Hive using Data frames.
Working closely with customer and addressing solutions for all issues.
Developed Spark Applications by using Scala, Python (Pyspark) and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
Knowledge on Amazon EC2 Spot integration & and Amazon S3 integration.
Optimizing the EMRFS for Hadoop to directly read and write in parallel to AWS S3 performantly.
Optimize the performance of Ingestion and consumption.
Develop pyspark scripts and deployed in Azure databricks jobs.
Performed ETL/ELT operation using Azure Data Factory (ADF) pipeline using activities like copy, web, lookup, metadata activity with linked service for communication from source blob storage and sink Azure SQL database and created triggers for execution of pipeline on a scheduled time.
Used blob storage, Azure Data Lake (ADLS).
Been part of production support team and resolve the tickets with root cause analysis.
Work closely with Data science team and provide the required data marts for there predictive analytics.
Had knowledge on Machine Learning algorithms like Linear regression, Logistic Regression, Naive Bayes, SVM, Decision Trees, Random Forest, Boosting, Kmeans.
Developed AWS Lambda functions in python for adhoc data engineering requirements with S3 event trigger.
Worked complete lifecycle i.e. modeling, ingestion, transformations, aggregation and data access layer thru pyspark.
Designed and developed Concurrent framework to simulate Parallel connections load testing.
Experienced in Designing and developing highly scalable and fault tolerant systems which served for 20 million records per day.
Developed Spark Streaming by consuming static and streaming data from different sources
Monitoring & Resource allocation and configuration for Spark applications.
Performed ETL/ELT operation using ADF pipeline using activities like copy, web, lookup, metadata activity.
Scheduling and grouping into pools for the jobs bases on the priority.
Administering the cluster and tuning the memory based on the RDD usage.
Implemented data ingestion and handling clusters in real time processing using Kafka .
Developed and Configured Kafka brokers to pipeline server logs data into spark streaming.
Deployment of Spark steaming applications with optimized number of executors, write ahead logs & check point configurations.
Worked on the Kerberos token authenticate & delegate token mechanism to implement the spark security
Active team participation, troubleshoot the issues;
Participate in the technical discussion towards conversion of business requirements to technical stories

Show More Show Less

Description

Developed ETL pipeline using Hadoop, Spark, Scala, Pyspark, hive, Azure Databricks, Azure SQL by performing data cleansing, data validation, data transformation, data lakes, data marts by performing data analysis.
Design and Implemented Data Ingestion Pipeline jobs using Sqoop into Hive tables.
Scaled the EMR and Spark Jobs to process daily billions of clinical interactions data.
Developed spark applications in python(PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
Design and develop ETL integration patterns using Pyspark (Python On Spark) on cloudera, Azure databricks.
Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
Collected LMS data into Hadoop cluster using SQOOP.
Wrote ETLs using Hive and processed the data as per business logic.
Knowledge on abinitio ETL tool & transformed the ETL logic to Spark Scala.
Infrastructure is on AWS EMR. All the hadoop jobs were running on EMR cluster.
Worked on Optimizing the EMR cluster.
Extensive experience with AWS services like S3, EMR, Amazon Redshift, Dynamo DB, lambda functions in python
Worked on creating Oozie workflows for scheduling jobs for generating reports on a daily, weekly and monthly cycles.
Familiarity with Hadoop cluster setup & configurations
Worked on Spark Streaming and Apache Kafka to fetch live stream data
Developed Spark Scripts using Scala, Spark SQL to access hive tables in spark for faster data processing.
Implemented a 'server less' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets.
Working closely with the Agile team to develop high quality products with product owner on incremental basis.
Created mappings and flows based on mapping documents
Developed ETL mappings, transformations using Informatica Power Center.
Router, Filter, sequence generator, joiner, aggregate transformation and expression Mapplets for migration of data using Informatica Designer.
Designed and Created data cleansing, validation and loading scripts for Oracle data
Prepared Unit test cases for the mappings.
Involved into performance tuning of mapping.
Worked with different sources such as Relational and flat files.
Extracting, Transforming and Loading the data from Source to Staging and Staging to Target according to the Business requirements.
Designed and Created data cleansing, validation and loading scripts for Oracle data warehouse using Informatica ETL tool.
Developed ADF pipeline using activities like copy, web, lookup, metadata activity, linked service for communication from source blob storage and sink Azure SQL database, triggers for execution of pipeline on a scheduled time and performed data ingestion to databases

Show More Show Less

Description

Developed data pipeline using Hadoop, Spark, Scala, Pyspark, hive, Azure Databricks, Azure SQL

Show More Show Less

Description

Worked on Installing and configuring the HDP Hortonworks 2.x and Cloudera (CDH 5.5.1) Clusters with Capacity planning in Dev and Production Environments.
Worked on Configuring High Availability for Name Node in HDP 2.1.
Worked on Configuring Kerberos Authentication in the cluster.
Worked on Configuring queues in capacity scheduler.
Integrated Cloudera Data Science Workbench (CDSW) with Web Application;
Developed R scripts to facilitate Analytical Models with FTP, ODBC and other provisions;
Developed scripts to enable FTP connectivity to internal file storage application to CDSW;
Performance Tuning in Hive queries. Worked on Memory management in Spark
Developed Data marts thru Hive, Impala;
Involved in Infrastructure setup for Data access & Data security;
Enabled the Hadoop connectivity for III party tools like R Studio, Tableau;
Coordinate with business users to collect the user requirements;
Cross team functionality & Awarded with Bronze & Platinum Agile Certification;
Develop nifi pipelines for data ingestion from SQL server & flat files to HDFS for data aggregation.
Develop Custom Nifi processors using Spark & Scala.
Involved in loading data to Kafka Producers from rest endpoints and transferring the data to Kafka Brokers.
Imported real time weblogs using Kafka as a messaging system and ingested the data to Spark Streaming.
Automate the workflows using Airflow.
Deployed the code using CI/CD tools like Github & Jenkins;
Coordinate with business users to collect the user requirements;
Active team participation, troubleshoot the issues;
Participate in the technical discussion towards conversion of business requirements to technical stories

Show More Show Less

Description

Involved in Hive query optimization techniques;
Involved in Hadoop (Cloudera) Cluster setup in AWS;
Automated the entire CI/CD using Git and Jenkins
Worked on Spark core, Spark Streaming, Spark SQL modules of Spark.
Used HDFS commands to import files to HDFS
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs (Resilient Distributed Dataset) and Scala.
Data analysis through Hive.
Design and develop Data Ingestion component.
Cluster coordination services through Zookeeper
Import of data using Sqoop from Oracle to HDFS
Import and export of data using Sqoop from or to HDFS and Relational DB Teradata.
Created executors for every created partition in Kafka Direct Stream.
Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
Developed POC on Apache-Spark and Kafka & Kafka and Flume;
Implement Flume, Spark, Spark Stream framework POC for real time data processing