Sai krishna M.

Sai krishna M.

Hadoop/Pyspark developer with AWS ETL

Schaumburg , United States

Experience: 10 Years

Sai krishna

Schaumburg , United States

Hadoop/Pyspark developer with AWS ETL

140000 USD / Year

  • Immediate: Available

10 Years

Now you can Instantly Chat with Sai krishna!

About Me

  • Sai Krishna is around 4 yrs Cloud/Bigdata Engineer with over 10 years in software development experience including Big Data/Hadoop developer, Azure, AWS & ETL related technologies.
  • Hands on experience on...
  • Hands on experience on major components in Hadoop Ecosystem including HDFS, YARN, Hive, HBase, PIG, Sqoop, Flume, Apache Spark and Apache Kafka, Cloudera Data Science Workbench, Stream Sets, Apache Nifi, Amazon Glue, Amazon Redshift, Amazon Kinesis.
  • Used Scala sbt to develop Scala coded spark projects and executed using spark-submit.
  • Implemented pre-defined operators in spark such as map, flat Map, filter, reduceByKey, groupByKey, aggregateByKey and combineByKey etc.
  • Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
  • Experience in implementing Spark RDD's in Python (Pyspark) & Scala.
  • Exposure on ML (Machine Learning) algorithms like Linear regression, Logistic Regression, Naive Bayes, SVM, Decision Trees, Random Forest, Boosting, Kmeans and work closely with DS (data science) team
  • Extensive experience in creating, configuring and optimizing for performance of EMR cluster on AWS.
  • Developed Star Schema Data models (Cubes) using ETL tool Azure Timextender and publish to PBI reports.
  • Set up standards and processing for Hadoop based application design and development.
  • AWS Certified Associate architect
  • Used ADF (Azure Data Factory) service to develop ETL/ELT pipelines;
  • Developed REST APIs for the front-end application using Java, Spring, Hibernate
  • Experience in performing data enrichment, cleansing, analytics, aggregations using Hive and SparkSQL.
  • Setup CDSW in Hadoop environment & associated in developing models in R, Python, Scala.
  • Experience in Designing, developing and implementing connectivity products that allow efficient exchange of data between our core database engine and Hadoop ecosystem.
  • Experience in Disaster recovery and Backup activities, Multi-node setup, Performance tuning and benchmarking, Security integration, Monitoring, maintenance and troubleshooting of Hadoop Cluster.
  • Knowledge and able to associate towards implementation of Data security using Kerberos authentication, User Privileges, Process ID etc.,
  • Experience in Data Ingestion (Batch Processing), by importing and exporting the data from HDFS to Relational Database systems and vice-versa & from Flat files (like .CSV, .txt) using Sqoop, Streamsets, and Nifi pipelines.
  • Developed Data Streaming (Near Real-time) pipeline using Kafka & Flume, Spark Streaming.
  • Experience in Data analytics, using HIVEQL, Apache Spark with R, Python, Scala and Spark SQL
  • Knowledge on Kubernetes to deploy applications and manage workloads and services with declarative configuration and automation.
  • Been part of production support team and resolve the tickets
  • Worked on NoSQL database HBase for storing huge amounts of web logs.
  • Knowledge towards Data Warehouse (DW) & SQL DB.
  • Deployed Instances, provisioned EC2, S3 bucket, Configured Security groups and Hadoop eco system for Cloudera in AWS.
  • Knowledge on abinitio ETL tool & transformed the ETL logic to Spark Scala.
  • Created functions and assigned roles in AWS Lambda to run python scripts & perform event driven processing.
  • Experience in Data mining and Business Intelligence tools such as Tableau.
  • Knowledge on CI/CD pipelines using tools like Git, and Jenkins.
  • Knowledge towards Memory Management in Spark & Hive Performance tuning.
  • Exposure towards Splunk tool for Log analysis.
  • Experience in application development using Core Java, RDBMS, and Linux shell scripting.
  • Major strengths are familiarity with multiple software systems, ability to learn quickly new technologies, adapt to new environments, self-motivated, team player, focused adaptive and quick learner with excellent interpersonal, technical and communication skills.
  • Good communication skills, work ethics and the ability to work in a team efficiently with good leadership skills.
  • Knowledge on “Agile” methodology and Platinum & Silver agile certified (at State Street Corporation) with cross-team functionality.

Show More

Portfolio Projects

Description

  • Design and develop ETL integration patterns using Pyspark (Python On Spark) on cloudera, AWS EMR, Azure databricks.
  • Leverage the AWS technologies like EMR (Spark cluster), S3, Glue, Athena, Redshift, Athena, Glue to build data pipeline and make it available for analytics.
  • Build data lake to store data from different source systems (EMR/EHR, registry, un-structured notes) from multiple practices stored in multiple formats (CSV/XML/Parquet/JSON)
  • Data Ingestion (Batch Processing), by importing and exporting the data from HDFS to Relational Database systems and vice-versa & from Flat files (like .CSV, .txt) using Nifi pipelines.
  • Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
  • Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
  • Used Spark-SQL & Scala API’s for querying & transformation of data in Hive using Data frames.
  • Working closely with customer and addressing solutions for all issues.
  • Developed Spark Applications by using Scala, Python (Pyspark) and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
  • Knowledge on Amazon EC2 Spot integration & and Amazon S3 integration.
  • Optimizing the EMRFS for Hadoop to directly read and write in parallel to AWS S3 performantly.
  • Optimize the performance of Ingestion and consumption.
  • Develop pyspark scripts and deployed in Azure databricks jobs.
  • Performed ETL/ELT operation using Azure Data Factory (ADF) pipeline using activities like copy, web, lookup, metadata activity with linked service for communication from source blob storage and sink Azure SQL database and created triggers for execution of pipeline on a scheduled time.
  • Used blob storage, Azure Data Lake (ADLS).
  • Been part of production support team and resolve the tickets with root cause analysis.
  • Work closely with Data science team and provide the required data marts for there predictive analytics.
  • Had knowledge on Machine Learning algorithms like Linear regression, Logistic Regression, Naive Bayes, SVM, Decision Trees, Random Forest, Boosting, Kmeans.
  • Developed AWS Lambda functions in python for adhoc data engineering requirements with S3 event trigger.
  • Worked complete lifecycle i.e. modeling, ingestion, transformations, aggregation and data access layer thru pyspark.
  • Designed and developed Concurrent framework to simulate Parallel connections load testing.
  • Experienced in Designing and developing highly scalable and fault tolerant systems which served for 20 million records per day.
  • Developed Spark Streaming by consuming static and streaming data from different sources
  • Monitoring & Resource allocation and configuration for Spark applications.
  • Performed ETL/ELT operation using ADF pipeline using activities like copy, web, lookup, metadata activity.
  • Scheduling and grouping into pools for the jobs bases on the priority.
  • Administering the cluster and tuning the memory based on the RDD usage.
  • Implemented data ingestion and handling clusters in real time processing using Kafka .
  • Developed and Configured Kafka brokers to pipeline server logs data into spark streaming.
  • Deployment of Spark steaming applications with optimized number of executors, write ahead logs & check point configurations.
  • Worked on the Kerberos token authenticate & delegate token mechanism to implement the spark security
  • Active team participation, troubleshoot the issues;
  • Participate in the technical discussion towards conversion of business requirements to technical stories

Show More Show Less

Description

  • Developed ETL pipeline using Hadoop, Spark, Scala, Pyspark, hive, Azure Databricks, Azure SQL by performing data cleansing, data validation, data transformation, data lakes, data marts by performing data analysis.
  • Design and Implemented Data Ingestion Pipeline jobs using Sqoop into Hive tables.
  • Scaled the EMR and Spark Jobs to process daily billions of clinical interactions data.
  • Developed spark applications in python(PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Design and develop ETL integration patterns using Pyspark (Python On Spark) on cloudera, Azure databricks.
  • Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
  • Collected LMS data into Hadoop cluster using SQOOP.
  • Wrote ETLs using Hive and processed the data as per business logic.
  • Knowledge on abinitio ETL tool & transformed the ETL logic to Spark Scala.
  • Infrastructure is on AWS EMR. All the hadoop jobs were running on EMR cluster.
  • Worked on Optimizing the EMR cluster.
  • Extensive experience with AWS services like S3, EMR, Amazon Redshift, Dynamo DB, lambda functions in python
  • Worked on creating Oozie workflows for scheduling jobs for generating reports on a daily, weekly and monthly cycles.
  • Familiarity with Hadoop cluster setup & configurations
  • Worked on Spark Streaming and Apache Kafka to fetch live stream data
  • Developed Spark Scripts using Scala, Spark SQL to access hive tables in spark for faster data processing.
  • Implemented a 'server less' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets.
  • Working closely with the Agile team to develop high quality products with product owner on incremental basis.
  • Created mappings and flows based on mapping documents
  • Developed ETL mappings, transformations using Informatica Power Center.
  • Router, Filter, sequence generator, joiner, aggregate transformation and expression Mapplets for migration of data using Informatica Designer.
  • Designed and Created data cleansing, validation and loading scripts for Oracle data
  • Prepared Unit test cases for the mappings.
  • Involved into performance tuning of mapping.
  • Worked with different sources such as Relational and flat files.
  • Extracting, Transforming and Loading the data from Source to Staging and Staging to Target according to the Business requirements.
  • Designed and Created data cleansing, validation and loading scripts for Oracle data warehouse using Informatica ETL tool.
  • Developed ADF pipeline using activities like copy, web, lookup, metadata activity, linked service for communication from source blob storage and sink Azure SQL database, triggers for execution of pipeline on a scheduled time and performed data ingestion to databases

Show More Show Less

Description

  • Worked on Installing and configuring the HDP Hortonworks 2.x and Cloudera (CDH 5.5.1) Clusters with Capacity planning in Dev and Production Environments.
  • Worked on Configuring High Availability for Name Node in HDP 2.1.
  • Worked on Configuring Kerberos Authentication in the cluster.
  • Worked on Configuring queues in capacity scheduler.
  • Integrated Cloudera Data Science Workbench (CDSW) with Web Application;
  • Developed R scripts to facilitate Analytical Models with FTP, ODBC and other provisions;
  • Developed scripts to enable FTP connectivity to internal file storage application to CDSW;
  • Performance Tuning in Hive queries. Worked on Memory management in Spark
  • Developed Data marts thru Hive, Impala;
  • Involved in Infrastructure setup for Data access & Data security;
  • Enabled the Hadoop connectivity for III party tools like R Studio, Tableau;
  • Coordinate with business users to collect the user requirements;
  • Cross team functionality & Awarded with Bronze & Platinum Agile Certification;
  • Develop nifi pipelines for data ingestion from SQL server & flat files to HDFS for data aggregation.
  • Develop Custom Nifi processors using Spark & Scala.
  • Involved in loading data to Kafka Producers from rest endpoints and transferring the data to Kafka Brokers.
  • Imported real time weblogs using Kafka as a messaging system and ingested the data to Spark Streaming.
  • Automate the workflows using Airflow.
  • Deployed the code using CI/CD tools like Github & Jenkins;
  • Coordinate with business users to collect the user requirements;
  • Active team participation, troubleshoot the issues;
  • Participate in the technical discussion towards conversion of business requirements to technical stories

Show More Show Less

Description

  • Involved in Hive query optimization techniques;
  • Involved in Hadoop (Cloudera) Cluster setup in AWS;
  • Automated the entire CI/CD using Git and Jenkins
  • Worked on Spark core, Spark Streaming, Spark SQL modules of Spark.
  • Used HDFS commands to import files to HDFS
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs (Resilient Distributed Dataset) and Scala.
  • Data analysis through Hive.
  • Design and develop Data Ingestion component.
  • Cluster coordination services through Zookeeper
  • Import of data using Sqoop from Oracle to HDFS
  • Import and export of data using Sqoop from or to HDFS and Relational DB Teradata.
  • Created executors for every created partition in Kafka Direct Stream.
  • Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
  • Developed POC on Apache-Spark and Kafka & Kafka and Flume;
  • Implement Flume, Spark, Spark Stream framework POC for real time data processing

Show More Show Less