Skills
Database
Web Development
Programming Language
Operating System
Others
Portfolio Projects
Company
Realtime Streaming Data pipeline
Description
Description:
The objective of this project is to build a Realtime Streaming Data pipeline and further loaded into MongoDB which will be used for Data Scientist to Predict their Models.
Responsibilities:
- Worked with CDH Cluster containing 126 Nodes Production Cluster and 64 Nodes Development cluster.
- Responsible for coordinating end to end project related activities.
- Involved in all phases of the SDLC including development, testing, and deployment of the application in the Hadoop cluster.
- Developed Logstash Filter to filter the data based on Domain and Service Names coming from Splunk Enterprise.
- Loading filtered data into Kafka Topic.
- Developed Scala script to subscribe Kafka topic data and further loaded into hdfs location.
- Developed Scala script to read hdfs data and further loaded into MongoDB.
- Capturing Oozie scheduled Failed jobs in Ambari both sqoop injection and spark jobs and taking necessary action
- Used Sbt to compile & package the Scala code into jar and deployed the same in cluster using spark-submit
- Done performance testing for end to end Realtime data flow from Splunk to MongoDB.
Tools
IntelliJ IDEACompany
Historical Data Processing Using Pyspark
Description
Description:
The objective of this project is creating Hive External tables from Master tables and preprocessing and storing it back to Hive External tables and further will be used for data scientists to train their models.
Responsibilities:
- Worked with CDH Cluster containing 64 Nodes Production Cluster and 36 Nodes Development cluster.
- Responsible for coordinating end to end project related activities.
- Involved in all phases of the SDLC including development, testing, and deployment of the application in the Hadoop cluster.
- Understand the business requirement from Data scientists and data will be dumped from hive external table into another hive external table with partition and specific duration from master table.
- After a successful data load into hive table it will be accessed by pyspark.
- Developed pyspark script and further loaded into hive tables to train their models.
- Used Sbt to compile & package the Scala code into jar and deployed the same in cluster using spark-submit
Tools
IntelliJ IDEACompany
Migration Project
Description
Description:
The objective of the project is to develop spark applications to convert informatica workflows and further laoded into Hive orc tables and which will be used for downstream systems. End-to-End flow is scheduled using Talend
Responsibilities:
- Worked with HDP Cluster containing 64 Nodes Production Cluster and 36 Nodes Development cluster.
- Responsible for coordinating end to end project related activities.
- Involved in all phases of the SDLC including analysis, design, development, testing, and deployment of the application in the Hadoop cluster.
- Developed Shell Scripting to pull the files from informatica server to Hadoop.
- Responsible for creating hive external & internal(orc) tables with zlib compression.
- Developed Spark Ingestion Framework to load the data from Hive External Tables to internal tables at one shot
- Understanding the informatica mappings and documented the business logic
- Developed Spark code for consumption layer which includes informatica logic and further loaded data into hive fact & dimension tables
- Used sbt to compile & package the scala code into jar and deployed the same in cluster using spark-submit
Tools
SCALA IDE