About Me
Lead Data Scientist with 17 years of total IT experience, 1 year in Tuning, 5 years in DataScience (Machine Learning, Deep learning and NLP with R & Python), 1.5 years in Big Data (MapReduce, PIG & Hive) & 9.5 years in Java. About 5.6 years into Bank...
Show MoreSkills
Portfolio Projects
Description
Project#1 CASA Loss prediction
For the bank one of the cheapest money is from saving and current account balances (CASA balance). Need to identify the customers whose CASA balance loss is 60% and above, in next 6 months. So that bank can target those customers to take the corrective actions to avoid the CASA balance loss. May 2018 to Oct 2018 period data was used to train and Nov 2018 to Apr 2019 period data was used to validate the model. 6 Months window period of data has been used for validation purposes. Model was able to capture 52% of leads in top 3 deciles. Top 5 predictors are Total AUM, Cash Inflow, Cash outflow, tenure and Debit transactions customer’s trend of saving balance in the last 6 months.
Client: 2nd largest bank in Malaysia.
Responsibilities:
Done exploratory analysis and took inferences by visualization the data
Removed the insignificant variables using Dimensionality reduction technique
Used Logistic Regression, Random forest and XGBoost.
Achievements:
Identified the 2.6 Billion RMB CASA loss over the next 6 months. Bank has acted to save this amount by targeting those customers with the offers such as Bonds to retain the CASA balance.
Environment/Technology Stack: Citrix Server, Big Data environment, Hue for Hive & Jupytor Notebook.
Description
Project#2 Customer FD Price Sensitivity.
The bank wants to know who the price insensitive customers are. So that bank can provide better (Preferred) interest rate to price insensitive customers than normal (Baud rate) interest rates. Also they want the FD propensity model to predict who will take their next FD either the existing or new customer. Jan 2018 to Dec 2018 period data was used to train and Jan 2019 to Feb 2019 period data was used to validate the model. 2 Months data has been used for validation purposes. Model was able to capture 46% of leads in top 3 deciles. Top 5 predictors are Average debit online transactions last 1 year, Age, Customer Segment, Customer’s SA balance mean of initial 6 months to the mean of next 6 months and Average branch debit transactions last 1 year. Business can just take top 3 deciles and identify 46% of customer’s price Insensitivity. We can target an outflow of RM4.8 Billion in FD balances as of 2018.
Client: 2nd largest bank in Malaysia.
Responsibilities:
Done exploratory analysis and took inferences by visualization the data
Removed insignificant variables using Dimensionality reduction technique
Used Logistic Regression, Random forest and XGBoost.
Random forest gave the best results.
Environment/Technology Stack: Citrix Server, Big Data environment, Hue for Hive & Jupytor Notebook.
Show More Show LessDescription
Project#3 Inbound Call Depletion
The bank wants to reduce the number of inbound calls for the call center. The inbound calls are of Debit Card, Savings Account, Current Account and Credit card reasons. From Feb 2016 to Feb 2019 ( 3 Year’s) period data was used to train and Apr 2019 data was used to validate the model.Top 5 predictors are Tenure, Total Savings Account balance, Active click user, Age Group and Transaction amount on day2 were used to build model. One month data has been used for validation purposes. Model was able to capture 61% of leads in top 3 deciles.
Client: 2nd largest bank in Malaysia.
Responsibilities:
Done exploratory analysis and took inferences by visualization the data
Removed insignificant variables using Dimensionality reduction technique
Used Logistic Regression and Random forest.
Random forest gave the best results.
Environment/Technology Stack: Citrix Server, Big Data environment, Hue for Hive & Jupyter Notebook.
Show More Show LessDescription
Project#4 Transaction Fraud Detection
Bank transactions data are explored and drawn the insights. The label data is provided for the given transactions. The model learns the fraud patterns and able to predict the given transaction is fraud or not.
Responsibilities:
Done exploratory analysis and took inferences by visualization the data
Removed insignificant variables using Dimensionality reduction technique
Used Logistic Regression and XGBoost.
Environment/Technology Stack: Windows Server 2012 & Machine Learning with Python
Show More Show LessDescription
Project#5 Credit card Sanction Predictor
The applicant’s and credit bureau data is available. The model is to be developed to predict whether the customer will repay the credit card bill or get default. Helped a Bank in deciding to sanction the credit card to the applicant or not based on the given Domestic and Credit bureau data.
Responsibilities:
Done exploratory analysis and took inferences by visualization the data
Removed insignificant variables using Dimensionality reduction technique
Used Logistic Regression and Random forest models.
Environment/Technology Stack: Windows Server 2012 & Machine Learning with Python
Description
Project#6 Telecom Churn
Identified churning of customers by analysing customer’s data. Identified best model out of KNN, Naive Bayes and Logistic.
Responsibilities:
Explored the data using different visualization techniques.
Improved the quality of data by removing inconsistent data, missing values & outliers.
Used algorithms like KNN, NAIVE BAYES and Logistic Regression.
Environment/Technology Stack: Windows XP & Machine Learning with R.
Description
Project#7 Spark Funds Investment
Helped Spark Funds Investment Company to identify the geographies and sector for its investment to maximise the returns in start-up eco system.
Responsibilities:
Extracted the data from the client and made an understanding
Made exploratory analysis and cleansed the data
Identified, top Countries with High Investments
Identified, top sectors to invest.
Environment/Technology Stack: Windows XP & Machine Learning with R.
Description
Project#8 SPC
This application is a widely implemented strategy for managing the sales & logistics. It involves using technology to organize, automate and synchronize business processes principally logistic activities, but also for sales. The overall goals are to deal efficiently with logistic and sales process and reduce the costs involved for logistics & Sales. It includes a management system for tracking and recording every stage in the sale & logistics from initial logistics to final sales.
Responsibilities:
Worked on a live 20 nodes Hadoop cluster running CDH.
Extracted the data from Oracle RDBMS into HDFS using Sqoop.
Created and worked with Sqoop jobs to populate Hive External tables.
Developed Hive scripts for end user / analyst requirements to perform ad hoc analysis.
Developed Oozie workflow for scheduling the ETL process.
Environment/Technology Stack: RHEL, Hadoop, HDFS, MapReduce, Hive and Hbase.
Show More Show LessDescription
Bank transactions data are explored and drawn the insights. The label data is provided for the giventransactions. The model learns the fraud patterns and able to predict the given transaction is fraudor not. The fraud transactions are classified into High, Medium and Low. The High level fraudtransactions are in focus and low level fraud transactions can be closed.
Show More Show LessDescription
The objective of this tool is to develop the AI models using different algorithms like KNN, NaiveBayes, Logistic Regression, Random Forest and XGBoost. This tool will takes the input data fromeither csv file or from the database. It performs preprocessing, feature selection, featureengineering, model development, performance validation and compares all the algorithm resultsand finds the best algorithm for the given dataset and provides the best algorithm results.
Show More Show LessDescription
The objective is to identify the customers who will take up the FD product, so that those customerscan be targeted. About 2 years of data (i.e. from Jan 2018 to Nov 2019) was used to train the modeland 2 months of data (i.e. from Dec 2019 to Jan 2020) was used to validate the mode.Top 5 features are Average debit online transactions last 1 year, age, Customer Segment, CustomerSA balance mean of initial 6 months to the mean of next 6 months and Average branch debittransactions last 1 year. Model was able to capture FDs of 58% from top 30% of customers targeted.Algorithms used are Logistic Regression, Random forest and XGBoost.
Show More Show LessDescription
Today the bank vendor manually maintaining the ATMs cash. Bank is losing the interest money forunused cash in the ATM. Bank wants to keep only the required amount of cash based on the ATMand on the specific day of the month. Needs to predict the withdrawal amount for each ATM fornext 30 days ahead. Two years data from Oct 2017 to Sept 2019 period was used to train the modeland two months data from Nov to Dec 2019 data was used to validate the model.
Show More Show LessDescription
The objective is to extract the sentiments from the social media sites especially from Twitter aboutthe bank. Need to identify which are positive sentiments and which are negative sentiments.Explored more on negative sentiments. Needed the actionable insights for the negativesentiments. Need to pick out the part of the tweet (word or phrase) that reflects the sentiment.Need to identify what words in tweets support a positive, negative, or neutral sentiment. The datahas 4 columns like text ID, text, selected text and sentiment. For each text Id, need to find theselected text that categorizes the sentiment.
Show More Show Less