About Me
Experienced Data Scientist/Data Solution Architect/ in designing and developing enterprise class system. Hands on experience in data collection, feature extraction, feature collection, machine learning (SVM, Random Forest, Apiori, Regression, Classif...
Portfolio Projects
Machine Learning: Light GBM and Support Vector Machine, Neural Network like CNN, RNN
Technology: Python, Apache Spark, Pytorch, Apache Kafka, Flume, MongoDB, Jenkins, Git
Supported Platforms: Ubantu 17.0, Cent OS
Team Size: 12
Duration: Dec 2017 to June 2019
Manage Detection and Response is a combination of technology and skills to deliver advanced threat detection, deep threat analytics, global threat intelligence, faster incident mitigation, and collaborative breach response on a 24x7 basis.
The endpoints (IOT devices and Enterprise servers) have system scanning done by EMS system or scan component. Apache Flume agents capture the logs and sends to Topic in Apache Kafka.
The queue is consumed using Apache Spark Steam component. The data is reduced and stored in Cassandra for Machine Learning.
The data is processed using Machine Learning for threat detection. The output is stored in Mongo DB and displayed in dashboard.
Machine Learning Algorithm techniques like Ensemble Learning & Boosting like Support Vector Machine, Light GBM, CNN, RNN are applied to best possible result is derived.
Role & Responsibilities:
· Part of Product Architecture Team.
· Leading and development of data ingestion, log processing component using Apache Spark/Flume, Kafka and HDFS and MongoDB
· Feature Selection and Engineering for Web Attack, Network Attack, Malware Attack. Light GBM and Support Vector Machine, Neural Network like CNN, RNN
· Collaborating with agile with cross-functional teams
Supported Platforms: Windows and Linux.
Team Size: 4
Duration: Apr 2017 to Dec 2017
IGA is integrated access management and governance product which takes care entire life cycle of employee engagement (on boarding and exit). During On boarding, employee id is created, access to different system is given after approval. During exit, all the access and id are revoked.
Machine Learning
The data collected from multiple system like attendance system, leave portal, access management, training system, appraisal system and other client multiple system.
The collected data cleansed, parsed, validated and thereafter feature selection and engineering, exploratory data analysis are applied to derive multiple metrics. Powerful dashboard is created using Tableau.
Machine Learning Algorithm techniques like Ensemble Learning & Boosting like Support Vector Machine, Light GBM are applied to best possible result is derived.
Role & Responsibilities:
· Part of Product Architecture Team.
· Model Creation, Data Pre-Processing, Data Cleaning.
· Feature Selection and Engineering.
· Implementing Machine Learning Algorithm techniques like Ensemble Learning & Boosting like Support Vector Machine, Light GBM are applied to beast possible
result is derived.
· Collaborating with fast-paced, agile, dynamic environment with cross-functional teams
Supported Platforms: Windows and Linux.
Team Size: 8
Duration: Jan 2017 to Nov 2017
Endpoint Detection and Protection Response detects, protects and responds to cyberattacks which adds to the complexity of securing the enterprise. Each of the point products adds an agent to the endpoint and is often managed independent of the other security technologies present on that endpoint.
Machine Learning: This involves Feature Extraction and Feature Engineering for malware based on Static Analysis for PE and PDF file types.
The metadata is extracted from malware samples. Thereafter, Data Pre-Processing, Data Cleaning is done.
Based on exploratory analysis, regularly model is created/updated and validated.
Machine Learning Algorithm techniques like Ensemble Learning & Boosting like Support Vector Machine, Light GBM are applied to best possible result is derived.
Role & Responsibilities:
· Part of Product Architecture Team.
· Model Creation, Data Pre-Processing, Data Cleaning.
· Feature Selection and Engineering.
· Implementing Machine Learning Algorithm techniques like Ensemble Learning & Boosting like Support Vector Machine, Light GBM are applied to beast possible
result is derived.
· Collaborating with fast-paced, agile, dynamic environment with cross-functional teams
<!--[if !supportLists]-->· <!--[endif]-->Machine Learning: YOLO, Neural Network,Google Colab
<!--[if !supportLists]-->· <!--[endif]-->The project is to detect the red blood cells in blood sample. The training of data was done from the below dataset: Https://github.com/cosmicad/dataset
<!--[if !supportLists]-->· <!--[endif]-->The dataset contains blood images and annotated files for training
<!--[if !supportLists]-->· <!--[endif]-->YOLO algorithm was used for training the dataset. YOLO is an extremely fast real time multi object detection algorithm. YOLO actually looks at the image just once by dividing the image into a grid of 13 by 13 cells. Each of these cells is responsible for predicting 5 bounding boxes which describes the rectangle that encloses an object. YOLO also outputs a confidence score that tells us how certain it is that the predicted bounding box actually encloses some object. This score doesn't say anything about what kind of object is in the box, just if the shape of the box is any good. For each bounding box, the cell also predicts a class. This works just like a classifier: it gives a probability distribution over all the possible classes. YOLO was trained on the PASCAL VOC dataset. The confidence score for the bounding box and the class prediction are combined into one final score that tells us the probability that this bounding box contains a specific type of object. Since there are 13×13 = 169 grid cells and each cell predicts 5 bounding boxes, we end up with 845 bounding boxes in total. It turns out that most of these boxes will have very low confidence scores, so we only keep the boxes whose final score is 30% or more (you can change this threshold depending on how accurate you want the detector to be)
Classification can help an organisation to meet legal and regulatory requirements for retrieving specific information in a set time frame, and this is often the motivation behind implementing data classification Initially the text files were taken as input. The text file was read using python. Later PDF and docx files were read using python and third party library like pdfminer. After extraction, the data were fed to ML-NLP Tokenization,Stemming and Lemmatization,Removing Stop Words and Punctuation,Converting to Number from Text,Computing term frequencies or tf-idf, Clustering. Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus All the process were done using NLTK Multiple machine learning models(Regresssion,Random Forest,SVM, K-Means,XGBoost)evaluate their accuracy, and tweak the model To evaluate each model, we will use the K-fold cross-validation technique: iteratively training the model on different subsets of the data.confusion matrix is used to show the discrepancies between predicted and actual labels
Breast cancer (BC) is considered as the most common cancers, resulting majority of new cancer cases and cancer-related deaths according to global statistics, making it a significant public health problem in todays society. The early diagnosis of BC can improve the prognosis and chance of survival significantly, as it can promote timely clinical treatment to patients. Benign Tumors can be classified in such a way that can prevent patients. So, the diagnosis of BC and the classification of patients into malignant or benign is really a matter of concern. n. Because of its unique advantages in critical features detection from complex BC data sets, machine learning (ML) is widely recognized as the methodology of choice in BC pattern classification and forecast modelling Exploratory Data Analysis : Bar Chart, Histogram, Heat map for correlation. Outlier treatment with null or blank data, converting string data to numeric using hot encoding Machine Learning: Regression, SVM,Naive Bayes,K-Means, XGBoost The training of data was done from the below dataset: Http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29 Multiple machine learning models(Regression,Random Forest,SVM, K-Means,XGBoost)evaluate their accuracy, and tweak the model To evaluate each model, we will use the K-fold cross-validation technique: iteratively training the model on different subsets of the data.confusion matrix is used to show the discrepancies between predicted and actual labels
The project is to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers Exploratory Data Analysis : Uni variate Analysis, Bi variate and Multivariate Analysis. The analysis include loan approval based on gender,martial status, employment, credit history, dependents and education etc.Outlier treatment with null or blank data, converting string data to numeric using hot encoding Multiple machine learning models(Regression,Random Forest,SVM, K-Means,XGBoost)evaluate their accuracy, and tweak the model To evaluate each model, we will use the K-fold cross-validation technique: iteratively training the model on different subsets of the data.confusion matrix is used to show the discrepancies between predicted and actual labels
