About Me
- Proficiency in using Python, and pyspark for machine learning (Amazon Web Services - AWS, Github, Databricks, TensorFlow, Pytorch, scikit-learn, H2O, Lime, keras, pandas, NLP)
- Proficiency in using SQL, SAS, python spark with adva...
- Proficiency in using SQL, SAS, python spark with advanced statistical techniques.
- Proficient in database driven Excel workbooks, pivot, design, macros, Business Intelligence and VBA.
- Proficiency in Teradata/DB2/Omniture and Strong UNIX experience. ETL/OLAP experience handling huge data in excess of hundreds of millions of records. Deep natural language processing, Tableau, Spark and Hadoop.
- Excellent communication skills presenting technical analysis to non-technical personnel with simple and understandable words.
- Work experience with CVS Health Corporate, Suntrust Bank, Home Depot, Pfizer, and Philip Morris.
Skills
Data & Analytics
Development Tools
Programming Language
Database
Operating System
Web Development
Others
Positions
Portfolio Projects
Company
Developed predictive models for E-trade sentiment analysis with Natural Language Processing (NLP)
Role
Data Scientist
Description
Developed different predictive models to measure E-trade sentiment analysis with NLP (Natural Language Processing) deep learning with GLOVE, wordnet, and punkt, including Keras LSTM (Long short-term memory), CNN (convolution neural network), CNN +GRU (Gated recurrent units), Bidirectional GRU, and Glove word embedding. Also applied spaCy Industrial-Strength NLP with smote for the unbalance data to build models using H2O and Lime. Set callback functions to early stop training and find the best model for the application purposes.
Show More Show LessTools
databricks NumPy Spark SQL PyCharmCompany
Developed Random Forest Classification Model of Prescribing a priority drug for marketing purpose
Role
Data Scientist
Description
Data engineering including data validation and features generation by spark sql, mount an AWS S3 bucket through Databricks to pull datasets, physicans' 3-yeras info mainly from laad data, and patients' 10-years info mainly from Optum data. Developed Random Forest model using GridSearchCV to understand drivers of physician prescribing for a priority drug considering Commercial actions, physician as well as patient characteristics and other market forces using various internal and external data sources such as Laad and Optum. Applied Shap values to explain features impacting the drives of physican's prescription.
Target druge: HUMIRA, SIMPONI, TALTZ, SILIQ, ENBREL, TREMFYA, STELARA, ILUMYA, CIMZIA, COSENTYX, OTEZLA, and REMICADE
Show More Show LessTools
AWS Github PyCharm databricksCompany
Analyzed data to generate monthly sales reports for different segments with visualization
Role
Data Scientist
Description
Created dashboards, KPI’s, visualization reports, especially focusing on measuring how the loyalty programs improving the total sales (year by year, month by month) within different customer segmentations; Cohort analysis with different years to build customer insight for the best customer groups; Analysis of variance (ANOVA) by building linear regression to measure the multiple factors related with historical transactional data impacting the total margin.
Show More Show LessSkills
Data Science NumPy Matplotlib Pandas SQLTools
NumPy Oracle DatabaseCompany
Built a mortgage refinance cash out model with logistic regression for marketing
Role
Data Scientist
Description
Collected customers data from the company website by different channels such as email, direct mail, social media, and paid_agents, other_ad. Applied a fuzzy logic matching customers' name, address, phone number, or email with existing dadabases to find 5 years customers historical data such as checking or saving account, account balance, account transactions, retirement account, income, liquid assets, and any other loan information. Created and validated datasets by SQL, set yes when customers cashed out meaning the extra money when refinancing their mortgage within 5 years, and set zero when when customers didn't cashed out when refinancing their mortgage within 5 years. Created customers features by business logic. Extract 60?ta by random for building logistic regression model to generate every customer score indicating the cash out probability. 20?ta for validation and another 20?ta for tesing. Ranking scores by customers, and creating decile groups, estimating KS values for evaluating the model performance. KS valuse from three groups data, modeling, validatio, and testing should quite close and all reach over 35%, then the model can be applied for the mortgage campaign. Customers in top 3 decile groups by scoring who are not cash out should be listed for marketing or campaign purposes.
Show More Show Less