Harry Z.

data scientist with 10 years python, spark, SQL, machine learning, advanced statistics experience

New York , United States

Experience: 15 Years

Harry

New York , United States

data scientist with 10 years python, spark, SQL, machine learning, advanced statistics experience

144000 USD / Year

Immediate: Available

15 Years

Now you can Instantly Chat with Harry!

Chat Now

About Me

Proficiency in using Python, and pyspark for machine learning (Amazon Web Services - AWS, Github, Databricks, TensorFlow, Pytorch, scikit-learn, H2O, Lime, keras, pandas, NLP)
Proficiency in using SQL, SAS, python spark with adva...earn, H2O, Lime, keras, pandas, NLP)
Proficiency in using SQL, SAS, python spark with advanced statistical techniques.
Proficient in database driven Excel workbooks, pivot, design, macros, Business Intelligence and VBA.
Proficiency in Teradata/DB2/Omniture and Strong UNIX experience. ETL/OLAP experience handling huge data in excess of hundreds of millions of records. Deep natural language processing, Tableau, Spark and Hadoop.
Excellent communication skills presenting technical analysis to non-technical personnel with simple and understandable words.
Work experience with CVS Health Corporate, Suntrust Bank, Home Depot, Pfizer, and Philip Morris.

Skills

Portfolio Projects

Description

Developed different predictive models to measure E-trade sentiment analysis with NLP (Natural Language Processing) deep learning with GLOVE, wordnet, and punkt, including Keras LSTM (Long short-term memory), CNN (convolution neural network), CNN +GRU (Gated recurrent units), Bidirectional GRU, and Glove word embedding. Also applied spaCy Industrial-Strength NLP with smote for the unbalance data to build models using H2O and Lime. Set callback functions to early stop training and find the best model for the application purposes.

Show More Show Less

Description

Data engineering including data validation and features generation by spark sql, mount an AWSS3bucket through Databricksto pull datasets, physicans' 3-yeras infomainly from laad data, and patients' 10-years info mainly from Optum data. Developed Random Forestmodel usingGridSearchCVto understand drivers of physician prescribing for a priority drug considering Commercial actions, physician as well as patient characteristics and other market forces using various internal and external data sources such as Laad and Optum. Applied Shap values to explain features impacting the drives of physican's prescription.

Target druge:HUMIRA,SIMPONI,TALTZ,SILIQ,ENBREL,TREMFYA,STELARA,ILUMYA,CIMZIA,COSENTYX,OTEZLA, andREMICADE

Show More Show Less

Description

Createddashboards, KPI’s, visualization reports, especially focusing on measuring how the loyalty programs improving the total sales (year by year, month by month) within different customer segmentations; Cohort analysis with different years to build customer insight for the best customer groups; Analysis of variance (ANOVA) by building linear regression to measure the multiple factors related with historical transactional data impacting the total margin.

Show More Show Less

Description

Collected customers data from the company website by differentchannels such as email, direct mail, social media, and paid_agents, other_ad. Applieda fuzzy logic matching customers' name, address, phone number, or email withexisting dadabasesto find 5 years customers historical data such as checking or saving account, account balance, account transactions, retirement account, income, liquid assets, and any other loan information. Created and validated datasets by SQL, set yes when customers cashed out meaning the extra money when refinancingtheir mortgage within 5 years, and set zero whenwhen customers didn't cashed out when refinancingtheir mortgage within 5 years. Created customers features by business logic. Extract 60?ta by random for buildinglogistic regression model to generate every customer scoreindicating the cash out probability. 20?ta for validation and another 20?ta for tesing. Rankingscores by customers, and creating decile groups, estimating KS values for evaluating the model performance. KS valuse from three groups data, modeling, validatio, and testing should quite close and all reach over 35%, then the model can be applied for the mortgage campaign. Customers in top 3 decile groups by scoring who are not cash out should be listed for marketing or campaign purposes.

Show More Show Less