Now you can Instantly Chat with Hongzhi!
About Me
20 years of experience including data ETL, business intelligence and reporting, data analytics, big data and web development, predictive analytics, practical experiences in risk management solutions, finance and investment bank, billing system, comme...
Show MoreSkills
Portfolio Projects
Description
As the transportation analytics and operation research (TAOR) department gets new physical machines for SQL Servers and new SQL Server edition, database migrationsand new data sychronizations are required. I performed thedatabase migrationsand resolvedany migration issues due to database versions, cross network domain security restrictions, data mismatches, performance degration due to virtual machine configuration anddatabase migration time window restrictions.
Show More Show LessDescription
Whenever find slow performance in SQL Server databases, I do the performance tunig which includes: create suitable indexes to speedup queries, modify stored procedures to change the query structures andchange stored procedures into SSIS packages to implement parallel executions.
Show More Show LessDescription
The TAOR department has an algorithm pool for demand forecast time series and we forecast package volume weekly for each hub for more than 4000 Hubs. Some of the forecasts have low accuracies due to low quality data or non-stationary data. Based onmachine learning technology and time series algorithms I developed two new algorithms/models to enhance forecast accuracies for the bad data cases. The test showed the two algorithms could really get better results than other off-shelf algorithms for some cases.
Show More Show LessDescription
Thetarget is to make daily forecast on different service types for the next year for each hub (total number of hubs is more than 4000) for the volumeof small packages. It is not feasible to directly applytime series algorithms to this task due to huge amount computation and very bad accuracies. Then the stratigy and algorithms were developed. Time series algorithms applied to forecast monthly volume for each hub. Then I developed an allocation algorithm based on history data to get allocation percent for each hub for each day in the future. The new algorithm can allocate monthly volume to daily volume to improve the accuracy and drastically reduce thecomputation cost.
Show More Show LessDescription
This project dealt with medical information related to doctors, nurses, physician assistents, hospitals, medical offices, drug stores, medical related lisences and policies. Working on HPCC (High Performance Computing Cluster) system with 400 nodes to do big data ETL solutionthat filters, cleanses, compares data and transforms data into specific formats required by different clients. Delivered different solutions for 12 different clients (CVS, Walgreens and many hospitals). The language used is ECL (Enterprise Control Language) which is similar to HADOOP and was developed by LexisNexis
Show More Show LessDescription
Collection data from hundreds of data sources and using 400 hundred servers for MapReduce calculation, an unsupervised clustering model is established based on algorithms of feature matching such as company names, addresses, contact information, business owners and other published public information. The model groups data records into clusters so that each cluster represents a business identity in USA. This is a team work and a big data project. I was one of the major player in the team. The clustered data is servedas the base for modeling credit score evaluation, fraud detection, auto insurance and many other fields.
Show More Show LessDescription
Finished three database migrations using T-SQL stored procedures and SSIS packages to centralize and normalize customer data to reduce a lot of cost for data maintenance. The major challenges were: the source databases have different table schemas than the target database so that I have to make sure records are logic and consistentafter migration; need to work with business analyst to clean a lot of conflict records/results; the migration processhave to be done in limited time windows so that the normal business should not be interrupted and hence the performance speed of the migrations must be high enough.
Show More Show LessDescription
Designed database tables for very large volume data. This has been done by using database partition, index partion, carefully picking data types for table columns, carefully design table indexes and creating tables to save aggregated data. In the implementation phase, stored procedures were carefully developed and performance tuned.Sliding window skill was applied when phase in and phase out data in partitioned tables.
Show More Show LessDescription
Design and lead a team implementing the EmailSniper system that works with Port25 email software. EmailSniper system is used to send campaign emails and transactional emails, to track the status of sent emails, to generate statistics reports and status reports of the emails. The system handles up to 100,000 emails per hour without noticeable impact on database performance.
Show More Show LessDescription
FinCast (expense/budget/revenue forecast strategy) projects. It used a web component OWC to develop so that the web pagelooks just like Excel worksheet and also has the Excel formular functionality.My responsibilities include design, setting coding standards, code review, deciding technical solutions and providing technical support for the off-shore team.
Show More Show LessDescription
Innovatively developed an organization chart style fund allocation application that uses drag-drop to set up fund structures and computes the fund return allocation to the investors. Comparing to an old Excel application there are multiple times of increase in performance and easy of use. The main technology used was Microsoft Vector Markup Language (VML).
Show More Show LessDescription
Design and developed most of the SSIS (ETL) packages and deployed and scheduled them as automated jobs for the department. The data sources include SQL Server database, Oracle database, CSV files, Microsoft Access and Json data from REST API call. These SSIS packages are used for database migrations, for supporting the daily or weekly data import and data sync, and for many machine learning processes including classifications, clustering and time series forecasts. Forecast detail log data and forecast data are also collected by these SSIS for accuracy analyses and reports.
Show More Show LessDescription
Using both Python and R, two forecast algorithms for volume demand time series have been developed. The first one is based on a weekday mean-value model with the consideration of holidays as index variable. It also treats peak season and non-peak season separately. The algorithm has a linear regression style. This model improves some time series forecasts that had low accuracies. A second algorithm is based on a classical STL algorithm but with linear approximation to trend, outlier detection and removal and a holiday volume compensation due to outlier removal. Some tests show this algorithm is the best (compare to available common time series algorithms and some machine learning algorithms) for long period daily level forecast. Both algorithms are candidates for future production release.
Show More Show LessDescription
Data for truck load for different time periods in day/night shifts is imported from Access database and data transform is applied to construct pivot table. Then a neighbor smoothing algorithm is developed to smooth load spikes by distribute big load in some time periods to their neighbor time periods. Using historical data and practicing the smoothing to the historical data to learn the patterns so that future time period truck loads can be forecast and planned.
Show More Show LessDescription
Using a powerful machine learning tool – datarobot to generate multiple machine learning models for classification and time series tasks. The process includes generating new features, computing feature and target correlations, feature efficiency tests, feature selections, model validations, model ranking and error estimation. Many different features based on neural network can be generated and are very powerful. Combining these features with raw features from input, high accuracy models are created. Selecting high accuracy models and adding them to existing algorithm pool in production in the future will greatly improve the forecast capability so that cases with low quality data become forecastable and low accuracy cases become high accuracy.
Show More Show LessDescription
Potential customer candidates for manufacture buyers and sellers consist of the lead sheet. The target is to find the candidates that are most likely become customers and then using phone calls and emails to sale the companys product. Statistics applied to candidates company size, history, industry section, geographic locations and other features to find patterns. Decision tree algorithm is used to classify the candidates.
Show More Show Less