Business Problem UKG Ready primarily operates in the Small and Medium Business (SMB) space, so inherently many customers are forced to operate and make key business decisions with less Workforce Management (WFM) / Human Capital Management (HCM) data. In addition to volume, SMB lacks the variety of data needed to create a dynamic and agile organization. This puts SMB at a major disadvantage compared to larger segments.Project Goals People Insights module is committed to surfacing insights to customers in the context of their day-to-day duties and aid in decision making. With the SMB customer data limitations mentioned above, the goal of this project was to create a global dataset that augments individual customer data to bring light to less obvious, yet important information.Challenges UKG Ready is a highly configurable application that gives customers the opportunity to build solutions on a platform that meets their specific business needs. High configurability gives high flexibility to customers in their usage of the software. However, it becomes nearly impossible to create a global dataset for machine learning and data insights. UKG Ready manages just under 4 million of the US workforce and some 30,000+ customers. Despite the large employee dataset size, machine learning models that are specific to customers are starved for data because the individual customers have a relatively small employee population. Does that mean we cannot support our SMB customers’ decision making with ML? Result Partnering with Google, we were able to develop an approach that allowed us to standardize various domain entities (pay categories, time off codes, job titles, etc.) so that we could build a global dataset to augment SMB customer data. Using machine learning we were able to build a common vocabulary across our customer base. This common vocabulary encapsulates the nuances of how our customers manage their business and yet is generalized and standardized such that the data can be aggregated over the variety of customer configurations. This allows us to serve up practical insights to customers through various use cases. Our partnership allowed us to leverage Google Cloud Services to meet the needs of our complex machine learning models, distributed data sets and CI/CD processes. How UKG Ready decided to partner with Google for an end-to-end solution for the analytics offering. This allowed us to focus on our core business logic without having to worry about the platform, environment configurations, performance and scalability of the entire solution. We make use of various Google Cloud services such as Cloud Triggers, Cloud Storage, Cloud Functions, Cloud Composer, Cloud Dataflow, Big Query, Vertex AI, Cloud Pub/Sub… to host our analytics solution. Jenkins manages the entire CI/CD pipelines and cloud environments are configured and deployed using Terraform.The standardization of business entities problem was solved in three distinct steps:Step 1: Collecting aggregated dataWe needed an approach to collect aggregated data from our highly distributed, sharded, multi-tenant data sources. We developed a custom solution that allows us to extract data aggregated at source for PII and GDPR considerations and transfer to Google Cloud Storage in the fastest manner possible. Data is then transformed and stored in Big Query. Services used: GCS, Cloud Functions, DataFlow, Cloud Composer and Big Query. All processes are orchestrated using Cloud Composer and detailed logging is available in Cloud Logging (Stackdriver).Step 2: Applying NLP (Natural Language Processing)Once we had the variety of customer configurations or the business entities available, we then applied NLP algorithms to categorize and standardize these in buckets. This approach assumes that customers use natural language for configurations like job titles, pay codes etc. String Preparation The input data for string preparation process is an entity string or several strings, that describe one entity object (like name-description pair or code-name pair). The output represents set of tokens that may be used to run classification/clustering model. The process of string preparation tokenizes strings, replaces shortcuts, handles abbreviations, translates tokens, handles grammatical errors and mistypes.ML Models Statistical The idea of the model is to use defined target classes (clusters) and assign several tokens (anchors) to each of them an entity that has any of those tokens would be “attracted” to appropriate class. All other tokens are weighted according to frequencies of usage of theses tokens in the entities with anchor tokens: Using anchor tokens, we are building kind-of Word2Vec – dimensionality of vector is equal to number of target classes. The higher specific dimension (cluster) value, the higher the probability of entity to be included in appropriate cluster. Final prediction for entity tokens list for specific class is sum of weights of all the tokens included. Predicted cluster is a cluster that has maximal prediction score. Lexical Model We managed to generate reasonable amount of labeled data during statistical model implementation and testing. That opens a possibility to build “classical” NLP model that uses labeled data to train classification neural network using pretrained layers to produce token embeddings or even string embeddings. We started experimentation with pre-trained models like GloVe and got good results with single words and bi-grams but started getting issues in handling of n-grams. Our Google account team came to our rescue and recommended some white papers that helped formulate our strategy. We now use Tensorflow nnlm-en-dim128 model to produce string embeddings – it was trained on 200B records English Google News corpus and produces for each input string 128-dimensional vector. After that we use several Dense and Dropout layers to build a classification model. Ensembling To perform ensembling all the model results for each class are cast to probabilities using softmaxtransformation with scale normalization. Final predicted probability is maximal average score of both models among all the classes scores – appropriate class is predicted class. The machine learning models are deployed on Vertex AI and are used in batch predictions. Model performance is captured at every prediction boundary and monitored for quality in production. Step 3: Making available common vocabularyHaving the standardized vocabulary, we then needed a mechanism to have the results be available in UKG Ready reports and customer specific models like Flight Risk and Fatigue. For this we again used Google Services for orchestration, data transformation and data storage.Once the modeling is complete, we made the customer specific models leveraging the above architecture be available in Reports. We utilized our proven existing technology choices in GCP for orchestration, data transformation and data storage Results We are able to build a common vocabulary of our customers’ business entities with good confidence. And be an expert advisor to our SMB customers in their decision-making using machine learning. With the advice of our Google account team and using Google services we can add value to our product in a relatively short amount of time. And we are not done! We continue to use this platform for new use cases, complex business problems and innovative machine learning solutions.Sample result:Special thanks to Kanchana Patlolla , AI Specialist, Google for the collaboration in bringing this to light
Quelle: Google Cloud Platform
Published by