How to provide better search results with AI ranking

Every IT team wants to get the right information to employees and vendors as quickly as possible. Yet the task is always getting harder as more information becomes available and results invariably become stale. Disparate internal systems hold vital information. Search capabilities are not consistent across tools. No universal system exists. And even inside Google we can’t use our web search technology, because that assumes a fully public dataset, a lot of traffic, and more active content owners. It’s so hard to get internal search right because each person has individual goals, access levels and needs.All too often, this sisyphean task ends up requiring huge amounts of manual labor, or leads to inferior results and frustrated people. At Google we transitioned our internal search to rank results using machine learning models. We found this helps surface the most relevant resources to employees – even when needs change rapidly and new information becomes available.Sudden changeOur internal search site–Moma–is Googlers’ primary way to source information. It covers a large number of data sources, from internal sites to engineering documentation to the files our employees collaborate on. Over 130,000 weekly users issue queries each week – to get their job done and to learn the latest about what’s going on at Google.With COVID-19 and working-from-home changing so much so rapidly, lots of new content and guidance for Googlers was created quickly and needed to be easily accessible and discoverable by all employees. But how to make sure it gets shown?Manual tweakingBefore adopting ML for search ranking, we used to tweak ranking formulas with literally hundreds of individual weights and factors for different data sources and signals. Adding new corpora of information and teaching the search engine new terminology was always possible, but laborious in practice. Synonyms, for example, would rely on separate datasets that needed manual updating, for example to make sure that searches for “Covid19”, “Covid”, and “Coronavirus” all return the relevant pages. The involved human effort to carefully craft and apply changes, validate them and deploy them often meant that new content for new topics was slow to rank highly. Even then, search results could be hit-or-miss depending on how users formulated their queries, as writers often wouldn’t know exactly which keywords to use in their content – especially in situations where trends emerge quickly, and the terminology was evolving in real time.Automated scoringWe now use ML for scoring and ranking results based on many signals, and our model learns quickly because we continuously train on our own usage logs of the last four weeks. Our team integrated this ranking method in 2018, and it served us well with recent shifts in search patterns. When new content becomes available for new needs, the model can pick up new patterns and correlations that would have otherwise taken careful manual modelling. This is the fruit of our investments over the last years, including automatic model releases and validation, measurement and experimentation, which allowed us to get to daily ranking model rollouts.Create training dataCreating training sets is the prerequisite for any application of machine learning, and in this case it’s actually pretty straightforward: Generate the training data from search logs that capture which results were clicked for which queries. Choosing an initial simple set of model features helps to keep complexity low and make the model robust. Click through rate for pages by queries and a simple topicality score like TF-IDF can serve as starting points. Each click on a document gets a label of 1, everything else a label of 0. Each search impression that gets a click should become a training example to the ML model. Don’t do any aggregations on query or such; the model will learn these by itself. Feed the training data into an ML ranking model, like tensorflow_ranking.MeasurementOnce the basics are working, you’ll want to gauge the performance of the model, and improve it. We combine offline analysis – replaying queries from logs and measuring if the clicked results ranked higher on average – and live experimentation, where we divert a share of traffic to a different ranking model for direct comparison. Robust search quality analysis is key, and in practice it’s helpful to consider that higher-up results will always get more clicks (position bias), and that not all clicks are good. When users immediately come back to the search results page to click on something different, that indicates the page wasn’t what they were looking for.Expanding the modelWith more signals and page attributes available, you can train more sophisticated models that consider e.g. page popularity, freshness, content type, data source or even user attributes like their job role. When structured data is available, it can make for powerful features, too. Word embeddings can outperform manually defined synonyms while reducing reliance on human curation, especially on the “long tail” of search queries. Running machine learning in production with regular model training, validation and deployment isn’t trivial, and comes with quite a learning curve for teams new to the technology. TFX does a lot of the heavy lifting for you, helping to follow best practices and to focus on model performance rather than infrastructure. Positive impactThe ML-driven approach allows us to have a relatively small team that doesn’t have to tweak ranking formulas and perform manual optimizations. We can operate driven by usage data only, and don’t employ human raters for internal search.This ultimately enabled us to focus our energy on identifying user needs and emerging query patterns from search logs in real time, using statistical modelling and clustering techniques. Equipped with these insights, we consulted partner teams across the company on their content strategy and delivered tailor-made, personalized search features (called Instant Answers) to get the most helpful responses in front of Googlers where they needed them most. For example, we could spot skyrocketing demand for (and issues with!) virtual machines and work-from-home IT equipment early, influencing policy, spurring content creation and informing custom, rich promotions in search for topical queries.  As a result, 4 out of 5 Googlers said they find it easy to find the right information on Covid-19, working from home, and updated company services.Give it a try Interested in improving your own search results? Good! Let’s put the pieces together. To get started you’ll need:Detailed logging, ranking quality measurements and integrated A/B testing capabilities.  These are the foundations to train models and evaluate their performance. Frameworks like Apache Beam can be very helpful to process raw logs and generate useful signals from them. A ranking model built with Tensorflow Ranking, based on usage signals. In many open source search systems like Elastic Search or Apache Solr, you can modify, extend or override scoring functions, which can allow you to plug in your model into an existing system.Production pipelines for model training, validation and deployment using TFXWe want to acknowledge Anton Krohmer, Senior Software Engineer, who contributed technical insight and expertise to this post.Related ArticleGoogle supercharges machine learning tasks with TPU custom chipEditor’s Update June 27, 2017: We recently announced Cloud TPUs.Machine learning provides the underlying oomph to many of Google’s most-l…Read Article
Quelle: Google Cloud Platform

How Wunderkind scales up to 200K requests per second using Google Cloud

Editor’s note: We’re hearing here how martech provider Wunderkind easily met the scaling demands of their growing customer base on multiple use cases with Cloud Bigtable and other Google Cloud data solutions.Wunderkind is a performance marketing channel and we mostly have two kinds of customers: online retailers, and publishers like Gizmodo Media Group, Reader’s Digest, The New York Post and more. We help  retailers boost their e-commerce revenue through real-time messaging solutions designed for email, SMS, onsite, and advertising. Brands want to provide a one-to-one experience to more of their customers, and we use our extensive history with best practices in email marketing and technology to help brands reach more customers through targeted messaging and personalized shopping experiences. With  publishers, it’s a different value proposition, we use the same platform to provide a non disruptive and personalized ad experience for their website. For example, if you are on their site and then you left, we might show an ad tailored to you when you come back later – depending on the campaign. After running into limitations with our legacy database system, we turned to Cloud Bigtable and Google Cloud, which helped us be more flexible and easily scale for high traffic demand – which can be a stable 40,000 requests per second, and meet the needs of our growing number of data use cases. Three different databases power our core productIn our core offering, companies send us user events from their websites. We store these events and later decide (using our secret sauce) if and how to reach out to those users on behalf of our customers. Because many of our customers are retailers, Black Friday and Cyber Monday are big traffic days for us as. On such days, we can get 31 billion events, sometimes as many as  200K events per second. We show 1.6 billion impressions that have seen close to 1 billion pageviews. And at the end of all this, we securely send about 100 million emails. We noticed the same thing for election time; traffic reached the same high volume. We need scalable solutions to support this level of traffic as well as the elasticity to let us pay only for what we use, and that’s where Google Cloud comes in.So how does this work? Our externally facing APIs, which are running on Google Kubernetes Engine, receive those user events—up to hundreds of thousand per second. All the components in our architecture need to be able to handle this demand. So from our APIs, those events go to Pub/Sub, Dataflow and from there they are written to Bigtable and BigQuery, Google Cloud’s serverless, and highly scalable data warehouse. This business user activity data underpins almost all our products. Events can be things like product views or additions to shopping carts. When we store this data in Bigtable, we use a combination of email address and the customer ID as the Bigtable key and we record the event details in that record. What do we do with this information next? It’s important to mention that we also mark the last time we received an event about a user in Memorystore for Redis, Google Cloud’s fully managed Redis service. This is important because we have another service that is periodically checking Memorystore for users that have not been active for a campaign-specific period of time (it can be 30 minutes, for example), then deciding whether to reach out to them.How we decide when we reach out is an intelligent part of our product offering, based on the channel, message, product, etc. When we do reach out, we use Memorystore for Redis as a rate limiter or token bucket. In order not to overwhelm the email or texting providers we send API requests to, we throttle those requests using Memorystore. (We prefer to preemptively throttle the outgoing API requests as opposed to handling errors later.)When we do reach out, often we will need details for a specific product—let’s say if the website belongs to a retailer. We usually get that information from the retailer through various channels and we store product information in Cloud SQL for MySQL. We pull that information when we need to send an email with product information, and we use Memorystore for Redis to cache that information, since many of the products are repeatedly called. Our Cloud SQL instance has 16 vCPUs, 60GBs of memory and 0.5TB of disk space and when we perform those product information updates, we have about a thousand write transactions per second. We are also in the process of migrating some tables from a self managed MySQL instance, and we keep those tables synchronized with Cloud SQL using Datastream. Our user history database was originally stored in AWS DynamoDB, but we were running into problems with how they structured the data, and we’d often get hot shards but with no way to determine how or why. That led to our decision to migrate to Bigtable. We set up the migration first by writing the data to two locations from Pub/Sub, performed some backfill of data until that was up and running, and then started working on the reading. We performed this over a few short months, then switched everything to Bigtable. So, as mentioned, we are using Bigtable for multiple databases. The instance that stores our user events has about 30 TB with about 50 nodes.Profile managementA second use case for Bigtable is for user profile management, where we track, for example, user attributes based on subscription activity, whether they’ve opted in or out of various lists, and where we apply list-specific rules that determine which targeted emails we send out to users. Our very own URL shortenerOur third use case for Bigtable is our URL shortener. When our customers build out campaigns and choose a URL, we append tracking information to the query string of the URLs and they become long. Many times, we are sending them via SMS texts, so the URLs need to be short. We originally used an external solution, but made the determination that they couldn’t support our future demands. Our calls tend to be very bursty in nature, and we needed to plan for a future state of supporting higher throughput. We use a separate table in Bigtable for this shortened URL. We generate the short slug that is 62 bit-encoded and use it as the rowkey. We use the long slug as a Protobuf-encoded data structure in one of the row cells and we also have a cell for counting how many times it was used. We use Bigtable’s atomic increment to increase that counter to track how many times the short slug was used. When the user receives a text message on their phone, they click the short URL, which goes through to us, and we expand it to the long slug (from Bigtable) and redirect them to the appropriate site location. Obviously, for the URL shortener use case, we need to make the conversion very quickly. Bigtable’s low latency helps us meet that demand and we can scale it up to meet higher throughput demands.Meeting the future with Google CloudOur business has grown considerably, and as we keep signing up new clients, we need to scale up accordingly, and Bigtable has met our scaling demands easily. With Bigtable and other Google Cloud products powering our data architecture, we’ve met the demand of incredibly high traffic days in the last year, including Black Friday and Cyber Monday. Traffic for these events went much higher than expected, and Bigtable was there, helping us easily scale on demand. We are working on leveraging a more cloud native approach and using Google Cloud managed services like GKE, Dataflow, pub/sub, Cloud SQL , Memorystore, BigQuery and more. Google has those 1st party products and we don’t see the value in rolling out or self managing such solutions ourselves..Thanks to Google Cloud, we now have reliable and flexible data solutions that will help us meet the needs of our growing customer base, and delight their users with fast, responsive, personalized shopping messaging and experiences. Learn more about Wunderkind and Cloud Bigtable. Or check out our recent blog exploring the differences between Bigtable and BigQuery.Related ArticleBigtable vs. BigQuery: What’s the difference?Bigtable vs BigQuery? What’s the difference? In this blog, you’ll get a side-by-side view of Google BigQuery and Google Cloud Bigtable.Read Article
Quelle: Google Cloud Platform

What is a Secure Software Supply Chain and Why Should I Care?

Recently there has been an increase in attacks that have compromised well known software companies’ supply chains, enabling attackers to gain access to customer systems by injecting their own malicious code or backdoor capabilities into third-party systems. These third-party systems (along with the malicious code) then get incorporated into software or other digital products.  These … Continued
Quelle: Mirantis