GPU LLM Inference

How Promoted uses large language models (LLMs) and GPU inference in live delivery.

Promoted uses an ensemble of large language models (LLMs) to generate content embeddings for use in all of Promoted's models, including those predicting semantic relevancy, clicks, conversions, and ad bidding decisions. Instead of using a single LLM or embedding model, Promoted uses a collection of the latest LLMs to account for trade-offs in complexity and domain specificity and to reduce inference variance and particular model version dependencies.

While some content embeddings can be cached — including item titles, descriptions, cover images, and common queries — "true" long-tail semantic search and recommendations require the query, context, and live user history to be converted to semantic embeddings in live production. While the CPU can handle this embedding for simple sentence embedding models that can handle "fuzzy string match," GPUs are required to achieve reasonably low inference latencies of modern and bigger multi-media and multi-language embedding models.

Promoted enables GPU inference for customers with sufficient complexity in search queries and user recommendations to achieve our business metric objectives within our latency and cost service level agreements.

Clients with their own embeddings can cache them for queries and items via our CMS system (see Sending Embeddings). Clients who would like to run their custom LLM models in live delivery should contact Promoted for more information. /

📘
Coming soon, Promoted will use LLMs in live delivery to generate descriptive text about contextual relevance
For example, Promoted builds domain-relevant query expansions to generate additional domain relevant features for use as numeric, categorical, and string-matching features in our feature expansion system and for use in blender rules.

Updated 8 months ago