Foundation models are a type of artificial intelligence (AI) model that is trained on a massive amount of unlabeled data. This allows them to learn general patterns and relationships in the data, which can then be fine-tuned for specific tasks such as patent prior-art searches, patent classification, and other patent analysis tasks. They are called "foundation" models because they can be used as a basis for building other AI models.

Key characteristics of foundation models:

  • Trained on massive datasets: Foundation models are trained on very large datasets, often containing billions or even trillions of data points. This allows them to learn complex patterns and relationships that would be difficult to learn from smaller datasets.
  • General purpose: Foundation models are designed to be general purpose, meaning they can be used for a wide range of tasks. This is in contrast to traditional AI models, which are often designed for specific tasks.
  • Adaptable: Foundation models can be adapted to specific tasks by fine-tuning them on smaller datasets that are specific to the task. This allows them to be used for a wide range of tasks without having to be retrained from scratch each time.
  • Powerful: Foundation models have demonstrated impressive performance in natural language processing which is at the heart of any patent analysis task.

Foundations models that are best suited for patent analysis are those that can process and understand large amounts of textual data. These include, for example:

 

Transformer-based models:

  • BERT (Bidirectional Encoder Representations from Transformers): This model is adept at understanding the context and nuances of language, making it well-suited for analyzing complex patent claims and identifying relevant prior art. Includes BERT variants such as SciBERT (trained on scientific text), LegalBERT (trained on legal text), and RoBERTa (Robustly Optimized BERT Pre-training Approach - an improved version of BERT with more training data and longer training time.)
  • GPT-3, GPT-4 (Generative Pre-trained Transformer): While primarily known for text generation, OpenAi's GPTs' powerful language understanding capabilities can be leveraged for prior art search tasks, especially when generating relevant keywords or summarizing patent documents.
  • T5 (Text-to-Text Transfer Transformer): T5 is a versatile model that can be trained for various NLP tasks in a text-to-text format. For example, T5's flexible architecture can be fine-tuned for specific prior art search tasks, such as retrieving documents with similar claims or classifying patents into relevant categories.
  • LLaMA 2 (Large Language Model Meta AI): Llama 2 is designed to be a powerful, scalable model suitable for a variety of natural language processing tasks, including those involving legal and technical text. Compared to some larger models, Llama 2 can be more computationally efficient.
  • Falcon is an open-source family of state-of-the-art large language models (LLMs) developed by the Technology Innovation Institute (TII) in Abu Dhabi. It comes in different sizes (7B, 40B, 180B parameters), catering to diverse computational needs and use cases. The Falcon-40B model, in particular, has been praised for its performance and relatively small size.

 

Sentence Transformer models:

  • These models are specifically designed for generating meaningful sentence embeddings, which can be used to calculate semantic similarity between patent claims and prior art documents. This allows for more accurate and efficient identification of relevant prior art.
  • For example, SBERT (Sentence-BERT) is a modification of BERT that is optimized for generating sentence embeddings, making it ideal for semantic similarity tasks. See also SciBERT-Sentence, LegalBERT-Sentence. 

 

Other techniques:

  • Word2Vec and GloVe: While not strictly foundation models, these word embedding algorithms can be used as a starting point for building custom prior art search systems.
    • They can help identify semantically similar words and phrases, which can be used to broaden the search and uncover relevant prior art that might otherwise be missed.
    • They are used to represent words as vectors and capture semantic relationships between words based on their context in a large corpus of text.
    • Word2Vec is generally faster, focuses on local context windows, and uses a shallow neural network, while GloVe considers global co-occurrence statistics and uses matrix factorization.
  • GloVe (Global Vectors for Word Representation):
    • GloVe is a count-based model. It constructs a large matrix of word co-occurrences, where each cell represents how frequently a pair of words appears together within a certain window in a corpus. It then applies matrix factorization techniques to reduce this high-dimensional co-occurrence matrix into a lower-dimensional representation. The output is a dense vector representations of words where the distance between vectors captures semantic similarity.
    • Training GloVe utilizes a weighted least squares model to learn embeddings that minimize the difference between the dot product of word vectors and the logarithm of their co-occurrence probability.
    • GloVe captures both global statistics and local context, leading to potentially more meaningful representations. It is efficient for creating large vocabulary embeddings.
  • Word2Vec (Word to Vector):
    • Word2Vec is a predictive model that learns word embeddings by trying to predict a target word given its context (Continuous Bag of Words or CBOW) or predict context words given a target word (Skip-gram).
    • Word2Vec focuses on local context, considering only the words surrounding the target word within a fixed window size.
    • Word2Vec is computationally efficient and can be trained on large corpora relatively quickly.

 

Fine-tuning for patent data:

It's important to note that while these foundation models are powerful, they generally need to be fine-tuned on patent-specific data to achieve optimal performance for prior art searching. This involves training the models on large datasets of patents and non-patent literature to improve their understanding of patent-specific language and terminology.