How to Match Related Content in JavaScript

In today's digital age, understanding the relationship between pieces of text is crucial for tasks like data mining, recommendation systems, and natural language processing. Whether you're comparing product descriptions, analyzing user reviews, or exploring other text-based data, matching related content can offer valuable insights. In this blog post, we'll explore various techniques in JavaScript to achieve this, ranging from simple keyword matching to sophisticated word embeddings using TensorFlow.js.

1. Keyword Matching

If you have specific keywords or phrases and want to see if they appear in multiple pieces of text, keyword matching is an effective and straightforward approach. Here's a quick example:

const string1 = "The quick brown fox jumps over the lazy dog.";
const string2 = "A fast brown fox leaps across the sleeping dog.";

const keywords = ["quick", "fox", "dog"];

keywords.forEach(keyword => {
    if (string1.includes(keyword) && string2.includes(keyword)) {
        console.log(`Keyword '${keyword}' found in both strings.`);
    }
});

This simple script checks if the given keywords are present in both strings and logs the matching results. While keyword matching is efficient, it's limited in its ability to understand the deeper semantics of the text.

2. Cosine Similarity with TF-IDF

For more advanced textual analysis, you can calculate the cosine similarity of Term Frequency-Inverse Document Frequency (TF-IDF) vectors. This method allows you to quantify the similarity between two texts based on the frequency of terms. Here's how you can implement TF-IDF and cosine similarity in JavaScript:

Step 1: Tokenize and Calculate TF-IDF

function getTFIDF(strings) {
  const termFrequency = {};
  const documentFrequency = {};
  const tfidf = [];

  strings.forEach((str, idx) => {
    const tokens = str.toLowerCase().split(/\W+/);
    termFrequency[idx] = {};
    tokens.forEach(token => {
      if (!termFrequency[idx][token]) termFrequency[idx][token] = 0;
      termFrequency[idx][token] += 1;
    });

    tokens.forEach(token => {
      if (!documentFrequency[token]) documentFrequency[token] = 0;
      documentFrequency[token] += 1;
    });
  });

  strings.forEach((str, idx) => {
    const tokens = str.toLowerCase().split(/\W+/);
    tfidf[idx] = {};
    tokens.forEach(token => {
      const tf = termFrequency[idx][token];
      const df = documentFrequency[token];
      const idf = Math.log(strings.length / df);
      tfidf[idx][token] = tf * idf;
    });
  });

  return tfidf;
}

Step 2: Calculate Cosine Similarity

const strings = [
  "The quick brown fox jumps over the lazy dog.",
  "A fast brown fox leaps across the sleeping dog."
];

const tfidf = getTFIDF(strings);

function cosineSimilarity(tfidf1, tfidf2) {
  let dotProduct = 0;
  let magnitude1 = 0;
  let magnitude2 = 0;

  for (let token in tfidf1) {
    dotProduct += (tfidf1[token] || 0) * (tfidf2[token] || 0);
    magnitude1 += (tfidf1[token] || 0) ** 2;
  }

  for (let token in tfidf2) {
    magnitude2 += (tfidf2[token] || 0) ** 2;
  }

  magnitude1 = Math.sqrt(magnitude1);
  magnitude2 = Math.sqrt(magnitude2);

  return dotProduct / (magnitude1 * magnitude2);
}

const similarity = cosineSimilarity(tfidf[0], tfidf[1]);
console.log("Cosine Similarity:", similarity);

This method provides a numerical value representing the similarity between the two texts, giving you a more nuanced understanding of their relationship than simple keyword matching.

3. Word Embeddings with TensorFlow.js

For capturing even more nuanced semantic relationships, you can use pre-trained word embeddings via TensorFlow.js. Word embeddings provide a dense representation of words in a high-dimensional space where semantically similar words are closer together.

First, you'll need to install TensorFlow.js:

npm install @tensorflow/tfjs

Then, you can use the following code to process your text using a pre-trained model like the Universal Sentence Encoder:

Step 1: Load TensorFlow and the Pre-trained Model

import * as tf from '@tensorflow/tfjs';
import * as use from '@tensorflow-models/universal-sentence-encoder';

async function getEmbeddings(strings) {
  const model = await use.load();
  return await model.embed(strings);
}

Step 2: Compute Cosine Similarity

const strings = [
  "The quick brown fox jumps over the lazy dog.",
  "A fast brown fox leaps across the sleeping dog."
];

getEmbeddings(strings).then(embeddings => {
  const embeddingsArray = embeddings.arraySync();

  const cosineSimilarity = (a, b) => {
    const dotProduct = tf.dot(a, b).dataSync()[0];
    const magnitudeA = tf.norm(a).dataSync()[0];
    const magnitudeB = tf.norm(b).dataSync()[0];
    return dotProduct / (magnitudeA * magnitudeB);
  };

  const similarity = cosineSimilarity(embeddingsArray[0], embeddingsArray[1]);
  console.log("Cosine Similarity:", similarity);
});

Using TensorFlow.js and pre-trained models allows you to leverage state-of-the-art neural network techniques to understand the semantic relationships between different pieces of text.

Conclusion

Matching related content in JavaScript can range from simple keyword matching to complex semantic analysis using machine learning. Depending on your specific needs, you can choose the method that best suits your requirements. For basic tasks, keyword matching might suffice, but for more intricate analyses, TF-IDF and word embeddings offer powerful tools to uncover deeper relationships within your text data.

Feel free to experiment with these methods and adapt them to your unique use case. Happy coding!

Exploring Techniques for Matching Related Text Content in JavaScript