Text Summarization#

Run on Google Colab View source on GitHub Download notebook


Introduction#

In this tutorial, we present how to use HuggingFace models in EvaDB to summarize and classify text. In particular, we will first load PDF documents into EvaDB using the LOAD PDF statement. The text in each paragraph from the PDF is automatically stored in the data column of a table for subsequent analysis. We then run text summarization and text classification AI queries on the data column obtained from the loaded PDF documents.

EvaDB makes it easy to process text using its built-in support for HuggingFace.

Prerequisites#

To follow along, you will need to set up a local instance of EvaDB via pip.

Connect to EvaDB#

After installing EvaDB, use the following Python code to establish a connection and obtain a cursor for running EvaQL queries.

import evadb
cursor = evadb.connect().cursor()

We will assume that the input pdf_sample1 PDF is loaded into EvaDB. To download the PDF and load it into EvaDB, see the complete text summarization notebook on Colab.

Create Text Summarization and Classification Functions#

To create custom TextSummarizer and TextClassifier functions, use the CREATE FUNCTION statement. In these queries, we leverage EvaDB’s built-in support for HuggingFace models. We only need to specify the task and the model parameters in the query to create these functions:

CREATE FUNCTION IF NOT EXISTS TextSummarizer
TYPE HuggingFace
TASK 'summarization'
MODEL 'facebook/bart-large-cnn';

CREATE FUNCTION IF NOT EXISTS TextClassifier
TYPE HuggingFace
TASK 'text-classification'
MODEL 'distilbert-base-uncased-finetuned-sst-2-english';

Note

EvaDB has built-in support for a wide range of HuggingFace models.

AI Query Using Registered Functions#

After registering these two functions, we use them in a single AI query over the data column to retrieve a subset of paragraphs from the loaded PDF documents with negative sentiment:

CREATE TABLE text_summary AS
SELECT data, TextSummarizer(data)
FROM MyPDFs
WHERE page = 1
      AND paragraph >= 1 AND paragraph <= 3
      AND TextClassifier(data).label = 'NEGATIVE';

Here, the TextClassifier function is applied on the data column of the pdf_sample1 PDF loaded into EvaDB and its output is used to filter out a subset of paragraphs with negative sentiment.

EvaDB’s query optimizer automatically applies the earlier predicates on page number and paragraph numbers to (e.g., page = 1) to avoid running the expensive TextClassifier function on all the rows in the table. After filtering out a subset of paragraphs, EvaDB applies the TextSummarizer function to derive their summaries.

Here is the query’s output DataFrame:

+--------------------------------------------------------------+--------------------------------------------------------------+
|                         mypdfs.data                          |                 textsummarizer.summary_text                  |
+--------------------------------------------------------------+--------------------------------------------------------------+
| DEFINATION  Specialized connective tissue with          ... | Specialized connective tissue with fluid matrix. Erythro ... |
| PHYSICAL CHARACTERISTICS ( 1 )  COLOUR   -- Red  ( 2 )  ... | The temperature is 38° C / 100.4° F. The body weight is  ... |
+--------------------------------------------------------------+--------------------------------------------------------------+

Leverage Text Processing AI Engines with EvaDB#

By integrating databases and AI engines using EvaDB, developers can easily extract insights from text data with just a few EvaQL queries. These powerful natural language processing (NLP) models from OpenAI and HuggingFace are capable of complex text processing tasks (e.g., answering complex questions with context obtained from a column in a table).

EvaDB makes it easy for developers to easily incorporate powerful NLP capabilities into their AI-powered applications while saving time and resources compared to traditional AI development pipelines.

What’s Next?#

👋 If you are excited about our vision of bringing AI inside databases, consider:




Language Models (🦙) and Databases

Language Models (🦙) and Databases#