Text Summarization#
Introduction#
In this tutorial, we present how to use HuggingFace
models in EvaDB to summarize and classify text. In particular, we will first load PDF
documents into EvaDB
using the LOAD PDF
statement. The text in each paragraph from the PDF is automatically stored in the data
column of a table for subsequent analysis. We then run text summarization and text classification AI queries on the data
column obtained from the loaded PDF
documents.
EvaDB makes it easy to process text using its built-in support for HuggingFace
.
Prerequisites#
To follow along, you will need to set up a local instance of EvaDB via pip.
Connect to EvaDB#
After installing EvaDB, use the following Python code to establish a connection and obtain a cursor
for running EvaQL
queries.
import evadb
cursor = evadb.connect().cursor()
We will assume that the input pdf_sample1
PDF is loaded into EvaDB
. To download the PDF and load it into EvaDB
, see the complete text summarization notebook on Colab.
Create Text Summarization and Classification Functions#
To create custom TextSummarizer
and TextClassifier
functions, use the CREATE FUNCTION
statement. In these queries, we leverage EvaDB’s built-in support for HuggingFace
models. We only need to specify the task
and the model
parameters in the query to create these functions:
CREATE FUNCTION IF NOT EXISTS TextSummarizer
TYPE HuggingFace
TASK 'summarization'
MODEL 'facebook/bart-large-cnn';
CREATE FUNCTION IF NOT EXISTS TextClassifier
TYPE HuggingFace
TASK 'text-classification'
MODEL 'distilbert-base-uncased-finetuned-sst-2-english';
Note
EvaDB has built-in support for a wide range of HuggingFace models.
AI Query Using Registered Functions#
After registering these two functions, we use them in a single AI query over the data
column to retrieve a subset of paragraphs from the loaded PDF
documents with negative sentiment:
CREATE TABLE text_summary AS
SELECT data, TextSummarizer(data)
FROM MyPDFs
WHERE page = 1
AND paragraph >= 1 AND paragraph <= 3
AND TextClassifier(data).label = 'NEGATIVE';
Here, the TextClassifier
function is applied on the data
column of the pdf_sample1
PDF loaded into EvaDB and its output is used to filter out a subset of paragraphs with negative sentiment.
EvaDB’s query optimizer automatically applies the earlier predicates on page number and paragraph numbers to (e.g., page = 1
) to avoid running the expensive TextClassifier
function on all the rows in the table. After filtering out a subset of paragraphs, EvaDB applies the TextSummarizer
function to derive their summaries.
Here is the query’s output DataFrame
:
+--------------------------------------------------------------+--------------------------------------------------------------+
| mypdfs.data | textsummarizer.summary_text |
+--------------------------------------------------------------+--------------------------------------------------------------+
| DEFINATION Specialized connective tissue with ... | Specialized connective tissue with fluid matrix. Erythro ... |
| PHYSICAL CHARACTERISTICS ( 1 ) COLOUR -- Red ( 2 ) ... | The temperature is 38° C / 100.4° F. The body weight is ... |
+--------------------------------------------------------------+--------------------------------------------------------------+
Leverage Text Processing AI Engines with EvaDB#
By integrating databases and AI engines using EvaDB, developers can easily extract insights from text data with just a few EvaQL queries. These powerful natural language processing (NLP) models from OpenAI
and HuggingFace
are capable of complex text processing tasks (e.g., answering complex questions with context
obtained from a column in a table).
EvaDB makes it easy for developers to easily incorporate powerful NLP capabilities into their AI-powered applications while saving time and resources compared to traditional AI development pipelines.
What’s Next?#
👋 If you are excited about our vision of bringing AI inside databases, consider:
📟 joining our Slack: https://evadb.ai/slack
🐙 following us on Github: https://evadb.ai/github
🐦 following us on Twitter: https://evadb.ai/twitter
📝 following us on Medium: https://evadb.ai/blog
🖥️ contributing to EvaDB: https://evadb.ai/github

Language Models (🦙) and Databases#