Text Summarization Benchmark#
In this benchmark, we compare the runtime performance of EvaDB and MindsDB on a text summarization application operating on a news dataset. In particular, we focus on the CNN-DailyMail News dataset.
All the relevant files are located in the text summarization benchmark folder on Github.
Prepare dataset#
cd benchmark/text_summarization
bash download_dataset.sh
Use EvaDB for Text Summarization#
Note
Install ray
along with EvaDB to speed up the queries:
pip install evadb[ray]
cd benchmark/text_summarization
python text_summarization_with_evadb.py
Loading Data Into EvaDB#
CREATE TABLE IF NOT EXISTS cnn_news_test(
id TEXT(128),
article TEXT(4096),
highlights TEXT(1024)
);
Creating Text Summarization Function in EvaDB#
CREATE FUNCTION IF NOT EXISTS TextSummarizer
TYPE HuggingFace
TASK 'summarization'
MODEL 'sshleifer/distilbart-cnn-12-6'
MIN_LENGTH 5
MAX_LENGTH 100;
Tuning EvaDB for Maximum GPU Utilization#
cursor._evadb.config.update_value("executor", "batch_mem_size", 300000)
cursor._evadb.config.update_value("executor", "gpu_ids", [0,1])
cursor._evadb.config.update_value("experimental", "ray", True)
Text Summarization Query in EvaDB#
CREATE TABLE IF NOT EXISTS cnn_news_summary AS
SELECT TextSummarizer(article) FROM cnn_news_test;
Use MindsDB for Text Summarization#
Setup SQLite Database#
sqlite3 cnn_news_test.db
> .mode csv
> .import cnn_news_test.csv cnn_news_test
> .exit
Install MindsDB#
Follow the MindsDB installation guide to install it via pip
.
Note
You will need to manually run pip install evaluate
for the HuggingFace
model to work in MindsDB.
After installation, use the MySQL
client for connecting to MindsDB
. Update the port number if needed.
mysql -h 127.0.0.1 --port 47335 -u mindsdb -p
Benchmark MindsDB#
Connect MindsDB
to the sqlite
database we created before:
CREATE DATABASE sqlite_datasource
WITH ENGINE = 'sqlite',
PARAMETERS = {
"db_file": "cnn_news_test.db"
};
Create a text summarization
model and wait for it to be ready
.
CREATE MODEL mindsdb.hf_bart_sum_20
PREDICT PRED
USING
engine = 'huggingface',
task = 'summarization',
model_name = 'sshleifer/distilbart-cnn-12-6',
input_column = 'article',
min_output_length = 5,
max_output_length = 100;
DESCRIBE mindsdb.hf_bart_sum_20;
Use the text summarization
model to summarize the CNN news dataset:
CREATE OR REPLACE TABLE sqlite_datasource.cnn_news_summary (
SELECT PRED
FROM mindsdb.hf_bart_sum_20
JOIN sqlite_datasource.cnn_news_test
);
Benchmarking Results#
Here are the key runtime metrics for the Text Summarization
benchmark.
The experiment is conducted on a server with 56 Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz and two Quadro P6000 GPUs.
MindsDB |
EvaDB |
EvaDB |
|
(off-the-shelf) |
(off-the-shelf) |
(tuned for maximum |
|
GPU utilization) |
|||
Runtime |
4 hours 45 mins |
1 hour 10 mins |
43 mins |
Speedup |
1x |
4.1x |
6.3x |