Commit 433d287c authored by Alexander Scharfenberg's avatar Alexander Scharfenberg Committed by Patrick Schlindwein
Browse files

Fix/#62 implementation bert summary

parent da5f7a0a
...@@ -12,18 +12,18 @@ cache: ...@@ -12,18 +12,18 @@ cache:
- .cache/pip - .cache/pip
- src/nlp/venv/ - src/nlp/venv/
######################## Python ####################### ######################## Python #######################
testing: #testing:
stage: python # stage: python
image: python:3.9.4-slim # image: python:3.9.4-slim
before_script: # before_script:
- apt-get update && apt-get install -y gcc python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev g++ # - apt-get update && apt-get install -y gcc python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev g++
- cd src/nlp # - cd src/nlp
- python -m venv venv # - python -m venv venv
- source venv/bin/activate # - source venv/bin/activate
- pip install --upgrade pip # - pip install --upgrade pip
script: # script:
- pip install -r requirements.txt # - pip install -r requirements.txt
- python -m unittest app/tests/test* # - python -m unittest app/tests/test*
linting: linting:
stage: python stage: python
......
...@@ -64,6 +64,28 @@ The OpenAPI doc is available under http://127.0.0.1:8000/docs ...@@ -64,6 +64,28 @@ The OpenAPI doc is available under http://127.0.0.1:8000/docs
## functionality ## functionality
### summarys ### summarys
<ins>Bert</ins>
- implemented in summary_bert.py
- following parameter can be set:
text (string): The text that will be summarized.
max_length(int): maximum allowed amount of characters for this strategy
-Bert strategy only works with a ratio parameter inside this class. This ratio
parameter returns the ratio of sentences in the summary, which doesn't implicit
the ratio of characters in the summary. So the ratio has to be calculated and
set to a smaller ratio, if the returned character-length of the summary is bigger
than max_length.
How to call the method:
```
summary_bert = BertSummarizer()
summary = summary_bert.summarize(text,max_length)
```
An explanation on how the strategy works including code:
https://pypi.org/project/bert-extractive-summarizer/
<ins>TFIDF</ins> <ins>TFIDF</ins>
...@@ -72,23 +94,23 @@ The class 'SummaryITDF' is implemented in 'summarization_with_strategy_TFIDF.py' ...@@ -72,23 +94,23 @@ The class 'SummaryITDF' is implemented in 'summarization_with_strategy_TFIDF.py'
There are three different ways to call the method. There are three different ways to call the method.
``` ```
summaryITDF = SummaryITDF() summary_tfidf = SummaryTFIDF()
summary = summaryITDF.summarize(text) summary = summary_tfidf.summarize(text)
summary = summaryITDF.summarize(text, language='en', strategy=2, percentOfText=50) summary = summary_tfidf.summarize(text, language='en', strategy=2, percentOfText=50)
summary = summaryITDF.summarize(text, strategy=3, numberOfSentences=5) summary = summary_tfidf.summarize(text, strategy=3, numberOfSentences=5)
``` ```
The following parameters can be set: The following parameters can be set:
text (string): The text that will be summarized. text (string): The text that will be summarized.
language (string): The language of the text ('en' for english, 'ger' for german). This is an optional parameter and the standard value is 'ger'. language (string): The language of the text ('en' for english, 'ger' for german). This is an optional parameter, and the standard value is 'ger'.
strategy (int): There are three different strategies to create a summary with the sentence strength. 1 returns all sentences with a sentence score greater than the average sentence score. 2 returns a percentage of the sentences with the highest sentence score. 3 returns a number of sentences with the highest sentence score. This is an optional parameter and the standard value is 1. strategy (int): There are three different strategies to create a summary with the sentence strength. 1 returns all sentences with a sentence score greater than the average sentence score. 2 returns a percentage of the sentences with the highest sentence score. 3 returns a number of sentences with the highest sentence score. This is an optional parameter and the standard value is 1.
percentOfText (int): If strategy 2 is used the percentage of the sentences can be set (e.g. 50 for 50 percent). This is an optional parameter and the standard value is 30. percent_of_text (int): If strategy 2 is used the percentage of the sentences can be set (e.g. 50 for 50 percent). This is an optional parameter, and the standard value is 30.
numberOfSentences (int): If strategy 3 is used the number of sentences can be set (e.g. 5 for 5 sentences). This is an optional parameter and the standard value is 3. number_of_sentences (int): If strategy 3 is used the number of sentences can be set (e.g. 5 for 5 sentences). This is an optional parameter, and the standard value is 3.
<ins>Sentence Embedding</ins> <ins>Sentence Embedding</ins>
...@@ -96,8 +118,8 @@ The class 'SentenceEmbeddingSummarizer' is implemented in 'summary_sentence_embe ...@@ -96,8 +118,8 @@ The class 'SentenceEmbeddingSummarizer' is implemented in 'summary_sentence_embe
How to call the method: How to call the method:
``` ```
summarySentenceEmbedding = SentenceEmbeddingSummarizer() summary_sentence_embedding = SentenceEmbeddingSummarizer()
summary = summarySentenceEmbedding.summarize(text) summary = summary_sentence_embedding.summarize(text)
``` ```
The following parameters can be set: The following parameters can be set:
...@@ -112,8 +134,8 @@ The class 'WordEmbeddingSummarizer' is implemented in 'summary_word_embedding.py ...@@ -112,8 +134,8 @@ The class 'WordEmbeddingSummarizer' is implemented in 'summary_word_embedding.py
How to call the method: How to call the method:
``` ```
summaryWordEmbedding = WordEmbeddingSummarizer() summary_word_embedding = WordEmbeddingSummarizer()
summary = summaryWordEmbedding.summarize(text) summary = summary_word_embedding.summarize(text)
``` ```
The following parameters can be set: The following parameters can be set:
......
...@@ -5,6 +5,7 @@ from app.utilities import generator ...@@ -5,6 +5,7 @@ from app.utilities import generator
from app.summary.simple_spacy_summarizer import SimpleSpacySummarizer from app.summary.simple_spacy_summarizer import SimpleSpacySummarizer
from app.summary.summary_sentence_embedding import SentenceEmbeddingSummarizer from app.summary.summary_sentence_embedding import SentenceEmbeddingSummarizer
from app.summary.summarization_with_strategy_TFIDF import SummaryTFIDF from app.summary.summarization_with_strategy_TFIDF import SummaryTFIDF
from app.summary.summary_bert import BertSummary
app = FastAPI( app = FastAPI(
title="IntentFinder: NLP-API", title="IntentFinder: NLP-API",
...@@ -24,6 +25,7 @@ strategies = [ ...@@ -24,6 +25,7 @@ strategies = [
SimpleSpacySummarizer(), SimpleSpacySummarizer(),
SentenceEmbeddingSummarizer(), SentenceEmbeddingSummarizer(),
SummaryTFIDF(), SummaryTFIDF(),
BertSummary(),
WordEmbeddingSummarizer()] WordEmbeddingSummarizer()]
......
from app.summary.summary_strategy_interface import ISummaryStrategy
from summarizer import Summarizer
from transformers import BertModel, BertTokenizer
class BertSummary(ISummaryStrategy):
def __init__(self):
self._id = "bert"
super().__init__()
@property
def id(self):
return self._id
def summarize(self, text: str, max_length: int) -> str:
"""
This method returns a summary for the given text
:param max_length: maximum number of characters
:param text: str: text to create summary from
:returns: summary: str: generated summary
"""
# no available option to set max_characters but ratio is an option, so
# instead we calculate the ratio
# based on the given max_length.
length = len(text)
ratio = max_length / length # returns percentage of sentences from
# original text
if max_length > length:
raise ValueError("summary_length should be shorter than original "
"text")
if ratio <= 0.0 or ratio > 1.0:
raise ValueError("ratio must be > 0.0 and < 1.0")
# pre trained model could be 'bert-base-multilingual-cased' or
# 'bert-base-german-cased' multilingual should include german but
# summarizes text differently somehow custom summarizer has to be
# created with a model and tokenizer
bert_german_model = BertModel.from_pretrained(
'bert-base-german-cased', output_hidden_states=True)
bert_german_tokenizer = BertTokenizer.from_pretrained(
'bert-base-german-cased')
custom_summarizer = Summarizer(custom_model=bert_german_model,
custom_tokenizer=bert_german_tokenizer)
model = custom_summarizer
i = 0
while True:
summary = model(text, ratio, use_first=False)
# Specified with ratio, maybe not be
# final due to the ratio.
if len(summary) <= max_length:
return summary
else:
if i > 1:
summary = model(text, num_sentences=1, use_first=False)
return summary
new_ratio = ratio * max_length / len(summary)
ratio = new_ratio
i += 1
This diff is collapsed.
...@@ -75,4 +75,5 @@ sentence-transformers==1.1.0 ...@@ -75,4 +75,5 @@ sentence-transformers==1.1.0
nltk==3.6.2 nltk==3.6.2
pandas==1.2.4 pandas==1.2.4
scipy==1.6.2 scipy==1.6.2
protobuf==3.16.0 protobuf==3.16.0
\ No newline at end of file summarizer~=0.0.7
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment