paper-cutter/{{cookiecutter.paper_repo_name}}/helpers/get-sentences.sh
Filip Gralinski bc7b4fdf89 Generalize script for extracting sentences
Can extract text using detex now (as an option).
2022-06-06 19:54:30 +02:00

16 lines
390 B
Bash
Executable File

#!/bin/bash
input_file="$1"
method="$2"
extract_text() {
if [[ "$method" == "from-tex" ]]
then
detex "$input_file" | egrep '\S' | grep -v 'unsrt' | perl -pne 's/^\s+| +$//g'
else
bash helpers/pdf-to-plain-text.sh "$input_file" | perl helpers/strip-references.pl | perl -pne 'chomp $_; $_.=" "'
fi
}
extract_text | python3 -m syntok.segmenter | egrep '\S'