Go to file
2023-03-16 00:07:03 +01:00
.gitignore Create corpus script 2023-03-15 23:59:24 +01:00
readme.md Create corpus script 2023-03-15 23:59:24 +01:00
wiki.sh Small script correction 2023-03-16 00:07:03 +01:00

Text corbus

based on Wikipedia EN Latest Articles -- enwiki-latest-pages-articles.xml.bz2 created with wiki.sh script

Code

  • include only article text in between marks
  • exclude lines without alphabet letters
  • clear excessive white spaces
  • remove special characters
  • remove unintentional empty lines

Stats

No lines

wc -l

Size

du -sh enwiki-latest-corpus.txt.bz2

Head of file

bzcat enwiki-latest-corpus.txt.bz2 | head -n 5

Random lines from file

bzcat enwiki-latest-corpus.txt.bz2 | shuf -n 5