.gitignore | ||
readme.md | ||
wiki.sh |
Text corbus
based on Wikipedia EN Latest Articles -- enwiki-latest-pages-articles.xml.bz2 created with wiki.sh script
Code
- include only article text in between marks
- exclude lines without alphabet letters
- clear excessive white spaces
- remove special characters
- remove unintentional empty lines
Stats
No lines
wc -l
Size
du -sh enwiki-latest-corpus.txt.bz2
Head of file
bzcat enwiki-latest-corpus.txt.bz2 | head -n 5
Random lines from file
bzcat enwiki-latest-corpus.txt.bz2 | shuf -n 5