32 lines
580 B
Markdown
32 lines
580 B
Markdown
# Text corbus
|
|
based on Wikipedia EN Latest Articles -- enwiki-latest-pages-articles.xml.bz2
|
|
created with wiki.sh script
|
|
|
|
# Code
|
|
- include only article text in between <text> marks
|
|
- exclude lines without alphabet letters
|
|
- clear excessive white spaces
|
|
- remove special characters
|
|
- remove unintentional empty lines
|
|
|
|
# Stats
|
|
## No lines
|
|
```bash
|
|
wc -l
|
|
|
|
```
|
|
## Size
|
|
```bash
|
|
du -sh enwiki-latest-corpus.txt.bz2
|
|
|
|
```
|
|
## Head of file
|
|
```bash
|
|
bzcat enwiki-latest-corpus.txt.bz2 | head -n 5
|
|
|
|
```
|
|
## Random lines from file
|
|
```bash
|
|
bzcat enwiki-latest-corpus.txt.bz2 | shuf -n 5
|
|
|
|
``` |