Gutenberg poetry corpus

From: https://github.com/aparrish/gutenberg-poetry-corpus

!curl -O http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 52.2M  100 52.2M    0     0  4073k      0  0:00:13  0:00:13 --:--:-- 4693k

# Unzip and load .json
import gzip, json
raw_data = []
for line in gzip.open('gutenberg-poetry-v001.ndjson.gz'):
    raw_data.append(json.loads(line.strip()))

raw_data[100:110]

[{'s': 'Through their palisades of pine-trees,', 'gid': '19'},
 {'s': 'And the thunder in the mountains,', 'gid': '19'},
 {'s': 'Whose innumerable echoes', 'gid': '19'},
 {'s': 'Flap like eagles in their eyries;--', 'gid': '19'},
 {'s': 'Listen to these wild traditions,', 'gid': '19'},
 {'s': 'To this Song of Hiawatha!', 'gid': '19'},
 {'s': "Ye who love a nation's legends,", 'gid': '19'},
 {'s': 'Love the ballads of a people,', 'gid': '19'},
 {'s': 'That like voices from afar off', 'gid': '19'},
 {'s': 'Call to us to pause and listen,', 'gid': '19'}]

# Store the poems separately in a dict by id; this makes it possible to connect verses into whole poems
poems_dict = {}
for object in raw_data:
    if object['gid'] not in poems_dict:
        poems_dict[object['gid']] = object['s']
    else:
        poems_dict[object['gid']] += f"\n{object['s']}"

print(poems_dict['19'][5000:5200])

here the tangled barberry-bushes
Hang their tufts of crimson berries
Over stone walls gray with mosses,
Pause by some neglected graveyard,
For a while to muse, and ponder
On a half-effaced inscription

# Check the total number of poems and the estimated average length of a poem in words (estimated since for now punctuation is left as it is)
poems_count = len(poems_dict)
total_word_count = sum([len(v.split()) for v in poems_dict.values()])
print('Total poems:', poems_count)
print('Average poem word length:', total_word_count // poems_count)

Total poems: 1191
Average poem word length: 18438

# Save the entire corpus as one .txt file
with open('gutenberg_poems.txt', 'w', encoding='utf-8') as f:
    for v in poems_dict.values():
        f.write(v + '\n')

# Checking all potentially undesirable characters
!grep -oE "[^a-zA-Z ]" gutenberg_poems.txt | sort | uniq -c | sort -k1 -nr

# Normalizing some characters that should be kept
!sed -i 's/;/;/g' gutenberg_poems.txt
!sed -i 's/…/\.\.\./g' gutenberg_poems.txt
!sed -i 's/[—─–]/-/g' gutenberg_poems.txt
!sed -i "s/[\`\’\‘\᾽\´\΄]/\'/g" gutenberg_poems.txt

# Nuking the remaining garbage characters
!sed -i "s/[^a-zA-Z\ \,\.\'\;\!\:\?\-]//g" gutenberg_poems.txt

# Removing any remaining multiple spaces
!sed -i "s/\ \ */\ /g" gutenberg_poems.txt

# Finally removing diacritic marks from alphabetic characters
!cat gutenberg_poems.txt | unidecode > gutenberg_poems_clean.txt

# Now it looks a lot better - only alphabetic characters, spaces and chosen punctuation are kept
!grep -oE "[^a-zA-Z ]" gutenberg_poems_clean.txt | sort | uniq -c | sort -k1 -nr

# Some basic processed file statistics:
!echo -n "Lines: "
!wc -l < gutenberg_poems_clean.txt
!echo -n "Words: "
!wc -w < gutenberg_poems_clean.txt
!echo -n "Characters: "
!wc -c < gutenberg_poems_clean.txt
!echo -n "Size: "
!ls -lh gutenberg_poems_clean.txt | awk '{print $5}'

Lines: 3085117
Words: 21938739
Characters: 120840262
Size: 116M

# Some random lines from the file (doing this in Jupyter throws a harmless piping error apparently)
!cat gutenberg_poems_clean.txt | shuf | head -10

Sae aft around him flung,
A thing so dark that moments of pain
A mother and daughter stood together
He hath heathen gifts of silver and gold,
at secura quies et nescia fallere uita,
The grim dim thrones of the east Ep. .
Ah tamen illa scelus non lavat unda tuum!
A strong emotion on her cheek!
Byron sang its funeral dirge. But tenderness, and heroism, and
Which now upon my fingers thoughtfully
shuf: write error: Broken pipe
shuf: write error

# Simple top 10 frequency histogram of letters (takes a while to run)
!grep -oE "\w" gutenberg_poems_clean.txt | sort | uniq -c | sort -k1 -nr | head

# Simple top 10 frequency histogram of words (takes a while to run, piping error thrown here as well, but it works)
!cat gutenberg_poems_clean.txt | tr ' ' '\n' | sort | uniq -c | sort -k1 -nr | head

1110344 the
 526755 and
 477187 of
 367204 to
 309477 a
 294277 And
 283595 in
 243898 I
 198621 The
 182639 his
sort: write failed: 'standard output': Broken pipe
sort: write error

# Compressing the file for uploading
!xz -v gutenberg_poems_clean.txt

gutenberg_poems_clean.txt (1/1)
  100 %        34.6 MiB / 115.2 MiB = 0.300   1.6 MiB/s       1:10

16 KiB Raw Blame History Unescape Escape

Gutenberg poetry corpus

16 KiB

Raw Blame History