This commit is contained in:
Mikołaj Pokrywka 2023-03-14 18:27:43 +01:00
commit 0c249dab9a
4 changed files with 117 additions and 0 deletions

BIN
News-Commentary-v16.xz Normal file

Binary file not shown.

73
README.md Normal file
View File

@ -0,0 +1,73 @@
# Corpus
News-Commentary v16 przefiltrowany korpus -- News-Commentary-v16.
# Statystyki
## Po filtrowaniu:
### Ilość linijek
`wc -l`
632985
### Rozmiar
83.0 MiB
## Filtrowanie
# Użyto biblioteki opusfilter
`opusfilter filter_config.yaml`
1. Usuwanie duplikatów (2.40% duplicate lines)
2. Użyto następujących filtrów:
```
filters:
- LengthFilter:
unit: word
min_length: 1
max_length: 300
- LengthRatioFilter:
unit: word
threshold: 3
- LongWordFilter:
threshold: 40
```
# Użyto skryptu filter.py, który:
1. Usuwa linijki w których nie ma ani jednej litery Unicode
2. Usuwa linijki składające się z jednego słowa, który jest linkiem lub jest alfanumeryczny
## Przed przefiltrowaniem
### Ilość linijek
`wc -l`
648886
### Rozmiar
84.8 MiB
### 10 przykładowych losowych zdań z korpusu:
`lm shuf corpora.eng | head`
```
1 The crash is followed by a flight to safety, which is followed by a steep fall in the velocity of money as investors hoard cash.
2 In this sense, the pandemic represents a unique opportunity to advance European integration like never before.
3 As depositors flee from a weak bank, they can destroy the banks liquidity.
4 But progress is nonetheless being made.
5 Critics of the growth model argue that it is imperative to redistribute income and wealth as soon as possible.
6 All told, countries that have pursued greater economic openness have enjoyed improved nutritional, health, and educational outcomes, as well as higher productivity and incomes.
7 The periods around World War I and World War II are routinely overlooked in discussions that focus on deregulation of capital markets since the 1980s.
8 The Greek people deserve some real choices in the near future.
9 LONDON The outbreak of the Zika virus, like Ebola before it, has highlighted the risk that infectious diseases can pose to the health of entire countries and the importance of vaccines to the fight against fast-moving epidemics.
10 Controls may even require curbing individual freedoms, like accessing hospitals or getting on airplanes.
```

17
filter.py Normal file
View File

@ -0,0 +1,17 @@
import sys
import regex
for line in sys.stdin:
line = line.strip()
sent_len = len(line.split())
if not regex.search('\p{L}', line):
continue
if sent_len == 1:
if 'http' in line:
continue
if line.isalnum():
continue
print(line)

27
filter_config.yaml Normal file
View File

@ -0,0 +1,27 @@
steps:
- type: remove_duplicates
parameters:
inputs:
- monolingual_data/corpora.eng.gz
outputs:
- monolingual_data/buff_deduped.eng.gz
- type: filter
parameters:
inputs:
- monolingual_data/buff_deduped.eng.gz
outputs:
- filtered_mono_data/filtered.eng.gz
filters:
- LengthFilter:
unit: word
min_length: 1
max_length: 300
- LengthRatioFilter:
unit: word
threshold: 3
- LongWordFilter:
threshold: 40