73 lines
2.2 KiB
73 lines
2.2 KiB
# Corpus
News-Commentary v16 przefiltrowany korpus -- News-Commentary-v16.
# Statystyki
## Po filtrowaniu:
### Ilość linijek
`wc -l`
### Rozmiar
83.0 MiB
## Filtrowanie
# Użyto biblioteki opusfilter
`opusfilter filter_config.yaml`
1. Usuwanie duplikatów (2.40% duplicate lines)
2. Użyto następujących filtrów:
- LengthFilter:
unit: word
min_length: 1
max_length: 300
- LengthRatioFilter:
unit: word
threshold: 3
- LongWordFilter:
threshold: 40
# Użyto skryptu filter.py, który:
1. Usuwa linijki w których nie ma ani jednej litery Unicode
2. Usuwa linijki składające się z jednego słowa, który jest linkiem lub jest alfanumeryczny
## Przed przefiltrowaniem
### Ilość linijek
`wc -l`
### Rozmiar
84.8 MiB
### 10 przykładowych losowych zdań z korpusu:
`lm shuf corpora.eng | head`
1 The crash is followed by a flight to safety, which is followed by a steep fall in the velocity of money as investors hoard cash.
2 In this sense, the pandemic represents a unique opportunity to advance European integration like never before.
3 As depositors flee from a weak bank, they can destroy the bank’s liquidity.
4 But progress is nonetheless being made.
5 Critics of the growth model argue that it is imperative to redistribute income and wealth as soon as possible.
6 All told, countries that have pursued greater economic openness have enjoyed improved nutritional, health, and educational outcomes, as well as higher productivity and incomes.
7 The periods around World War I and World War II are routinely overlooked in discussions that focus on deregulation of capital markets since the 1980s.
8 The Greek people deserve some real choices in the near future.
9 LONDON – The outbreak of the Zika virus, like Ebola before it, has highlighted the risk that infectious diseases can pose to the health of entire countries – and the importance of vaccines to the fight against fast-moving epidemics.
10 Controls may even require curbing individual freedoms, like accessing hospitals or getting on airplanes.