Go to file
Mikołaj Pokrywka 3dba635416 all done 2023-03-14 18:28:34 +01:00
News-Commentary-v16.xz all done 2023-03-14 18:27:43 +01:00
README.md all done 2023-03-14 18:28:34 +01:00
filter.py all done 2023-03-14 18:27:43 +01:00
filter_config.yaml all done 2023-03-14 18:27:43 +01:00

README.md

Corpus

News-Commentary v16 przefiltrowany korpus -- News-Commentary-v16.xz

Statystyki

Po filtrowaniu:

Ilość linijek

wc -l

632985

Rozmiar

83.0 MiB

Filtrowanie

Użyto biblioteki opusfilter

opusfilter filter_config.yaml

  1. Usuwanie duplikatów (2.40% duplicate lines)
  2. Użyto następujących filtrów:
      filters:
        - LengthFilter:
            unit: word
            min_length: 1
            max_length: 300

        - LengthRatioFilter:
            unit: word
            threshold: 3

        - LongWordFilter:
            threshold: 40
        

Użyto skryptu filter.py, który:

  1. Usuwa linijki w których nie ma ani jednej litery Unicode
  2. Usuwa linijki składające się z jednego słowa, który jest linkiem lub jest alfanumeryczny

Przed przefiltrowaniem

Ilość linijek

wc -l

648886

Rozmiar

84.8 MiB

10 przykładowych losowych zdań z korpusu:

lm shuf corpora.eng | head

     1  The crash is followed by a flight to safety, which is followed by a steep fall in the velocity of money as investors hoard cash.
     2  In this sense, the pandemic represents a unique opportunity to advance European integration like never before.
     3  As depositors flee from a weak bank, they can destroy the banks liquidity.
     4  But progress is nonetheless being made.
     5  Critics of the growth model argue that it is imperative to redistribute income and wealth as soon as possible.
     6  All told, countries that have pursued greater economic openness have enjoyed improved nutritional, health, and educational outcomes, as well as higher productivity and incomes.
     7  The periods around World War I and World War II are routinely overlooked in discussions that focus on deregulation of capital markets since the 1980s.
     8  The Greek people deserve some real choices in the near future.
     9  LONDON  The outbreak of the Zika virus, like Ebola before it, has highlighted the risk that infectious diseases can pose to the health of entire countries  and the importance of vaccines to the fight against fast-moving epidemics.
    10  Controls may even require curbing individual freedoms, like accessing hospitals or getting on airplanes.