This commit is contained in:
Adam Wojdyla 2023-03-16 02:29:34 +01:00
parent 0e913cb3e9
commit 893f4409e5
3 changed files with 95 additions and 0 deletions

1
.gitignore vendored Normal file
View File

@ -0,0 +1 @@
.DS_STORE

View File

@ -0,0 +1,40 @@
# Corpus
**EUR LEX** Zbiór wszystkich regulacji unijnych
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/0EGYWY
## Statystyki
**Ilość wierszy**: `wc -l` 737767
**Rozmiar pliku (przed kompresją)**: 666 MB
**Rozmiar pliku (po kompresji)**: 147 MB
## Skrypt czyszczący
- Rozbijanie długich akapitów na mniejsze zdania na podstawie słów kluczy lub znaczników.
- Usuwanie wybranych patternów np. "(3)", "8.4.2", "p. 4."
- Usuwanie akapitów, które mają więcej ni 5 znaków znakami poza ASCII. Następnie te znaki są ignorowane.
- Usuwanie zduplikowanych linii.
## Uruchomienie skryptu
Należy uruchomić skrypt pythonowy clean.py. Wynikiem działania programu jest plik out-merged.txt, który zawiera wstępnie wyczyszczony korpus.
```python clean.py --filePath {sciezka_do_pliku}```
## Head
```From 15 February to 15 April each year, it shall be prohibited to use bottom trawls, longlines and static nets within an area enclosed by sequentially joining with rhumb lines the following coordinates, which shall be measured according to the WGS84 system: 60 58.76 N, 27 27.32 W 60 56.02 N, 27 31.16 W 60 59.76 N, 27 43.48 W 61 03.00 N, 27 39.41 W 60 58.76 N, 27 27.32 W.
It shall be permitted to fish within the area defined in point 7.1 with: static nets and/or hand lines; demersal trawls, Danish seines or other similar towed nets, with a mesh size greater than 80 mm.
On the basis of the best available scientific advice a Member State may, for vessels flying its flag, put in place mitigation measures or restrictions on the use of certain gear. Such measures shall minimise, and where possible eliminate, the catches of the species referred to in paragraph 1 of this Article and shall be compatible with the objectives set out in Article 2 of Regulation (EU) No 1380/2013 and be at least as stringent as technical measures applicable under Union law.
The catch percentages referred to in paragraph 2 may be calculated on the basis of one or more representative samples.
The application of the conditions in relation to the mesh size specifications set out in Article 27 and in Part B of Annexes V to XI shall not lead to a deterioration of selectivity standards, in particular in terms of an increase in the catches of juveniles, existing on 14 August 2019, and shall aim at achieving the objectives and targets set out in Articles 3 and 4.
A joint recommendation submitted for the purpose of adopting the measures referred to in Article 15(2), in relation to moving-on provisions, shall include: (a) the species and threshold levels that trigger an obligation to move; (b) the distance by which a vessel is to move away from its previous fishing position.
25.7.2019 EN Official Journal of the European Union L 198/105 REGULATION (EU) 2019/1241 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 20 June 2019 on the conservation of fisheries resources and the protection of marine ecosystems through technical measures, amending Council Regulations (EC) No 1967/2006, (EC) No 1224/2009 and Regulations (EU) No 1380/2013, (EU) 2016/1139, (EU) 2018/973, (EU) 2019/472 and (EU) 2019/1022 of the European Parliament and of the Council, and repealing Council Regulations (EC) No 894/97, (EC) No 850/98, (EC) No 2549/2000, (EC) No 254/2002, (EC) No 812/2004 and (EC) No 2187/2005 THE EUROPEAN PARLIAMENT AND THE COUNCIL OF THE EUROPEAN UNION, Having regard to the Treaty on the Functioning of the European Union, and in particular Article 43 thereof, Having regard to the proposal from the European Commission, After transmission of the draft legislative act to the national parliaments, Having regard to the opinion of the European Economic and Social Committee (1), Having regard to the opinion of the Committee of the Regions (2), Acting in accordance with the ordinary legislative procedure (3), Whereas: Regulation (EU) No 1380/2013 of the European Parliament and of the Council establishes a Common Fisheries Policy (CFP) for the conservation and sustainable exploitation of fisheries resources.
Where the direct restocking or transplantation is carried out in the waters of another Member State or Member States, the Commission and all those Member States shall be informed, at least 20 calendar days in advance, of the intention to conduct such fishing operations. CHAPTER V CONDITIONS IN RELATION TO MESH SIZE SPECIFICATIONS Article 27 Conditions in relation to mesh size specifications The catch percentages referred to in the Annexes V to VIII shall mean the maximum percentage of species allowed so as to qualify for the specific mesh sizes set out in those Annexes. Such percentages shall be without prejudice to the obligation to land catches in Article 15 of Regulation (EU) No 1380/2013.
It shall be prohibited to have on board or set more than 2 500 m of combined gillnets and trammel nets and 6 000 m of any gillnet, entangling net or trammel net.
Article 11 Catches of marine mammals, seabirds and marine reptiles The catching, retention on board, transhipment or landing of marine mammals or marine reptiles referred to in Annexes II and IV to Directive 92/43/EEC and of species of seabirds covered by Directive 2009/147/EC shall be prohibited.```

View File

@ -0,0 +1,54 @@
import pandas
import regex as re
import argparse, sys
parser=argparse.ArgumentParser()
parser.add_argument("--filepath",)
args=parser.parse_args()
FILE_PATH = args.filepath
def is_letter_sentence(text):
return len(re.findall("\p{L}", text)) > len(re.findall("[^\p{L}\s]", text))*2
def is_asci(text):
nonasci = len(re.findall("[^ -~]", text))
return nonasci < 5
def filter_line(line):
return line is not None and len(line) > 30 and is_letter_sentence(line) and is_asci(line)
def clean_with_regex(text):
text = str(text).encode("ascii", "ignore").decode("utf-8")
regex_pattern = "(?<=..\.)(\s+)(?=\(\d+\))|(?<=..\.)(\s+)(?=\d\.)|(?<=..\.)(\s+)(?=Article \d+)"
try:
out = re.split(regex_pattern, text)
except TypeError as e:
return []
out = list(filter(lambda item: filter_line(item), out))
out = list(map(lambda item: re.sub("(?<=\d)(\(\d+\))(?=\s+)|(\(\d+\)\s+)|(\d+\.)+\s", "", item), out))
if out:
out.pop(len(out)-1)
return out
def print_text(text, sort=False):
if sort:
text = sorted(text, key=lambda item: len(item.split(" ")))
for i, line in enumerate(text):
print(f'-----------------LINE {i}, words: {len(line.split(" "))}, length: {len(line)}-----------------')
print(line)
def save_to_file(paragraph_list, file_name):
with open(file_name, 'a') as f:
for line in paragraph_list:
f.write("%s\n" % line.strip())
f.close()
print(f"Cleaning file: {FILE_PATH}")
csv_file = pandas.read_csv(FILE_PATH)
file_directives = csv_file['act_raw_text']
for direcrive in file_directives:
paragraphs = clean_with_regex(direcrive)
paragraphs = [*set(paragraphs)]
save_to_file(paragraphs, f'out-merged.txt')