kubapok/modelowanie-jezykowe-aitech-cw

Fork 0

Jakub Pokrywka 68241366ba points 40b b-> 70 in 2

2022-03-14 10:31:57 +01:00

3.7 MiB

Raw Permalink Blame History

Logo 1

Modelowanie Języka

2. Język [ćwiczenia]

Jakub Pokrywka (2022)

Logo 2

import random
import plotly.express as px
import numpy as np
import pandas as pd
import nltk

https://github.com/sdadas/polish-nlp-resources

ps = nltk.stem.PorterStemmer()

for w in ["program", "programs", "programmer", "programming", "programmers"]:
    print(w, " : ", ps.stem(w))

program  :  program
programs  :  program
programmer  :  programm
programming  :  program
programmers  :  programm

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/kuba/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/kuba/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

text = """Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library."""
nltk.tokenize.word_tokenize(text)

['Python',
 'is',
 'dynamically-typed',
 'and',
 'garbage-collected',
 '.',
 'It',
 'supports',
 'multiple',
 'programming',
 'paradigms',
 ',',
 'including',
 'structured',
 '(',
 'particularly',
 ',',
 'procedural',
 ')',
 ',',
 'object-oriented',
 'and',
 'functional',
 'programming',
 '.',
 'It',
 'is',
 'often',
 'described',
 'as',
 'a',
 '``',
 'batteries',
 'included',
 "''",
 'language',
 'due',
 'to',
 'its',
 'comprehensive',
 'standard',
 'library',
 '.']

nltk.tokenize.sent_tokenize(text)

['Python is dynamically-typed and garbage-collected.',
 'It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented and functional programming.',
 'It is often described as a "batteries included" language due to its comprehensive standard library.']

nltk.corpus.stopwords.words('german')

['aber',
 'alle',
 'allem',
 'allen',
 'aller',
 'alles',
 'als',
 'also',
 'am',
 'an',
 'ander',
 'andere',
 'anderem',
 'anderen',
 'anderer',
 'anderes',
 'anderm',
 'andern',
 'anderr',
 'anders',
 'auch',
 'auf',
 'aus',
 'bei',
 'bin',
 'bis',
 'bist',
 'da',
 'damit',
 'dann',
 'der',
 'den',
 'des',
 'dem',
 'die',
 'das',
 'dass',
 'daß',
 'derselbe',
 'derselben',
 'denselben',
 'desselben',
 'demselben',
 'dieselbe',
 'dieselben',
 'dasselbe',
 'dazu',
 'dein',
 'deine',
 'deinem',
 'deinen',
 'deiner',
 'deines',
 'denn',
 'derer',
 'dessen',
 'dich',
 'dir',
 'du',
 'dies',
 'diese',
 'diesem',
 'diesen',
 'dieser',
 'dieses',
 'doch',
 'dort',
 'durch',
 'ein',
 'eine',
 'einem',
 'einen',
 'einer',
 'eines',
 'einig',
 'einige',
 'einigem',
 'einigen',
 'einiger',
 'einiges',
 'einmal',
 'er',
 'ihn',
 'ihm',
 'es',
 'etwas',
 'euer',
 'eure',
 'eurem',
 'euren',
 'eurer',
 'eures',
 'für',
 'gegen',
 'gewesen',
 'hab',
 'habe',
 'haben',
 'hat',
 'hatte',
 'hatten',
 'hier',
 'hin',
 'hinter',
 'ich',
 'mich',
 'mir',
 'ihr',
 'ihre',
 'ihrem',
 'ihren',
 'ihrer',
 'ihres',
 'euch',
 'im',
 'in',
 'indem',
 'ins',
 'ist',
 'jede',
 'jedem',
 'jeden',
 'jeder',
 'jedes',
 'jene',
 'jenem',
 'jenen',
 'jener',
 'jenes',
 'jetzt',
 'kann',
 'kein',
 'keine',
 'keinem',
 'keinen',
 'keiner',
 'keines',
 'können',
 'könnte',
 'machen',
 'man',
 'manche',
 'manchem',
 'manchen',
 'mancher',
 'manches',
 'mein',
 'meine',
 'meinem',
 'meinen',
 'meiner',
 'meines',
 'mit',
 'muss',
 'musste',
 'nach',
 'nicht',
 'nichts',
 'noch',
 'nun',
 'nur',
 'ob',
 'oder',
 'ohne',
 'sehr',
 'sein',
 'seine',
 'seinem',
 'seinen',
 'seiner',
 'seines',
 'selbst',
 'sich',
 'sie',
 'ihnen',
 'sind',
 'so',
 'solche',
 'solchem',
 'solchen',
 'solcher',
 'solches',
 'soll',
 'sollte',
 'sondern',
 'sonst',
 'über',
 'um',
 'und',
 'uns',
 'unsere',
 'unserem',
 'unseren',
 'unser',
 'unseres',
 'unter',
 'viel',
 'vom',
 'von',
 'vor',
 'während',
 'war',
 'waren',
 'warst',
 'was',
 'weg',
 'weil',
 'weiter',
 'welche',
 'welchem',
 'welchen',
 'welcher',
 'welches',
 'wenn',
 'werde',
 'werden',
 'wie',
 'wieder',
 'will',
 'wir',
 'wird',
 'wirst',
 'wo',
 'wollen',
 'wollte',
 'würde',
 'würden',
 'zu',
 'zum',
 'zur',
 'zwar',
 'zwischen']

nltk_tokens = nltk.word_tokenize(text)
print(list(nltk.bigrams(nltk_tokens)))

[('Python', 'is'), ('is', 'dynamically-typed'), ('dynamically-typed', 'and'), ('and', 'garbage-collected'), ('garbage-collected', '.'), ('.', 'It'), ('It', 'supports'), ('supports', 'multiple'), ('multiple', 'programming'), ('programming', 'paradigms'), ('paradigms', ','), (',', 'including'), ('including', 'structured'), ('structured', '('), ('(', 'particularly'), ('particularly', ','), (',', 'procedural'), ('procedural', ')'), (')', ','), (',', 'object-oriented'), ('object-oriented', 'and'), ('and', 'functional'), ('functional', 'programming'), ('programming', '.'), ('.', 'It'), ('It', 'is'), ('is', 'often'), ('often', 'described'), ('described', 'as'), ('as', 'a'), ('a', '``'), ('``', 'batteries'), ('batteries', 'included'), ('included', "''"), ("''", 'language'), ('language', 'due'), ('due', 'to'), ('to', 'its'), ('its', 'comprehensive'), ('comprehensive', 'standard'), ('standard', 'library'), ('library', '.')]

df = pd.DataFrame([['ma', 20], ['ala', 15], ['psa', 10], ['kota', 10]], columns=['słowo', 'liczba'])
fig = px.bar(df, x="słowo", y="liczba")
fig.show()

df = pd.DataFrame([[random.choice(['ang','polski','hiszp']), np.random.geometric(0.2)]  for i in range(5000) ], columns=['jezyk', 'dlugosc'])
fig = px.histogram(df, x="dlugosc",facet_row='jezyk',nbins=50, hover_data=df.columns)
fig.show()

?px.histogram

ZADANIE 1

(40 punktów)

ZNAJDŹ PRZYKŁAD TEKSTÓW Z TEJ SAMEJ DOMENY 1_000_000 słów albo nawet tłumaczenie :

język angielski
język polski
język z rodziny romańskich

Proponowane narzędzia:

nltk
plotly express
biblioteka collections
spacy (niekoniecznie)

Dla każdego z języków:

policz ilosć unikalnych lowercase słów (ze stemmingiem i bez)
policz ilosć znaków
policz ilosć unikalnych znaków
policz ilosć zdań zdań
policz ilosć unikalnych zdań
podaj min, max, średnią oraz medianę ilości znaków w słowie
podaj min, max, średnią oraz medianę ilości słów w zdaniu, znajdz najkrotsze i najdluzsze zdania
wygeneruj word cloud (normalnie i po usunięciu stopwordów)
wypisz 20 najbardziej popularnych słów (normalnie i po usunięciu stopwordów) (lowercase)
wypisz 20 najbardziej popularnych bigramów (normalnie i po usunięciu stopwordów)
narysuj wykres częstotliwości słów (histogram lub linie) w taki sposób żeby był czytelny, wypróbuj skali logarytmicznej dla osi x (ale na razie nie dla y), usuwanie słów poniżej limitu wystąpień itp.
punkt jak wyżej, tylko dla bigramów
punkt jak wyżej, tylko dla znaków
narysuj wykres barplot dla części mowy (PART OF SPEECH TAGS, tylko pierwszy stopień zagłębienia)
dla próbki 10000 zdań sprawdź jak często langdetect https://pypi.org/project/langdetect/ się myli i w jaki sposób.
zilustruj prawo zipfa ( px.line z zaznaczonymi punktami)
napisz wnioski (10-50 zdań)

START ZADANIA

KONIEC ZADANIA

ZADANIE 2

(30 punktów)

Znajdź teksty w języku polskim (mają składać sie po 5 osobnych dokumentów każdy, długości powinny być różne):

tekst prawny
tekst naukowy
tekst z polskiego z powieści (np. wolne lektury)
tekst z polskiego internetu (reddit, wykop, komentarze)
transkrypcja tekstu mówionego

ZADANIA:

zilustruj gunning fog index (oś y) i średnią długość zdania (oś x) na jednym wykresie dla wszystkich tekstów, domeny oznacz kolorami (px.scatter), dla języka polskiego traktuj wyrazy długie jako te powyżej 3 sylab, możesz użyć https://pyphen.org/ do liczenia sylab
zilustruj prawo Heaps'a dla wszystkich tekstów na jednym wykresie, domeny oznacz kolorami (px.scatter)
napisz wnioski (10-50 zdań)

START ZADANIA

KONIEC ZADANIA

WYKONANIE ZADAŃ

Zgodnie z instrukcją 01_Kodowanie_tekstu.ipynb

3.7 MiB Raw Permalink Blame History

Modelowanie Języka

2. Język [ćwiczenia]

Jakub Pokrywka (2022)

ZADANIE 1

START ZADANIA

KONIEC ZADANIA

ZADANIE 2

START ZADANIA

KONIEC ZADANIA

WYKONANIE ZADAŃ

3.7 MiB

Raw Permalink Blame History